The-Olympics-of-AI-Benchmarking-Machine-Learning-Systems
AI

Machine learning system comparisons at the AI Olympics

author
15 minutes, 46 seconds Read

Running a mile in under four minutes was seen as an unattainable achievement by many for a long time. Many people believed it was an impossible mental and physical goal. Experts in medicine and athletics hypothesized that the human body could not sustain such high speeds for long. Some even went so far as to say that trying to disprove this myth may be dangerous.

The British middle-distance runner and medical student Sir Roger Bannister disagreed. He was aware of the difficulty, but he thought it was more mental than physical. Bannister adopted a methodical approach to his training, dividing the mile into segments and timed each one precisely. He used interval training and established various mini-goals to push himself in the months leading up to his record attempt.

Bannister attempted to break the four-minute mile on May 6, 1954, at a track in Oxford, England, with the support of his friends Chris Brasher and Chris Chataway as pacemakers. He broke the record for the mile with a time of 3 minutes, 59.4 seconds.

A photograph of Roger Bannister competing in a race. Photos courtesy of Norske Leksikon (CC-BY 4.0 license).
The effects of Bannister’s feat were completely unanticipated. Bannister broke the four-minute, one-and-a-half-second barrier, which had been held by Gunder Hägg since 1945. When the four-minute mile was finally accomplished, though, many others quickly followed. It took John Landy only 46 days following Bannister’s performance for him to run a mile in 3 minutes and 57.9 seconds. The record was broken five more times in the next decade. Hicham El Guerrouj established the current record at 3 minutes, 43.1 seconds.

Times in the mile set globally between 1900 and 2000. It wasn’t until 1954 when Roger Bannister broke the four-minute mile record, but before that, the decreasing trend was practically linear. Author-made illustration.
Bannister’s success demonstrates the value of standards, not merely as indicators of progress but as catalysts for innovation. When the four-minute “benchmark” was shattered, it changed the way athletes saw their own abilities. The wall wasn’t only physical; it was also mental.

The four-minute mile is a symbol of the revolutionary potential of standards in several fields. The use of benchmarks allows us to measure our progress toward specific goals and compare our results to those of other groups. This is the very foundation upon which competitions like the Olympics are built. To be effective, however, benchmarks require consensus on an overarching objective among the communities they serve.

As the Olympic Games are to athletes, benchmarks are to the machine learning and computer science communities. Here, algorithms, systems, and approaches fight not for medals but for the honor of progress and the impetus to innovate. Developers and researchers optimize their models and systems to increase performance, aiming to outperform on established benchmarks, just as athletes train for years to shave milliseconds off their pace in quest of Olympic gold.

Establishing that target is both the art and science of benchmarking. It’s not enough to just assign work; the assignment must also embody the core of practical difficulties, expanding the bounds of what’s achievable while still being applicable. Researchers can be misguided by inappropriate standards, leading to improvements in theoretical rather than practical contexts. A well-crafted standard may lead an entire community to a new level of innovation that changes the game.

While standards can be used in a competitive context, their actual worth resides in their capacity to bring people together around a common goal. A well-conceived standard may elevate an entire field, altering paradigms and ushering in new eras of innovation, much as Bannister’s run did for athletics.

In this article, we will examine the significance of benchmarking in driving forward the fields of computer science and machine learning by looking at its origins, examining the current state of the art in benchmarking machine learning systems, and seeing how it encourages development in the realm of hardware.

There was a rising demand for defined measurements (a benchmark) in the 1980s, when the PC revolution was beginning, to compare the performance of various computer systems. Before there were industry-wide standards, individual manufacturers frequently created and utilized their own unique performance indicators. These tests tended to exaggerate the capabilities of the machines and hide their flaws. The need for a standardized, objective measure of quality became apparent.

The System Performance Evaluation Cooperative (SPEC) was created as a solution to this problem. Hardware manufacturers, academics, and others came together to form this group with the goal of developing a standardized method of measuring the performance of computer processors (CPUs), or “chips.”

The SPEC89 benchmark suite was SPEC’s first major contribution; it was one of the earliest attempts to provide a uniform CPU benchmark in the industry. SPEC’s benchmarks were designed to offer measures that mattered to end-users rather than arcane or specialist measurements, with an emphasis on real-world applications and computing workloads.

However, as the benchmark matured, an interesting phenomena emerged: the so-called “benchmark effect.” As the SPEC benchmarks became the industry standard for assessing CPU performance, CPU manufacturers began optimizing their designs towards SPEC’s standards, leading to a noticeable increase in performance on those benchmarks. Manufacturers had a strong incentive to guarantee their CPUs performed very well on SPEC benchmarks since the industry had learned to respect these tests as a measure of overall performance, even if this meant possibly compromising performance in non-SPEC activities.

Even though it wasn’t SPEC’s original aim, this sparked heated discussion amongst computer scientists. Did the performance metrics accurately reflect the state of the art? Or were they operating under the influence of a type of tunnel vision in which the benchmarks were seen as an aim in and of themselves?

In light of these difficulties, SPEC has constantly revised its benchmarks throughout the years to keep current and prevent unnecessary optimization. Their sets of benchmarks now include jobs not just in integer and floating-point arithmetic but also in graphics, file systems, and more.

It is clear from the history of SPEC and its benchmarks how influential benchmarking can be on the course of an entire sector. The standards were more than just indicators of quality; they really drove improvements. It’s proof that standardization works, but it’s also a cautionary story about how optimizing for just one statistic may have unforeseen repercussions.

The computer hardware market and the purchase decisions of consumers and businesses are still heavily influenced by SPEC benchmarks and other benchmarks today.

Progress in computer vision, the branch of artificial intelligence that aims to teach computers to understand and act on visual information, stalled in the late 2000s. While traditional methods had made strides, they had stalled out on many jobs. Experts had to painstakingly create and pick features for each work when using the then-available methodologies, which depended significantly on hand-crafted features. It was a time-consuming procedure with several restrictions.

Then, Dr. Fei-Fei Li and her colleagues unveiled ImageNet, a vast visual database. Millions of tagged photos from hundreds of categories were made available via ImageNet. Because of its enormous size, this dataset could only be tagged by the public via cloud-based methods like Amazon Mechanical Turk’s crowdsourcing service. Since its introduction, the ImageNet paper has been cited over 50,000 times as a benchmark dataset.

The photos from ImageNet put together visually. Picture by Gluon (CC BY 4.0).
However, data collection was only the beginning. ImageNet Large Scale Visual Recognition Competition (ILSVRC) was released in 2010. Although the task at hand was straightforward in concept, the sheer scope of it was intimidating. This benchmark challenge would provide an unprecedented opportunity to gauge the true state of the art in computer vision.

In the early years, progress was slow but steady in comparison to more conventional approaches. However, a revolutionary change occurred during the 2012 challenge. AlexNet, a deep convolutional neural network (CNN), was developed by a team at the University of Toronto led by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Their model reduced the error rate from the previous year’s model by nearly half, to 15.3%.

Statistical measures of failure in the ImageNet Large-Scale Visual Recognition Challenge. After deep learning was introduced in 2012, accuracy increased considerably and has stayed high ever since. The typical human has a 5% mistake rate. NIH/RSNA/ACR/The Academy Workshop 2018. This picture was republished using the terms of the CC BY 4.0 International Creative Commons License.
What facilitated this outcome? There was no longer a need for hand-crafted features thanks to the capabilities of deep learning and, in particular, CNNs to learn features straight from raw pixels. With sufficient data and processing capacity, these networks may reveal complex patterns that had previously been impossible to detect.

The achievement of AlexNet marked a turning point in the history of artificial intelligence. In the years after 2012, deep learning approaches came to dominate the ImageNet challenge, leading to steadily decreasing error rates. The benchmarks sent an unmistakable message: deep learning, a hitherto obscure subfield of machine learning, was about to transform computer vision.

More than that, in fact. As a result of its success in the ILSVRC, deep learning has quickly become one of the most promising fields in artificial intelligence (AI), revolutionizing not just computer vision but also NLP and gaming. Because of the problem, more people are looking at deep learning and more money is being invested in the field.

By providing a demanding standard, the ImageNet challenge was essential in changing the course of AI research and paving the way for the current deep learning-driven AI renaissance.

The revolutionary effects of SPEC and ImageNet benchmarks beg the question: where do we go from here? The computing requirements of deep learning models grew in tandem with their complexity. As a result, researchers began focusing on the hardware that operated these models. In comes MLPerf.

The goal of the MLPerf initiative, which saw the participation of both corporate and academic heavyweights, was to provide a uniform set of criteria by which the efficiency of machine learning infrastructure could be evaluated. MLPerf captures a wide variety of machine learning problems, from image classification to reinforcement learning, as the name indicates. The goal was quite clear: bring order to a space where claims of “best performance” were on the rise, while frequently being founded on skewed criteria or selective measurements.

The tech sector now has access to a standard measurement tool with the release of MLPerf. It established a standard by which academic research could be evaluated and helped create a climate in which algorithmic progress could be tracked over time. It was both a threat and an opportunity for the business world, notably the hardware sector. There is now a globally acknowledged benchmark that will put to the test any promises made regarding a new chip’s machine learning capability.

The development of AI hardware has been impacted by MLPerf, just like SPEC did for CPUs. It wasn’t only about increasing raw performance that businesses began optimizing designs with MLPerf benchmarks in mind. Efficiency criteria were included in the benchmarks to promote developments that improved not just speed but also energy efficiency, a major issue in this era of massive transformer models and environmental awareness. Tech giants like Nvidia and AMD regularly utilize these benchmarks to flaunt their latest and greatest gear.


MLPerf is only one of dozens of similar benchmarks handled by MLCommons today.

  • MLPerf Training: When developing a machine learning model, benchmarking system performance is essential (especially for scientists).
  • MLPerf Inference: For measuring how well a system performs during inference using a machine learning model (especially important for businesses that store their models in the cloud). MLPerf Inference comes in a number of different flavors, some of which are tailored to specific environments like data centers, mobile phones, the cloud, or the edge.
  • MLPerf Training HPC: High-Performance Computing-Relevant Workload Benchmarking with MLPerf Training HPC.
  • MLPerf Storage: For use in establishing norms for storage-related workloads.

However, MLPerf does not exist in a vacuum. Concerns concerning “overfitting” to benchmarks, in which designs unduly optimize for the benchmark tests at the possible cost of real-world applicability, are valid for any benchmark that obtains popularity. Another persistent difficulty is keeping benchmarks up-to-date with the ever-changing machine learning landscape.

Nonetheless, the history of MLPerf, like that of its forerunners, highlights a fundamental truth: benchmarks spur development. They are not only a barometer of progress but a driving force behind it. They serve to concentrate efforts by providing lofty goals to strive towards, so propelling businesses and academic institutions to pioneer new fields. And in a world where AI is constantly expanding the boundaries of possibility, having a map to follow is more than a luxury; it’s a need.

Large language models, which are used in generative AI, are another area where benchmarking efforts are concentrated outside of AI hardware. It is much harder to test these sorts of models than it is to assess hardware or even many other forms of machine learning models.

That’s because a language model’s efficacy is dependent on more than just how fast it can compute or how well it does on certain tests. Instead, it depends on the model’s capacity to produce answers that are consistent, useful, and applicable across a wide range of stimuli and settings. Furthermore, the “quality” of a response evaluation is very subjective, changing depending on the context or the personal preferences of the assessor. Given these complications, benchmarks for language models like GPT-3 and BERT need to be more varied and comprehensive than their more traditional counterparts.

The 2018 General Language Understanding Evaluation (GLUE) benchmark is one of the most well-known measures of language models’ performance. From sentiment analysis to textual entailment, GLUE included not one but nine different language tasks. The goal was to offer a thorough analysis, checking that the models weren’t simply good at one thing but could handle a wide range of problems involving language comprehension.

GLUE had a dramatic and immediate effect. Finally, there existed a standardized, reliable standard by which language models could be judged. Tech titans and academic institutions alike quickly joined the fray, all hoping to rise to the top of the GLUE rankings.

When GPT-2 was originally compared to other models using the GLUE benchmark, it achieved a score that was considered exceptional at the time. Not only did this demonstrate GPT-2’s superiority, but it also highlighted GLUE’s usefulness as a transparent yardstick. The accolade of “state-of-the-art on GLUE” quickly rose in esteem.

The triumph of GLUE, however, had negative consequences as well. Late in 2019, the scoreboard for GLUE was flooded by models that were getting close to the human benchmark. This oversaturation emphasized an additional vital feature of benchmarking: the requirement that benchmarks keep pace with developments in the sector. In response, the same group released SuperGLUE, a more challenging benchmark meant to further test the limits.

Model performance on specialized tasks like sentiment analysis and question answering may be measured against benchmarks like GLUE, SuperGLUE, and SQuAD. However, these metrics merely scratch the surface of what these base models hope to accomplish. New metrics have evolved to evaluate these models beyond task-specific precision:

  • Robustness. How does the model react to unusual or negative data? To test how well a model can withstand attacks from malevolent users or unanticipated conditions, “robustness benchmarks” present it with intentionally misleading data.
  • Generalization and Transfer Learning: Learning that may be used in a general sense. It is believed that foundational models would carry out adequately on jobs for which they were not originally designed. Understanding the malleability of a model requires testing its performance on tasks for which it has either no or very few training samples.
  • Interactivity and Coherence: The reliability and coherence of a model’s responses over time is crucial for use cases like chatbots and virtual assistants. Long conversations or remembering details from several conversations may serve as benchmarks in this field.
  • Safety and Controlability: These standards are crucial as model sizes grow to prevent models from generating incoherent or dangerous results.
  • Customizability. There is an increasing need for specialized basic models as their use expands across the board. For example, benchmarks in this space may measure a model’s ability to learn the lingo and subtleties of a certain sector.

It’s fascinating to see how tests originally designed to evaluate human performance are being repurposed as language model benchmarks as their performance becomes closer to that of humans. Examples of examinations that included GPT-4 questions include the SAT, LSAT, and medical licensing exams. The SAT score of 1410 places it in the top 6 percent of all test takers. Even more impressively, GPT-4 averaged an 80.7% on all iterations of the medical board examinations. In contrast, its LSAT scores of 148 and 157 put it in the 37th and 70th percentiles, respectively.

GPT 3.5 performance

GPA and test scores in school and the workplace. Taken from the “GPT-4 Technical Report” figure. OpenAI provided this image (CC-BY 4.0).
Since language models are quickly catching up to and even surpassing human performance, it will be fascinating to observe how benchmarking techniques evolve for this field.

There will be many different types of benchmarking in the future to accommodate the wide variety of new technology and uses. Some new fields where benchmarking is being used are listed below:

  • RobotPerf: As robotics becomes more pervasive, standards are being developed to monitor and accelerate robotics applications, like RobotPerf, to make sure that robots are effective and safe.
  • NeuroBench is a cutting-edge tool for evaluating neuromorphic systems, which are used in brain-inspired computing, and providing insights into how well these structures mirror neural processes.
  • XRBench: With new gear from companies like Meta and Apple, the virtual and augmented reality industries have seen a renaissance. Extended Reality (XR) apps are crucial for an immersive and smooth user experience, which is why XRBench was created.
  • MAVBench: As developments in multi-agent systems and battery technology make drones more practical for commercial use, benchmarks like MAVbench will be crucial for maximizing their potential.

It is common knowledge in the fields of computer science and machine learning that benchmarking is a key factor in advancing these disciplines. One of the most prestigious AI conferences, NeurIPS, has added a special datasets and benchmarks track. This is the third year of this track, and it is gaining incredible traction, as seen by the amazing amount of almost 1,000 submissions so far this year. This pattern demonstrates that, as technology advances at an ever-increasing rate, benchmarks will keep influencing its path in real time.

It is impossible to overestimate the importance of benchmarks in guiding development, whether in sports or AI. They serve as both a reflection of the present situation and a view into possible futures. With AI’s growing impact on fields as varied as healthcare and finance, reliable metrics are more important than ever. They direct efforts toward problems that really matter, making ensuring that development is swift but also significant. Sir Roger Bannister’s sub-4-minute mile demonstrates how overcoming seemingly insurmountable obstacles can spark a surge of creativity and invention that lasts for decades. It’s still early in the game for computers and machine learning.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *