llm-adaptive-testing

LLM Evals using Adaptive Testing

Trismik white paper

December 1, 2024

Anyone working with Generative AI (GenAI) and Large Language Models (LLMs) knows how quickly these systems are evolving. From fine-tuning foundational models like GPT-4 or Mistral to building RAG-capable applications, ensuring they meet real-world performance standards is critical. However, traditional evaluation methods often fall short: static benchmark tests are time-consuming, computationally expensive, and fail to adapt to the unique strengths and weaknesses of different models.

At Trismik, we’re tackling a key challenge: how do you efficiently and effectively evaluate LLMs without compromising accuracy? That’s why we've developed an adaptive testing solution that dynamically assesses LLMs based on their responses. Adaptive testing is not just faster—it’s smarter. By tailoring the difficulty of questions in real time to align with a model’s performance, adaptive testing minimizes the time and computational resources needed to reach reliable conclusions. But what is adaptive testing, and how exactly does it compare to classical benchmark testing?

This white paper takes a deep dive into these questions, using MMLU Pro data to evaluate the performance of 10 models from four leading commercial vendors. Our findings demonstrate how adaptive testing provides actionable insights while reducing evaluation time, enabling AI engineers and data scientists to focus on innovation and deployment.

What is adaptive testing?

Adaptive testing, or Computer Adaptive Testing (CAT), is a method that dynamically tailors the difficulty of questions based on the test-taker's performance in real time. Unlike traditional benchmark testing, which uses a static set of questions, adaptive testing iteratively selects the most informative questions to quickly and accurately estimate performance. This approach, grounded in Item Response Theory (IRT), minimizes the number of questions required while maintaining precision, making it ideal for evaluating LLMs. By focusing computational resources where they matter most, adaptive testing not only reduces evaluation time but also provides a clearer picture of a model’s strengths and weaknesses—offering an efficient alternative to fixed-form tests.

Comparison to classical benchmark testing

One way to compare adaptive testing with classical methods is through a side-by-side analysis. Using MMLU Pro [1], a 10-way multiple-choice quiz spanning domains like mathematics, engineering, and medicine, we leveraged the HuggingFace leaderboard, which shows classical test results for models against 12,032 test items.

To "Trismikize" MMLU Pro, we reduced the test to a 5-way format, calculated question difficulty, and re-tested using Trismik's adaptive framework. Each Trismik test runs in 60 questions (1/200 of the original test). We then compared results by calculating the Pearson correlation coefficient between leaderboard accuracy scores and Trismik ability scores.

The Pearson correlation coefficient quantifies the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation of 0 indicates no relationship. In this context, a high positive Pearson correlation suggests that Trismik’s adaptive ability scores align well with traditional accuracy metrics, validating the effectiveness of adaptive testing.

Here are the setting details: All tests were run between 21/11/2024 and 23/11/2024 to minimize model drift, all models were set to have temperature = 0 and max tokens = 5, and we randomized the presentation of multiple choice answers to minimize option bias. We averaged results from Trismik tests over 5, 10, and 20 test runs to see how correlation with the classical test results on HuggingFace compared. Four vendor families were compared: Mistral (nemo 2407, mixtral 8x7B, mistral 7B), GPT (4o-mini and 4o), Gemini (1.5 Pro and 1.5 Flash) and Claude (3.5 Haiku, 3.5 Opus and 3.5 Sonnet). So how did our adaptive tests compare?

There's a small mismatch which we'll discuss in a moment, but the >0.9 results show a very strong alignment between Trismik’s adaptive testing approach and the classical methods on MMLU Pro.

The difference between the adaptive test method and the classical test method can be accounted for in a number of ways that are external and internal to the testing methods themselves:

External factors: (1) Model parameters: there might be mis-alignment between hyper-parameters that control model inferencing between leaderboard runs and our test runs, e.g. temperature or number of tokens; (2) Model fluctuation: vendor models might have shifted between the time of the reported leaderboard test run and now.

Internal factors: (1) Item selection: adaptive tests rely on an algorithm to select the next question based on previous responses, which may bias the selection toward certain question types. Classical tests, by contrast, present a pre-determined set of questions, ensuring uniform coverage but potentially missing the opportunity for deeper probing; (2) Test length: the shorter length of adaptive tests may introduce slight variance in results; (3) Down-sampling: our multiple choice test down-sampled from 10 to 5 options - making it easier in general to guess the correct answer.

Overall, the small gap we have seen in Pearson correlation can be viewed as a natural trade-off for the efficiency gains of adaptive testing. We're working to make further refinements to our approach to narrow this gap even further.

And finally here are the latency results for each model to complete a single 60 item test run on MMLU Pro (averaged over n=20). Factors influencing latency include network speed, inferencing speed

Conclusion

Adaptive tests are designed to provide a precise estimate of ability with fewer items, focusing only on the most informative questions. Classical tests, on the other hand, use a fixed set of items for all test-takers, which can lead to redundant or irrelevant questions for some models. This results in adaptive tests being more efficient in terms of time and computational resources making them an ideal solution for organizations looking to optimize testing workflows without compromising on accuracy.

As a pioneer in applying adaptive testing to LLMs, Trismik is at the forefront of this exciting technology. While adaptive testing is still new in the LLM space, the results show its potential to reshape how AI systems are evaluated. We invite you to explore Trismik’s innovative solutions and join us in setting a new standard for efficient, scalable model evaluation.

References

[1] Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., ... & Chen, W. (2024). MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574.

trismik

LLM Evals using Adaptive Testing

What is adaptive testing?

Comparison to classical benchmark testing

Conclusion

References