The standard approach to evaluating large language models (LLMs) is simple but inefficient: run models through massive
static benchmarks, average the scores, and compare results. The problem is that these benchmarks often require models
to process thousands of items, many of which offer little useful information about a model's actual capabilities.
Computerized Adaptive Testing (CAT) has been quietly transforming educational assessments for decades
[1, 2, 3]. Rather than using a one-size-fits-all test, CAT adapts question difficulty in real time
based on the test-taker’s performance. The concept is intuitive: start with a medium-difficulty question. If the answer
is correct, try something harder. If it’s wrong, step back. In this way, the test adapts continually to pinpoint the
test-taker’s ability efficiently.