4 posts tagged with "Adaptive Testing"

View All Tags

Upcycling Datasets for LLM Evaluation

September 30, 2025 · 6 min read

Nigel Collier

Alice Pernthaller

We use upcycling to describe the process of transforming raw, uneven datasets into high-quality calibrated item banks optimized for model evaluation.
Trismik upcycles open datasets like MMLU-Pro, OpenBookQA, and PIQA into calibrated test banks.
Schema transformation brings datasets into a standard format for discriminative multiple-choice tests (with future support for generative evals).
Balanced distributions across question difficulties + quality goals ensure reliability, efficiency, and reproducibility.

Trismik Secures £2.2m Pre-seed Funding

September 24, 2025 · 6 min read

Rebekka Mikkola

Science-grade LLM evaluation startup Trismik quietly raises £2.2M to transform how AI capabilities are measured, with the company’s unique approach to adaptive testing allowing AI builders to go from test to insight in seconds rather than minutes or hours

CAMBRIDGE, UK - 24th September 2025 12:00PM UK - While AI labs race to build more powerful models, a fundamental problem threatens progress: we’re no longer able to meaningfully measure what these systems can actually do. Traditional benchmarks have become saturated, with multiple models scoring above 90% accuracy on popular benchmarks like MMLU and GSM8K, creating a challenge for businesses that want to measure and adapt the ability of their models to perform a task and communicate results with other stakeholders.

Adaptive Testing for LLMs: Does It Really Work?

September 15, 2025 · 6 min read

Esma Balkir

The standard approach to evaluating large language models (LLMs) is simple but inefficient: run models through massive static benchmarks, average the scores, and compare results. The problem is that these benchmarks often require models to process thousands of items, many of which offer little useful information about a model's actual capabilities.

Computerized Adaptive Testing (CAT) has been quietly transforming educational assessments for decades [1, 2, 3]. Rather than using a one-size-fits-all test, CAT adapts question difficulty in real time based on the test-taker’s performance. The concept is intuitive: start with a medium-difficulty question. If the answer is correct, try something harder. If it’s wrong, step back. In this way, the test adapts continually to pinpoint the test-taker’s ability efficiently.

Why Traditional LLM Evaluation Falls Short - and What's Next

September 9, 2025 · 5 min read

Marco Basaldella

Why do we evaluate LLMs?

In any scientific or commercial application of LLMs, evaluation is a key, if often under appreciated, step. Unlike traditional software, where every sub-component can be unit tested and the whole system behaves predictably and deterministically, LLMs are best tested end-to-end, are non-deterministic by nature, and can behave unpredictably when faced with unexpected inputs. For example, when faced with an unexpected input, a traditional program might crash, while an LLM might produce a factually incorrect, offensive, or otherwise brand-damaging answer.

Why do we evaluate LLMs?​

Why do we evaluate LLMs?