Metric Scoring
How Scores Are Computed in Scorebook
Metrics define how model predictions are evaluated against ground-truth labels. Each evaluation dataset in Scorebook declares its own set of metrics. During evaluation, predictions are passed to these metrics to compute aggregate scores and item-level scores.
How Metric Scoring Works
The inference callable produces a list of outputs, one for each dataset item.
- Each metric is applied to compare predictions with labels.
- For every metric, two levels of scores are returned:
- Aggregate scores: summary statistics (e.g., overall accuracy, average F1).
- Item scores: per-item correctness or metric-specific values.
Internally, Scorebook enforces a 1–1 match between outputs and labels. If counts mismatch, a DataMismatchError is raised.
Example: Accuracy Metric
from scorebook.metrics import Accuracy
from scorebook import EvalDataset, evaluate
dataset = EvalDataset.from_json("qa_dataset.json", label="answer", metrics=Accuracy)
def inference_fn(items, **kwargs):
return ["A" for _ in items] # naive baseline
results = evaluate(inference=inference_fn, datasets=dataset)
print(results)
[
{
"dataset": "qa_dataset",
"run_completed": true,
"accuracy": 0.25
}
]
Multiple Metrics
Datasets can specify multiple metrics. Each is scored independently.
from scorebook.metrics import Accuracy, F1
dataset = EvalDataset.from_json(
"qa_dataset.json",
label="answer",
metrics=[Accuracy, F1]
)
Results will contain scores for each metric.
[
{
"dataset": "qa_dataset",
"run_completed": true,
"accuracy": 0.25
"f1": 0.4
}
]