Skip to main content

Metric Scoring

How Scores Are Computed in Scorebook

Metrics define how model predictions are evaluated against ground-truth labels. Each evaluation dataset in Scorebook declares its own set of metrics. During evaluation, predictions are passed to these metrics to compute aggregate scores and item-level scores.

How Metric Scoring Works

The inference callable produces a list of outputs, one for each dataset item.

  • Each metric is applied to compare predictions with labels.
  • For every metric, two levels of scores are returned:
    • Aggregate scores: summary statistics (e.g., overall accuracy, average F1).
    • Item scores: per-item correctness or metric-specific values.

Internally, Scorebook enforces a 1–1 match between outputs and labels. If counts mismatch, a DataMismatchError is raised.

Example: Accuracy Metric

from scorebook.metrics import Accuracy
from scorebook import EvalDataset, evaluate

dataset = EvalDataset.from_json("qa_dataset.json", label="answer", metrics=Accuracy)

def inference_fn(items, **kwargs):
return ["A" for _ in items] # naive baseline

results = evaluate(inference=inference_fn, datasets=dataset)

print(results)
[
{
"dataset": "qa_dataset",
"run_completed": true,
"accuracy": 0.25
}
]

Multiple Metrics

Datasets can specify multiple metrics. Each is scored independently.

from scorebook.metrics import Accuracy, F1

dataset = EvalDataset.from_json(
"qa_dataset.json",
label="answer",
metrics=[Accuracy, F1]
)

Results will contain scores for each metric.

[
{
"dataset": "qa_dataset",
"run_completed": true,
"accuracy": 0.25
"f1": 0.4
}
]