Skip to main content

Results

Understanding and Working with Evaluation Results

When you run evaluate, Scorebook produces results for both aggregate scores and per-item scores.
The structure and level of detail can be controlled with the following flags:

  • return_dict (default: True): Set return type as a dict/list or (EvalResult)
  • return_aggregates (default: True): In dicts returned, include aggregate scores
  • return_items (default: False): In dicts returned, include item scores
  • return_output (default: False): In item scores in dicts returned, include associated model outputs

Return Shapes

Dict Output

By default, results are returned as a dictionary with up to two sections. If return_dict=True and both return_aggregates and return_items are True, evaluate() returns a dictionary with two keys:

  • aggregate_results → list of dicts (one row per dataset × hyperparameter run)
  • item_results → list of dicts (one row per evaluated item)

If only one of return_aggregates or return_items is True, then the return value is a list containing just that section.

At least one of return_aggregates or return_items must be True when return_dict=True, otherwise a ParameterValidationError is raised.

{
"aggregate_results": [
{
"dataset": "qa_dataset",
"run_completed": true,
"temperature": 0.7,
"accuracy": 0.81,
"f1": 0.78
}
],
"item_results": [
{
"item_id": 0,
"dataset_name": "qa_dataset",
"temperature": 0.7,
"accuracy": 1,
"f1": 1
}
]
}

EvalResult Output

If the return_dict parameter in an evaluation is set to False, evaluate will return an EvalResult instance.

results: EvalResult = evaluate(inference_function, eval_dataset, return_dict=False)
results.scores # Dict[str, List[Dict[str, Any]]]
results.aggregate_scores # List[Dict[str, Any]] (same rows as aggregate_results)
results.item_scores # List[Dict[str, Any]] (same rows as item_results)

Return Details

Outputs

Model outputs for each evaluation item can be included optionally with the return_output flag. These are found under "inference_output" within the item results of an evaluation.

"item_results": [
{
"item_id": 0,
"dataset_name": "basic_questions",
"inference_output": "4"
"temperature": 0.7,
"accuracy": 1,
"f1": 1
},
{
"item_id": 1,
"dataset_name": "basic_questions",
"inference_output": "Paris"
"temperature": 0.7,
"accuracy": 1,
"f1": 1
},
{
"item_id": 2,
"dataset_name": "basic_questions",
"inference_output": "William Shakespeare"
"temperature": 0.7,
"accuracy": 1,
"f1": 1
},
]

Run Id

If using Trismik's services within Scorebook, any evaluation results uploaded to the Trismik dashboard will include a unique run_id for each run within the evaluation. An evaluation run refers to a evaluation dataset × hyperparameter configuration.

{
"dataset": "dataset",
"run_id": "387b77604e21654f238c74ec3e12b25df33e89e7",
"accuracy": 1.0
}