Results

Understanding and Working with Evaluation Results

When you run evaluate, Scorebook produces results for both aggregate scores and per-item scores.
The structure and level of detail can be controlled with the following flags:

return_dict (default: True): Set return type as a dict/list or (EvalResult)
return_aggregates (default: True): In dicts returned, include aggregate scores
return_items (default: False): In dicts returned, include item scores
return_output (default: False): In item scores in dicts returned, include associated model outputs

Return Shapes

Dict Output

By default, results are returned as a dictionary with up to two sections. If return_dict=True and both return_aggregates and return_items are True, evaluate() returns a dictionary with two keys:

aggregate_results → list of dicts (one row per dataset × hyperparameter run)
item_results → list of dicts (one row per evaluated item)

If only one of return_aggregates or return_items is True, then the return value is a list containing just that section.

At least one of return_aggregates or return_items must be True when return_dict=True, otherwise a ParameterValidationError is raised.

{
  "aggregate_results": [
    {
      "dataset": "qa_dataset",
      "run_completed": true,
      "temperature": 0.7,
      "accuracy": 0.81,
      "f1": 0.78
    }
  ],
  "item_results": [
    {
      "item_id": 0,
      "dataset_name": "qa_dataset",
      "temperature": 0.7,
      "accuracy": 1,
      "f1": 1
    }
  ]
}

EvalResult Output

If the return_dict parameter in an evaluation is set to False, evaluate will return an EvalResult instance.

results: EvalResult = evaluate(inference_function, eval_dataset, return_dict=False)
results.scores              # Dict[str, List[Dict[str, Any]]]
results.aggregate_scores    # List[Dict[str, Any]] (same rows as aggregate_results)
results.item_scores         # List[Dict[str, Any]] (same rows as item_results)

Return Details

Outputs

Model outputs for each evaluation item can be included optionally with the return_output flag. These are found under "inference_output" within the item results of an evaluation.

"item_results": [
  {
    "item_id": 0,
    "dataset_name": "basic_questions",
    "inference_output": "4"      
    "temperature": 0.7,
    "accuracy": 1,
    "f1": 1
  },
  {
    "item_id": 1,
    "dataset_name": "basic_questions",
    "inference_output": "Paris"           
    "temperature": 0.7,
    "accuracy": 1,
    "f1": 1
  },
  {
    "item_id": 2,
    "dataset_name": "basic_questions",
    "inference_output": "William Shakespeare"           
    "temperature": 0.7,
    "accuracy": 1,
    "f1": 1
  },        
]

Run Id

If using Trismik's services within Scorebook, any evaluation results uploaded to the Trismik dashboard will include a unique run_id for each run within the evaluation. An evaluation run refers to a evaluation dataset × hyperparameter configuration.

{
  "dataset": "dataset",
  "run_id": "387b77604e21654f238c74ec3e12b25df33e89e7",
  "accuracy": 1.0
}

Return Shapes​

Dict Output​

EvalResult Output​

Return Details​

Outputs​

Run Id​