Skip to main content

Evaluation Datasets

Creating Evaluation Datasets for Scorebook Evaluations

In Scorebook, every evaluation runs inference over one or more evaluation datasets, each represented by an EvalDataset object. When multiple datasets are passed to evaluate, the model is evaluated on each, and the results are reported separately for every dataset.

Evaluation datasets can be created from local files or loaded directly from Hugging Face. For a complete walkthrough of the different ways datasets can be constructed and combined in evaluations, see example 2.


Creating Evaluation Datasets

Scorebook's EvalDataset contains three methods for creating evaluation datasets from local data:

  • from_list
  • from_json
  • from_csv

When creating an evaluation dataset from data in any format, the following must be specified:

  • label: The label (ground truth) key for evaluation items
  • metrics: A single, or list of metrics to be used to calculate scores during evaluation

The evaluation dataset name can be optionally set, if this is not provided, it will be inferred from the source such as the file name.

evaluation_items = [
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

# Create an EvalDataset from a list
dataset = EvalDataset.from_list(
name = "basic_questions", # Dataset name
label = "answer", # Key for the label value in evaluation items
metrics = Accuracy, # Metric/Metrics used to calculate scores
data = evaluation_items # List of evaluation items
)

Loading Evaluation Datasets from HuggingFace

Scorebook's EvalDataset contains two methods for creating evaluation datasets from HuggingFace datasets:

  • from_huggingface: Creates an Eval Dataset from a HuggingFace dataset directly
  • from_yaml: Creates an Eval Dataset from a HuggingFace dataset, specified in a .YAML file

From HuggingFace

from scorebook import EvalDataset
from scorebook.metrics import Accuracy

# Load cais MMLU from HuggingFace as an EvalDataset
mmlu_eval_dataset = EvalDataset.from_huggingface(
path="cais/mmlu", # HuggingFace Dataset Path
label="answer", # Label Field
metrics=Accuracy, # Metrics for Scoring
split="test", # Dataset split
name="all", # Dataset config name
)

# Load TIGER-Lab MMLU-Pro from HuggingFace as an EvalDataset
mmlu_pro_eval_dataset = EvalDataset.from_huggingface(
path="TIGER-Lab/MMLU-Pro",
label="answer",
metrics=Accuracy,
split="validation",
name="default",
)

From YAML

The specification of a HuggingFace dataset for loading can be represented in a .YAML file

tiger_lab_mmlu_pro.yaml
name: "TIGER-Lab/MMLU-Pro"
label: "answer"
split: "validation"
config: "default"
metrics:
- "accuracy"

system_prompt: "You are..."
template: |
{{ question }}
Options:
{% for option in options %}
{{ number_to_letter(loop.index0) }} : {{ option }}
{% endfor %}

# Optional metadata
description: "MMLU-Pro multiple choice questions"

Loading an EvalDataset from a .YAML spec for a HuggingFace dataset

# Load TIGER-Lab MMLU-Pro from HuggingFace as an EvalDataset as specified in a YAML spec
EvalDataset.from_yaml("tiger_lab_mmlu_pro.yaml")