Evaluation Datasets

Creating Evaluation Datasets for Scorebook Evaluations

In Scorebook, every evaluation runs inference over one or more evaluation datasets, each represented by an EvalDataset object. When multiple datasets are passed to evaluate, the model is evaluated on each, and the results are reported separately for every dataset.

Evaluation datasets can be created from local files or loaded directly from Hugging Face. For a complete walkthrough of the different ways datasets can be constructed and combined in evaluations, see example 2.

Creating Evaluation Datasets

Scorebook's EvalDataset contains three methods for creating evaluation datasets from local data:

from_list
from_json
from_csv

When creating an evaluation dataset from data in any format, the following must be specified:

label: The label (ground truth) key for evaluation items
metrics: A single, or list of metrics to be used to calculate scores during evaluation

The evaluation dataset name can be optionally set, if this is not provided, it will be inferred from the source such as the file name.

From List
From JSON
From CSV

evaluation_items = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

# Create an EvalDataset from a list
dataset = EvalDataset.from_list(
    name = "basic_questions",  # Dataset name
    label = "answer",          # Key for the label value in evaluation items
    metrics = Accuracy,        # Metric/Metrics used to calculate scores
    data = evaluation_items    # List of evaluation items
)

basic_questions.json
[
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

# Create an EvalDataset from a JSON file
dataset = EvalDataset.from_json(
    file_path = "basic_questions.json",
    label = "answer",
    metrics = Accuracy
)

basic_questions.csv
question,answer
What is 2 + 2?,4
What is the capital of France?,Paris
Who wrote Romeo and Juliet?,William Shakespeare

# Create an EvalDataset from a CSV file
dataset = EvalDataset.from_csv(
    file_path="basic_questions.csv",
    label="answer",
    metrics=Accuracy
)

Loading Evaluation Datasets from HuggingFace

Scorebook's EvalDataset contains two methods for creating evaluation datasets from HuggingFace datasets:

from_huggingface: Creates an Eval Dataset from a HuggingFace dataset directly
from_yaml: Creates an Eval Dataset from a HuggingFace dataset, specified in a .YAML file

From HuggingFace

from scorebook import EvalDataset
from scorebook.metrics import Accuracy

# Load cais MMLU from HuggingFace as an EvalDataset
mmlu_eval_dataset = EvalDataset.from_huggingface(
    path="cais/mmlu",  # HuggingFace Dataset Path
    label="answer",    # Label Field
    metrics=Accuracy,  # Metrics for Scoring
    split="test",      # Dataset split
    name="all",        # Dataset config name
)

# Load TIGER-Lab MMLU-Pro from HuggingFace as an EvalDataset
mmlu_pro_eval_dataset = EvalDataset.from_huggingface(
    path="TIGER-Lab/MMLU-Pro",
    label="answer",
    metrics=Accuracy,
    split="validation",
    name="default",
)

From YAML

The specification of a HuggingFace dataset for loading can be represented in a .YAML file

tiger_lab_mmlu_pro.yaml
name: "TIGER-Lab/MMLU-Pro"
label: "answer"
split: "validation"
config: "default"
metrics:
  - "accuracy"

system_prompt: "You are..."
template: |
  {{ question }}
  Options:
  {% for option in options %}
  {{ number_to_letter(loop.index0) }} : {{ option }}
  {% endfor %}

# Optional metadata
description: "MMLU-Pro multiple choice questions"

Loading an EvalDataset from a .YAML spec for a HuggingFace dataset

# Load TIGER-Lab MMLU-Pro from HuggingFace as an EvalDataset as specified in a YAML spec
EvalDataset.from_yaml("tiger_lab_mmlu_pro.yaml")

Creating Evaluation Datasets​

Loading Evaluation Datasets from HuggingFace​

From HuggingFace​

From YAML​

Creating Evaluation Datasets

Loading Evaluation Datasets from HuggingFace

From HuggingFace

From YAML