Evaluation Datasets
Creating Evaluation Datasets for Scorebook Evaluations
In Scorebook, every evaluation runs inference over one or more evaluation datasets, each represented by an
EvalDataset
object. When multiple datasets are passed to
evaluate
, the model is evaluated on each, and the results are reported separately for every
dataset.
Evaluation datasets can be created from local files or loaded directly from Hugging Face. For a complete walkthrough of the different ways datasets can be constructed and combined in evaluations, see example 2.
Creating Evaluation Datasets
Scorebook's EvalDataset
contains three methods for creating evaluation datasets from local data:
from_list
from_json
from_csv
When creating an evaluation dataset from data in any format, the following must be specified:
- label: The label (ground truth) key for evaluation items
- metrics: A single, or list of metrics to be used to calculate scores during evaluation
The evaluation dataset name can be optionally set, if this is not provided, it will be inferred from the source such as the file name.
- From List
- From JSON
- From CSV
evaluation_items = [
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]
# Create an EvalDataset from a list
dataset = EvalDataset.from_list(
name = "basic_questions", # Dataset name
label = "answer", # Key for the label value in evaluation items
metrics = Accuracy, # Metric/Metrics used to calculate scores
data = evaluation_items # List of evaluation items
)
basic_questions.json[
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]
# Create an EvalDataset from a JSON file
dataset = EvalDataset.from_json(
file_path = "basic_questions.json",
label = "answer",
metrics = Accuracy
)
basic_questions.csvquestion,answer
What is 2 + 2?,4
What is the capital of France?,Paris
Who wrote Romeo and Juliet?,William Shakespeare
# Create an EvalDataset from a CSV file
dataset = EvalDataset.from_csv(
file_path="basic_questions.csv",
label="answer",
metrics=Accuracy
)
Loading Evaluation Datasets from HuggingFace
Scorebook's EvalDataset
contains two methods for creating evaluation datasets from HuggingFace datasets:
from_huggingface
: Creates an Eval Dataset from a HuggingFace dataset directlyfrom_yaml
: Creates an Eval Dataset from a HuggingFace dataset, specified in a .YAML file
From HuggingFace
from scorebook import EvalDataset
from scorebook.metrics import Accuracy
# Load cais MMLU from HuggingFace as an EvalDataset
mmlu_eval_dataset = EvalDataset.from_huggingface(
path="cais/mmlu", # HuggingFace Dataset Path
label="answer", # Label Field
metrics=Accuracy, # Metrics for Scoring
split="test", # Dataset split
name="all", # Dataset config name
)
# Load TIGER-Lab MMLU-Pro from HuggingFace as an EvalDataset
mmlu_pro_eval_dataset = EvalDataset.from_huggingface(
path="TIGER-Lab/MMLU-Pro",
label="answer",
metrics=Accuracy,
split="validation",
name="default",
)
From YAML
The specification of a HuggingFace dataset for loading can be represented in a .YAML file
tiger_lab_mmlu_pro.yamlname: "TIGER-Lab/MMLU-Pro"
label: "answer"
split: "validation"
config: "default"
metrics:
- "accuracy"
system_prompt: "You are..."
template: |
{{ question }}
Options:
{% for option in options %}
{{ number_to_letter(loop.index0) }} : {{ option }}
{% endfor %}
# Optional metadata
description: "MMLU-Pro multiple choice questions"
Loading an EvalDataset from a .YAML spec for a HuggingFace dataset
# Load TIGER-Lab MMLU-Pro from HuggingFace as an EvalDataset as specified in a YAML spec
EvalDataset.from_yaml("tiger_lab_mmlu_pro.yaml")