Skip to main content

Quick Start

A Quick Start Guide to Running Evaluations With Scorebook

Getting started with Scorebook is simple. Scorebook can be installed via pip into your project:

pip install scorebook

A Simple Scorebook Evaluation Example

The following example demonstrates the three core steps in a Scorebook evaluation:

  1. Creating an evaluation dataset
  2. Defining an inference callable
  3. Running an evaluation

The full implementation of this simple example can be found in example 1.

1) Creating an Evaluation Dataset

An evaluation dataset can be created from a list of evaluation items. The model evaluated will use each evaluation item to generate a prediction, which will be scored against the label value for that item.

from scorebook import EvalDataset

# Create a list of evaluation items
evaluation_items = [
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

# Create an evaluation dataset
evaluation_dataset = EvalDataset.from_list(
name = "basic_questions", # Dataset name
label = "answer", # Key for the label value in evaluation items
metrics = "accuracy", # Metric/Metrics used to calculate scores
data = evaluation_items # List of evaluation items
)

2) Defining an Inference Callable

An inference callable can be implemented as a function, method, or class. Its role is to handle the inference process and return a list of model predictions for a list of evaluation items.

Scorebook is model-agnostic, so you can plug in any model or framework. In this example, we use Hugging Face’s Transformers library to run a local Phi-4-mini-instruct model.

An inference callable in Scorebook must:

  • Accept a list of evaluation items
  • Accept hyperparameters as **kwargs
  • Return a list of predictions
import transformers

# Create a model
pipeline = transformers.pipeline(
"text-generation",
model="microsoft/Phi-4-mini-instruct",
model_kwargs={"torch_dtype": "auto"},
device_map="auto",
)

# Define an inference function
def inference_function(evaluation_items: List[Dict[str, Any]], **hyperparameters: Any) -> List[Any]:
"""Return a list of model predictions for a list of evaluation items."""
predictions = []
for evaluation_item in evaluation_items:
# Transform evaluation item into valid model input format
messages = [
{
"role": "system",
"content": hyperparameters.get("system_message"),
},
{"role": "user", "content": evaluation_item.get("question")},
]

# Run inference on the item
output = pipeline(messages, temperature=hyperparameters.get("temperature"))

# Extract and collect the output generated from the model's response
predictions.append(output[0]["generated_text"][-1]["content"])

return predictions

3) Running an Evaluation

A Scorebook evaluation of a model can be called with the evaluate function, provided an inference callable, and evaluation dataset. Hyperparameters can optionally be passed in as a dict.

from scorebook import evaluate

# Evaluate a model against an evaluation dataset
results: List[Dict[str, Any]] = evaluate(
inference_function, # The inference function we defined
evaluation_dataset, # The evaluation dataset we created
hyperparameters={
"temperature": 0.7,
"system_message": "Answer the question directly and concisely.",
},
)

By default, evaluate will return results as a list of dicts, with a result for each evaluation. In the simple example the evaluate call runs only one evaluation, a single model (Phi-4-mini-instruct) against a single evaluation dataset.

Example Results:

[
{
"dataset": "basic_questions",
"run_completed": true,
"temperature": 0.7,
"accuracy": 1
}
]

If you made it this far, congrats! You have completed your first Scorebook Evaluation 🎉