Quick Start

A Quick Start Guide to Running Evaluations With Scorebook

Getting started with Scorebook is simple. Scorebook can be installed via pip into your project:

pip install scorebook

A Simple Scorebook Evaluation Example

The following example demonstrates the three core steps in a Scorebook evaluation:

Creating an evaluation dataset
Defining an inference callable
Running an evaluation

The full implementation of this simple example can be found in example 1.

1) Creating an Evaluation Dataset

An evaluation dataset can be created from a list of evaluation items. The model evaluated will use each evaluation item to generate a prediction, which will be scored against the label value for that item.

from scorebook import EvalDataset

# Create a list of evaluation items
evaluation_items = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

# Create an evaluation dataset
evaluation_dataset = EvalDataset.from_list(
    name = "basic_questions",  # Dataset name
    label = "answer",          # Key for the label value in evaluation items
    metrics = "accuracy",      # Metric/Metrics used to calculate scores
    data = evaluation_items    # List of evaluation items
)

2) Defining an Inference Callable

An inference callable can be implemented as a function, method, or class. Its role is to handle the inference process and return a list of model predictions for a list of evaluation items.

Scorebook is model-agnostic, so you can plug in any model or framework. In this example, we use Hugging Face’s Transformers library to run a local Phi-4-mini-instruct model.

An inference callable in Scorebook must:

Accept a list of evaluation items
Accept hyperparameters as **kwargs
Return a list of predictions

import transformers

# Create a model
pipeline = transformers.pipeline(
    "text-generation",
    model="microsoft/Phi-4-mini-instruct",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

# Define an inference function
def inference_function(evaluation_items: List[Dict[str, Any]], **hyperparameters: Any) -> List[Any]:
    """Return a list of model predictions for a list of evaluation items."""
    predictions = []
    for evaluation_item in evaluation_items:
        # Transform evaluation item into valid model input format
        messages = [
            {
                "role": "system",
                "content": hyperparameters.get("system_message"),
            },
            {"role": "user", "content": evaluation_item.get("question")},
        ]
        
        # Run inference on the item
        output = pipeline(messages, temperature=hyperparameters.get("temperature"))
                          
        # Extract and collect the output generated from the model's response
        predictions.append(output[0]["generated_text"][-1]["content"])

    return predictions

3) Running an Evaluation

A Scorebook evaluation of a model can be called with the evaluate function, provided an inference callable, and evaluation dataset. Hyperparameters can optionally be passed in as a dict.

from scorebook import evaluate

# Evaluate a model against an evaluation dataset
results: List[Dict[str, Any]] = evaluate(
    inference_function,  # The inference function we defined
    evaluation_dataset,  # The evaluation dataset we created
    hyperparameters={  
        "temperature": 0.7,
        "system_message": "Answer the question directly and concisely.",
    },
)

By default, evaluate will return results as a list of dicts, with a result for each evaluation. In the simple example the evaluate call runs only one evaluation, a single model (Phi-4-mini-instruct) against a single evaluation dataset.

Example Results:

[
  {
    "dataset": "basic_questions",
    "run_completed": true,
    "temperature": 0.7,
    "accuracy": 1
  }
]

If you made it this far, congrats! You have completed your first Scorebook Evaluation 🎉

A Simple Scorebook Evaluation Example​

1) Creating an Evaluation Dataset​

2) Defining an Inference Callable​

3) Running an Evaluation​

A Simple Scorebook Evaluation Example

1) Creating an Evaluation Dataset

2) Defining an Inference Callable

3) Running an Evaluation