Concepts

Most of the concepts used in Gage are defined by Inspect AI. We discuss how each topic fits into Gage below. To develop an in-depth understanding of the concepts, we recommend visiting the links to the Inspect documentation below.

Tasks

Tasks let you define, implement, and measure well-defined behavior. Tasks can be broad or narrow.

Broadly defined task:

Handle a customer inquiry

Narrowly defined task:

Tell me if a customer has become upset in a chat

In general, narrowly defined tasks are easier to build and measure than broadly define tasks. Gage encourages a process of task composition where lower level, narrow tasks, are composed to implement higher level, broad tasks.

A task is what you build, test, and run in Gage.

Implement the task
Show how the task works through evaluations
When it’s ready, deploy the task to production

Define a task

A task is an instance of inspect_ai.Task. It consists of a solver and an optional scorer.

@task
def my_task():
    return Task(
        solver=input_template(),
        scorer=llm_judge()
    )

solver implements the task
scorer measures how well a task was performed

This is how Gage supports eval driven development. As you write code, you define what correct and incorrect behavior is. This mirrors the best practice of test driven development — for LLMs.

Run a task

A task by itself is just a recipe for doing work and scoring results. Tasks need to be run.

You can run a task different ways depending on the stage of development.

Use	When
`gage run` command	In development for ad hoc, single cases
`gage eval` command	In development and test to formally measure
`run_task` Python function	To run your task in an application

Evaluate a task

Tasks are evaluated by running them with sample input. For each sample, task output is scored. Sample scores are tallied to give you an idea how the task performed for a run.

Sample input is defined in a dataset.

The task processes input using a model.

An evaluation measures:

Task performance
Model performance
Sample quality
Scoring quality

You have opportunities to improve your AI application by carefully reviewing evaluation results and re-running them to chart your progress. This iterative process exemplifies eval driven development.

Solvers

A solver implements task behavior. The simplest possible solver takes input and passes it to a model, returning the result. In practice, solvers prepare the input using a variety of AI programming patterns.

Prompt engineering
Retrieval augmented generation (RAG)
Few shot examples
Self critique

Solvers can be used off-the-shelf or customized.

A solver is often defined as a list of sub-solvers. Inspect provides an elegant, high level interface to chain solver logic together to simplify AI development. You assemble complex behavior from simpler, more narrowly scoped behavior.

Here’s a task to assess user sentiment in a chat history. It uses several solvers in series.

Multi-step solver chain

@task
def user_sentiment():
    """Return how a user feels for a point in a chat session."""

    return Task(
        solver=[
            instructions(),       # Model prompt and user chat history
            positive_examples(),  # Examples of desired output
            negative_examples(),  # Examples of undesired output
            generate(),           # Call the model
            self_critique()       # Ask the model to critique itself
        ]
    )

Each item in solvers is itself a solver. Each solver participates in the task implementation by applying its specialized behavior. You can add and remove solvers from a chain to alter the task behavior.

Here’s a minimal version of the above task.

Simplified solver

@task
def user_sentiment():
    """Return how a user feels for a point in a chat session."""

    return Task(
        solver=[
            instructions(),  # Model prompt and user chat history
            generate(),      # Call the model
        ]
    )

Which solver is better? The first solver does a lot more. But is it better?

That’s the question! It can be hard to answer.

How accurate is each solver for a given model?
How is “accuracy” defined? How do we know if the task is correct or not?
How long does does each solver take to run? How much does it cost?
Does the additional time and cost of a fancier solver provide enough improved accuracy?

You can answer these questions by running different experiments and comparing the results.

Notice we haven’t asked, “which model is better?” That’s also an important topic — one that you should consider alongside different solver options.

Scorers

A scorer is the part of a task that measures task output and assigns a score.

Here’s the user sentiment task that uses an LLM to assign a score.

@task
def user_sentiment():
    return Task(
        solver=[...],  # Task solver chain (see above)
        scorer=llm_judge(),
    )

Accurate scoring is arguably the most challenging aspect of developing software with statistical models. After all, the point of a model is to estimate an answer. What’s a good estimate? What’s good enough?

And without accurate scoring, how do you know if your software is improving? How do you know when it’s ready to ship?

In Inspect, a score is typically a boolean indicator of Correct or Incorrect. This may seem overly simplistic in many cases but it’s surprisingly effective, especially when you evaluate enough samples.

You can assign any number of scores, however, including numeric values and other categories.

Computed vs estimated

When scoring model output, you generally take two approaches.

Run some Python code to compute an answer
Ask a model for a score

We refer to these methods as computed and estimated scores respectively.

Consider some of the model output in Get started. The task uses an LLM judge to score model output. In some cases, in our examples, the LLM judge was wrong about word count. It sometimes scored a result as Incorrect because it mis-counted words.

If your scoring algorithm is systematically wrong, you have little chance of making systematic improvements to your code.

There are a couple ways to address the scoring problem in the Get started example.

Option 1. Fix the LLM judge

There is likely a path to improving the word count accuracy by adjusting the LLM judge prompt. It’s often easy to fix problems like this with more specific instructions.

This leaves the problem of scoring the LLM judge. How do we know if our changes to the judge prompt have improved its word count accuracy?

Great news! Gage lets you implement LLM judge scorers as tasks. This means you can develop and evaluate your judge using the same iterative workflow used for your other work.

Option 2. Count the words using Python

It’s temping to ask AI to score AI. It’s easy. “LLM, tell me the score!" and you’re job is done.

Fortunately for you job, it’s not done! If you ask AI to score AI, you need to score the AI that scores the AI. And so on.

At some point you want to find ground truth — the true answer. At a minimum you should have an idea where you could find it, even it it’s too expensive or complex to obtain.

In the case of “did the model stay within the allowed word count," ground truth is easy to come by.

Here’s a simple scorer that uses nltk to compute an accurate score.

import nltk

def score_word_count(output: str, max_count: int):
    count = len(nltk.word_tokenize(output))
    return "C" if count <= max_count else "I"

The use of "C" and "I" is a convention used in Inspect to indicate “correct” and “incorrect” respectively. Use these values for boolean scores to calculate accuracy tallies for an eval.

In the case of the Get started example, we still need to determine whether model output is funny. This is hard! One person’s funny is another person’s cringe, and vise versa.

This is of course the difficulty of building AI applications. There’s a temptation to hand off implementation details to a model. But when it comes time to measure “how well is the model doing” you can’t hand that off. You need to define what you want with enough precision that you can accurately measure when you get it.

Datasets

A dataset is a list of samples that you can use to evaluate a task. Think of datasets as test cases.

Datasets play a central role in Inspect evaluations — so much so they’re built into the task definitions.

In Gage you can separate datasets from tasks to support different evaluation scenarios.

Small datasets used in development to quickly check changes as you make them
Larger, more diverse datasets used for more comprehensive testing
Narrowly defined datasets to measure specific scenarios and edge cases

Here’s an example of three datasets, each consisting of samples defined in YAML files.

YAML is a convenient data format for text block inputs.

from gage_inspect.dataset import dataset, yaml_dataset

@dataset(default=True)
def custom_sentiment_quickcheck():
    """Quick check samples - use as you develop."""
    return yaml_dataset("samples/quickcheck.yaml")

@dataset()
def custom_sentiment_test():
    """Full test coverage."""
    return yaml_dataset("samples/test.yaml")

@dataset()
def custom_sentiment_issue_123():
    """Test regressions for issue 123."""
    return yaml_dataset("samples/issues/123.yaml")

Specify the dataset used for an evaluation in gage eval with the --dataset option. Otherwise Gage uses the default dataset.

gage eval custom_sentiment --dataset test

Models

A model refers to an LLM. Tasks interact with models to perform their work. In fact, the role of most tasks is to construct model prompts.

When you run a task, you specify the model it uses. You can use a different model for each call. In this way you can compare how different models affect task performance.

Anthropic model for a single run

gage run funny --model anthropic/claude-sonnet-4 --input santa

OpenAI model for an eval

gage eval funny --model openai/gpt-5

See Models reference for a list of model providers.

Evals

An eval — short for evaluation — is a recorded series of task calls. Each task calls consists of a single sample and a model. Each sample is sent to the task, which uses the configured model to do work. Results are scored and tallied to provide evaluation results.

In Get started, you run an eval with five samples. The samples are defined as a dataset.

@dataset
def samples():
    """Sample topics for the funny task."""
    return ["birds", "cows", "cats", "corn", "barns"]

Eval results are stored in logs and are available using Inspect compatible tools.

Not surprisingly, evals play the main role in eval driven development.

A typical workflow looks like this:

Initial dev

Write some task code
Run the task with a sample input
Study the results, get some ideas
Modify your code, run more samples, study the results

Initial eval

When you find yourself repeating samples, create a dataset
Use the dataset to run evals each time you make changes to your code
Study the results (you’re used to this by now)

Initial scoring

When you’ve studied enough results, you’ll have a sense of correct and incorrect output — define the criteria for what’s correct
Use an LLM judge to estimate a score using your criteria
Run your eval with the LLM judge scorer
Study the results

Ongoing development

Once you have an eval working you can establish a baseline for future work.

For each iteration, study the results. Is the score correct? It can be just as likely that the score is wrong than that the task is wrong. Look for false positives and false negatives.

If you see an incorrect score, try to address it. If you’re using an LLM judge, use the LLM judge as task pattern and evaluate your judge. A judge doesn’t have to be 100% accurate to be useful. An LLM judge estimates scores and any given estimate can be wrong.

If can apply ground truth to in your scoring, that’s better than estimating. However, getting and applying ground truth can be expensive and is challenging. You might start with an LLM judge and move to computed scores once you understand your evals better.

Eval driven development encourages you to study the results. If you’re looking for shortcuts with automated tests, ask yourself, who tests the testers? How can you know without looking? Gage evals make this process easier as faster.

Logs

Everything that happens during a task run is stored in a log. When we say “study the results,” we mean “study at the logs.”

Logs are generated for ad hoc runs, evals, and calls to run_task.

There are various methods of reading Inspect logs.

Gage Review
Inspect View (standalone or as a VS Code extension)
Python API

Gage Review

Gage Review is a terminal based UI (TUI) for reviewing Inspect logs. Gage Review is designed to make “study the results” as fast and effective as possible.

Gage Review is experimental — it’s not obvious to us that a TUI is the best tool for the job but we want to see how it works! We like it, but then again we write software in nerdy languages like Rust!

Advantages of a TUI:

Run local, anywhere — same experience during local dev as well as remote runs/debugging
Fast navigation through eval samples
Keyboard navigation encourages fast workflow

Gage Review is a developer tool. It’s not intended for a non-technical audience. We recognize that a TUI is not the tool of choice for most data scientists.

Subject to community feedback, we have plans for Review!

Annotations (for open, axial, and selective codes)
Assistance with review and comparison
Interfaces for non-technical, expert review input

Inspect View

Inspect View is the gold standard for viewing Inspect logs.

Use Inspect View when you need:

Web interface to Inspect logs
Information or features that aren’t available in Gage (in particular, Inspect View has outstanding filtering support for pinpointing the information you need for a particular case)

Python API

For interactive Notebook and automation, use Inspect’s Log API.

Profiles

Gage introduces profiles to streamline task development.

A profile is a set of configuration settings that you can use for task runs and evals. Profiles are defined in gage.toml for your project.

Here’s an example.

gage.toml

[profiles.openai]

help = "Dev profile for OpenAI"
secrets = "secrets.json"

env.OPENAI_API_KEY = "{openai}"
env.GAGE_MODEL = "openai/gpt-4.1"

To apply these settings to future Gage commands, use gage profile use.

gage profile use openai

Now when you call gage run or gage eval, you’ll use openai/gpt-4.1 by default. You’ll also have the OPENAI_API_KEY environment variable set to the API token.

List available profiles with gage profile list.

gage profile list

╭────────┬────────────────────────╮
│ Name   │ Description            │
├────────┼────────────────────────┤
│ openai │ Dev profile for OpenAI │
╰────────┴────────────────────────╯

Secrets

Use Gage profiles to provide secure access to your API secrets. In the sample above, the openai profile uses {openai} for the OPENAI_API_KEY environment variable. This is a secrets reference.

A secret named openai must be defined in secrets.json.

To keep your secrets safe, Gage requires that you use SOPS to edit secrets.json. SOPS ensures that you secrets are encrypted and only readable by authorized users. For more information, see the secrets.json reference.

secrets.json

{
  "openai": "*******" // values are always encrypted!
}

Inspect docs

We encourage you to read the excellent Inspect docs for these topics. Information in the Inspect website supersedes Gage docs except where Gage has added new concepts (e.g. Gage Review, profiles, dataset registry).

In the next section you cover Gage workflow, starting with development, followed by test and production.