Concepts
Most of the concepts used in Gage are defined by Inspect AI. We discuss how each topic fits into Gage below. To develop an in-depth understanding of the concepts, we recommend visiting the links to the Inspect documentation below.
Tasks
Tasks let you define, implement, and measure well-defined behavior. Tasks can be broad or narrow.
Broadly defined task:
Handle a customer inquiry
Narrowly defined task:
Tell me if a customer has become upset in a chat
In general, narrowly defined tasks are easier to build and measure than broadly define tasks. Gage encourages a process of task composition where lower level, narrow tasks, are composed to implement higher level, broad tasks.
A task is what you build, test, and run in Gage.
- Implement the task
- Show how the task works through evaluations
- When it’s ready, deploy the task to production
Define a task
A task is an instance of inspect_ai.Task. It
consists of a solver and an optional scorer.
@task
def my_task():
return Task(
solver=input_template(),
scorer=llm_judge()
)solverimplements the taskscorermeasures how well a task was performed
This is how Gage supports eval driven development. As you write code, you define what correct and incorrect behavior is. This mirrors the best practice of test driven development — for LLMs.
Run a task
A task by itself is just a recipe for doing work and scoring results. Tasks need to be run.
You can run a task different ways depending on the stage of development.
| Use | When |
|---|---|
gage run command |
In development for ad hoc, single cases |
gage eval command |
In development and test to formally measure |
run_task Python function |
To run your task in an application |
Evaluate a task
Tasks are evaluated by running them with sample input. For each sample, task output is scored. Sample scores are tallied to give you an idea how the task performed for a run.
Sample input is defined in a dataset.
The task processes input using a model.
An evaluation measures:
- Task performance
- Model performance
- Sample quality
- Scoring quality
You have opportunities to improve your AI application by carefully reviewing evaluation results and re-running them to chart your progress. This iterative process exemplifies eval driven development.
Solvers
A solver implements task behavior. The simplest possible solver takes input and passes it to a model, returning the result. In practice, solvers prepare the input using a variety of AI programming patterns.
- Prompt engineering
- Retrieval augmented generation (RAG)
- Few shot examples
- Self critique
Solvers can be used off-the-shelf or customized.
A solver is often defined as a list of sub-solvers. Inspect provides an elegant, high level interface to chain solver logic together to simplify AI development. You assemble complex behavior from simpler, more narrowly scoped behavior.
Here’s a task to assess user sentiment in a chat history. It uses several solvers in series.
@task
def user_sentiment():
"""Return how a user feels for a point in a chat session."""
return Task(
solver=[
instructions(), # Model prompt and user chat history
positive_examples(), # Examples of desired output
negative_examples(), # Examples of undesired output
generate(), # Call the model
self_critique() # Ask the model to critique itself
]
)Each item in solvers is itself a solver. Each solver participates in
the task implementation by applying its specialized behavior. You can
add and remove solvers from a chain to alter the task behavior.
Here’s a minimal version of the above task.
@task
def user_sentiment():
"""Return how a user feels for a point in a chat session."""
return Task(
solver=[
instructions(), # Model prompt and user chat history
generate(), # Call the model
]
)Which solver is better? The first solver does a lot more. But is it better?
That’s the question! It can be hard to answer.
- How accurate is each solver for a given model?
- How is “accuracy” defined? How do we know if the task is correct or not?
- How long does does each solver take to run? How much does it cost?
- Does the additional time and cost of a fancier solver provide enough improved accuracy?
You can answer these questions by running different experiments and comparing the results.
Notice we haven’t asked, “which model is better?” That’s also an important topic — one that you should consider alongside different solver options.
Scorers
A scorer is the part of a task that measures task output and assigns a score.
Here’s the user sentiment task that uses an LLM to assign a score.
@task
def user_sentiment():
return Task(
solver=[...], # Task solver chain (see above)
scorer=llm_judge(),
)Accurate scoring is arguably the most challenging aspect of developing software with statistical models. After all, the point of a model is to estimate an answer. What’s a good estimate? What’s good enough?
And without accurate scoring, how do you know if your software is improving? How do you know when it’s ready to ship?
In Inspect, a score is typically a boolean indicator of Correct or
Incorrect. This may seem overly simplistic in many cases but it’s
surprisingly effective, especially when you evaluate enough samples.
You can assign any number of scores, however, including numeric values and other categories.
Computed vs estimated
When scoring model output, you generally take two approaches.
- Run some Python code to compute an answer
- Ask a model for a score
We refer to these methods as computed and estimated scores respectively.
Consider some of the model output in Get started. The
task uses an LLM judge to score model output. In some cases, in our
examples, the LLM judge was wrong about word count. It sometimes scored
a result as Incorrect because it mis-counted words.
If your scoring algorithm is systematically wrong, you have little chance of making systematic improvements to your code.
There are a couple ways to address the scoring problem in the Get started example.
Option 1. Fix the LLM judge
There is likely a path to improving the word count accuracy by adjusting the LLM judge prompt. It’s often easy to fix problems like this with more specific instructions.
This leaves the problem of scoring the LLM judge. How do we know if our changes to the judge prompt have improved its word count accuracy?
Great news! Gage lets you implement LLM judge scorers as tasks. This means you can develop and evaluate your judge using the same iterative workflow used for your other work.
Option 2. Count the words using Python
It’s temping to ask AI to score AI. It’s easy. “LLM, tell me the score!" and you’re job is done.
Fortunately for you job, it’s not done! If you ask AI to score AI, you need to score the AI that scores the AI. And so on.
At some point you want to find ground truth — the true answer. At a minimum you should have an idea where you could find it, even it it’s too expensive or complex to obtain.
In the case of “did the model stay within the allowed word count," ground truth is easy to come by.
Here’s a simple scorer that uses nltk to compute an accurate
score.
import nltk
def score_word_count(output: str, max_count: int):
count = len(nltk.word_tokenize(output))
return "C" if count <= max_count else "I"The use of
"C"and"I"is a convention used in Inspect to indicate “correct” and “incorrect” respectively. Use these values for boolean scores to calculate accuracy tallies for an eval.
In the case of the Get started example, we still need to determine whether model output is funny. This is hard! One person’s funny is another person’s cringe, and vise versa.
This is of course the difficulty of building AI applications. There’s a temptation to hand off implementation details to a model. But when it comes time to measure “how well is the model doing” you can’t hand that off. You need to define what you want with enough precision that you can accurately measure when you get it.
Datasets
A dataset is a list of samples that you can use to evaluate a task. Think of datasets as test cases.
Datasets play a central role in Inspect evaluations — so much so they’re built into the task definitions.
In Gage you can separate datasets from tasks to support different evaluation scenarios.
- Small datasets used in development to quickly check changes as you make them
- Larger, more diverse datasets used for more comprehensive testing
- Narrowly defined datasets to measure specific scenarios and edge cases
Here’s an example of three datasets, each consisting of samples defined in YAML files.
YAML is a convenient data format for text block inputs.
from gage_inspect.dataset import dataset, yaml_dataset
@dataset(default=True)
def custom_sentiment_quickcheck():
"""Quick check samples - use as you develop."""
return yaml_dataset("samples/quickcheck.yaml")
@dataset()
def custom_sentiment_test():
"""Full test coverage."""
return yaml_dataset("samples/test.yaml")
@dataset()
def custom_sentiment_issue_123():
"""Test regressions for issue 123."""
return yaml_dataset("samples/issues/123.yaml")Specify the dataset used for an evaluation in gage eval
with the --dataset option. Otherwise Gage uses the default dataset.
gage eval custom_sentiment --dataset test
Models
A model refers to an LLM. Tasks interact with models to perform their work. In fact, the role of most tasks is to construct model prompts.
When you run a task, you specify the model it uses. You can use a different model for each call. In this way you can compare how different models affect task performance.
gage run funny --model anthropic/claude-sonnet-4 --input santa
gage eval funny --model openai/gpt-5
See Models reference for a list of model providers.
Evals
An eval — short for evaluation — is a recorded series of task calls. Each task calls consists of a single sample and a model. Each sample is sent to the task, which uses the configured model to do work. Results are scored and tallied to provide evaluation results.
In Get started, you run an eval with five samples. The samples are defined as a dataset.
@dataset
def samples():
"""Sample topics for the funny task."""
return ["birds", "cows", "cats", "corn", "barns"]Eval results are stored in logs and are available using Inspect compatible tools.
Not surprisingly, evals play the main role in eval driven development.
A typical workflow looks like this:
Initial dev
- Write some task code
- Run the task with a sample input
- Study the results, get some ideas
- Modify your code, run more samples, study the results
Initial eval
- When you find yourself repeating samples, create a dataset
- Use the dataset to run evals each time you make changes to your code
- Study the results (you’re used to this by now)
Initial scoring
- When you’ve studied enough results, you’ll have a sense of correct and incorrect output — define the criteria for what’s correct
- Use an LLM judge to estimate a score using your criteria
- Run your eval with the LLM judge scorer
- Study the results
Ongoing development
Once you have an eval working you can establish a baseline for future work.
For each iteration, study the results. Is the score correct? It can be just as likely that the score is wrong than that the task is wrong. Look for false positives and false negatives.
If you see an incorrect score, try to address it. If you’re using an LLM judge, use the LLM judge as task pattern and evaluate your judge. A judge doesn’t have to be 100% accurate to be useful. An LLM judge estimates scores and any given estimate can be wrong.
If can apply ground truth to in your scoring, that’s better than estimating. However, getting and applying ground truth can be expensive and is challenging. You might start with an LLM judge and move to computed scores once you understand your evals better.
Eval driven development encourages you to study the results. If you’re looking for shortcuts with automated tests, ask yourself, who tests the testers? How can you know without looking? Gage evals make this process easier as faster.
Logs
Everything that happens during a task run is stored in a log. When we say “study the results,” we mean “study at the logs.”
Logs are generated for ad hoc runs, evals, and calls to
run_task.
There are various methods of reading Inspect logs.
- Gage Review
- Inspect View (standalone or as a VS Code extension)
- Python API
Gage Review
Gage Review is a terminal based UI (TUI) for reviewing Inspect logs. Gage Review is designed to make “study the results” as fast and effective as possible.
Gage Review is experimental — it’s not obvious to us that a TUI is the best tool for the job but we want to see how it works! We like it, but then again we write software in nerdy languages like Rust!
Advantages of a TUI:
- Run local, anywhere — same experience during local dev as well as remote runs/debugging
- Fast navigation through eval samples
- Keyboard navigation encourages fast workflow
Gage Review is a developer tool. It’s not intended for a non-technical audience. We recognize that a TUI is not the tool of choice for most data scientists.
Subject to community feedback, we have plans for Review!
- Annotations (for open, axial, and selective codes)
- Assistance with review and comparison
- Interfaces for non-technical, expert review input
Inspect View
Inspect View is the gold standard for viewing Inspect logs.
Use Inspect View when you need:
- Web interface to Inspect logs
- Information or features that aren’t available in Gage (in particular, Inspect View has outstanding filtering support for pinpointing the information you need for a particular case)
Python API
For interactive Notebook and automation, use Inspect’s Log API.
Profiles
Gage introduces profiles to streamline task development.
A profile is a set of configuration settings that you can use for task
runs and evals. Profiles are defined in gage.toml for
your project.
Here’s an example.
[profiles.openai]
help = "Dev profile for OpenAI"
secrets = "secrets.json"
env.OPENAI_API_KEY = "{openai}"
env.GAGE_MODEL = "openai/gpt-4.1"To apply these settings to future Gage commands, use
gage profile use.
gage profile use openai
Now when you call gage run or gage eval, you’ll use openai/gpt-4.1
by default. You’ll also have the OPENAI_API_KEY environment variable
set to the API token.
List available profiles with gage profile list.
gage profile list
╭────────┬────────────────────────╮ │ Name │ Description │ ├────────┼────────────────────────┤ │ openai │ Dev profile for OpenAI │ ╰────────┴────────────────────────╯
Secrets
Use Gage profiles to provide secure access to your API secrets. In the
sample above, the openai profile uses {openai} for the
OPENAI_API_KEY environment variable. This is a secrets reference.
A secret named openai must be defined in secrets.json.
To keep your secrets safe, Gage requires that you use SOPS to
edit secrets.json. SOPS ensures that you secrets are encrypted and
only readable by authorized users. For more information, see the
secrets.json reference.
{
"openai": "*******" // values are always encrypted!
}Inspect docs
We encourage you to read the excellent Inspect docs for these topics. Information in the Inspect website supersedes Gage docs except where Gage has added new concepts (e.g. Gage Review, profiles, dataset registry).
In the next section you cover Gage workflow, starting with development, followed by test and production.