Overview

Gage is a framework for building LLM or AI powered applications. It’s different from other frameworks because evaluations are baked in from the start.

Gage supports eval driven development. Think of it as traditional test driven development but for LLM apps.

Gage extends Inspect, the leading open source software for LLM evaluations. Build and evaluate your code using Inspect and deploy it to production with Gage.

Here’s a simple task.

from inspect_ai import task, Task
from gage_inspect.solver import task_doc

@task
def add():
    """Add two numbers.

    Input:
      x: First number
      y: Second number

    Output: Sum of x and y
    """
    return Task(solver=task_doc())

A task can be run in various ways.

Use gage run for ad hoc runs — see how code changes affect task behavior
Use gage eval to evaluate the task — measure performance of different approaches and models
Use the run_task function in your Python code — host the task in an application

Motivation

We want to build software using the tried and true method of TDD. Write some code, test the code. Iterate in small steps to work toward demonstrably correct software. This method has proven time and again to result in higher quality, more maintainable software.

How to apply TDD to LLM apps?

It’s hard. LLMs handle unanticipated cases. TDD addresses anticipated cases. Naive boolean assertions about “correct results” in TDD don’t work well with statistical models.

That’s where evals come in.

Evals are used by AI experts to measure model behavior. Gage extends evals to measure app behavior — where models are involved.

That’s why Gage is tightly integrated with Inspect — to support evals from early development, through test, and even into production.

When to use Gage

You’re building software that uses LLMs
You want to evaluate how your software performs early and often
You’re excited to try a new framework