Overview
Gage is a framework for building LLM or AI powered applications. It’s different from other frameworks because evaluations are baked in from the start.
Gage supports eval driven development. Think of it as traditional test driven development but for LLM apps.
Gage extends Inspect, the leading open source software for LLM evaluations. Build and evaluate your code using Inspect and deploy it to production with Gage.
Here’s a simple task.
from inspect_ai import task, Task
from gage_inspect.solver import task_doc
@task
def add():
"""Add two numbers.
Input:
x: First number
y: Second number
Output: Sum of x and y
"""
return Task(solver=task_doc())A task can be run in various ways.
- Use
gage runfor ad hoc runs — see how code changes affect task behavior - Use
gage evalto evaluate the task — measure performance of different approaches and models - Use the
run_taskfunction in your Python code — host the task in an application
Motivation
We want to build software using the tried and true method of TDD. Write some code, test the code. Iterate in small steps to work toward demonstrably correct software. This method has proven time and again to result in higher quality, more maintainable software.
How to apply TDD to LLM apps?
It’s hard. LLMs handle unanticipated cases. TDD addresses anticipated cases. Naive boolean assertions about “correct results” in TDD don’t work well with statistical models.
That’s where evals come in.
Evals are used by AI experts to measure model behavior. Gage extends evals to measure app behavior — where models are involved.
That’s why Gage is tightly integrated with Inspect — to support evals from early development, through test, and even into production.
When to use Gage
- You’re building software that uses LLMs
- You want to evaluate how your software performs early and often
- You’re excited to try a new framework