Gage Review

Gage Review is a terminal application (TUI) used to review Inspect logs.

Start Review using gage review.

gage review

Review is a TUI application that is intended for quick navigation and processing of Inspect samples. If you’re not sure how to use a particular screen, press ?. This opens a help screen with available commands.

Press ? anytime you need help in Gage Review.

If you want to exit and screen, press q. This will take you back one level. When you’re at the top level and press q, you exit Review.

Press q to go back and eventually exit Review.

Review shows the list of logs. To view a log, navigate to it using the arrow keys and press Enter. You can alternatively double click the log.

Review shows the first sample in the log. If the log is from run, it will contain a single sample. If it’s from eval, it will contain all of the samples processed for that evaluation.

Initial review

The default view in Review helps you focus on three topics:

Input
Output
Score

By reviewing these three topics you can often get a sense of what to do next.

Here’s what to look for:

Does your input make sense?

In most cases it will. But it’s worth taking a moment to ask, “Is this sample meaningful for this task?” If you’re working with a test dataset, the samples may need to be fixed or improved.
Is the output correct?

Did the task generate what you expect? What did you expect? If there’s a single correct answer, it’s easy. Sometimes there is no correct answer or “correct” is ambiguous or subjective.

It’s important to spend time on this problem. If you can’t answer the question, “What do we expect here?” with enough precision to identify correct and incorrect results, at least in part, think about ways to refine your task definition.
Is the score correct?

This the crux of the “review” in Gage Review. If you’re using an LLM judge, it can be just as likely that the judge wrong. If you’re computing a score, your code may have a bug.

An incorrect score undermines all the value of eval driven development.

When you see a score, be skeptical. Say to yourself, “This score is probably wrong, now let me see how.”

If you’re not an expert in the task, you may not know the answers to these questions. Work with someone who understands the domain to review the logs to identify issues with your task solver, scorer, and sample input.

Transcript messages

The task solver is responsible for processing input and interacting with the LLM. It’s a good idea to look closely at what the solver does for each sample.

At the bottom of the sample view, expand the Messages section. This shows the series of chat messages that occurred for a sample. This is where you can see what the solver did (e.g. prompt generation) and how the model responded.

Here’s what to look for:

Did the solver do what you expected?

Input transformation and prompts are performed by the solver. Look over what was sent to the model. Is it what you expected?
What did the model respond with?

If the solver transforms model output, task output may not reflect what the model replied with. By looking at the assistant messages you can see exactly what happened.

Advanced view

Gage Review has two views:

The default simplified view, which you cover in the previous sections
An advanced view, which provides detailed access to the log

To access the advanced view, press 2. To return to the simplified view, press 1.

Press ? if you ever need help.

The advanced view is convenient for accessing log details within Gage Review. However, it’s limited in search and filtering. For a better experience viewing Inspect logs, use Inspect View, which is a far more capable app for digging into logs.

Use Inspect View when you need to really dig into the logs. Gage Review is designed for high level workflow that doesn’t always get you what you need quickly.