Eval Runner · Khush Agarwala

Define evals as files. That's it.

An eval is a folder with a YAML config, a list of prompt-expected-output pairs, and an optional grader. The runner picks them up automatically. No SDK to learn, no service to deploy.

Schedules per eval, not per project.

Some evals matter daily. Some matter on every model release. Some only matter when you change the prompt. Each eval declares its own cadence and triggers, so you don't run expensive evals more often than needed.

Alerts you'd actually open.

Slack notifications include the diff: which exact prompts started failing, the previous output, the new output, and a one-line summary of what changed. Most regressions are obvious in the diff.

What it runs on

A short tech sheet.

Models

Claude, OpenAI, Gemini, local

Scheduling

Cron-style, per eval

Graders

String, regex, LLM-as-judge

Storage

Postgres + S3 for runs

Alerts

Slack, email, webhook

Stack

Python, FastAPI, Celery

Hosted on

Fly.io

Stage

Internal use, in progress