LLM tooling · 2026

Catch model regressions before your users do.

Most teams find out about an LLM regression from a customer ticket. By that point the bad output has been in production for a week. Eval Runner is a small framework that runs your evals on a schedule against whichever model versions you care about, then alerts when a metric drops. So you find out from a Slack message at 9am, not a customer at 9pm.

Open in new tab ↗
01

Define evals as files. That's it.

An eval is a folder with a YAML config, a list of prompt-expected-output pairs, and an optional grader. The runner picks them up automatically. No SDK to learn, no service to deploy.

02

Schedules per eval, not per project.

Some evals matter daily. Some matter on every model release. Some only matter when you change the prompt. Each eval declares its own cadence and triggers, so you don't run expensive evals more often than needed.

03

Alerts you'd actually open.

Slack notifications include the diff: which exact prompts started failing, the previous output, the new output, and a one-line summary of what changed. Most regressions are obvious in the diff.

What it runs on

A short tech sheet.

Models
Claude, OpenAI, Gemini, local
Scheduling
Cron-style, per eval
Graders
String, regex, LLM-as-judge
Storage
Postgres + S3 for runs
Alerts
Slack, email, webhook
Stack
Python, FastAPI, Celery
Hosted on
Fly.io
Stage
Internal use, in progress
← previous PR Diff Review next → LeadListHQ