Most teams find out about an LLM regression from a customer ticket. By that point the bad output has been in production for a week. Eval Runner is a small framework that runs your evals on a schedule against whichever model versions you care about, then alerts when a metric drops. So you find out from a Slack message at 9am, not a customer at 9pm.
An eval is a folder with a YAML config, a list of prompt-expected-output pairs, and an optional grader. The runner picks them up automatically. No SDK to learn, no service to deploy.
Some evals matter daily. Some matter on every model release. Some only matter when you change the prompt. Each eval declares its own cadence and triggers, so you don't run expensive evals more often than needed.
Slack notifications include the diff: which exact prompts started failing, the previous output, the new output, and a one-line summary of what changed. Most regressions are obvious in the diff.