EvalOps Workbench
A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.
Researching
Developer Tool
LLM
A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.
Problem. LLM teams lack a lightweight way to compare prompt and tool changes before shipping. Why now. Evaluation is moving from optional best practice to baseline engineering hygiene.
System status
unknown
Mode
showcase
Tier B — see schema
Last commit
never
never
Last deploy
never
never
Schema
v1
public contract
Built for
Agent builders, prompt engineers, applied AI teams
Stack
Python
Typer
DuckDB
OpenTelemetry
What ships first
The MVP scope this project commits to.
- Load datasets from JSON or CSV
- Run prompt or agent variants
- Score outputs with rubric functions
- Compare runs and export regressions