EvalOps Workbench

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Open telemetry
Researching
Developer Tool
LLM

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Problem. LLM teams lack a lightweight way to compare prompt and tool changes before shipping. Why now. Evaluation is moving from optional best practice to baseline engineering hygiene.

System status
unknown

Mode

showcase

Tier B — see schema

Last commit

never

never

Last deploy

never

never

Schema

v1

public contract

Built for
Agent builders, prompt engineers, applied AI teams

Stack

Python
Typer
DuckDB
OpenTelemetry
What ships first
The MVP scope this project commits to.
  • Load datasets from JSON or CSV
  • Run prompt or agent variants
  • Score outputs with rubric functions
  • Compare runs and export regressions