EvalOps Workbench

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Connecting

Researching

Developer Tool

LLM

A local-first evaluation harness for prompts, tools, and agents with regression tracking and experiment history.

Problem. LLM teams lack a lightweight way to compare prompt and tool changes before shipping. Why now. Evaluation is moving from optional best practice to baseline engineering hygiene.

See capabilities GitHub

System status

unknown

Mode

showcase

Tier B — see schema

Last commit

never

Last deploy

never

Schema

public contract

Built for

Agent builders, prompt engineers, applied AI teams

Stack

Python

Typer

DuckDB

OpenTelemetry

What ships first

The MVP scope this project commits to.

Load datasets from JSON or CSV
Run prompt or agent variants
Score outputs with rubric functions
Compare runs and export regressions