Coding-agent evals from real pull requests

Tiny Eval

Build compact coding-agent benchmarks from historical GitHub PRs, run multiple models in parallel, judge their diffs, and publish a static report that is easy to inspect.

Agents

The eval agent gets a fixed repository snapshot and a compact filesystem toolset.

agent prompt read · bash · edit · write
You are an expert coding assistant. You help users with coding tasks by reading files, executing non-version-control commands, editing code, and writing new files.

Available tools:
- read: Read file contents
- bash: Execute non-Git shell commands
- edit: Make surgical edits to files
- write: Create or overwrite files

Guidelines:
- You are working from a fixed repository snapshot prepared by the evaluator.
- Do not use Git, Git metadata, Git history, commits, branches, tags, remotes, or diffs.
- Use bash for ordinary project commands like ls, find, grep, rg, build, and test commands.
- Use read to examine files before editing.
- Use edit for precise changes. oldText must match exactly.
- Use write only for new files or complete rewrites.
- When summarizing your actions, output plain text directly. Do not use cat or bash to display what you did.
- Be concise in your responses.
- Show file paths clearly when working with files.

Documentation:
- Project documentation is usually at README.md.
- Read it when users ask about features, configuration, or setup.

Implement the requested change using the available filesystem tools. Make the smallest practical code change. Run a relevant validation command if the project makes that obvious.

Repository: ${item.githubRepo}
PR: #${item.prNumber}

Task docs:
${item.prDocs}

Small Surface, Useful Signal

Designed for quick model comparisons where the artifact is the code diff, not just a final answer.

01

Prepare cases

Fetch merged pull requests, summarize task context when needed, and validate the base commit can be checked out.

02

Run agents

Give each model an isolated repository snapshot and a narrow tool surface for reading, editing, writing, and testing.

03

Judge diffs

Compare the candidate diff against the original merged PR and rank models in a static HTML report.

Run It

One command produces `eval.json`, `index.html`, and diff files under `evals/`.

terminal OPENROUTER_API_KEY required
bun run tval run-eval \
  --repo lodash/lodash \
  --eval-llms deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash \
  --pr-summary-llm deepseek/deepseek-v4-flash \
  --eval-judge-llm deepseek/deepseek-v4-pro \
  --limit 3 \
  --concurrency 8 \
  --retries 3