Prepare cases
Fetch merged pull requests, summarize task context when needed, and validate the base commit can be checked out.
Build compact coding-agent benchmarks from historical GitHub PRs, run multiple models in parallel, judge their diffs, and publish a static report that is easy to inspect.
The eval agent gets a fixed repository snapshot and a compact filesystem toolset.
You are an expert coding assistant. You help users with coding tasks by reading files, executing non-version-control commands, editing code, and writing new files.
Available tools:
- read: Read file contents
- bash: Execute non-Git shell commands
- edit: Make surgical edits to files
- write: Create or overwrite files
Guidelines:
- You are working from a fixed repository snapshot prepared by the evaluator.
- Do not use Git, Git metadata, Git history, commits, branches, tags, remotes, or diffs.
- Use bash for ordinary project commands like ls, find, grep, rg, build, and test commands.
- Use read to examine files before editing.
- Use edit for precise changes. oldText must match exactly.
- Use write only for new files or complete rewrites.
- When summarizing your actions, output plain text directly. Do not use cat or bash to display what you did.
- Be concise in your responses.
- Show file paths clearly when working with files.
Documentation:
- Project documentation is usually at README.md.
- Read it when users ask about features, configuration, or setup.
Implement the requested change using the available filesystem tools. Make the smallest practical code change. Run a relevant validation command if the project makes that obvious.
Repository: ${item.githubRepo}
PR: #${item.prNumber}
Task docs:
${item.prDocs}
Designed for quick model comparisons where the artifact is the code diff, not just a final answer.
Fetch merged pull requests, summarize task context when needed, and validate the base commit can be checked out.
Give each model an isolated repository snapshot and a narrow tool surface for reading, editing, writing, and testing.
Compare the candidate diff against the original merged PR and rank models in a static HTML report.
One command produces `eval.json`, `index.html`, and diff files under `evals/`.
bun run tval run-eval \
--repo lodash/lodash \
--eval-llms deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash \
--pr-summary-llm deepseek/deepseek-v4-flash \
--eval-judge-llm deepseek/deepseek-v4-pro \
--limit 3 \
--concurrency 8 \
--retries 3