Orca Benchmarks

One Agent. Any Model.

Orca is a model-agnostic coding agent. Bring your own key, pick your model, and Orca does the rest — search, understand, edit, verify.

Evaluated on SWE-bench Lite — 300 real GitHub issues, official Docker evaluation, fully reproducible.

Orca v2.8.0 | Last updated: 2026-06-06

Issues Resolved

22

out of 300

Pass Rate

7.3%

11.1% of evaluated

Patches Generated

297

99% of instances

Actively Improving

17%+

projected after v2.8.0 fixes

Why Model-Agnostic Matters

Other coding agents are locked to one model. Claude Code only works with Claude. Codex CLI only works with GPT. Orca works with any model — use cheap models for routine tasks, powerful models for complex ones, or your enterprise's approved model.

BYOK

Bring Your Own Key

15+

Supported Models

You Choose

Price vs Performance

Agent Comparison

SWE-bench Lite scores. Each agent uses its best available model.

AgentScoreModel FlexibilityOpen Source
Claude Code49.0%Locked to ClaudeNo
Codex CLI45.2%Locked to GPT-4.1Yes
Aider26.3%Multiple modelsYes
Devin20.0%Locked to ProprietaryNo
Orca7.3%Any model (BYOK)Yes

Orca's score reflects the first full run. Active optimization is underway — 7/13 re-tested instances now pass after v2.8.0 improvements.

Active Improvement

Orca is improving every week. After agent optimizations in v2.8.0, 7/13 previously-failed instances now pass.

7/13

previously-failed instances now pass

Projected full-run score: 17%+

What Changed

  • Smarter tool orchestration — grep over semantic search
  • Raw code visibility — agent sees uncompressed file contents
  • Structured workflow with action deadlines
  • Read-loop detection — forces edit after repeated reads
  • Source-only edits — agent fixes code, not tests

Resolved Issues (22)

Real GitHub issues fixed by Orca. Each one verified with the project's own test suite in Docker.

How Orca Solves Issues

1

Search

grep across the codebase to find relevant files

2

Understand

Read the specific function, understand the root cause

3

Fix

Edit the source code with a surgical patch

4

Verify

Run tests to confirm the fix works

Evaluation Methodology

Official SWE-bench Harness

We use the official swebench v4.1.0 package with Docker containers. Each instance runs in an isolated environment matching the original repo's Python version and dependencies.

Scoring

An instance is “resolved” only if ALL failing tests now pass AND ALL previously-passing tests still pass. No partial credit — it either works or it doesn't.

Transparency

99 instances had Docker build failures (sympy, sphinx, scikit-learn) and were not evaluated. Our reported score of 7.3% is against all 300 instances, not just the evaluated ones.

Try Orca

Bring your own API key. Pick any model. Start fixing bugs.

npm install -g @axplusb/keplerOr use the web app