The Project

This project benchmarks the pi-coding-agent on local AMD hardware (specifically Strix Halo and Radeon 9700 GPUs). The objective is to measure what is possible on local hardware with current models and specific quantizations, helping developers make informed decisions.

Models are evaluated using the SWEBench-verified-mini dataset. This dataset is a curated subset of SWEBench-verified that uses 50 instead of 500 datapoints while maintaining the same distribution of performance, test pass rates, and difficulty as the original dataset.

Evaluation Trade-off

To prioritize speed, heavy containerized test suites are bypassed. Instead, a frontier LLM (Gemini 3.1 Pro) rigorously judges the generated diffs against the ground-truth solutions.

Agent Configuration

The benchmark executes the bare default pi-coding-agent. No custom system prompts, tools, or wrappers are injected—ensuring a pure test of the model's native capabilities.

// WHOAMI

Donato Capitella

Software Engineer and Ethical Hacker. I enjoy understanding systems by breaking them down and documenting the process. This is a hobby project I maintain alongside strix-halo-toolboxes.com to track the coding abilities of local LLMs. Running these benchmarks can take over 24 hours per model!

YouTube Channel LinkedIn Profile

Support the maintenance of this project:

☕ Buy me a coffee

Initializing data matrices...

CRITICAL ERROR: {{ error }}

Task ID	{{ parseModelName(model.name).base }} {{ parseModelName(model.name).quant }}
{{ task.id }}	{{ task.results[model.id].judgeScore >= 1 ? 'PASS' : 'FAIL' }} N/A
No tasks match your filter.

The Project

Evaluation Trade-off

Agent Configuration

// WHOAMI

{{ parseModelName(model.name).base }}

{{ activeModal.taskId }} | {{ parseModelName(activeModal.modelId).base }} {{ parseModelName(activeModal.modelId).quant }}

// Judge Rationale

// Generated Diff