ROIBench builds AI apps and measures how quickly a new user gets real, verified value. Not model accuracy. Not feature count. Activation ROI: value delivered per unit of user effort.
Most AI benchmarks measure model answers. ROIBench measures something different: can a new user reach a verified "aha moment" quickly and without friction?
Time spent before getting value. Steps taken (clicks, uploads, form fills). Friction endured (errors, confusion, dead ends).
The user actually got what the app promises, confirmed by a deterministic validator. Not vibes — proof.
An agentic conveyor builds apps, tests them with synthetic personas, and iterates until activation quality hits a target — or proves it can't.
Computed from instrumented action traces. Every click, upload, and wait is logged with timestamps. The headline number that tracks improvement.
Synthetic users write free-text experience reports. These explain why the score is what it is and generate ranked change requests for the builder.
The score measures progress. The feedback drives it. A score without diagnostics is a number you can't act on. Diagnostics without a score is improvement you can't measure.
4 out of 5 hit the activation target. The one that didn't revealed an architectural ceiling — which is itself a useful result.
| App | Rounds | Score | Status | Key finding |
|---|---|---|---|---|
| DocBench Agentic document workspace with citations | 23 | 4.75 | Target hit R6 | Weakest persona 1.8 → 5.0 over 10 rounds. Citation grounding was the bottleneck. |
| Night Desk AI detective game with generated scenes | 10 | 4.86 | Target hit R5 | Image quality upgrade drove visual impact to 5.0 across all personas. |
| ClawTrade Ops Trading automation with receipts | 4 | 4.65 | Target hit R4 | Fastest to target. NLP-first UX was the breakthrough. |
| Annotate AI annotation platform for students | 4 | 4.53 | Target hit R2 | Deterministic fallback: 3.07. Real LLM: +1.46 in one round. |
| TailorCV Resume tailoring tool | 10 | 3.43 | Target missed | Architecture ceiling ~3.5. Single-LLM-pass can't solve accuracy tension. |
Upload documents. Ask questions. Get answers with grounded citations you can verify. An agentic document workspace — not a chatbot wrapper.
ROIBench is an ongoing research project exploring the intersection of synthetic evaluation, agentic building, and product activation.
Early evidence: TailorCV's flat trajectory (10 rounds, no improvement past R2) suggests architectural ceilings are detectable early. Annotate's +1.46 jump when switching from deterministic to real LLM shows infrastructure choices dominate UX polish.
ROIBench is built by Anton as a side project alongside a full-time role. The entire pipeline — discovery, building, persona testing, iteration — is agentic, built with Claude Code (Anthropic) and tested via Playwright.
All code, Value Contracts, persona definitions, and round-by-round results are available in the repository.