Model leaderboard
This site is also a testbed. The same content tasks run through Cloudflare's Edge AI, Claude and Gemini; we score quality and measure latency and cost, then use the strongest outputs to tune the cheapest edge model.
How to read it
- Quality — an LLM judge scores each answer 0–10 against the house rubric (Claude judges when available, otherwise the strongest edge model).
- Latency — average wall-clock per task.
- Cost — approximate, for relative comparison only.
The cheap edge model (Llama 3.1 8B) is the tuning target: the gap to Claude/Gemini is what we close with better prompts, few-shot exemplars and (where supported) LoRA — then re-run this eval to confirm the lift.