Model leaderboard

This site is also a testbed. The same content tasks run through Cloudflare's Edge AI, Claude and Gemini; we score quality and measure latency and cost, then use the strongest outputs to tune the cheapest edge model.

Loading latest run…

How to read it

Quality — an LLM judge scores each answer 0–10 against the house rubric (Claude judges when available, otherwise the strongest edge model).
Latency — average wall-clock per task.
Cost — approximate, for relative comparison only.

The cheap edge model (Llama 3.1 8B) is the tuning target: the gap to Claude/Gemini is what we close with better prompts, few-shot exemplars and (where supported) LoRA — then re-run this eval to confirm the lift.

← About 3d.2nth.ai