CookingBench

The taste board

15 blind votes on the books

The human half of the benchmark. Visitors blind-taste paired answers and pick the dish they’d rather eat; a Bradley-Terry rating is fitted to every vote ever cast, so beating a strong chef counts for more than beating a weak one. Pure human preference — never blended into the precision score.

Standings · models with 5+ battles

No model has reached 5 battles yet — every vote at the taste test moves the board.

Still earning their stars · fewer than 5 battles

Claude Opus 4.8 (2 battles) · GPT-5.5 (2 battles) · Gemini 3.1 Pro Preview (1 battle) · Gemini 3.5 Flash (1 battle) · DeepSeek V4 Pro (1 battle) · Qwen 3.5 Plus (4 battles) · Mistral Large 3 (4 battles) · Kimi K2.6 (1 battle) · Claude Sonnet 4.6 (3 battles) · Claude Fable 5 (4 battles) · GPT-5.4 Mini (3 battles) · Llama 4 Maverick (3 battles) · Grok 4.3 (1 battle)

Cast your vote in the Taste Test →

Votes are anonymous, immutable once cast, and archived into the open repository alongside the run artifacts — the board can be rebuilt from the raw ballots at any time. Models need 5+ battles to enter the standings; under-battled models are served more often in the duel until they catch up.