Methodology

Why this benchmark

Cooking is an unusually good probe of model reliability: it mixes hard arithmetic (scaling, conversions, nutrition math), regulated facts (food-safety temperatures), and judgement (technique, flavour). Models visibly differ here — and crucially, new versions of the same model family sometimes regress on quantities and volumes while improving elsewhere. CookingBench makes that measurable.

Dataset

184 hand-written questions across 8 categories. Most are graded deterministically; the rest by a reference-anchored LLM judge. The entire dataset is public — we don’t pretend to have a secret hold-out. Contamination defence is mechanical instead: after every run, item analysis demotes saturated questions to a separate Basics tier (a regression gate excluded from the Overall score) and the active set is refreshed with harder, real-life items. The dataset carries a canary string so training-data filters can exclude it.

Quantities & Scaling25 questions
Conversions28 questions
Food Safety22 questions
Substitutions21 questions
Technique16 questions
Flavour Pairing16 questions
Nutrition36 questions
Recipe Generation20 questions

Grading

Deterministic graders handle anything with a right answer: numbers are extracted from the model's final answer line (handling fractions, thousands separators and ranges), converted across units where equivalent (350°F = 177°C), and checked against a tolerance. Answers that merely restate a value from the question never get credit. Unsafe advice (e.g. washing raw chicken) zeroes the question regardless of anything else said.

The judge panel (methodology v2) replaces a single LLM judge with three: Claude Opus 4.8, GPT-5.5 and Qwen 3.5 Plus. Each answer is scored by two of the three seats; a judge never scores a model from its own maker (self-preference bias), and the remaining seat rotation is deterministic by hash, so every published score is reproducible. Judges are fact-checkers, not mark-givers: each compares the answer to a reference and lists concrete faults — typed critical, major or minor — and code maps those to deductions (−40/−15/−5 from 100). Never awarding points removes the grade-inflation ceiling that saturated v1. Judges are blind to which model wrote the answer, cross-judge disagreements over 15 points are flagged for human review, and every panel seat must independently pass a calibration gate (reproducing hand-scored anchor answers) before a run is accepted. For constrained recipe generation the panel score is blended with deterministic constraint checks — e.g. an allergen appearing in a “nut-free” recipe.

Precision and taste are scored separately. Everything above measures precision — facts, math, constraints, technique. But a benchmark that stops there is a metrics test, not a flavour test. The Taste Test is the second axis: blind, side-by-side human votes on paired answers, arena-style. Every ballot is kept forever and a Bradley-Terry rating is fitted to the full history on the Taste Board; with enough battles the human win rate also appears as its own leaderboard column — never folded into the precision score.

Every question scores 0–100. The Overall score is the plain mean over active questions, with a 95% bootstrap confidence interval over questions shown as ±. Frontier is the mean over difficulty-4+ items — compound multi-step chains where errors compound, dangerous-premise traps, buried-constraint briefs and locale traps (a UK pint, an Australian tablespoon). Basics is the saturated tier every model should ace; a dip there is a regression worth investigating, and transport incidents (empty or provider-filtered responses, retried then scored 0) are reported separately so infrastructure noise is never mistaken for skill.

The role of human experts

An AI judge scales, but it shares the blind spots of the models it grades. So grading is layered: deterministic checks need no opinion at all; the LLM judge handles the subjective bulk; and a sampled and flagged set of answers — anything the double-judge disagreed on, plus a random audit slice — is reviewed by people who actually cook. We are recruiting professional chefs and nutritionists for that expert layer, and their verdicts calibrate the judge over time. A future public “taste test” mode will let visitors blind-vote on paired answers, arena-style, as a third independent signal.

Reproducibility

Models run via OpenRouter at temperature 0 with fixed token caps. Raw responses, per-request costs, grading details and the judge configuration are committed to the open repository, so every published leaderboard can be rebuilt from git alone.

Related work

Existing cooking-adjacent benchmarks measure something different: CookBench (embodied planning in a simulated kitchen), CuisineWorld (multi-agent kitchen coordination), PizzaCommonSense (commonsense reasoning over recipe steps) and Recipe1MSubs (ingredient substitution pairs). To our knowledge CookingBench is the first public leaderboard for culinary knowledge correctness in general-purpose chat models.