Methodology
Why this benchmark
Cooking is an unusually good probe of model reliability: it mixes hard arithmetic (scaling, conversions, nutrition math), regulated facts (food-safety temperatures), and judgement (technique, flavour). Models visibly differ here — and crucially, new versions of the same model family sometimes regress on quantities and volumes while improving elsewhere. CookingBench makes that measurable.
Dataset
129 hand-written questions across 8 categories. Roughly 55% are graded deterministically; the rest by a rubric-driven LLM judge. A subset of questions is held out (never published) to resist contamination.
- Quantities & Scaling17 questions
- Conversions19 questions
- Food Safety14 questions
- Substitutions15 questions
- Technique11 questions
- Flavour Pairing12 questions
- Nutrition27 questions
- Recipe Generation14 questions
Grading
Deterministic graders handle anything with a right answer: numbers are extracted from the model's final answer line (handling fractions, thousands separators and ranges), converted across units where equivalent (350°F = 177°C), and checked against a tolerance. Answers that merely restate a value from the question never get credit. Unsafe advice (e.g. washing raw chicken) zeroes the question regardless of anything else said.
The LLM judge grades subjective answers against a per-question written rubric, blind to which model wrote the answer (self-identifying phrases are stripped). Each answer is judged twice at temperature 0 and averaged; large disagreements are flagged for manual review. For constrained recipe generation, the judge score (70%) is blended with deterministic constraint checks (30%) — e.g. an allergen appearing in a "nut-free" recipe.
Every question scores 0–100. A category score is the mean of its questions; the overall score is the unweighted mean of category scores. The leaderboard also reports a Hard set score — difficulty-3 questions only. Frontier models saturate the easy questions (which exist as a floor, to catch regressions and rank smaller models), so the hard set carries the ranking signal at the top: inverse and non-linear scaling traps, unit-identity traps (a UK pint, an Australian tablespoon, weight-vs-volume ounces), chained conversions, given-data nutrition reasoning, and multi-constraint recipe briefs.
The role of human experts
An AI judge scales, but it shares the blind spots of the models it grades. So grading is layered: deterministic checks need no opinion at all; the LLM judge handles the subjective bulk; and a sampled and flagged set of answers — anything the double-judge disagreed on, plus a random audit slice — is reviewed by people who actually cook. We are recruiting professional chefs and nutritionists for that expert layer, and their verdicts calibrate the judge over time. A future public “taste test” mode will let visitors blind-vote on paired answers, arena-style, as a third independent signal.
Reproducibility
Models run via OpenRouter at temperature 0 with fixed token caps. Raw responses, per-request costs, grading details and the judge configuration are committed to the open repository, so every published leaderboard can be rebuilt from git alone.
Related work
Existing cooking-adjacent benchmarks measure something different: CookBench (embodied planning in a simulated kitchen), CuisineWorld (multi-agent kitchen coordination), PizzaCommonSense (commonsense reasoning over recipe steps) and Recipe1MSubs (ingredient substitution pairs). To our knowledge CookingBench is the first public leaderboard for culinary knowledge correctness in general-purpose chat models.