CookingBench

Methodology

Why this benchmark

Cooking is an unusually good probe of model reliability: it mixes hard arithmetic (scaling, conversions, nutrition math), regulated facts (food-safety temperatures), and judgement (technique, flavour). Models visibly differ here — and crucially, new versions of the same model family sometimes regress on quantities and volumes while improving elsewhere. CookingBench makes that measurable.

Dataset

129 hand-written questions across 8 categories. Roughly 55% are graded deterministically; the rest by a rubric-driven LLM judge. A subset of questions is held out (never published) to resist contamination.

  • Quantities & Scaling17 questions
  • Conversions19 questions
  • Food Safety14 questions
  • Substitutions15 questions
  • Technique11 questions
  • Flavour Pairing12 questions
  • Nutrition27 questions
  • Recipe Generation14 questions

Grading

Deterministic graders handle anything with a right answer: numbers are extracted from the model's final answer line (handling fractions, thousands separators and ranges), converted across units where equivalent (350°F = 177°C), and checked against a tolerance. Answers that merely restate a value from the question never get credit. Unsafe advice (e.g. washing raw chicken) zeroes the question regardless of anything else said.

The LLM judge grades subjective answers against a per-question written rubric, blind to which model wrote the answer (self-identifying phrases are stripped). Each answer is judged twice at temperature 0 and averaged; large disagreements are flagged for manual review. For constrained recipe generation, the judge score (70%) is blended with deterministic constraint checks (30%) — e.g. an allergen appearing in a "nut-free" recipe.

Every question scores 0–100. A category score is the mean of its questions; the overall score is the unweighted mean of category scores. The leaderboard also reports a Hard set score — difficulty-3 questions only. Frontier models saturate the easy questions (which exist as a floor, to catch regressions and rank smaller models), so the hard set carries the ranking signal at the top: inverse and non-linear scaling traps, unit-identity traps (a UK pint, an Australian tablespoon, weight-vs-volume ounces), chained conversions, given-data nutrition reasoning, and multi-constraint recipe briefs.

The role of human experts

An AI judge scales, but it shares the blind spots of the models it grades. So grading is layered: deterministic checks need no opinion at all; the LLM judge handles the subjective bulk; and a sampled and flagged set of answers — anything the double-judge disagreed on, plus a random audit slice — is reviewed by people who actually cook. We are recruiting professional chefs and nutritionists for that expert layer, and their verdicts calibrate the judge over time. A future public “taste test” mode will let visitors blind-vote on paired answers, arena-style, as a third independent signal.

Reproducibility

Models run via OpenRouter at temperature 0 with fixed token caps. Raw responses, per-request costs, grading details and the judge configuration are committed to the open repository, so every published leaderboard can be rebuilt from git alone.

Related work

Existing cooking-adjacent benchmarks measure something different: CookBench (embodied planning in a simulated kitchen), CuisineWorld (multi-agent kitchen coordination), PizzaCommonSense (commonsense reasoning over recipe steps) and Recipe1MSubs (ingredient substitution pairs). To our knowledge CookingBench is the first public leaderboard for culinary knowledge correctness in general-purpose chat models.