Which AI model is the best chef?

CookingBench scores models on the things that actually go wrong in a kitchen: scaling quantities, converting units, food safety, substitutions, technique, flavour logic and nutrition math.

Leaderboard · run 2026-06-v2methodology v2

2026-06-12

#	Model	Overall	Frontier	Basics	Categories	Run cost
1	GPT-5.4 MiniOpenAI	96.4±2.4	94.1	100.0⚠1		$0.58
2	Grok 4.3xAI	95.4±2.8	94.4	100.0		$0.42
3	GPT-5.5OpenAI	94.5±3.4	89.1	100.0		$2.39
4	Claude Fable 5Anthropic	92.7±4.0	84.0	98.8⚠3		$4.82
5	Claude Opus 4.8Anthropic	92.6±3.8	85.4	100.0		$2.09
6	Gemini 3.5 FlashGoogle	92.4±3.9	87.6	100.0		$1.49
7	Gemini 3.1 Pro PreviewGoogle	91.6±4.3	89.4	100.0		$1.58
8	Kimi K2.6Moonshot AI	91.5±4.1	89.5	100.0⚠1		$1.48
9	Qwen 3.5 PlusAlibaba	91.4±3.9	91.8	100.0		$0.52
10	Claude Sonnet 4.6Anthropic	91.3±4.1	89.5	100.0⚠2		$3.07
11	DeepSeek V4 ProDeepSeek	90.5±4.1	86.3	100.0		$0.38
12	Mistral Large 3Mistral	88.9±4.6	93.3	100.0		$0.05
13	Llama 4 MaverickMeta	82.0±5.5	82.1	100.0		$0.03

Categories

Quantities & Scaling

Scaling recipes up and down, yields, pan-size math and baker’s percentages.

Conversions

Volume, weight and temperature conversions across kitchen units and locales.

Food Safety

Safe internal temperatures, the danger zone, storage times and cross-contamination.

Substitutions

Ingredient swaps with correct ratios, including allergen-aware alternatives.

Technique

Troubleshooting failures (split sauces, dense bread) and method advice.

Flavour Pairing

Pairing logic, cuisine coherence and balancing dishes.

Nutrition

Calorie and macro math, per-serving calculations and label reasoning.

Recipe Generation

Generating complete recipes under constraints: servings, allergens, time, equipment.