CookingBench

Leaderboard / categories

Recipe Generation

Generating complete recipes under constraints: servings, allergens, time, equipment.

Ranking

  1. 1Qwen 3.5 Plus
    97.4
  2. 2Grok 4.3
    95.6
  3. 3Gemini 3.1 Pro Preview
    94.2
  4. 4GPT-5.5
    94.0
  5. 5Claude Opus 4.8
    93.9
  6. 6Claude Fable 5
    91.8
  7. 7Mistral Large 3
    89.4
  8. 8Llama 4 Maverick
    81.9
  9. 9DeepSeek V4 Pro
    78.9
  10. 10Kimi K2.6
    50.0

Question heatmap (public questions only)

Model001002003004005007008009010011013014
Qwen 3.5 Plus
Grok 4.3
Gemini 3.1 Pro Preview
GPT-5.5
Claude Opus 4.8
Claude Fable 5
Mistral Large 3
Llama 4 Maverick
DeepSeek V4 Pro
Kimi K2.6

Each cell is one question; deeper colour = higher score. Hover for exact values.