CookingBench

Leaderboard / categories

Technique

Troubleshooting failures (split sauces, dense bread) and method advice.

Ranking

  1. 1GPT-5.5
    100.0
  2. 2Claude Opus 4.8
    100.0
  3. 3DeepSeek V4 Pro
    99.7
  4. 4Qwen 3.5 Plus
    99.1
  5. 5Claude Fable 5
    97.4
  6. 6Gemini 3.1 Pro Preview
    96.6
  7. 7Grok 4.3
    96.5
  8. 8Mistral Large 3
    88.1
  9. 9Kimi K2.6
    87.6
  10. 10Llama 4 Maverick
    74.2

Question heatmap (public questions only)

Model001002003004005007008009010
GPT-5.5
Claude Opus 4.8
DeepSeek V4 Pro
Qwen 3.5 Plus
Claude Fable 5
Gemini 3.1 Pro Preview
Grok 4.3
Mistral Large 3
Kimi K2.6
Llama 4 Maverick

Each cell is one question; deeper colour = higher score. Hover for exact values.