Leaderboard / categories
Recipe Generation
Generating complete recipes under constraints: servings, allergens, time, equipment.
Ranking
- 1Qwen 3.5 Plus97.4
- 2Grok 4.395.6
- 3Gemini 3.1 Pro Preview94.2
- 4GPT-5.594.0
- 5Claude Opus 4.893.9
- 6Claude Fable 591.8
- 7Mistral Large 389.4
- 8Llama 4 Maverick81.9
- 9DeepSeek V4 Pro78.9
- 10Kimi K2.650.0
Question heatmap (public questions only)
| Model | 001 | 002 | 003 | 004 | 005 | 007 | 008 | 009 | 010 | 011 | 013 | 014 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen 3.5 Plus | ||||||||||||
| Grok 4.3 | ||||||||||||
| Gemini 3.1 Pro Preview | ||||||||||||
| GPT-5.5 | ||||||||||||
| Claude Opus 4.8 | ||||||||||||
| Claude Fable 5 | ||||||||||||
| Mistral Large 3 | ||||||||||||
| Llama 4 Maverick | ||||||||||||
| DeepSeek V4 Pro | ||||||||||||
| Kimi K2.6 |
Each cell is one question; deeper colour = higher score. Hover for exact values.