Leaderboard / categories
Substitutions
Ingredient swaps with correct ratios, including allergen-aware alternatives.
Ranking
- 1GPT-5.5100.0
- 2Claude Opus 4.8100.0
- 3Claude Fable 5100.0
- 4DeepSeek V4 Pro100.0
- 5Qwen 3.5 Plus96.7
- 6Grok 4.396.7
- 7Mistral Large 396.7
- 8Gemini 3.1 Pro Preview93.3
- 9Kimi K2.687.8
- 10Llama 4 Maverick86.7
Question heatmap (public questions only)
| Model | 001 | 002 | 003 | 004 | 005 | 007 | 008 | 009 | 010 | 011 | 013 | 014 | 015 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | |||||||||||||
| Claude Opus 4.8 | |||||||||||||
| Claude Fable 5 | |||||||||||||
| DeepSeek V4 Pro | |||||||||||||
| Qwen 3.5 Plus | |||||||||||||
| Grok 4.3 | |||||||||||||
| Mistral Large 3 | |||||||||||||
| Gemini 3.1 Pro Preview | |||||||||||||
| Kimi K2.6 | |||||||||||||
| Llama 4 Maverick |
Each cell is one question; deeper colour = higher score. Hover for exact values.