Leaderboard / categories
Quantities & Scaling
Scaling recipes up and down, yields, pan-size math and baker’s percentages.
Ranking
- 1GPT-5.5100.0
- 2Claude Opus 4.8100.0
- 3Qwen 3.5 Plus100.0
- 4Grok 4.3100.0
- 5Claude Fable 5100.0
- 6Gemini 3.1 Pro Preview100.0
- 7DeepSeek V4 Pro100.0
- 8Mistral Large 3100.0
- 9Kimi K2.6100.0
- 10Llama 4 Maverick94.1
Question heatmap (public questions only)
| Model | 001 | 002 | 003 | 004 | 005 | 007 | 008 | 009 | 010 | 011 | 013 | 014 | 015 | 017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | ||||||||||||||
| Claude Opus 4.8 | ||||||||||||||
| Qwen 3.5 Plus | ||||||||||||||
| Grok 4.3 | ||||||||||||||
| Claude Fable 5 | ||||||||||||||
| Gemini 3.1 Pro Preview | ||||||||||||||
| DeepSeek V4 Pro | ||||||||||||||
| Mistral Large 3 | ||||||||||||||
| Kimi K2.6 | ||||||||||||||
| Llama 4 Maverick |
Each cell is one question; deeper colour = higher score. Hover for exact values.