Leaderboard / categories
Technique
Troubleshooting failures (split sauces, dense bread) and method advice.
Ranking
- 1GPT-5.5100.0
- 2Claude Opus 4.8100.0
- 3DeepSeek V4 Pro99.7
- 4Qwen 3.5 Plus99.1
- 5Claude Fable 597.4
- 6Gemini 3.1 Pro Preview96.6
- 7Grok 4.396.5
- 8Mistral Large 388.1
- 9Kimi K2.687.6
- 10Llama 4 Maverick74.2
Question heatmap (public questions only)
| Model | 001 | 002 | 003 | 004 | 005 | 007 | 008 | 009 | 010 |
|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | |||||||||
| Claude Opus 4.8 | |||||||||
| DeepSeek V4 Pro | |||||||||
| Qwen 3.5 Plus | |||||||||
| Claude Fable 5 | |||||||||
| Gemini 3.1 Pro Preview | |||||||||
| Grok 4.3 | |||||||||
| Mistral Large 3 | |||||||||
| Kimi K2.6 | |||||||||
| Llama 4 Maverick |
Each cell is one question; deeper colour = higher score. Hover for exact values.