Puzzle Solving Benchmark
How well can AI models solve real chess puzzles? This benchmark tests tactical calculation using positions from the Lichess puzzle database.
Coming Soon
We are currently running all models on 1,000 Lichess puzzles rated 500–3000. Results will be published once all runs are complete and verified.
Accuracy
Percentage of puzzles where the model finds the correct best move.
Puzzle ELO
Glicko-2 rating computed from puzzle difficulty vs. solve rate, comparable to human Lichess ratings.
Illegal Move Rate
How often models output moves that are illegal in the given position.
Difficulty Breakdown
Accuracy by rating band: 500–1000, 1000–1500, 1500–2000, 2000–2500, 2500–3000.
Theme Breakdown
Performance across tactical themes: fork, pin, skewer, discovered attack, back rank, etc.
Model Preview
The following models will be evaluated. Results pending.
| Model | Accuracy | Puzzle ELO | Illegal Move Rate | Cost / 1K |
|---|---|---|---|---|
Claude Opus 4.6 | — | — | — | — |
GPT-5.2 | — | — | — | — |
Claude Sonnet 4.5 | — | — | — | — |
Gemini 3 Flash | — | — | — | — |
GPT-5 Nano | — | — | — | — |