Leaderboard/Puzzle Solving

Puzzle Solving Benchmark

How well can AI models solve real chess puzzles? This benchmark tests tactical calculation using positions from the Lichess puzzle database.

Results being collected now

Coming Soon

We are currently running all models on 1,000 Lichess puzzles rated 500–3000. Results will be published once all runs are complete and verified.

Source:Lichess puzzle database|1,000 puzzles, rated 500–3000

Accuracy

Percentage of puzzles where the model finds the correct best move.

Puzzle ELO

Glicko-2 rating computed from puzzle difficulty vs. solve rate, comparable to human Lichess ratings.

Illegal Move Rate

How often models output moves that are illegal in the given position.

Difficulty Breakdown

Accuracy by rating band: 500–1000, 1000–1500, 1500–2000, 2000–2500, 2500–3000.

Theme Breakdown

Performance across tactical themes: fork, pin, skewer, discovered attack, back rank, etc.

The following models will be evaluated. Results pending.

Model	Accuracy	Puzzle ELO	Illegal Move Rate	Cost / 1K
Claude Opus 4.6	—	—	—	—
GPT-5.2	—	—	—	—
Claude Sonnet 4.5	—	—	—	—
Gemini 3 Flash	—	—	—	—
GPT-5 Nano	—	—	—	—

For details on how puzzles are selected, scored, and rated, see Methodology.