Open Benchmark — Feb 2026
The Chess AI Benchmark
Testing frontier AI models on chess understanding and puzzle solving. Transparent methodology, reproducible results, open data.
5
Models evaluated
900
Questions (ChessQA)
2
Benchmarks
Peer-reviewed
Dataset source
Top Models
Ranked by understanding score
| Rank | Model | Provider | Understanding | Puzzles | Overall |
|---|---|---|---|---|---|
| 1st | Claude Opus 4.6 | Anthropic | 83.5% | Coming Soon | Partial |
| 2nd | GPT-5.2 | OpenAI | 78.8% | Coming Soon | Partial |
| 3rd | Claude Sonnet 4.5 | Anthropic | 73.0% | Coming Soon | Partial |
Benchmarks
Chess Understanding
Multiple-choice and classification questions testing semantic chess knowledge and position judgment. Inspired by ChessQA from the University of Toronto.
900
Total questions
2
Categories
Paper: arXiv:2510.23948 (University of Toronto)
Top model: Claude Opus 4.6 (83.5%)
Puzzle Solving
Given a FEN position, models must output the correct UCI move. Puzzles are sampled from the Lichess puzzle database, rated 500–3000 Elo with stratified difficulty distribution.
1,000
Puzzles
500–3000
Elo range
Source: lichess.org
Scoring: Glicko-2 ELO rating system
Cost efficiency varies widely
Evaluation cost per 1,000 questions ranges from $0.69 to $25.48 across models. See the full leaderboard for per-model cost and latency data.