Open Benchmark — Feb 2026

The Chess AI Benchmark

Testing frontier AI models on chess understanding and puzzle solving. Transparent methodology, reproducible results, open data.

5
Models evaluated
900
Questions (ChessQA)
2
Benchmarks
Peer-reviewed
Dataset source

Top Models

Ranked by understanding score

RankModelProviderUnderstandingPuzzlesOverall
1stClaude Opus 4.6Anthropic83.5%Coming SoonPartial
2ndGPT-5.2OpenAI78.8%Coming SoonPartial
3rdClaude Sonnet 4.5Anthropic73.0%Coming SoonPartial

Benchmarks

Chess Understanding

Live

Multiple-choice and classification questions testing semantic chess knowledge and position judgment. Inspired by ChessQA from the University of Toronto.

900
Total questions
2
Categories
Paper: arXiv:2510.23948 (University of Toronto)
Top model: Claude Opus 4.6 (83.5%)

Puzzle Solving

Coming Soon

Given a FEN position, models must output the correct UCI move. Puzzles are sampled from the Lichess puzzle database, rated 500–3000 Elo with stratified difficulty distribution.

1,000
Puzzles
500–3000
Elo range
Source: lichess.org
Scoring: Glicko-2 ELO rating system

Cost efficiency varies widely

Evaluation cost per 1,000 questions ranges from $0.69 to $25.48 across models. See the full leaderboard for per-model cost and latency data.