The Chess AI Benchmark
Testing frontier AI models on chess understanding, puzzle solving, and gameplay. Aggregated data, transparent methodology.
Top Models
Ranked by understanding score
| Rank | Model | Provider | Understanding | Puzzles | Overall |
|---|---|---|---|---|---|
| 1st | Claude Opus 4.6 | Anthropic | 83.5% | Coming Soon | Partial |
| 2nd | GPT-5.2 | OpenAI | 78.8% | Coming Soon | Partial |
| 3rd | Claude Sonnet 4.5 | Anthropic | 73.0% | Coming Soon | Partial |
Benchmarks
Chess Understanding
Multiple-choice and classification questions testing semantic chess knowledge and position judgment. Inspired by ChessQA from the University of Toronto.
Puzzle Solving
Given a FEN position, models must output the correct UCI move. Puzzles are sampled from the Lichess puzzle database, rated 500–3000 Elo with stratified difficulty distribution.
Gameplay
How well do LLMs actually play chess? ELO ratings from two independent sources covering 500+ models in real gameplay against engines and each other.
Explainability
LLM-as-judge scoring on how well models explain chess positions. Evaluated on 5 dimensions: relevance, completeness, clarity, correctness, and actionability.
Cost efficiency varies widely
Evaluation cost per 1,000 questions ranges from $0.69 to $25.48 across models. See the full leaderboard for per-model cost and latency data.