Independent Benchmark — Feb 2026

The Chess AI Benchmark

Testing frontier AI models on chess understanding, puzzle solving, and gameplay. Aggregated data, transparent methodology.

5
Models evaluated
900
Questions (ChessQA)
4
Benchmark areas
500+
Gameplay models tracked

Top Models

Ranked by understanding score

RankModelProviderUnderstandingPuzzlesOverall
1stClaude Opus 4.6Anthropic83.5%Coming SoonPartial
2ndGPT-5.2OpenAI78.8%Coming SoonPartial
3rdClaude Sonnet 4.5Anthropic73.0%Coming SoonPartial

Benchmarks

Chess Understanding

Live

Multiple-choice and classification questions testing semantic chess knowledge and position judgment. Inspired by ChessQA from the University of Toronto.

900
Total questions
2
Categories
Paper: arXiv:2510.23948 (University of Toronto)
Top model: Claude Opus 4.6 (83.5%)

Puzzle Solving

Coming Soon

Given a FEN position, models must output the correct UCI move. Puzzles are sampled from the Lichess puzzle database, rated 500–3000 Elo with stratified difficulty distribution.

1,000
Puzzles
500–3000
Elo range
Source: lichess.org
Scoring: Glicko-2 ELO rating system

Gameplay

Aggregated

How well do LLMs actually play chess? ELO ratings from two independent sources covering 500+ models in real gameplay against engines and each other.

407
dubesor.de models
157
LLM Chess models

Explainability

Live

LLM-as-judge scoring on how well models explain chess positions. Evaluated on 5 dimensions: relevance, completeness, clarity, correctness, and actionability.

5
Models tested
1–3
Score scale
Judge: Claude Opus 4.6

Cost efficiency varies widely

Evaluation cost per 1,000 questions ranges from $0.69 to $25.48 across models. See the full leaderboard for per-model cost and latency data.