Independent Benchmark — Feb 2026

The Chess AI Benchmark

Testing frontier AI models on chess understanding, puzzle solving, and gameplay. Aggregated data, transparent methodology.

View Leaderboard Read Methodology

Models evaluated

900

Questions (ChessQA)

Benchmark areas

500+

Gameplay models tracked

Top Models

Ranked by understanding score

Rank	Model	Provider	Understanding	Puzzles	Overall
1st	Claude Opus 4.6	Anthropic	83.5%	Coming Soon	Partial
2nd	GPT-5.2	OpenAI	78.8%	Coming Soon	Partial
3rd	Claude Sonnet 4.5	Anthropic	73.0%	Coming Soon	Partial

View Full Leaderboard

Benchmarks

Chess Understanding

Live

Multiple-choice and classification questions testing semantic chess knowledge and position judgment. Inspired by ChessQA from the University of Toronto.

900

Total questions

Puzzle Solving

Coming Soon

Given a FEN position, models must output the correct UCI move. Puzzles are sampled from the Lichess puzzle database, rated 500–3000 Elo with stratified difficulty distribution.

1,000

Puzzles

500–3000

Elo range

Source: lichess.org

Scoring: Glicko-2 ELO rating system

Gameplay

Aggregated

How well do LLMs actually play chess? ELO ratings from two independent sources covering 500+ models in real gameplay against engines and each other.

407

dubesor.de models

157

LLM Chess models

Sources: dubesor.de · LLM Chess

View Gameplay Leaderboard

Explainability

Live

LLM-as-judge scoring on how well models explain chess positions. Evaluated on 5 dimensions: relevance, completeness, clarity, correctness, and actionability.

Models tested

1–3

Score scale

Judge: Claude Opus 4.6

View Explainability Leaderboard

Cost efficiency varies widely

Evaluation cost per 1,000 questions ranges from $0.69 to $25.48 across models. See the full leaderboard for per-model cost and latency data.