Leaderboard/Chess Understanding

Chess Understanding Benchmark

Multiple-choice and classification questions. Semantic (400Q) tests chess knowledge; Position Judgment (500Q) tests positional evaluation ability. Inspired by ChessQA (University of Toronto).

arXiv:2510.23948 &nearr;

400 questions across four subcategories. Click any column header to sort.

Model	Easy Random	Keyword	Piece + Stage	Embedding	Overall
Gemini 3 Flash	88.0%	85.0%	87.0%	79.0%	84.8%
Claude Opus 4.6	87.0%	85.0%	83.0%	79.0%	83.5%
GPT-5.2	82.0%	82.0%	80.0%	71.0%	78.8%
Claude Sonnet 4.5	77.0%	75.0%	74.0%	66.0%	73.0%
GPT-5 Nano	39.0%	39.0%	42.0%	36.0%	39.0%

Subcategory Breakdown

Scores per subcategory across all models.

Coverage Profile

Subcategory radar chart per model.