Leaderboard/Chess Understanding

Chess Understanding Benchmark

Multiple-choice and classification questions. Semantic (400Q) tests chess knowledge; Position Judgment (500Q) tests positional evaluation ability. Inspired by ChessQA (University of Toronto).

arXiv:2510.23948 ↗

400 questions across four subcategories. Click any column header to sort.

ModelEasy RandomKeywordPiece + StageEmbeddingOverall
Gemini 3 Flash88.0%85.0%87.0%79.0%84.8%
Claude Opus 4.687.0%85.0%83.0%79.0%83.5%
GPT-5.282.0%82.0%80.0%71.0%78.8%
Claude Sonnet 4.577.0%75.0%74.0%66.0%73.0%
GPT-5 Nano39.0%39.0%42.0%36.0%39.0%

Subcategory Breakdown

Scores per subcategory across all models.

Coverage Profile

Subcategory radar chart per model.