Chess Understanding Benchmark
Multiple-choice and classification questions. Semantic (400Q) tests chess knowledge; Position Judgment (500Q) tests positional evaluation ability. Inspired by ChessQA (University of Toronto).
400 questions across four subcategories. Click any column header to sort.
| Model | Easy Random | Keyword | Piece + Stage | Embedding | Overall |
|---|---|---|---|---|---|
| Gemini 3 Flash | 88.0% | 85.0% | 87.0% | 79.0% | 84.8% |
| Claude Opus 4.6 | 87.0% | 85.0% | 83.0% | 79.0% | 83.5% |
| GPT-5.2 | 82.0% | 82.0% | 80.0% | 71.0% | 78.8% |
| Claude Sonnet 4.5 | 77.0% | 75.0% | 74.0% | 66.0% | 73.0% |
| GPT-5 Nano | 39.0% | 39.0% | 42.0% | 36.0% | 39.0% |
Subcategory Breakdown
Scores per subcategory across all models.
Coverage Profile
Subcategory radar chart per model.