Explainability Benchmark
How well does each model explain a chess position? Evaluated by Claude Opus 4.6 as judge across five dimensions on expert-annotated positions from real GM games.
Scale
1–3 per dimension
1 = poor · 2 = adequate · 3 = excellent
Judge
Claude Opus 4.6
Blind evaluation, no access to model identity
Configuration
With Chessvia context
Tactics, eval, plans passed as structured context
Rubric Dimensions
relevance
Is the analysis about this specific position?
completeness
Are plans, tactics, and key squares all covered?
clarity
Is it understandable to an intermediate player?
correctness
Is the chess analysis accurate?
actionability
Does the player know what to do next?
Overall Rankings
Scores on the 1–3 scale. Click any column header to sort.
| Rank | Model | Positions | Relevance | Completeness | Clarity | Correctness | Actionability | Overall |
|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3 Flash | 100 | 2.38 | 2.04 | 3.51 | 2.92 | 3.02 | 2.78 |
| 2 | GPT-5.2 | 30 | 2.23 | 1.77 | 3.47 | 2.90 | 3.23 | 2.72 |
| 3 | Gemini 3 Pro | 30 | 2.10 | 1.73 | 3.40 | 3.03 | 3.30 | 2.71 |
| 4 | Claude Sonnet 4.5 | 30 | 2.13 | 1.67 | 3.63 | 2.97 | 3.13 | 2.71 |
| 5 | GPT-5 Nano | 30 | 2.07 | 1.70 | 3.13 | 2.73 | 3.03 | 2.53 |
Gemini 3 Flash tested on 100 positions; other models on 30 positions from the same expert-annotated pool drawn from 197 GM-annotated Lichess games.
By Position Category
Scores by position type. Click column headers to sort within each category.
tactical
| Model | n | Relevance | Completeness | Clarity | Correctness | Actionability | Overall |
|---|---|---|---|---|---|---|---|
| Gemini 3 Flash | 16 | 2.29 | 2.07 | 3.48 | 2.83 | 3.08 | 2.75 |
| Claude Sonnet 4.5 | 5 | 1.40 | 1.40 | 4.00 | 3.00 | 3.40 | 2.64 |
| GPT-5.2 | 5 | 2.00 | 1.40 | 3.40 | 2.80 | 3.20 | 2.56 |
| GPT-5 Nano | 5 | 1.80 | 1.40 | 3.20 | 2.40 | 3.40 | 2.44 |
| Gemini 3 Pro | 5 | 1.60 | 1.40 | 3.40 | 2.60 | 3.00 | 2.40 |
endgame
| Model | n | Relevance | Completeness | Clarity | Correctness | Actionability | Overall |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 8 | 2.13 | 1.88 | 4.13 | 3.25 | 3.63 | 3.00 |
| Gemini 3 Pro | 8 | 2.25 | 1.88 | 3.63 | 3.25 | 3.88 | 2.98 |
| GPT-5.2 | 8 | 2.00 | 1.75 | 3.88 | 3.13 | 3.75 | 2.90 |
| Gemini 3 Flash | 25 | 2.42 | 2.09 | 3.59 | 2.96 | 3.12 | 2.84 |
| GPT-5 Nano | 8 | 2.13 | 1.88 | 3.38 | 2.88 | 3.38 | 2.73 |
positional
| Model | n | Relevance | Completeness | Clarity | Correctness | Actionability | Overall |
|---|---|---|---|---|---|---|---|
| GPT-5.2 | 14 | 2.29 | 1.86 | 3.64 | 3.00 | 3.50 | 2.86 |
| Gemini 3 Pro | 14 | 2.36 | 1.93 | 3.36 | 3.14 | 3.50 | 2.86 |
| Gemini 3 Flash | 37 | 2.34 | 2.01 | 3.44 | 2.93 | 3.01 | 2.74 |
| GPT-5 Nano | 14 | 2.29 | 1.86 | 3.21 | 2.93 | 3.07 | 2.67 |
| Claude Sonnet 4.5 | 14 | 2.21 | 1.64 | 3.43 | 2.79 | 3.14 | 2.64 |
middlegame
| Model | n | Relevance | Completeness | Clarity | Correctness | Actionability | Overall |
|---|---|---|---|---|---|---|---|
| Gemini 3 Flash | 50 | 2.47 | 2.15 | 3.29 | 2.87 | 2.92 | 2.74 |
| Gemini 3 Pro | 15 | 2.20 | 1.87 | 3.13 | 3.00 | 3.00 | 2.64 |
| GPT-5.2 | 15 | 2.40 | 1.93 | 3.13 | 2.73 | 2.93 | 2.63 |
| Claude Sonnet 4.5 | 15 | 2.33 | 1.67 | 3.27 | 2.80 | 2.80 | 2.57 |
| GPT-5 Nano | 15 | 2.20 | 1.73 | 3.00 | 2.73 | 2.87 | 2.51 |
opening
| Model | n | Relevance | Completeness | Clarity | Correctness | Actionability | Overall |
|---|---|---|---|---|---|---|---|
| Gemini 3 Flash | 25 | 2.18 | 1.78 | 3.86 | 3.00 | 3.12 | 2.79 |
| GPT-5.2 | 7 | 2.14 | 1.43 | 3.71 | 3.00 | 3.29 | 2.71 |
| Claude Sonnet 4.5 | 7 | 1.71 | 1.43 | 3.86 | 3.00 | 3.29 | 2.66 |
| Gemini 3 Pro | 7 | 1.71 | 1.29 | 3.71 | 2.86 | 3.29 | 2.57 |
| GPT-5 Nano | 7 | 1.71 | 1.43 | 3.14 | 2.57 | 3.00 | 2.37 |
Coming Soon: Base Model Comparison
The current results show models receiving full Chessvia structured context (tactical patterns, Stockfish evaluation, positional plans). A separate benchmark will test the same models on FEN input only, isolating the impact of structured analysis on explanation quality.