Leaderboard/Explainability

Explainability Benchmark

How well does each model explain a chess position? Evaluated by Claude Opus 4.6 as judge across five dimensions on expert-annotated positions from real GM games.

Last updated: February 25, 2026
Scale
1–3 per dimension
1 = poor · 2 = adequate · 3 = excellent
Judge
Claude Opus 4.6
Blind evaluation, no access to model identity
Configuration
With Chessvia context
Tactics, eval, plans passed as structured context

Rubric Dimensions

relevance
Is the analysis about this specific position?
completeness
Are plans, tactics, and key squares all covered?
clarity
Is it understandable to an intermediate player?
correctness
Is the chess analysis accurate?
actionability
Does the player know what to do next?

Overall Rankings

Scores on the 1–3 scale. Click any column header to sort.

RankModelPositionsRelevanceCompletenessClarityCorrectnessActionabilityOverall
1Gemini 3 Flash1002.382.043.512.923.022.78
2GPT-5.2302.231.773.472.903.232.72
3Gemini 3 Pro302.101.733.403.033.302.71
4Claude Sonnet 4.5302.131.673.632.973.132.71
5GPT-5 Nano302.071.703.132.733.032.53

Gemini 3 Flash tested on 100 positions; other models on 30 positions from the same expert-annotated pool drawn from 197 GM-annotated Lichess games.

By Position Category

Scores by position type. Click column headers to sort within each category.

tactical

ModelnRelevanceCompletenessClarityCorrectnessActionabilityOverall
Gemini 3 Flash162.292.073.482.833.082.75
Claude Sonnet 4.551.401.404.003.003.402.64
GPT-5.252.001.403.402.803.202.56
GPT-5 Nano51.801.403.202.403.402.44
Gemini 3 Pro51.601.403.402.603.002.40

endgame

ModelnRelevanceCompletenessClarityCorrectnessActionabilityOverall
Claude Sonnet 4.582.131.884.133.253.633.00
Gemini 3 Pro82.251.883.633.253.882.98
GPT-5.282.001.753.883.133.752.90
Gemini 3 Flash252.422.093.592.963.122.84
GPT-5 Nano82.131.883.382.883.382.73

positional

ModelnRelevanceCompletenessClarityCorrectnessActionabilityOverall
GPT-5.2142.291.863.643.003.502.86
Gemini 3 Pro142.361.933.363.143.502.86
Gemini 3 Flash372.342.013.442.933.012.74
GPT-5 Nano142.291.863.212.933.072.67
Claude Sonnet 4.5142.211.643.432.793.142.64

middlegame

ModelnRelevanceCompletenessClarityCorrectnessActionabilityOverall
Gemini 3 Flash502.472.153.292.872.922.74
Gemini 3 Pro152.201.873.133.003.002.64
GPT-5.2152.401.933.132.732.932.63
Claude Sonnet 4.5152.331.673.272.802.802.57
GPT-5 Nano152.201.733.002.732.872.51

opening

ModelnRelevanceCompletenessClarityCorrectnessActionabilityOverall
Gemini 3 Flash252.181.783.863.003.122.79
GPT-5.272.141.433.713.003.292.71
Claude Sonnet 4.571.711.433.863.003.292.66
Gemini 3 Pro71.711.293.712.863.292.57
GPT-5 Nano71.711.433.142.573.002.37

Coming Soon: Base Model Comparison

The current results show models receiving full Chessvia structured context (tactical patterns, Stockfish evaluation, positional plans). A separate benchmark will test the same models on FEN input only, isolating the impact of structured analysis on explanation quality.