Leaderboard/Explainability

Explainability Benchmark

How well does each model explain a chess position? Evaluated by Claude Opus 4.6 as judge across five dimensions on expert-annotated positions from real GM games.

Last updated: February 25, 2026

Scale

1–3 per dimension

1 = poor · 2 = adequate · 3 = excellent

Judge

Claude Opus 4.6

Blind evaluation, no access to model identity

Configuration

With Chessvia context

Tactics, eval, plans passed as structured context

Rubric Dimensions

relevance

Is the analysis about this specific position?

completeness

Are plans, tactics, and key squares all covered?

clarity

Is it understandable to an intermediate player?

correctness

Is the chess analysis accurate?

actionability

Does the player know what to do next?

Overall Rankings

Scores on the 1–3 scale. Click any column header to sort.

Rank	Model	Positions	Relevance	Completeness	Clarity	Correctness	Actionability	Overall
1	Gemini 3 Flash	100	2.38	2.04	3.51	2.92	3.02	2.78
2	GPT-5.2	30	2.23	1.77	3.47	2.90	3.23	2.72
3	Gemini 3 Pro	30	2.10	1.73	3.40	3.03	3.30	2.71
4	Claude Sonnet 4.5	30	2.13	1.67	3.63	2.97	3.13	2.71
5	GPT-5 Nano	30	2.07	1.70	3.13	2.73	3.03	2.53

Gemini 3 Flash tested on 100 positions; other models on 30 positions from the same expert-annotated pool drawn from 197 GM-annotated Lichess games.

By Position Category

Scores by position type. Click column headers to sort within each category.

tactical

Model	n	Relevance	Completeness	Clarity	Correctness	Actionability	Overall
Gemini 3 Flash	16	2.29	2.07	3.48	2.83	3.08	2.75
Claude Sonnet 4.5	5	1.40	1.40	4.00	3.00	3.40	2.64
GPT-5.2	5	2.00	1.40	3.40	2.80	3.20	2.56
GPT-5 Nano	5	1.80	1.40	3.20	2.40	3.40	2.44
Gemini 3 Pro	5	1.60	1.40	3.40	2.60	3.00	2.40

endgame

Model	n	Relevance	Completeness	Clarity	Correctness	Actionability	Overall
Claude Sonnet 4.5	8	2.13	1.88	4.13	3.25	3.63	3.00
Gemini 3 Pro	8	2.25	1.88	3.63	3.25	3.88	2.98
GPT-5.2	8	2.00	1.75	3.88	3.13	3.75	2.90
Gemini 3 Flash	25	2.42	2.09	3.59	2.96	3.12	2.84
GPT-5 Nano	8	2.13	1.88	3.38	2.88	3.38	2.73

positional

Model	n	Relevance	Completeness	Clarity	Correctness	Actionability	Overall
GPT-5.2	14	2.29	1.86	3.64	3.00	3.50	2.86
Gemini 3 Pro	14	2.36	1.93	3.36	3.14	3.50	2.86
Gemini 3 Flash	37	2.34	2.01	3.44	2.93	3.01	2.74
GPT-5 Nano	14	2.29	1.86	3.21	2.93	3.07	2.67
Claude Sonnet 4.5	14	2.21	1.64	3.43	2.79	3.14	2.64

middlegame

Model	n	Relevance	Completeness	Clarity	Correctness	Actionability	Overall
Gemini 3 Flash	50	2.47	2.15	3.29	2.87	2.92	2.74
Gemini 3 Pro	15	2.20	1.87	3.13	3.00	3.00	2.64
GPT-5.2	15	2.40	1.93	3.13	2.73	2.93	2.63
Claude Sonnet 4.5	15	2.33	1.67	3.27	2.80	2.80	2.57
GPT-5 Nano	15	2.20	1.73	3.00	2.73	2.87	2.51

opening

Model	n	Relevance	Completeness	Clarity	Correctness	Actionability	Overall
Gemini 3 Flash	25	2.18	1.78	3.86	3.00	3.12	2.79
GPT-5.2	7	2.14	1.43	3.71	3.00	3.29	2.71
Claude Sonnet 4.5	7	1.71	1.43	3.86	3.00	3.29	2.66
Gemini 3 Pro	7	1.71	1.29	3.71	2.86	3.29	2.57
GPT-5 Nano	7	1.71	1.43	3.14	2.57	3.00	2.37

Coming Soon: Base Model Comparison

The current results show models receiving full Chessvia structured context (tactical patterns, Stockfish evaluation, positional plans). A separate benchmark will test the same models on FEN input only, isolating the impact of structured analysis on explanation quality.