Methodology

All benchmarks are run at temperature 0 via the OpenRouter API. Results are verified before appearing on the leaderboard.

Chess Understanding Benchmark

Description

The Chess Understanding benchmark is a set of multiple-choice and classification questions testing semantic chess knowledge and position judgment. Each question has exactly one correct answer; scoring is exact match with no partial credit. All evaluations run at temperature 0 to ensure reproducibility.

Inspired by the ChessQA dataset from the University of Toronto (arXiv:2510.23948). The ChessQA dataset was created by researchers at U of T and is used here with attribution under the CC-BY-4.0 license.

Semantic (400 questions)

Tests general chess knowledge: rules, piece behavior, terminology, and strategic concepts.

Easy Random100 Qs
Keyword100 Qs
Piece + Stage100 Qs
Embedding100 Qs

Position Judgment (500 questions)

Given a chess position, the model classifies the evaluation from the perspective of the side to move.

Neutral100 Qs
Advantage100 Qs
Disadvantage100 Qs
Winning100 Qs
Losing100 Qs

Scoring

Method
Exact match
Partial credit
None
Temperature
0

Academic Inspiration

Inspired by ChessQA from the University of Toronto (arXiv:2510.23948). The ChessQA dataset was created by researchers at U of T.

@article{chessqa2024,
  title={ChessQA: A Chess Question Answering Benchmark},
  author={...},
  journal={arXiv preprint arXiv:2510.23948},
  year={2024}
}

Puzzle Solving Benchmark

Overview

Coming Soon

Puzzle evaluation is in progress. The puzzle benchmark will measure a model's ability to find the best move in a given chess position.

Dataset
Count1,000 puzzles
Elo range500–3000
SelectionStratified sampling
Format & Scoring
InputFEN position
OutputUCI move string
CorrectExact UCI match
Incorrect / IllegalLoss
RatingGlicko-2

Explainability Benchmark

Description

Tests how well a model can explain a chess position to an intermediate player. Each model receives a position (as FEN) along with structured Chessvia API context (tactical patterns, Stockfish evaluation, positional plans). Its explanation is then scored by Claude Opus 4.6 acting as an independent judge.

Ground truth is provided by 197 expert-annotated games from Lichess, covering 3,373 GM-annotated positions across 5 position categories.

Rubric (1–3 scale)

RelevanceIs the analysis about this specific position?
CompletenessAre plans, tactics, and key squares covered?
ClarityIs it understandable to an intermediate player?
CorrectnessIs the chess analysis accurate?
ActionabilityDoes the player know what to do next?

1 = poor · 2 = adequate · 3 = excellent. Overall score is the mean across all five dimensions.

Configurations

With Chessvia Context
Model receives the full Chessvia API output: tactical patterns, Stockfish centipawn evaluation, positional plans, and piece activity analysis. This measures how well models leverage structured chess context.
Base Model (FEN only) — Coming Soon
Model receives the FEN string only. Provides a baseline to measure how much structured context improves explanation quality.

Evaluation Protocol

API
OpenRouter
Temperature
0
Max tokens
16,384
System prompt
None

Answer Extraction

Answers are extracted from model output using two regex patterns, checked in order:

1.FINAL ANSWER: <answer>
2.\boxed{<answer>}

External Gameplay Sources

The Gameplay Leaderboard aggregates data from two independent research projects. We do not run these evaluations — we scrape and display their published data with full attribution. Data is refreshed periodically by re-running the scraper scripts.

Chess Gameplay Leaderboard

Models play against Stockfish at varying depths. ELO, accuracy, legal move rate, and material metrics. Over 400 models, 3,000+ recorded games.

LLM Chess

Agentic gameplay framework where LLMs manage the full game loop. ELO with 95% confidence intervals, wrong move rates, cost per game. Paper: arXiv:2512.01992.