Methodology

All benchmarks are run at temperature 0 via the OpenRouter API. Results are verified before appearing on the leaderboard.

Chess Understanding Benchmark

Description

The Chess Understanding benchmark is a set of multiple-choice and classification questions testing semantic chess knowledge and position judgment. Each question has exactly one correct answer; scoring is exact match with no partial credit. All evaluations run at temperature 0 to ensure reproducibility.

Inspired by the ChessQA dataset from the University of Toronto (arXiv:2510.23948). The ChessQA dataset was created by researchers at U of T and is used here with attribution under the CC-BY-4.0 license.

Semantic (400 questions)

Tests general chess knowledge: rules, piece behavior, terminology, and strategic concepts.

Easy Random100 Qs

Keyword100 Qs

Piece + Stage100 Qs

Embedding100 Qs

Position Judgment (500 questions)

Given a chess position, the model classifies the evaluation from the perspective of the side to move.

Neutral100 Qs

Advantage100 Qs

Disadvantage100 Qs

Winning100 Qs

Losing100 Qs

Scoring

Method

Exact match

Partial credit

None

Temperature

Academic Inspiration

Inspired by ChessQA from the University of Toronto (arXiv:2510.23948). The ChessQA dataset was created by researchers at U of T.

@article{chessqa2024,
  title={ChessQA: A Chess Question Answering Benchmark},
  author={...},
  journal={arXiv preprint arXiv:2510.23948},
  year={2024}
}

Puzzle Solving Benchmark

Overview

Coming Soon

Puzzle evaluation is in progress. The puzzle benchmark will measure a model's ability to find the best move in a given chess position.

Dataset

SourceLichess puzzle database

Count1,000 puzzles

Elo range500–3000

SelectionStratified sampling

Format & Scoring

InputFEN position

OutputUCI move string

CorrectExact UCI match

Incorrect / IllegalLoss

RatingGlicko-2

Explainability Benchmark

Description

Tests how well a model can explain a chess position to an intermediate player. Each model receives a position (as FEN) along with structured Chessvia API context (tactical patterns, Stockfish evaluation, positional plans). Its explanation is then scored by Claude Opus 4.6 acting as an independent judge.

Ground truth is provided by 197 expert-annotated games from Lichess, covering 3,373 GM-annotated positions across 5 position categories.

Rubric (1–3 scale)

RelevanceIs the analysis about this specific position?

CompletenessAre plans, tactics, and key squares covered?

ClarityIs it understandable to an intermediate player?

CorrectnessIs the chess analysis accurate?

ActionabilityDoes the player know what to do next?

1 = poor · 2 = adequate · 3 = excellent. Overall score is the mean across all five dimensions.

Configurations

With Chessvia Context

Model receives the full Chessvia API output: tactical patterns, Stockfish centipawn evaluation, positional plans, and piece activity analysis. This measures how well models leverage structured chess context.

Base Model (FEN only) — Coming Soon

Model receives the FEN string only. Provides a baseline to measure how much structured context improves explanation quality.

Evaluation Protocol

API

OpenRouter

Temperature

Max tokens

16,384

System prompt

None

Answer Extraction

Answers are extracted from model output using two regex patterns, checked in order:

1.FINAL ANSWER: <answer>

2.\boxed{<answer>}

External Gameplay Sources

The Gameplay Leaderboard aggregates data from two independent research projects. We do not run these evaluations — we scrape and display their published data with full attribution. Data is refreshed periodically by re-running the scraper scripts.

Chess Gameplay Leaderboard

Models play against Stockfish at varying depths. ELO, accuracy, legal move rate, and material metrics. Over 400 models, 3,000+ recorded games.

Source: dubesor.de/chess/chess-leaderboard

LLM Chess

Agentic gameplay framework where LLMs manage the full game loop. ELO with 95% confidence intervals, wrong move rates, cost per game. Paper: arXiv:2512.01992.

Source: maxim-saplin.github.io/llm_chess