Methodology
All benchmarks are run at temperature 0 via the OpenRouter API. Results are verified before appearing on the leaderboard.
Chess Understanding Benchmark
Description
The Chess Understanding benchmark is a set of multiple-choice and classification questions testing semantic chess knowledge and position judgment. Each question has exactly one correct answer; scoring is exact match with no partial credit. All evaluations run at temperature 0 to ensure reproducibility.
Inspired by the ChessQA dataset from the University of Toronto (arXiv:2510.23948). The ChessQA dataset was created by researchers at U of T and is used here with attribution under the CC-BY-4.0 license.
Semantic (400 questions)
Tests general chess knowledge: rules, piece behavior, terminology, and strategic concepts.
Position Judgment (500 questions)
Given a chess position, the model classifies the evaluation from the perspective of the side to move.
Scoring
Academic Inspiration
Inspired by ChessQA from the University of Toronto (arXiv:2510.23948). The ChessQA dataset was created by researchers at U of T.
@article{chessqa2024,
title={ChessQA: A Chess Question Answering Benchmark},
author={...},
journal={arXiv preprint arXiv:2510.23948},
year={2024}
}Puzzle Solving Benchmark
Overview
Coming SoonPuzzle evaluation is in progress. The puzzle benchmark will measure a model's ability to find the best move in a given chess position.
Explainability Benchmark
Description
Tests how well a model can explain a chess position to an intermediate player. Each model receives a position (as FEN) along with structured Chessvia API context (tactical patterns, Stockfish evaluation, positional plans). Its explanation is then scored by Claude Opus 4.6 acting as an independent judge.
Ground truth is provided by 197 expert-annotated games from Lichess, covering 3,373 GM-annotated positions across 5 position categories.
Rubric (1–3 scale)
1 = poor · 2 = adequate · 3 = excellent. Overall score is the mean across all five dimensions.
Configurations
Evaluation Protocol
Answer Extraction
Answers are extracted from model output using two regex patterns, checked in order:
FINAL ANSWER: <answer>\boxed{<answer>}External Gameplay Sources
The Gameplay Leaderboard aggregates data from two independent research projects. We do not run these evaluations — we scrape and display their published data with full attribution. Data is refreshed periodically by re-running the scraper scripts.
Chess Gameplay Leaderboard
Models play against Stockfish at varying depths. ELO, accuracy, legal move rate, and material metrics. Over 400 models, 3,000+ recorded games.
LLM Chess
Agentic gameplay framework where LLMs manage the full game loop. ELO with 95% confidence intervals, wrong move rates, cost per game. Paper: arXiv:2512.01992.