Gameplay Leaderboard

How well do LLMs actually play chess? Data aggregated from independent research — we do not run these evaluations ourselves.

Data by dubesor.de — independent evaluation, not affiliated with Chess AI Bench.Cached Feb 25, 2026Source
Mode
Min ELO
Min Games
406 models
#ModelModeELOGamesWin RateAccuracyAvg TurnsIllegal/Game
1gemini-3-pro-previewReasoning18454086.4%89.5%37.10.00
2gemini-3.1-pro-previewReasoning18371577.8%88.3%40.60.00
3gpt-4.5-preview ˟Continuation18002095.5%90.2%41.50.50
4qwen3-max-thinkingReasoning18001100.0%91.0%63.00.00
5gemini-3-pro-previewContinuation17952792.6%88.6%43.30.59
6gpt-5.1-codexReasoning17851678.1%89.6%45.30.00
7gpt-5-codexReasoning17772384.8%96.1%40.50.00
8gemini-3-flash-previewContinuation17672489.6%88.9%44.50.25
9gemini-3.1-pro-previewContinuation16611254.2%89.3%43.30.33
10grok-4Reasoning16152763.0%90.5%39.70.00
11chatgpt-4o-latest ˟Continuation15841863.9%87.2%48.02.29
12o3Reasoning15583580.0%83.6%47.20.00
13gpt-5Reasoning15262375.0%82.1%39.20.00
14gpt-5.1Reasoning15261957.9%85.7%41.20.00
15gpt-5-chatContinuation14972862.5%87.8%41.25.11
16gpt-4oContinuation14633773.2%76.6%39.41.44
17gpt-5Continuation14431761.8%84.0%45.31.00
18gemini-3-flash-previewReasoning14363367.1%84.5%34.70.02
19gpt-5.1-codexContinuation14291450.0%83.1%43.43.29
20gpt-5.1-chatContinuation14011752.9%85.0%39.14.65
21gpt-3.5-turbo-instructContinuation13798869.1%81.8%34.12.66
22gpt-3.5-turboContinuation13746380.2%76.5%31.82.67
23gpt-5.1-codex-maxContinuation13531439.3%81.5%39.33.93
24o3Continuation13471741.2%80.5%40.93.00
25gpt-5.1-codex-maxReasoning13421844.4%81.7%44.70.00
26gpt-5.2-codexReasoning13271747.1%83.3%39.70.00
27gpt-4o-2024-11-20Continuation13082358.7%78.3%36.61.20
28gpt-5-codexContinuation12911369.2%76.4%46.24.00
29gpt-4.1-miniContinuation12612865.7%78.5%37.521.58
30gpt-4.1Continuation12393759.2%71.1%39.61.00
31step-3.5-flashReasoning12322164.3%80.0%34.40.00
32gpt-5.1Continuation12311241.7%77.0%40.25.92
33grok-4.1-fast-reasoningReasoning12072858.9%74.7%31.00.00
34gpt-4Continuation12031546.7%77.6%43.31.80
35gemini-2.0-flash-001Continuation11962346.0%78.5%43.45.29
36gpt-5.2Reasoning11741844.4%72.1%51.80.00
37grok-4-fast-reasoningReasoning11453051.7%77.3%34.30.00
38gpt-5.2-codexContinuation11451334.6%73.7%41.65.85
39gpt-5-miniContinuation11171634.4%77.3%46.07.67
40gpt-5-nanoReasoning11123056.7%72.5%40.30.00
41codex-mini ˟Reasoning11083554.2%81.2%43.10.00
42gpt-5-miniReasoning11073042.2%80.3%36.50.00
43gpt-5.3-codexReasoning1089843.8%71.2%41.20.00
44gpt-5.3-codexContinuation1082721.4%73.7%47.78.29
45claude-opus-4.1Continuation10782042.5%74.7%47.513.14
46deepseek-v3.2-specialeReasoning10722358.7%69.3%53.40.00
47gemini-2.5-proContinuation10692522.0%76.7%39.77.11
48gpt-5-nanoContinuation10491656.3%74.9%40.64.67
49o4-miniReasoning10343756.6%73.6%44.60.00
50gpt-4-turboContinuation10241942.1%72.5%42.314.40
51o4-miniContinuation9951241.7%70.3%40.22.50
52gpt-4.5-preview ˟Reasoning9921566.7%70.9%43.30.00
53grok-4Continuation9921220.8%72.5%40.42.50
54o1Continuation989714.3%70.6%44.611.50
55gpt-4o-miniContinuation9551338.6%69.0%42.527.58
56gpt-5.1-codex-miniReasoning9522243.2%65.6%39.10.00
57gemini-2.5-proReasoning9434655.1%58.1%39.70.08
58grok-code-fast-1Reasoning9272352.1%69.2%49.80.00
59kimi-k2.5Reasoning9272646.2%72.5%41.70.12
60gpt-oss-120bReasoning9253561.4%62.8%44.00.08
61gpt-5.2Continuation9201225.0%66.0%42.010.08
62gpt-oss-20bContinuation9061942.5%69.5%42.58.29
63claude-opus-4.5Continuation9001741.2%67.3%55.912.41
64gemini-2.5-flashContinuation8962250.0%63.6%42.73.43
65grok-4.1-fast-reasoningContinuation8941275.0%63.7%55.814.17
66nemotron-3-nano-30b-a3bReasoning8922661.1%68.2%46.10.07
67gemini-1.5-pro ˟Continuation8881050.0%64.0%66.313.67
68claude-opus-4Continuation8711841.7%61.8%52.912.50
69o1-mini ˟Continuation86986.3%68.5%55.414.00
70claude-opus-4.5Reasoning8692842.2%63.9%43.60.03
71gpt-oss-20bReasoning8543164.5%62.8%52.40.18
72codestral-2508Reasoning854150.0%64.7%93.00.00
73gpt-5.1-chatReasoning8532138.6%65.3%37.60.00
74minimax-m2Continuation8381220.8%66.5%50.038.92
75codex-mini ˟Continuation8331045.8%62.6%45.532.00
76gpt-4oReasoning8293847.7%61.2%51.40.14
77claude-opus-4.6Continuation8271543.3%61.5%50.110.67
78gpt-5.2-chatReasoning8252038.1%62.4%33.20.10
79gpt-5.1-codex-miniContinuation8241040.0%61.1%37.79.70
80qwen3.5-397b-a17bReasoning8231550.0%65.9%48.30.00
81deepseek-v3.2-specialeContinuation8191035.0%61.9%36.29.40
82seed-oss-36b-instructReasoning8181361.5%60.8%47.80.00
83gpt-4.1Reasoning8174053.5%61.6%51.20.00
84qwen3.5-plus-02-15Reasoning8121457.1%62.4%50.00.57
85claude-opus-4.6Reasoning8112744.8%62.5%44.10.03
86o1Reasoning8071136.4%60.7%48.00.00
87seed-1.6Reasoning8022145.2%65.1%40.90.00
88chatgpt-4o-latest ˟Reasoning7991738.9%63.0%47.70.31
89lfm-7b ˟Continuation7961126.3%65.5%65.1
90gpt-5.2-chatContinuation7931229.2%59.9%56.715.50
91glm-5Reasoning7912243.8%62.2%51.00.00
92claude-sonnet-4.6Reasoning7901641.7%59.6%40.10.22
93gpt-5-chatReasoning7872646.3%57.0%50.00.00
94qwen3-next-80b-a3b-thinkingReasoning7831668.8%55.0%51.90.40
95claude-sonnet-4Continuation7801732.4%66.5%51.927.75
96grok-3Continuation7801741.7%66.0%58.76.80
97deepseek-v3.2Reasoning7802062.5%56.4%56.60.00
98minimax-m2.1Continuation780843.8%62.7%65.476.00
99kimi-k2.5Continuation7721245.8%59.7%52.716.58
100grok-4-fast-reasoningContinuation7711435.7%61.5%59.125.75
101o1-mini ˟Reasoning7662462.5%52.1%54.10.00
102grok-3-miniReasoning7663844.3%62.9%40.70.00
103grok-4-fast-non-reasoningReasoning7562444.4%65.8%49.20.67
104gemini-2.5-flashReasoning7553150.0%63.1%52.81.12
105deepseek-v3.1-terminusContinuation7551100.0%59.1%71.041.00
106lfm-2.5-1.2b-instructReasoning755150.0%59.6%19.0123.00
107grok-2-latest ˟Continuation744850.0%58.1%42.020.62
108command-aReasoning7431847.2%63.5%50.71.20
109qwen3-8bReasoning7411863.9%58.1%61.40.86
110kimi-k2Reasoning7373151.6%55.9%60.41.62
111aurora-alphaReasoning736125.0%58.1%43.50.00
112kimi-k2-0905Reasoning7343246.9%66.6%49.30.82
113kimi-k2-thinkingContinuation7311172.7%51.4%60.827.27
114gemini-2.5-flash-liteContinuation7271552.6%59.1%52.79.00
115minimax-m2Reasoning7212439.6%60.2%45.30.50
116lfm-2.5-1.2b-thinkingReasoning721150.0%57.5%19.02.00
117kimi-k2-thinkingReasoning7152042.5%56.4%49.50.00
118o3-miniContinuation7141015.0%60.5%56.535.20
119glm-5Continuation7141035.0%58.6%57.919.10
120gpt-4.1-miniReasoning7132751.8%53.5%70.30.25
121claude-opus-4Reasoning7092443.8%54.3%56.40.75
122o3-miniReasoning7071634.2%55.7%43.80.00
123qwen3-32bReasoning7062653.8%54.7%50.70.17
124mistral-large-2-2411Reasoning7043557.1%59.1%66.310.50
125qwen2.5-72b-instructReasoning7043852.6%59.3%70.14.00
126kimi-k2Continuation7041435.7%56.1%50.047.50
127qwen-plus-2025-07-28Reasoning7041752.9%58.2%61.30.81
128longcat-flash-chatReasoning7012259.1%54.2%52.91.00
129qwen3-14bReasoning7001653.1%54.2%64.80.25
130claude-3.7-sonnetContinuation6981241.7%59.7%68.723.25
131claude-opus-4.1Reasoning6982832.8%57.9%39.30.57
132internvl3-78bReasoning6971154.5%55.3%77.02.73
133deepseek-r1Reasoning6911361.5%52.1%43.20.00
134llama-3.3-70b-instructReasoning6875254.5%56.7%56.30.22
135qwen3-maxReasoning6862145.2%51.1%57.50.10
136claude-haiku-4.5Reasoning6802145.5%53.2%51.50.33
137glm-4.6vReasoning6801850.0%57.3%66.10.22
138grok-4.1-fast-non-reasoningReasoning6681439.3%53.4%43.70.21
139qwen2.5-maxReasoning6672345.6%54.2%50.31.00
140qwen-plus-2025-07-28Continuation6671020.0%58.2%48.822.56
141claude-sonnet-4.6Continuation6671045.0%51.3%59.724.00
142qwen3-235b-a22b-thinking-2507Reasoning6661944.7%54.8%60.50.20
143gemini-2.0-flash-lite-001Continuation6651350.0%53.8%60.112.75
144qwen3-coder-nextReasoning6641250.0%51.3%75.52.92
145claude-sonnet-4.5Reasoning6633045.0%58.1%46.20.20
146deepseek-r1-0528Reasoning6621643.8%52.0%56.80.25
147deepseek-v3.2-expReasoning6601847.2%56.9%45.10.50
148devstral-2512Reasoning6601839.5%53.5%50.00.95
149gpt-4o-miniReasoning6592130.3%55.0%43.80.14
150qwen3-coder-plusReasoning6591439.3%56.1%69.40.62
151grok-2-latest ˟Reasoning6591656.3%53.6%73.41.25
152seed-1.6-flashReasoning6592140.5%53.2%72.50.71
153gemma-2-27b-itReasoning6581850.0%58.0%68.26.60
154glm-4.5Reasoning6582152.4%50.7%60.21.67
155claude-3.7-sonnetReasoning6552648.2%47.2%52.20.00
156claude-3.5-sonnetContinuation6531140.9%53.2%61.011.83
157minimax-m1Reasoning6531533.3%56.6%52.00.14
158olmo-3-32b-thinkReasoning6501566.7%48.0%57.70.07
159phi-4Reasoning6491460.7%50.0%75.621.33
160qwen2.5-plusReasoning6491563.3%49.0%72.61.50
161intellect-3Reasoning649771.4%50.8%79.40.43
162deepseek-v3-0324Continuation6481350.0%54.4%56.211.50
163hunyuan-a13b-instructContinuation648837.5%57.5%56.188.50
164gpt-4Reasoning6481258.3%47.2%72.41.50
165gemini-2.5-flash-liteReasoning6452843.6%54.0%55.97.80
166claude-3-sonnet ˟Reasoning642658.3%53.4%81.3
167claude-sonnet-4Reasoning6403344.4%51.3%55.10.00
168qwen3-235b-a22bReasoning6391942.1%51.4%42.90.00
169llama-3.1-nemotron-ultra-253b-v1Reasoning6371342.9%51.3%59.10.00
170gpt-oss-120bContinuation6371334.6%49.6%57.76.00
171claude-sonnet-4.5Continuation6341747.1%47.6%45.815.75
172ernie-4.5-21b-a3b-thinkingReasoning6341365.4%47.9%70.71.33
173gpt-4-turboReasoning6322147.6%53.4%64.60.67
174qwen3-235b-a22bContinuation6301035.0%55.6%52.760.00
175claude-haiku-4.5Continuation6291523.3%57.2%51.859.60
176ministral-14b-2512Reasoning6281342.3%56.8%64.99.31
177gemini-1.5-pro ˟Reasoning6271237.5%52.8%50.85.00
178gemini-2.0-flash-001Reasoning6264734.8%54.9%56.40.00
179deepseek-v3Continuation6261050.0%50.3%69.731.75
180qwen2.5-vl-32b-instructReasoning626250.0%52.9%150.54.00
181inflection-3-piReasoning6251168.2%48.0%74.65.91
182mistral-large-3-2512Reasoning6211543.8%47.5%52.60.19
183grok-code-fast-1Continuation6191050.0%49.8%68.465.33
184llama-3.3-nemotron-super-49b-v1.5Reasoning6191150.0%49.0%64.10.00
185deepseek-v3-0324Reasoning6132735.9%50.2%49.90.25
186glm-4.6Reasoning6132347.8%45.2%56.91.91
187llama-3.3-70b-instructContinuation6111330.8%55.0%67.280.33
188qwen3-30b-a3bReasoning6101752.9%45.0%74.90.00
189deepseek-v3Reasoning6091741.2%53.4%42.30.20
190minimax-m2.1Reasoning6071229.2%52.8%51.99.08
191qwen3.5-397b-a17bContinuation605735.7%52.4%53.98.57
192devstral-2512Continuation604955.6%46.8%67.943.00
193llama-3.3-nemotron-super-49b-v1Reasoning6021060.0%49.6%87.913.00
194grok-3Reasoning6022145.5%43.2%52.00.00
195devstral-small-2505Reasoning59930.0%55.4%83.32.00
196minimax-m2.5Reasoning5991446.4%43.5%56.82.57
197magistral-medium-2506Reasoning5981040.0%51.0%78.91.25
198inflection-3-piContinuation598150.0%50.1%54.033.00
199gpt-4.1-nanoContinuation597850.0%47.2%100.6288.50
200gemini-1.5-flash ˟Reasoning5941041.7%51.5%38.60.00
201llama-3.1-70b-instructReasoning5931353.8%47.1%68.21.00
202qwen3-next-80b-a3b-instructReasoning5932037.5%48.6%54.06.33
203aurora-alphaContinuation593150.0%48.3%97.037.00
204deepseek-v3.1Reasoning5911938.1%51.1%52.10.25
205deepseek-v3.2Continuation5911154.5%45.7%47.328.91
206llama-3.1-405b-instructReasoning5902443.8%46.6%64.02.25
207nemotron-3-nano-30b-a3bContinuation589560.0%48.7%86.490.60
208gpt-4o-2024-11-20Reasoning5882536.5%44.8%58.20.25
209gemini-2.0-flash-lite-001Reasoning5881834.2%52.1%59.71.25
210devstral-mediumReasoning5881266.7%40.1%78.44.42
211glm-4.5-airReasoning5851955.3%39.3%61.12.25
212qwen3-vl-235b-a22b-thinkingReasoning5851150.0%46.8%56.60.00
213mimo-v2-flashReasoning5851343.3%42.1%61.22.00
214gemma-3-12b-itReasoning5841653.1%48.1%50.50.25
215claude-3.5-haikuReasoning5812238.6%55.7%59.50.00
216qwen3-coder-480b-a35bReasoning5811245.8%46.3%60.63.40
217qwq-32bReasoning5801334.6%48.0%61.80.67
218gemma-2-27b-itContinuation579641.7%47.5%45.262.00
219magistral-medium-2506:thinkingReasoning57820.0%51.8%29.5
220hunyuan-a13b-instructReasoning5761442.9%49.5%69.219.50
221llama-4-maverickReasoning5752646.2%42.9%59.40.50
222ling-1tReasoning5741750.0%45.8%48.28.00
223qwen2.5-turboReasoning5731353.8%50.1%80.62.33
224jamba-large-1.7Reasoning5721442.9%46.3%59.00.33
225gpt-4.1-nanoReasoning5702038.6%51.8%57.03.00
226inflection-3-productivityReasoning5701154.5%46.7%87.56.27
227mistral-small-3.2-24b-instructReasoning5691446.7%49.2%80.52.50
228ernie-4.5-300b-a47bReasoning5691936.8%51.8%72.20.00
229lfm2-8b-a1bReasoning5691968.4%38.8%98.99.47
230qwen3-next-80b-a3b-thinkingContinuation5681035.0%50.1%38.542.20
231deepseek-v3.1-terminusReasoning5681450.0%43.4%52.90.80
232claude-opus-4.5-thinkingReasoning567150.0%46.4%59.00.00
233ernie-4.5-21b-a3bReasoning5661850.0%47.7%72.58.57
234mistral-medium-3Reasoning5651741.2%45.7%49.80.33
235glm-4.7-flashReasoning5621131.8%49.4%73.01.73
236qwen3-30b-a3b-thinking-2507Reasoning5611350.0%44.6%83.30.00
237ministral-8bReasoning5602152.4%49.2%60.646.67
238command-r-plus-08-2024Reasoning5601350.0%42.8%73.510.33
239internvl3-78bContinuation559250.0%46.4%105.5128.00
240qwen3-30b-a3b-instruct-2507Reasoning5582250.0%41.5%57.83.50
241gpt-3.5-turbo-instructReasoning5531326.7%50.8%46.55.67
242mistral-small-24b-instruct-2501Reasoning5531563.3%38.3%76.81.50
243nova-2-lite-v1Reasoning5531339.3%47.8%79.41.50
244gemini-1.5-flash-8b ˟Reasoning5521050.0%43.5%43.64.00
245deepseek-r1Continuation551250.0%46.4%61.028.00
246lfm-7b ˟Reasoning5502529.1%56.8%56.29.00
247gpt-3.5-turboReasoning5501427.8%50.4%52.311.75
248hermes-4-70bReasoning550275.0%43.8%59.51.00
249step-3.5-flashContinuation5481136.4%42.0%46.018.73
250qwen3-next-80b-a3b-instructContinuation5461136.4%46.2%54.853.80
251glm-4-32bReasoning5451752.9%38.8%71.44.33
252kimi-k2-0905Continuation5451338.5%40.1%67.337.33
253minimax-m2.5Continuation544125.0%44.9%67.552.00
254claude-3-opus ˟Reasoning5431035.0%47.6%64.25.80
255grok-3-miniContinuation5421065.0%36.6%51.235.00
256mistral-medium-3.1Reasoning5411750.0%44.7%68.01.80
257qwen3-vl-32b-instructReasoning541742.9%46.1%49.60.71
258qwen3-maxContinuation5401040.0%41.0%69.867.00
259grok-4-fast-non-reasoningContinuation5401245.8%42.1%72.751.29
260claude-3-haikuReasoning5391356.7%41.4%79.30.00
261qwen3-235b-a22b-instruct-2507Reasoning5392732.8%50.1%49.80.25
262command-r-08-2024Reasoning5381237.5%48.5%76.627.00
263llama-4-scoutContinuation5371055.0%41.2%91.499.00
264gemma-3-27b-itReasoning5361934.2%51.2%47.73.43
265llama-4-maverickContinuation5351342.3%45.4%87.5109.33
266claude-3.5-sonnetReasoning5331338.5%45.0%46.21.33
267longcat-flash-chatContinuation5331027.3%46.6%45.221.80
268deepseek-v3.2-expContinuation532628.6%48.1%42.917.50
269mimo-v2-flashContinuation5311050.0%42.7%74.237.80
270mistral-large-2-2411Continuation5301030.0%43.0%93.183.00
271qwen3-vl-235b-a22b-instructReasoning5271229.2%47.5%58.40.00
272gemma-2-9b-itReasoning5261950.0%41.9%92.747.43
273llama-3.1-405b-instructContinuation5201040.0%44.0%102.562.00
274claude-3.7-sonnet:thinkingReasoning520250.0%42.0%58.00.00
275qwen3-vl-8b-instructReasoning519742.9%44.0%80.12.86
276llama-4-scoutReasoning5172636.5%45.5%65.80.00
277glm-z1-32bReasoning517275.0%41.0%44.5
278gemini-1.5-flash ˟Continuation517366.7%39.6%74.3
279mistral-large-3-2512Continuation5171163.6%33.9%96.070.36
280wizardlm-2-8x22bReasoning5151229.2%48.6%48.83.50
281seed-1.6-flashContinuation513250.0%43.7%47.059.50
282deepseek-prover-v2Reasoning510837.5%46.0%45.90.80
283mistral-nemoReasoning5101535.3%44.3%81.448.50
284jamba-large-1.6Reasoning507641.7%42.7%85.31.00
285devstral-smallReasoning5061233.3%48.3%74.01.58
286magistral-small-2506Continuation50520.0%43.3%23.043.00
287llama-3.1-8b-instructReasoning5033056.5%31.6%106.620.00
288magistral-small-2506Reasoning5031150.0%39.3%79.22.00
289llama-3-8b-instructReasoning5031440.0%45.3%105.314.64
290qwen3.5-plus-02-15Continuation503856.3%34.5%75.936.00
291molmo-2-8bReasoning5021245.8%45.2%93.19.17
292gemma-3-27b-itContinuation49977.1%47.6%58.949.50
293olmo-3.1-32b-instructReasoning4991229.2%46.9%58.41.17
294command-r-08-2024Continuation495625.0%43.1%101.8231.00
295grok-4.1-fast-non-reasoningContinuation4951140.9%39.6%72.080.91
296qwen3-coder-480b-a35bContinuation494950.0%38.4%76.657.33
297qwen3-4bReasoning493560.0%39.7%66.81.00
298llama-3.2-3b-instructReasoning4911450.0%43.9%95.637.00
299kimi-linear-48b-a3b-instructReasoning4891133.3%40.2%54.39.92
300jamba-large-1.7Continuation4881040.0%41.8%84.5118.00
301ministral-8b-2512Reasoning4881640.6%43.4%68.06.00
302devstral-smallContinuation487233.3%39.2%55.350.00
303qwen3-vl-30b-a3b-thinkingReasoning487366.7%35.9%52.70.00
304qwen-2.5-7b-instructReasoning4871546.7%38.0%86.111.20
305deepseek-r1-0528-qwen3-8bReasoning486350.0%37.9%90.31.00
306glm-4-32bContinuation486837.5%40.4%79.2216.00
307deepseek-v3.1Continuation483850.0%35.5%86.824.50
308ernie-4.5-300b-a47bContinuation482837.5%36.4%128.272.00
309jamba-large-1.6Continuation480540.0%39.8%90.8
310hermes-4-405bReasoning479225.0%41.2%59.01.00
311qwen2.5-72b-instructContinuation4761145.5%37.4%90.179.20
312deepseek-prover-v2Continuation474437.5%38.8%100.076.00
313mistral-small-3.1-24b-instructReasoning4731942.1%35.8%69.42.00
314inflection-3-productivityContinuation471150.0%38.2%111.079.00
315tng-r1t-chimeraReasoning471150.0%38.3%67.00.00
316olmo-3.1-32b-thinkReasoning47110.0%40.0%27.00.00
317llama-3.3-8b-instructReasoning4701543.3%41.3%75.69.00
318rnj-1-instructReasoning4641242.3%38.0%58.79.23
319seed-1.6Continuation462350.0%38.1%58.068.67
320phi-4Continuation461750.0%36.1%71.796.50
321mistral-small-creativeReasoning4611346.2%36.3%70.26.85
322gemma-3-4b-itReasoning4601450.0%37.9%79.912.00
323mythomax-l2-13bReasoning4601131.8%44.8%85.538.27
324granite-4.0-h-microReasoning4591150.0%38.5%56.519.45
325minimax-m1Continuation456150.0%35.2%153.0174.00
326qwen2.5-vl-72b-instructReasoning454150.0%35.1%85.00.00
327deepseek-r1t-chimeraReasoning453150.0%35.7%58.0
328olmo-2-0325-32b-instructReasoning4531338.5%36.4%94.75.92
329olmo-3-7b-thinkReasoning4531140.9%39.9%106.31.27
330glm-4.6Continuation446950.0%30.9%80.466.00
331glm-4.5-airContinuation443850.0%34.4%68.441.00
332lfm2-8b-a1bContinuation443150.0%35.7%80.0166.00
333afm-4.5bReasoning4421450.0%31.2%97.322.67
334ui-tars-1.5-7bContinuation442150.0%35.5%80.0124.00
335minimax-m2-herReasoning441150.0%35.2%176.019.00
336qwen3-coder-plusContinuation440540.0%35.4%51.858.33
337ministral-3bReasoning4391752.9%30.2%65.647.71
338llama-3.1-nemotron-ultra-253b-v1Continuation438250.0%34.4%126.0154.00
339seed-oss-36b-instructContinuation438333.3%34.5%83.0
340qwen2.5-plusContinuation437950.0%33.0%71.942.33
341deepseek-r1-0528Continuation437250.0%31.9%87.558.00
342qwen2.5-turboContinuation4371050.0%31.7%95.9141.67
343gemma-3-12b-itContinuation435250.0%33.1%108.0211.00
344trinity-large-previewReasoning434150.0%33.5%72.01.00
345ministral-3b-2512Reasoning4331136.4%42.1%92.55.27
346llama-3.1-nemotron-70b-instructReasoning43220.0%36.7%14.70.00
347command-r7b-12-2024Reasoning4311744.1%31.4%127.628.06
348glm-4.7-flashContinuation4311100.0%31.9%107.0173.00
349jamba-mini-1.6Reasoning430742.9%32.0%86.012.00
350mistral-medium-3Continuation428950.0%33.4%79.258.50
351jamba-mini-1.7Continuation428550.0%33.7%134.8302.50
352llama-3.3-nemotron-super-49b-v1.5Continuation428150.0%33.7%188.0271.00
353phi-3-medium-128k-instructReasoning4251245.8%30.8%106.913.00
354lfm-3b ˟Continuation424150.0%31.1%152.0
355glm-4.5Continuation423950.0%29.8%74.338.00
356mistral-7b-instruct-v0.1Reasoning4221245.8%29.9%81.764.00
357olmo-3-7b-instructReasoning4221735.3%32.3%72.69.35
358qwen3-coder-nextContinuation421550.0%28.4%94.9108.71
359lfm-2.2-6bReasoning4171241.7%32.0%115.793.00
360jamba-mini-1.7Reasoning4161533.3%35.0%115.015.67
361gemma-3n-e4b-itReasoning4151637.5%39.0%65.727.40
362qwen2.5-maxContinuation414742.9%29.3%90.335.00
363qwen3-vl-235b-a22b-thinkingContinuation414150.0%30.0%100.081.00
364gemini-1.5-flash-8b ˟Continuation41310.0%34.1%40.0
365lfm-3b ˟Reasoning4121442.9%29.0%89.015.00
366qwen3-30b-a3b-instruct-2507Continuation412641.7%31.7%77.2174.00
367glm-4.6vContinuation4111050.0%29.9%101.7137.30
368gemma-2-9b-itContinuation410150.0%31.4%132.0193.00
369qwen3-30b-a3b-thinking-2507Continuation408150.0%31.1%132.0190.00
370claude-3-opus ˟Continuation407275.0%27.6%72.018.00
371mistral-medium-3.1Continuation406837.5%33.2%90.596.67
372ui-tars-1.5-7bReasoning4051936.8%39.3%92.133.50
373claude-3.5-haikuContinuation395250.0%29.4%84.565.00
374qwen3-235b-a22b-thinking-2507Continuation392250.0%28.8%84.068.00
375wizardlm-2-8x22bContinuation391250.0%27.8%92.081.00
376ernie-4.5-21b-a3bContinuation391350.0%28.7%117.3385.00
377qwq-32bContinuation385150.0%27.8%79.0110.00
378claude-3-haikuContinuation384550.0%27.4%158.4236.50
379gemma-3n-e4b-itContinuation384250.0%27.6%128.0173.00
380kimi-linear-48b-a3b-instructContinuation384550.0%27.1%148.8307.00
381olmo-3-32b-thinkContinuation3841100.0%24.0%70.096.00
382olmo-2-0325-32b-instructContinuation384250.0%27.0%87.5148.50
383qwen3-30b-a3bContinuation379250.0%26.4%75.0142.00
384command-aContinuation378641.7%26.7%101.358.00
385command-r-plus-08-2024Continuation378550.0%26.4%123.8226.00
386qwen3-vl-32b-instructContinuation378150.0%26.6%79.065.00
387mistral-nemoContinuation375250.0%26.1%108.0110.00
388mistral-small-3.2-24b-instructContinuation374350.0%25.9%73.090.00
389ministral-14b-2512Continuation372150.0%25.6%198.0328.00
390ministral-8b-2512Continuation370150.0%25.4%198.0341.00
391deepseek-r1-distill-llama-8bReasoning367275.0%21.6%96.0
392qwen3-235b-a22b-instruct-2507Continuation3661122.7%32.3%68.241.20
393llama-3.1-nemotron-70b-instructContinuation366150.0%24.2%195.0228.00
394olmo-3-7b-instructContinuation365150.0%24.5%182.0414.00
395llama-3.1-8b-instructContinuation364250.0%24.3%87.060.00
396olmo-3-7b-thinkContinuation363150.0%24.0%182.0395.00
397mistral-small-3.1-24b-instructContinuation361250.0%23.6%87.067.00
398claude-3-sonnet ˟Continuation361150.0%23.5%254.0
399olmo-3.1-32b-instructContinuation356250.0%20.9%148.5174.50
400qwen3-32bContinuation351633.3%27.6%76.2141.00
401llama-3.1-70b-instructContinuation339150.0%19.5%189.0177.00
402qwen3-vl-235b-a22b-instructContinuation339150.0%19.5%189.065.00
403deepseek-r1-distill-qwen-7bReasoning333225.0%21.8%96.0
404mistral-small-24b-instruct-2501Continuation329250.0%17.0%80.077.00
405llama-3.3-nemotron-super-49b-v1Continuation324150.0%14.4%103.0
406qwen2.5-vl-32b-instructContinuation316150.0%14.1%75.0