Gameplay Leaderboard
How well do LLMs actually play chess? Data aggregated from independent research — we do not run these evaluations ourselves.
Data by dubesor.de — independent evaluation, not affiliated with Chess AI Bench.Cached Feb 25, 2026Source
Mode
Min ELO
Min Games
406 models
| # | Model | Mode | ELO | Games | Win Rate | Accuracy | Avg Turns | Illegal/Game |
|---|---|---|---|---|---|---|---|---|
| 1 | gemini-3-pro-preview | Reasoning | 1845 | 40 | 86.4% | 89.5% | 37.1 | 0.00 |
| 2 | gemini-3.1-pro-preview | Reasoning | 1837 | 15 | 77.8% | 88.3% | 40.6 | 0.00 |
| 3 | gpt-4.5-preview ˟ | Continuation | 1800 | 20 | 95.5% | 90.2% | 41.5 | 0.50 |
| 4 | qwen3-max-thinking | Reasoning | 1800 | 1 | 100.0% | 91.0% | 63.0 | 0.00 |
| 5 | gemini-3-pro-preview | Continuation | 1795 | 27 | 92.6% | 88.6% | 43.3 | 0.59 |
| 6 | gpt-5.1-codex | Reasoning | 1785 | 16 | 78.1% | 89.6% | 45.3 | 0.00 |
| 7 | gpt-5-codex | Reasoning | 1777 | 23 | 84.8% | 96.1% | 40.5 | 0.00 |
| 8 | gemini-3-flash-preview | Continuation | 1767 | 24 | 89.6% | 88.9% | 44.5 | 0.25 |
| 9 | gemini-3.1-pro-preview | Continuation | 1661 | 12 | 54.2% | 89.3% | 43.3 | 0.33 |
| 10 | grok-4 | Reasoning | 1615 | 27 | 63.0% | 90.5% | 39.7 | 0.00 |
| 11 | chatgpt-4o-latest ˟ | Continuation | 1584 | 18 | 63.9% | 87.2% | 48.0 | 2.29 |
| 12 | o3 | Reasoning | 1558 | 35 | 80.0% | 83.6% | 47.2 | 0.00 |
| 13 | gpt-5 | Reasoning | 1526 | 23 | 75.0% | 82.1% | 39.2 | 0.00 |
| 14 | gpt-5.1 | Reasoning | 1526 | 19 | 57.9% | 85.7% | 41.2 | 0.00 |
| 15 | gpt-5-chat | Continuation | 1497 | 28 | 62.5% | 87.8% | 41.2 | 5.11 |
| 16 | gpt-4o | Continuation | 1463 | 37 | 73.2% | 76.6% | 39.4 | 1.44 |
| 17 | gpt-5 | Continuation | 1443 | 17 | 61.8% | 84.0% | 45.3 | 1.00 |
| 18 | gemini-3-flash-preview | Reasoning | 1436 | 33 | 67.1% | 84.5% | 34.7 | 0.02 |
| 19 | gpt-5.1-codex | Continuation | 1429 | 14 | 50.0% | 83.1% | 43.4 | 3.29 |
| 20 | gpt-5.1-chat | Continuation | 1401 | 17 | 52.9% | 85.0% | 39.1 | 4.65 |
| 21 | gpt-3.5-turbo-instruct | Continuation | 1379 | 88 | 69.1% | 81.8% | 34.1 | 2.66 |
| 22 | gpt-3.5-turbo | Continuation | 1374 | 63 | 80.2% | 76.5% | 31.8 | 2.67 |
| 23 | gpt-5.1-codex-max | Continuation | 1353 | 14 | 39.3% | 81.5% | 39.3 | 3.93 |
| 24 | o3 | Continuation | 1347 | 17 | 41.2% | 80.5% | 40.9 | 3.00 |
| 25 | gpt-5.1-codex-max | Reasoning | 1342 | 18 | 44.4% | 81.7% | 44.7 | 0.00 |
| 26 | gpt-5.2-codex | Reasoning | 1327 | 17 | 47.1% | 83.3% | 39.7 | 0.00 |
| 27 | gpt-4o-2024-11-20 | Continuation | 1308 | 23 | 58.7% | 78.3% | 36.6 | 1.20 |
| 28 | gpt-5-codex | Continuation | 1291 | 13 | 69.2% | 76.4% | 46.2 | 4.00 |
| 29 | gpt-4.1-mini | Continuation | 1261 | 28 | 65.7% | 78.5% | 37.5 | 21.58 |
| 30 | gpt-4.1 | Continuation | 1239 | 37 | 59.2% | 71.1% | 39.6 | 1.00 |
| 31 | step-3.5-flash | Reasoning | 1232 | 21 | 64.3% | 80.0% | 34.4 | 0.00 |
| 32 | gpt-5.1 | Continuation | 1231 | 12 | 41.7% | 77.0% | 40.2 | 5.92 |
| 33 | grok-4.1-fast-reasoning | Reasoning | 1207 | 28 | 58.9% | 74.7% | 31.0 | 0.00 |
| 34 | gpt-4 | Continuation | 1203 | 15 | 46.7% | 77.6% | 43.3 | 1.80 |
| 35 | gemini-2.0-flash-001 | Continuation | 1196 | 23 | 46.0% | 78.5% | 43.4 | 5.29 |
| 36 | gpt-5.2 | Reasoning | 1174 | 18 | 44.4% | 72.1% | 51.8 | 0.00 |
| 37 | grok-4-fast-reasoning | Reasoning | 1145 | 30 | 51.7% | 77.3% | 34.3 | 0.00 |
| 38 | gpt-5.2-codex | Continuation | 1145 | 13 | 34.6% | 73.7% | 41.6 | 5.85 |
| 39 | gpt-5-mini | Continuation | 1117 | 16 | 34.4% | 77.3% | 46.0 | 7.67 |
| 40 | gpt-5-nano | Reasoning | 1112 | 30 | 56.7% | 72.5% | 40.3 | 0.00 |
| 41 | codex-mini ˟ | Reasoning | 1108 | 35 | 54.2% | 81.2% | 43.1 | 0.00 |
| 42 | gpt-5-mini | Reasoning | 1107 | 30 | 42.2% | 80.3% | 36.5 | 0.00 |
| 43 | gpt-5.3-codex | Reasoning | 1089 | 8 | 43.8% | 71.2% | 41.2 | 0.00 |
| 44 | gpt-5.3-codex | Continuation | 1082 | 7 | 21.4% | 73.7% | 47.7 | 8.29 |
| 45 | claude-opus-4.1 | Continuation | 1078 | 20 | 42.5% | 74.7% | 47.5 | 13.14 |
| 46 | deepseek-v3.2-speciale | Reasoning | 1072 | 23 | 58.7% | 69.3% | 53.4 | 0.00 |
| 47 | gemini-2.5-pro | Continuation | 1069 | 25 | 22.0% | 76.7% | 39.7 | 7.11 |
| 48 | gpt-5-nano | Continuation | 1049 | 16 | 56.3% | 74.9% | 40.6 | 4.67 |
| 49 | o4-mini | Reasoning | 1034 | 37 | 56.6% | 73.6% | 44.6 | 0.00 |
| 50 | gpt-4-turbo | Continuation | 1024 | 19 | 42.1% | 72.5% | 42.3 | 14.40 |
| 51 | o4-mini | Continuation | 995 | 12 | 41.7% | 70.3% | 40.2 | 2.50 |
| 52 | gpt-4.5-preview ˟ | Reasoning | 992 | 15 | 66.7% | 70.9% | 43.3 | 0.00 |
| 53 | grok-4 | Continuation | 992 | 12 | 20.8% | 72.5% | 40.4 | 2.50 |
| 54 | o1 | Continuation | 989 | 7 | 14.3% | 70.6% | 44.6 | 11.50 |
| 55 | gpt-4o-mini | Continuation | 955 | 13 | 38.6% | 69.0% | 42.5 | 27.58 |
| 56 | gpt-5.1-codex-mini | Reasoning | 952 | 22 | 43.2% | 65.6% | 39.1 | 0.00 |
| 57 | gemini-2.5-pro | Reasoning | 943 | 46 | 55.1% | 58.1% | 39.7 | 0.08 |
| 58 | grok-code-fast-1 | Reasoning | 927 | 23 | 52.1% | 69.2% | 49.8 | 0.00 |
| 59 | kimi-k2.5 | Reasoning | 927 | 26 | 46.2% | 72.5% | 41.7 | 0.12 |
| 60 | gpt-oss-120b | Reasoning | 925 | 35 | 61.4% | 62.8% | 44.0 | 0.08 |
| 61 | gpt-5.2 | Continuation | 920 | 12 | 25.0% | 66.0% | 42.0 | 10.08 |
| 62 | gpt-oss-20b | Continuation | 906 | 19 | 42.5% | 69.5% | 42.5 | 8.29 |
| 63 | claude-opus-4.5 | Continuation | 900 | 17 | 41.2% | 67.3% | 55.9 | 12.41 |
| 64 | gemini-2.5-flash | Continuation | 896 | 22 | 50.0% | 63.6% | 42.7 | 3.43 |
| 65 | grok-4.1-fast-reasoning | Continuation | 894 | 12 | 75.0% | 63.7% | 55.8 | 14.17 |
| 66 | nemotron-3-nano-30b-a3b | Reasoning | 892 | 26 | 61.1% | 68.2% | 46.1 | 0.07 |
| 67 | gemini-1.5-pro ˟ | Continuation | 888 | 10 | 50.0% | 64.0% | 66.3 | 13.67 |
| 68 | claude-opus-4 | Continuation | 871 | 18 | 41.7% | 61.8% | 52.9 | 12.50 |
| 69 | o1-mini ˟ | Continuation | 869 | 8 | 6.3% | 68.5% | 55.4 | 14.00 |
| 70 | claude-opus-4.5 | Reasoning | 869 | 28 | 42.2% | 63.9% | 43.6 | 0.03 |
| 71 | gpt-oss-20b | Reasoning | 854 | 31 | 64.5% | 62.8% | 52.4 | 0.18 |
| 72 | codestral-2508 | Reasoning | 854 | 1 | 50.0% | 64.7% | 93.0 | 0.00 |
| 73 | gpt-5.1-chat | Reasoning | 853 | 21 | 38.6% | 65.3% | 37.6 | 0.00 |
| 74 | minimax-m2 | Continuation | 838 | 12 | 20.8% | 66.5% | 50.0 | 38.92 |
| 75 | codex-mini ˟ | Continuation | 833 | 10 | 45.8% | 62.6% | 45.5 | 32.00 |
| 76 | gpt-4o | Reasoning | 829 | 38 | 47.7% | 61.2% | 51.4 | 0.14 |
| 77 | claude-opus-4.6 | Continuation | 827 | 15 | 43.3% | 61.5% | 50.1 | 10.67 |
| 78 | gpt-5.2-chat | Reasoning | 825 | 20 | 38.1% | 62.4% | 33.2 | 0.10 |
| 79 | gpt-5.1-codex-mini | Continuation | 824 | 10 | 40.0% | 61.1% | 37.7 | 9.70 |
| 80 | qwen3.5-397b-a17b | Reasoning | 823 | 15 | 50.0% | 65.9% | 48.3 | 0.00 |
| 81 | deepseek-v3.2-speciale | Continuation | 819 | 10 | 35.0% | 61.9% | 36.2 | 9.40 |
| 82 | seed-oss-36b-instruct | Reasoning | 818 | 13 | 61.5% | 60.8% | 47.8 | 0.00 |
| 83 | gpt-4.1 | Reasoning | 817 | 40 | 53.5% | 61.6% | 51.2 | 0.00 |
| 84 | qwen3.5-plus-02-15 | Reasoning | 812 | 14 | 57.1% | 62.4% | 50.0 | 0.57 |
| 85 | claude-opus-4.6 | Reasoning | 811 | 27 | 44.8% | 62.5% | 44.1 | 0.03 |
| 86 | o1 | Reasoning | 807 | 11 | 36.4% | 60.7% | 48.0 | 0.00 |
| 87 | seed-1.6 | Reasoning | 802 | 21 | 45.2% | 65.1% | 40.9 | 0.00 |
| 88 | chatgpt-4o-latest ˟ | Reasoning | 799 | 17 | 38.9% | 63.0% | 47.7 | 0.31 |
| 89 | lfm-7b ˟ | Continuation | 796 | 11 | 26.3% | 65.5% | 65.1 | — |
| 90 | gpt-5.2-chat | Continuation | 793 | 12 | 29.2% | 59.9% | 56.7 | 15.50 |
| 91 | glm-5 | Reasoning | 791 | 22 | 43.8% | 62.2% | 51.0 | 0.00 |
| 92 | claude-sonnet-4.6 | Reasoning | 790 | 16 | 41.7% | 59.6% | 40.1 | 0.22 |
| 93 | gpt-5-chat | Reasoning | 787 | 26 | 46.3% | 57.0% | 50.0 | 0.00 |
| 94 | qwen3-next-80b-a3b-thinking | Reasoning | 783 | 16 | 68.8% | 55.0% | 51.9 | 0.40 |
| 95 | claude-sonnet-4 | Continuation | 780 | 17 | 32.4% | 66.5% | 51.9 | 27.75 |
| 96 | grok-3 | Continuation | 780 | 17 | 41.7% | 66.0% | 58.7 | 6.80 |
| 97 | deepseek-v3.2 | Reasoning | 780 | 20 | 62.5% | 56.4% | 56.6 | 0.00 |
| 98 | minimax-m2.1 | Continuation | 780 | 8 | 43.8% | 62.7% | 65.4 | 76.00 |
| 99 | kimi-k2.5 | Continuation | 772 | 12 | 45.8% | 59.7% | 52.7 | 16.58 |
| 100 | grok-4-fast-reasoning | Continuation | 771 | 14 | 35.7% | 61.5% | 59.1 | 25.75 |
| 101 | o1-mini ˟ | Reasoning | 766 | 24 | 62.5% | 52.1% | 54.1 | 0.00 |
| 102 | grok-3-mini | Reasoning | 766 | 38 | 44.3% | 62.9% | 40.7 | 0.00 |
| 103 | grok-4-fast-non-reasoning | Reasoning | 756 | 24 | 44.4% | 65.8% | 49.2 | 0.67 |
| 104 | gemini-2.5-flash | Reasoning | 755 | 31 | 50.0% | 63.1% | 52.8 | 1.12 |
| 105 | deepseek-v3.1-terminus | Continuation | 755 | 1 | 100.0% | 59.1% | 71.0 | 41.00 |
| 106 | lfm-2.5-1.2b-instruct | Reasoning | 755 | 1 | 50.0% | 59.6% | 19.0 | 123.00 |
| 107 | grok-2-latest ˟ | Continuation | 744 | 8 | 50.0% | 58.1% | 42.0 | 20.62 |
| 108 | command-a | Reasoning | 743 | 18 | 47.2% | 63.5% | 50.7 | 1.20 |
| 109 | qwen3-8b | Reasoning | 741 | 18 | 63.9% | 58.1% | 61.4 | 0.86 |
| 110 | kimi-k2 | Reasoning | 737 | 31 | 51.6% | 55.9% | 60.4 | 1.62 |
| 111 | aurora-alpha | Reasoning | 736 | 1 | 25.0% | 58.1% | 43.5 | 0.00 |
| 112 | kimi-k2-0905 | Reasoning | 734 | 32 | 46.9% | 66.6% | 49.3 | 0.82 |
| 113 | kimi-k2-thinking | Continuation | 731 | 11 | 72.7% | 51.4% | 60.8 | 27.27 |
| 114 | gemini-2.5-flash-lite | Continuation | 727 | 15 | 52.6% | 59.1% | 52.7 | 9.00 |
| 115 | minimax-m2 | Reasoning | 721 | 24 | 39.6% | 60.2% | 45.3 | 0.50 |
| 116 | lfm-2.5-1.2b-thinking | Reasoning | 721 | 1 | 50.0% | 57.5% | 19.0 | 2.00 |
| 117 | kimi-k2-thinking | Reasoning | 715 | 20 | 42.5% | 56.4% | 49.5 | 0.00 |
| 118 | o3-mini | Continuation | 714 | 10 | 15.0% | 60.5% | 56.5 | 35.20 |
| 119 | glm-5 | Continuation | 714 | 10 | 35.0% | 58.6% | 57.9 | 19.10 |
| 120 | gpt-4.1-mini | Reasoning | 713 | 27 | 51.8% | 53.5% | 70.3 | 0.25 |
| 121 | claude-opus-4 | Reasoning | 709 | 24 | 43.8% | 54.3% | 56.4 | 0.75 |
| 122 | o3-mini | Reasoning | 707 | 16 | 34.2% | 55.7% | 43.8 | 0.00 |
| 123 | qwen3-32b | Reasoning | 706 | 26 | 53.8% | 54.7% | 50.7 | 0.17 |
| 124 | mistral-large-2-2411 | Reasoning | 704 | 35 | 57.1% | 59.1% | 66.3 | 10.50 |
| 125 | qwen2.5-72b-instruct | Reasoning | 704 | 38 | 52.6% | 59.3% | 70.1 | 4.00 |
| 126 | kimi-k2 | Continuation | 704 | 14 | 35.7% | 56.1% | 50.0 | 47.50 |
| 127 | qwen-plus-2025-07-28 | Reasoning | 704 | 17 | 52.9% | 58.2% | 61.3 | 0.81 |
| 128 | longcat-flash-chat | Reasoning | 701 | 22 | 59.1% | 54.2% | 52.9 | 1.00 |
| 129 | qwen3-14b | Reasoning | 700 | 16 | 53.1% | 54.2% | 64.8 | 0.25 |
| 130 | claude-3.7-sonnet | Continuation | 698 | 12 | 41.7% | 59.7% | 68.7 | 23.25 |
| 131 | claude-opus-4.1 | Reasoning | 698 | 28 | 32.8% | 57.9% | 39.3 | 0.57 |
| 132 | internvl3-78b | Reasoning | 697 | 11 | 54.5% | 55.3% | 77.0 | 2.73 |
| 133 | deepseek-r1 | Reasoning | 691 | 13 | 61.5% | 52.1% | 43.2 | 0.00 |
| 134 | llama-3.3-70b-instruct | Reasoning | 687 | 52 | 54.5% | 56.7% | 56.3 | 0.22 |
| 135 | qwen3-max | Reasoning | 686 | 21 | 45.2% | 51.1% | 57.5 | 0.10 |
| 136 | claude-haiku-4.5 | Reasoning | 680 | 21 | 45.5% | 53.2% | 51.5 | 0.33 |
| 137 | glm-4.6v | Reasoning | 680 | 18 | 50.0% | 57.3% | 66.1 | 0.22 |
| 138 | grok-4.1-fast-non-reasoning | Reasoning | 668 | 14 | 39.3% | 53.4% | 43.7 | 0.21 |
| 139 | qwen2.5-max | Reasoning | 667 | 23 | 45.6% | 54.2% | 50.3 | 1.00 |
| 140 | qwen-plus-2025-07-28 | Continuation | 667 | 10 | 20.0% | 58.2% | 48.8 | 22.56 |
| 141 | claude-sonnet-4.6 | Continuation | 667 | 10 | 45.0% | 51.3% | 59.7 | 24.00 |
| 142 | qwen3-235b-a22b-thinking-2507 | Reasoning | 666 | 19 | 44.7% | 54.8% | 60.5 | 0.20 |
| 143 | gemini-2.0-flash-lite-001 | Continuation | 665 | 13 | 50.0% | 53.8% | 60.1 | 12.75 |
| 144 | qwen3-coder-next | Reasoning | 664 | 12 | 50.0% | 51.3% | 75.5 | 2.92 |
| 145 | claude-sonnet-4.5 | Reasoning | 663 | 30 | 45.0% | 58.1% | 46.2 | 0.20 |
| 146 | deepseek-r1-0528 | Reasoning | 662 | 16 | 43.8% | 52.0% | 56.8 | 0.25 |
| 147 | deepseek-v3.2-exp | Reasoning | 660 | 18 | 47.2% | 56.9% | 45.1 | 0.50 |
| 148 | devstral-2512 | Reasoning | 660 | 18 | 39.5% | 53.5% | 50.0 | 0.95 |
| 149 | gpt-4o-mini | Reasoning | 659 | 21 | 30.3% | 55.0% | 43.8 | 0.14 |
| 150 | qwen3-coder-plus | Reasoning | 659 | 14 | 39.3% | 56.1% | 69.4 | 0.62 |
| 151 | grok-2-latest ˟ | Reasoning | 659 | 16 | 56.3% | 53.6% | 73.4 | 1.25 |
| 152 | seed-1.6-flash | Reasoning | 659 | 21 | 40.5% | 53.2% | 72.5 | 0.71 |
| 153 | gemma-2-27b-it | Reasoning | 658 | 18 | 50.0% | 58.0% | 68.2 | 6.60 |
| 154 | glm-4.5 | Reasoning | 658 | 21 | 52.4% | 50.7% | 60.2 | 1.67 |
| 155 | claude-3.7-sonnet | Reasoning | 655 | 26 | 48.2% | 47.2% | 52.2 | 0.00 |
| 156 | claude-3.5-sonnet | Continuation | 653 | 11 | 40.9% | 53.2% | 61.0 | 11.83 |
| 157 | minimax-m1 | Reasoning | 653 | 15 | 33.3% | 56.6% | 52.0 | 0.14 |
| 158 | olmo-3-32b-think | Reasoning | 650 | 15 | 66.7% | 48.0% | 57.7 | 0.07 |
| 159 | phi-4 | Reasoning | 649 | 14 | 60.7% | 50.0% | 75.6 | 21.33 |
| 160 | qwen2.5-plus | Reasoning | 649 | 15 | 63.3% | 49.0% | 72.6 | 1.50 |
| 161 | intellect-3 | Reasoning | 649 | 7 | 71.4% | 50.8% | 79.4 | 0.43 |
| 162 | deepseek-v3-0324 | Continuation | 648 | 13 | 50.0% | 54.4% | 56.2 | 11.50 |
| 163 | hunyuan-a13b-instruct | Continuation | 648 | 8 | 37.5% | 57.5% | 56.1 | 88.50 |
| 164 | gpt-4 | Reasoning | 648 | 12 | 58.3% | 47.2% | 72.4 | 1.50 |
| 165 | gemini-2.5-flash-lite | Reasoning | 645 | 28 | 43.6% | 54.0% | 55.9 | 7.80 |
| 166 | claude-3-sonnet ˟ | Reasoning | 642 | 6 | 58.3% | 53.4% | 81.3 | — |
| 167 | claude-sonnet-4 | Reasoning | 640 | 33 | 44.4% | 51.3% | 55.1 | 0.00 |
| 168 | qwen3-235b-a22b | Reasoning | 639 | 19 | 42.1% | 51.4% | 42.9 | 0.00 |
| 169 | llama-3.1-nemotron-ultra-253b-v1 | Reasoning | 637 | 13 | 42.9% | 51.3% | 59.1 | 0.00 |
| 170 | gpt-oss-120b | Continuation | 637 | 13 | 34.6% | 49.6% | 57.7 | 6.00 |
| 171 | claude-sonnet-4.5 | Continuation | 634 | 17 | 47.1% | 47.6% | 45.8 | 15.75 |
| 172 | ernie-4.5-21b-a3b-thinking | Reasoning | 634 | 13 | 65.4% | 47.9% | 70.7 | 1.33 |
| 173 | gpt-4-turbo | Reasoning | 632 | 21 | 47.6% | 53.4% | 64.6 | 0.67 |
| 174 | qwen3-235b-a22b | Continuation | 630 | 10 | 35.0% | 55.6% | 52.7 | 60.00 |
| 175 | claude-haiku-4.5 | Continuation | 629 | 15 | 23.3% | 57.2% | 51.8 | 59.60 |
| 176 | ministral-14b-2512 | Reasoning | 628 | 13 | 42.3% | 56.8% | 64.9 | 9.31 |
| 177 | gemini-1.5-pro ˟ | Reasoning | 627 | 12 | 37.5% | 52.8% | 50.8 | 5.00 |
| 178 | gemini-2.0-flash-001 | Reasoning | 626 | 47 | 34.8% | 54.9% | 56.4 | 0.00 |
| 179 | deepseek-v3 | Continuation | 626 | 10 | 50.0% | 50.3% | 69.7 | 31.75 |
| 180 | qwen2.5-vl-32b-instruct | Reasoning | 626 | 2 | 50.0% | 52.9% | 150.5 | 4.00 |
| 181 | inflection-3-pi | Reasoning | 625 | 11 | 68.2% | 48.0% | 74.6 | 5.91 |
| 182 | mistral-large-3-2512 | Reasoning | 621 | 15 | 43.8% | 47.5% | 52.6 | 0.19 |
| 183 | grok-code-fast-1 | Continuation | 619 | 10 | 50.0% | 49.8% | 68.4 | 65.33 |
| 184 | llama-3.3-nemotron-super-49b-v1.5 | Reasoning | 619 | 11 | 50.0% | 49.0% | 64.1 | 0.00 |
| 185 | deepseek-v3-0324 | Reasoning | 613 | 27 | 35.9% | 50.2% | 49.9 | 0.25 |
| 186 | glm-4.6 | Reasoning | 613 | 23 | 47.8% | 45.2% | 56.9 | 1.91 |
| 187 | llama-3.3-70b-instruct | Continuation | 611 | 13 | 30.8% | 55.0% | 67.2 | 80.33 |
| 188 | qwen3-30b-a3b | Reasoning | 610 | 17 | 52.9% | 45.0% | 74.9 | 0.00 |
| 189 | deepseek-v3 | Reasoning | 609 | 17 | 41.2% | 53.4% | 42.3 | 0.20 |
| 190 | minimax-m2.1 | Reasoning | 607 | 12 | 29.2% | 52.8% | 51.9 | 9.08 |
| 191 | qwen3.5-397b-a17b | Continuation | 605 | 7 | 35.7% | 52.4% | 53.9 | 8.57 |
| 192 | devstral-2512 | Continuation | 604 | 9 | 55.6% | 46.8% | 67.9 | 43.00 |
| 193 | llama-3.3-nemotron-super-49b-v1 | Reasoning | 602 | 10 | 60.0% | 49.6% | 87.9 | 13.00 |
| 194 | grok-3 | Reasoning | 602 | 21 | 45.5% | 43.2% | 52.0 | 0.00 |
| 195 | devstral-small-2505 | Reasoning | 599 | 3 | 0.0% | 55.4% | 83.3 | 2.00 |
| 196 | minimax-m2.5 | Reasoning | 599 | 14 | 46.4% | 43.5% | 56.8 | 2.57 |
| 197 | magistral-medium-2506 | Reasoning | 598 | 10 | 40.0% | 51.0% | 78.9 | 1.25 |
| 198 | inflection-3-pi | Continuation | 598 | 1 | 50.0% | 50.1% | 54.0 | 33.00 |
| 199 | gpt-4.1-nano | Continuation | 597 | 8 | 50.0% | 47.2% | 100.6 | 288.50 |
| 200 | gemini-1.5-flash ˟ | Reasoning | 594 | 10 | 41.7% | 51.5% | 38.6 | 0.00 |
| 201 | llama-3.1-70b-instruct | Reasoning | 593 | 13 | 53.8% | 47.1% | 68.2 | 1.00 |
| 202 | qwen3-next-80b-a3b-instruct | Reasoning | 593 | 20 | 37.5% | 48.6% | 54.0 | 6.33 |
| 203 | aurora-alpha | Continuation | 593 | 1 | 50.0% | 48.3% | 97.0 | 37.00 |
| 204 | deepseek-v3.1 | Reasoning | 591 | 19 | 38.1% | 51.1% | 52.1 | 0.25 |
| 205 | deepseek-v3.2 | Continuation | 591 | 11 | 54.5% | 45.7% | 47.3 | 28.91 |
| 206 | llama-3.1-405b-instruct | Reasoning | 590 | 24 | 43.8% | 46.6% | 64.0 | 2.25 |
| 207 | nemotron-3-nano-30b-a3b | Continuation | 589 | 5 | 60.0% | 48.7% | 86.4 | 90.60 |
| 208 | gpt-4o-2024-11-20 | Reasoning | 588 | 25 | 36.5% | 44.8% | 58.2 | 0.25 |
| 209 | gemini-2.0-flash-lite-001 | Reasoning | 588 | 18 | 34.2% | 52.1% | 59.7 | 1.25 |
| 210 | devstral-medium | Reasoning | 588 | 12 | 66.7% | 40.1% | 78.4 | 4.42 |
| 211 | glm-4.5-air | Reasoning | 585 | 19 | 55.3% | 39.3% | 61.1 | 2.25 |
| 212 | qwen3-vl-235b-a22b-thinking | Reasoning | 585 | 11 | 50.0% | 46.8% | 56.6 | 0.00 |
| 213 | mimo-v2-flash | Reasoning | 585 | 13 | 43.3% | 42.1% | 61.2 | 2.00 |
| 214 | gemma-3-12b-it | Reasoning | 584 | 16 | 53.1% | 48.1% | 50.5 | 0.25 |
| 215 | claude-3.5-haiku | Reasoning | 581 | 22 | 38.6% | 55.7% | 59.5 | 0.00 |
| 216 | qwen3-coder-480b-a35b | Reasoning | 581 | 12 | 45.8% | 46.3% | 60.6 | 3.40 |
| 217 | qwq-32b | Reasoning | 580 | 13 | 34.6% | 48.0% | 61.8 | 0.67 |
| 218 | gemma-2-27b-it | Continuation | 579 | 6 | 41.7% | 47.5% | 45.2 | 62.00 |
| 219 | magistral-medium-2506:thinking | Reasoning | 578 | 2 | 0.0% | 51.8% | 29.5 | — |
| 220 | hunyuan-a13b-instruct | Reasoning | 576 | 14 | 42.9% | 49.5% | 69.2 | 19.50 |
| 221 | llama-4-maverick | Reasoning | 575 | 26 | 46.2% | 42.9% | 59.4 | 0.50 |
| 222 | ling-1t | Reasoning | 574 | 17 | 50.0% | 45.8% | 48.2 | 8.00 |
| 223 | qwen2.5-turbo | Reasoning | 573 | 13 | 53.8% | 50.1% | 80.6 | 2.33 |
| 224 | jamba-large-1.7 | Reasoning | 572 | 14 | 42.9% | 46.3% | 59.0 | 0.33 |
| 225 | gpt-4.1-nano | Reasoning | 570 | 20 | 38.6% | 51.8% | 57.0 | 3.00 |
| 226 | inflection-3-productivity | Reasoning | 570 | 11 | 54.5% | 46.7% | 87.5 | 6.27 |
| 227 | mistral-small-3.2-24b-instruct | Reasoning | 569 | 14 | 46.7% | 49.2% | 80.5 | 2.50 |
| 228 | ernie-4.5-300b-a47b | Reasoning | 569 | 19 | 36.8% | 51.8% | 72.2 | 0.00 |
| 229 | lfm2-8b-a1b | Reasoning | 569 | 19 | 68.4% | 38.8% | 98.9 | 9.47 |
| 230 | qwen3-next-80b-a3b-thinking | Continuation | 568 | 10 | 35.0% | 50.1% | 38.5 | 42.20 |
| 231 | deepseek-v3.1-terminus | Reasoning | 568 | 14 | 50.0% | 43.4% | 52.9 | 0.80 |
| 232 | claude-opus-4.5-thinking | Reasoning | 567 | 1 | 50.0% | 46.4% | 59.0 | 0.00 |
| 233 | ernie-4.5-21b-a3b | Reasoning | 566 | 18 | 50.0% | 47.7% | 72.5 | 8.57 |
| 234 | mistral-medium-3 | Reasoning | 565 | 17 | 41.2% | 45.7% | 49.8 | 0.33 |
| 235 | glm-4.7-flash | Reasoning | 562 | 11 | 31.8% | 49.4% | 73.0 | 1.73 |
| 236 | qwen3-30b-a3b-thinking-2507 | Reasoning | 561 | 13 | 50.0% | 44.6% | 83.3 | 0.00 |
| 237 | ministral-8b | Reasoning | 560 | 21 | 52.4% | 49.2% | 60.6 | 46.67 |
| 238 | command-r-plus-08-2024 | Reasoning | 560 | 13 | 50.0% | 42.8% | 73.5 | 10.33 |
| 239 | internvl3-78b | Continuation | 559 | 2 | 50.0% | 46.4% | 105.5 | 128.00 |
| 240 | qwen3-30b-a3b-instruct-2507 | Reasoning | 558 | 22 | 50.0% | 41.5% | 57.8 | 3.50 |
| 241 | gpt-3.5-turbo-instruct | Reasoning | 553 | 13 | 26.7% | 50.8% | 46.5 | 5.67 |
| 242 | mistral-small-24b-instruct-2501 | Reasoning | 553 | 15 | 63.3% | 38.3% | 76.8 | 1.50 |
| 243 | nova-2-lite-v1 | Reasoning | 553 | 13 | 39.3% | 47.8% | 79.4 | 1.50 |
| 244 | gemini-1.5-flash-8b ˟ | Reasoning | 552 | 10 | 50.0% | 43.5% | 43.6 | 4.00 |
| 245 | deepseek-r1 | Continuation | 551 | 2 | 50.0% | 46.4% | 61.0 | 28.00 |
| 246 | lfm-7b ˟ | Reasoning | 550 | 25 | 29.1% | 56.8% | 56.2 | 9.00 |
| 247 | gpt-3.5-turbo | Reasoning | 550 | 14 | 27.8% | 50.4% | 52.3 | 11.75 |
| 248 | hermes-4-70b | Reasoning | 550 | 2 | 75.0% | 43.8% | 59.5 | 1.00 |
| 249 | step-3.5-flash | Continuation | 548 | 11 | 36.4% | 42.0% | 46.0 | 18.73 |
| 250 | qwen3-next-80b-a3b-instruct | Continuation | 546 | 11 | 36.4% | 46.2% | 54.8 | 53.80 |
| 251 | glm-4-32b | Reasoning | 545 | 17 | 52.9% | 38.8% | 71.4 | 4.33 |
| 252 | kimi-k2-0905 | Continuation | 545 | 13 | 38.5% | 40.1% | 67.3 | 37.33 |
| 253 | minimax-m2.5 | Continuation | 544 | 1 | 25.0% | 44.9% | 67.5 | 52.00 |
| 254 | claude-3-opus ˟ | Reasoning | 543 | 10 | 35.0% | 47.6% | 64.2 | 5.80 |
| 255 | grok-3-mini | Continuation | 542 | 10 | 65.0% | 36.6% | 51.2 | 35.00 |
| 256 | mistral-medium-3.1 | Reasoning | 541 | 17 | 50.0% | 44.7% | 68.0 | 1.80 |
| 257 | qwen3-vl-32b-instruct | Reasoning | 541 | 7 | 42.9% | 46.1% | 49.6 | 0.71 |
| 258 | qwen3-max | Continuation | 540 | 10 | 40.0% | 41.0% | 69.8 | 67.00 |
| 259 | grok-4-fast-non-reasoning | Continuation | 540 | 12 | 45.8% | 42.1% | 72.7 | 51.29 |
| 260 | claude-3-haiku | Reasoning | 539 | 13 | 56.7% | 41.4% | 79.3 | 0.00 |
| 261 | qwen3-235b-a22b-instruct-2507 | Reasoning | 539 | 27 | 32.8% | 50.1% | 49.8 | 0.25 |
| 262 | command-r-08-2024 | Reasoning | 538 | 12 | 37.5% | 48.5% | 76.6 | 27.00 |
| 263 | llama-4-scout | Continuation | 537 | 10 | 55.0% | 41.2% | 91.4 | 99.00 |
| 264 | gemma-3-27b-it | Reasoning | 536 | 19 | 34.2% | 51.2% | 47.7 | 3.43 |
| 265 | llama-4-maverick | Continuation | 535 | 13 | 42.3% | 45.4% | 87.5 | 109.33 |
| 266 | claude-3.5-sonnet | Reasoning | 533 | 13 | 38.5% | 45.0% | 46.2 | 1.33 |
| 267 | longcat-flash-chat | Continuation | 533 | 10 | 27.3% | 46.6% | 45.2 | 21.80 |
| 268 | deepseek-v3.2-exp | Continuation | 532 | 6 | 28.6% | 48.1% | 42.9 | 17.50 |
| 269 | mimo-v2-flash | Continuation | 531 | 10 | 50.0% | 42.7% | 74.2 | 37.80 |
| 270 | mistral-large-2-2411 | Continuation | 530 | 10 | 30.0% | 43.0% | 93.1 | 83.00 |
| 271 | qwen3-vl-235b-a22b-instruct | Reasoning | 527 | 12 | 29.2% | 47.5% | 58.4 | 0.00 |
| 272 | gemma-2-9b-it | Reasoning | 526 | 19 | 50.0% | 41.9% | 92.7 | 47.43 |
| 273 | llama-3.1-405b-instruct | Continuation | 520 | 10 | 40.0% | 44.0% | 102.5 | 62.00 |
| 274 | claude-3.7-sonnet:thinking | Reasoning | 520 | 2 | 50.0% | 42.0% | 58.0 | 0.00 |
| 275 | qwen3-vl-8b-instruct | Reasoning | 519 | 7 | 42.9% | 44.0% | 80.1 | 2.86 |
| 276 | llama-4-scout | Reasoning | 517 | 26 | 36.5% | 45.5% | 65.8 | 0.00 |
| 277 | glm-z1-32b | Reasoning | 517 | 2 | 75.0% | 41.0% | 44.5 | — |
| 278 | gemini-1.5-flash ˟ | Continuation | 517 | 3 | 66.7% | 39.6% | 74.3 | — |
| 279 | mistral-large-3-2512 | Continuation | 517 | 11 | 63.6% | 33.9% | 96.0 | 70.36 |
| 280 | wizardlm-2-8x22b | Reasoning | 515 | 12 | 29.2% | 48.6% | 48.8 | 3.50 |
| 281 | seed-1.6-flash | Continuation | 513 | 2 | 50.0% | 43.7% | 47.0 | 59.50 |
| 282 | deepseek-prover-v2 | Reasoning | 510 | 8 | 37.5% | 46.0% | 45.9 | 0.80 |
| 283 | mistral-nemo | Reasoning | 510 | 15 | 35.3% | 44.3% | 81.4 | 48.50 |
| 284 | jamba-large-1.6 | Reasoning | 507 | 6 | 41.7% | 42.7% | 85.3 | 1.00 |
| 285 | devstral-small | Reasoning | 506 | 12 | 33.3% | 48.3% | 74.0 | 1.58 |
| 286 | magistral-small-2506 | Continuation | 505 | 2 | 0.0% | 43.3% | 23.0 | 43.00 |
| 287 | llama-3.1-8b-instruct | Reasoning | 503 | 30 | 56.5% | 31.6% | 106.6 | 20.00 |
| 288 | magistral-small-2506 | Reasoning | 503 | 11 | 50.0% | 39.3% | 79.2 | 2.00 |
| 289 | llama-3-8b-instruct | Reasoning | 503 | 14 | 40.0% | 45.3% | 105.3 | 14.64 |
| 290 | qwen3.5-plus-02-15 | Continuation | 503 | 8 | 56.3% | 34.5% | 75.9 | 36.00 |
| 291 | molmo-2-8b | Reasoning | 502 | 12 | 45.8% | 45.2% | 93.1 | 9.17 |
| 292 | gemma-3-27b-it | Continuation | 499 | 7 | 7.1% | 47.6% | 58.9 | 49.50 |
| 293 | olmo-3.1-32b-instruct | Reasoning | 499 | 12 | 29.2% | 46.9% | 58.4 | 1.17 |
| 294 | command-r-08-2024 | Continuation | 495 | 6 | 25.0% | 43.1% | 101.8 | 231.00 |
| 295 | grok-4.1-fast-non-reasoning | Continuation | 495 | 11 | 40.9% | 39.6% | 72.0 | 80.91 |
| 296 | qwen3-coder-480b-a35b | Continuation | 494 | 9 | 50.0% | 38.4% | 76.6 | 57.33 |
| 297 | qwen3-4b | Reasoning | 493 | 5 | 60.0% | 39.7% | 66.8 | 1.00 |
| 298 | llama-3.2-3b-instruct | Reasoning | 491 | 14 | 50.0% | 43.9% | 95.6 | 37.00 |
| 299 | kimi-linear-48b-a3b-instruct | Reasoning | 489 | 11 | 33.3% | 40.2% | 54.3 | 9.92 |
| 300 | jamba-large-1.7 | Continuation | 488 | 10 | 40.0% | 41.8% | 84.5 | 118.00 |
| 301 | ministral-8b-2512 | Reasoning | 488 | 16 | 40.6% | 43.4% | 68.0 | 6.00 |
| 302 | devstral-small | Continuation | 487 | 2 | 33.3% | 39.2% | 55.3 | 50.00 |
| 303 | qwen3-vl-30b-a3b-thinking | Reasoning | 487 | 3 | 66.7% | 35.9% | 52.7 | 0.00 |
| 304 | qwen-2.5-7b-instruct | Reasoning | 487 | 15 | 46.7% | 38.0% | 86.1 | 11.20 |
| 305 | deepseek-r1-0528-qwen3-8b | Reasoning | 486 | 3 | 50.0% | 37.9% | 90.3 | 1.00 |
| 306 | glm-4-32b | Continuation | 486 | 8 | 37.5% | 40.4% | 79.2 | 216.00 |
| 307 | deepseek-v3.1 | Continuation | 483 | 8 | 50.0% | 35.5% | 86.8 | 24.50 |
| 308 | ernie-4.5-300b-a47b | Continuation | 482 | 8 | 37.5% | 36.4% | 128.2 | 72.00 |
| 309 | jamba-large-1.6 | Continuation | 480 | 5 | 40.0% | 39.8% | 90.8 | — |
| 310 | hermes-4-405b | Reasoning | 479 | 2 | 25.0% | 41.2% | 59.0 | 1.00 |
| 311 | qwen2.5-72b-instruct | Continuation | 476 | 11 | 45.5% | 37.4% | 90.1 | 79.20 |
| 312 | deepseek-prover-v2 | Continuation | 474 | 4 | 37.5% | 38.8% | 100.0 | 76.00 |
| 313 | mistral-small-3.1-24b-instruct | Reasoning | 473 | 19 | 42.1% | 35.8% | 69.4 | 2.00 |
| 314 | inflection-3-productivity | Continuation | 471 | 1 | 50.0% | 38.2% | 111.0 | 79.00 |
| 315 | tng-r1t-chimera | Reasoning | 471 | 1 | 50.0% | 38.3% | 67.0 | 0.00 |
| 316 | olmo-3.1-32b-think | Reasoning | 471 | 1 | 0.0% | 40.0% | 27.0 | 0.00 |
| 317 | llama-3.3-8b-instruct | Reasoning | 470 | 15 | 43.3% | 41.3% | 75.6 | 9.00 |
| 318 | rnj-1-instruct | Reasoning | 464 | 12 | 42.3% | 38.0% | 58.7 | 9.23 |
| 319 | seed-1.6 | Continuation | 462 | 3 | 50.0% | 38.1% | 58.0 | 68.67 |
| 320 | phi-4 | Continuation | 461 | 7 | 50.0% | 36.1% | 71.7 | 96.50 |
| 321 | mistral-small-creative | Reasoning | 461 | 13 | 46.2% | 36.3% | 70.2 | 6.85 |
| 322 | gemma-3-4b-it | Reasoning | 460 | 14 | 50.0% | 37.9% | 79.9 | 12.00 |
| 323 | mythomax-l2-13b | Reasoning | 460 | 11 | 31.8% | 44.8% | 85.5 | 38.27 |
| 324 | granite-4.0-h-micro | Reasoning | 459 | 11 | 50.0% | 38.5% | 56.5 | 19.45 |
| 325 | minimax-m1 | Continuation | 456 | 1 | 50.0% | 35.2% | 153.0 | 174.00 |
| 326 | qwen2.5-vl-72b-instruct | Reasoning | 454 | 1 | 50.0% | 35.1% | 85.0 | 0.00 |
| 327 | deepseek-r1t-chimera | Reasoning | 453 | 1 | 50.0% | 35.7% | 58.0 | — |
| 328 | olmo-2-0325-32b-instruct | Reasoning | 453 | 13 | 38.5% | 36.4% | 94.7 | 5.92 |
| 329 | olmo-3-7b-think | Reasoning | 453 | 11 | 40.9% | 39.9% | 106.3 | 1.27 |
| 330 | glm-4.6 | Continuation | 446 | 9 | 50.0% | 30.9% | 80.4 | 66.00 |
| 331 | glm-4.5-air | Continuation | 443 | 8 | 50.0% | 34.4% | 68.4 | 41.00 |
| 332 | lfm2-8b-a1b | Continuation | 443 | 1 | 50.0% | 35.7% | 80.0 | 166.00 |
| 333 | afm-4.5b | Reasoning | 442 | 14 | 50.0% | 31.2% | 97.3 | 22.67 |
| 334 | ui-tars-1.5-7b | Continuation | 442 | 1 | 50.0% | 35.5% | 80.0 | 124.00 |
| 335 | minimax-m2-her | Reasoning | 441 | 1 | 50.0% | 35.2% | 176.0 | 19.00 |
| 336 | qwen3-coder-plus | Continuation | 440 | 5 | 40.0% | 35.4% | 51.8 | 58.33 |
| 337 | ministral-3b | Reasoning | 439 | 17 | 52.9% | 30.2% | 65.6 | 47.71 |
| 338 | llama-3.1-nemotron-ultra-253b-v1 | Continuation | 438 | 2 | 50.0% | 34.4% | 126.0 | 154.00 |
| 339 | seed-oss-36b-instruct | Continuation | 438 | 3 | 33.3% | 34.5% | 83.0 | — |
| 340 | qwen2.5-plus | Continuation | 437 | 9 | 50.0% | 33.0% | 71.9 | 42.33 |
| 341 | deepseek-r1-0528 | Continuation | 437 | 2 | 50.0% | 31.9% | 87.5 | 58.00 |
| 342 | qwen2.5-turbo | Continuation | 437 | 10 | 50.0% | 31.7% | 95.9 | 141.67 |
| 343 | gemma-3-12b-it | Continuation | 435 | 2 | 50.0% | 33.1% | 108.0 | 211.00 |
| 344 | trinity-large-preview | Reasoning | 434 | 1 | 50.0% | 33.5% | 72.0 | 1.00 |
| 345 | ministral-3b-2512 | Reasoning | 433 | 11 | 36.4% | 42.1% | 92.5 | 5.27 |
| 346 | llama-3.1-nemotron-70b-instruct | Reasoning | 432 | 2 | 0.0% | 36.7% | 14.7 | 0.00 |
| 347 | command-r7b-12-2024 | Reasoning | 431 | 17 | 44.1% | 31.4% | 127.6 | 28.06 |
| 348 | glm-4.7-flash | Continuation | 431 | 1 | 100.0% | 31.9% | 107.0 | 173.00 |
| 349 | jamba-mini-1.6 | Reasoning | 430 | 7 | 42.9% | 32.0% | 86.0 | 12.00 |
| 350 | mistral-medium-3 | Continuation | 428 | 9 | 50.0% | 33.4% | 79.2 | 58.50 |
| 351 | jamba-mini-1.7 | Continuation | 428 | 5 | 50.0% | 33.7% | 134.8 | 302.50 |
| 352 | llama-3.3-nemotron-super-49b-v1.5 | Continuation | 428 | 1 | 50.0% | 33.7% | 188.0 | 271.00 |
| 353 | phi-3-medium-128k-instruct | Reasoning | 425 | 12 | 45.8% | 30.8% | 106.9 | 13.00 |
| 354 | lfm-3b ˟ | Continuation | 424 | 1 | 50.0% | 31.1% | 152.0 | — |
| 355 | glm-4.5 | Continuation | 423 | 9 | 50.0% | 29.8% | 74.3 | 38.00 |
| 356 | mistral-7b-instruct-v0.1 | Reasoning | 422 | 12 | 45.8% | 29.9% | 81.7 | 64.00 |
| 357 | olmo-3-7b-instruct | Reasoning | 422 | 17 | 35.3% | 32.3% | 72.6 | 9.35 |
| 358 | qwen3-coder-next | Continuation | 421 | 5 | 50.0% | 28.4% | 94.9 | 108.71 |
| 359 | lfm-2.2-6b | Reasoning | 417 | 12 | 41.7% | 32.0% | 115.7 | 93.00 |
| 360 | jamba-mini-1.7 | Reasoning | 416 | 15 | 33.3% | 35.0% | 115.0 | 15.67 |
| 361 | gemma-3n-e4b-it | Reasoning | 415 | 16 | 37.5% | 39.0% | 65.7 | 27.40 |
| 362 | qwen2.5-max | Continuation | 414 | 7 | 42.9% | 29.3% | 90.3 | 35.00 |
| 363 | qwen3-vl-235b-a22b-thinking | Continuation | 414 | 1 | 50.0% | 30.0% | 100.0 | 81.00 |
| 364 | gemini-1.5-flash-8b ˟ | Continuation | 413 | 1 | 0.0% | 34.1% | 40.0 | — |
| 365 | lfm-3b ˟ | Reasoning | 412 | 14 | 42.9% | 29.0% | 89.0 | 15.00 |
| 366 | qwen3-30b-a3b-instruct-2507 | Continuation | 412 | 6 | 41.7% | 31.7% | 77.2 | 174.00 |
| 367 | glm-4.6v | Continuation | 411 | 10 | 50.0% | 29.9% | 101.7 | 137.30 |
| 368 | gemma-2-9b-it | Continuation | 410 | 1 | 50.0% | 31.4% | 132.0 | 193.00 |
| 369 | qwen3-30b-a3b-thinking-2507 | Continuation | 408 | 1 | 50.0% | 31.1% | 132.0 | 190.00 |
| 370 | claude-3-opus ˟ | Continuation | 407 | 2 | 75.0% | 27.6% | 72.0 | 18.00 |
| 371 | mistral-medium-3.1 | Continuation | 406 | 8 | 37.5% | 33.2% | 90.5 | 96.67 |
| 372 | ui-tars-1.5-7b | Reasoning | 405 | 19 | 36.8% | 39.3% | 92.1 | 33.50 |
| 373 | claude-3.5-haiku | Continuation | 395 | 2 | 50.0% | 29.4% | 84.5 | 65.00 |
| 374 | qwen3-235b-a22b-thinking-2507 | Continuation | 392 | 2 | 50.0% | 28.8% | 84.0 | 68.00 |
| 375 | wizardlm-2-8x22b | Continuation | 391 | 2 | 50.0% | 27.8% | 92.0 | 81.00 |
| 376 | ernie-4.5-21b-a3b | Continuation | 391 | 3 | 50.0% | 28.7% | 117.3 | 385.00 |
| 377 | qwq-32b | Continuation | 385 | 1 | 50.0% | 27.8% | 79.0 | 110.00 |
| 378 | claude-3-haiku | Continuation | 384 | 5 | 50.0% | 27.4% | 158.4 | 236.50 |
| 379 | gemma-3n-e4b-it | Continuation | 384 | 2 | 50.0% | 27.6% | 128.0 | 173.00 |
| 380 | kimi-linear-48b-a3b-instruct | Continuation | 384 | 5 | 50.0% | 27.1% | 148.8 | 307.00 |
| 381 | olmo-3-32b-think | Continuation | 384 | 1 | 100.0% | 24.0% | 70.0 | 96.00 |
| 382 | olmo-2-0325-32b-instruct | Continuation | 384 | 2 | 50.0% | 27.0% | 87.5 | 148.50 |
| 383 | qwen3-30b-a3b | Continuation | 379 | 2 | 50.0% | 26.4% | 75.0 | 142.00 |
| 384 | command-a | Continuation | 378 | 6 | 41.7% | 26.7% | 101.3 | 58.00 |
| 385 | command-r-plus-08-2024 | Continuation | 378 | 5 | 50.0% | 26.4% | 123.8 | 226.00 |
| 386 | qwen3-vl-32b-instruct | Continuation | 378 | 1 | 50.0% | 26.6% | 79.0 | 65.00 |
| 387 | mistral-nemo | Continuation | 375 | 2 | 50.0% | 26.1% | 108.0 | 110.00 |
| 388 | mistral-small-3.2-24b-instruct | Continuation | 374 | 3 | 50.0% | 25.9% | 73.0 | 90.00 |
| 389 | ministral-14b-2512 | Continuation | 372 | 1 | 50.0% | 25.6% | 198.0 | 328.00 |
| 390 | ministral-8b-2512 | Continuation | 370 | 1 | 50.0% | 25.4% | 198.0 | 341.00 |
| 391 | deepseek-r1-distill-llama-8b | Reasoning | 367 | 2 | 75.0% | 21.6% | 96.0 | — |
| 392 | qwen3-235b-a22b-instruct-2507 | Continuation | 366 | 11 | 22.7% | 32.3% | 68.2 | 41.20 |
| 393 | llama-3.1-nemotron-70b-instruct | Continuation | 366 | 1 | 50.0% | 24.2% | 195.0 | 228.00 |
| 394 | olmo-3-7b-instruct | Continuation | 365 | 1 | 50.0% | 24.5% | 182.0 | 414.00 |
| 395 | llama-3.1-8b-instruct | Continuation | 364 | 2 | 50.0% | 24.3% | 87.0 | 60.00 |
| 396 | olmo-3-7b-think | Continuation | 363 | 1 | 50.0% | 24.0% | 182.0 | 395.00 |
| 397 | mistral-small-3.1-24b-instruct | Continuation | 361 | 2 | 50.0% | 23.6% | 87.0 | 67.00 |
| 398 | claude-3-sonnet ˟ | Continuation | 361 | 1 | 50.0% | 23.5% | 254.0 | — |
| 399 | olmo-3.1-32b-instruct | Continuation | 356 | 2 | 50.0% | 20.9% | 148.5 | 174.50 |
| 400 | qwen3-32b | Continuation | 351 | 6 | 33.3% | 27.6% | 76.2 | 141.00 |
| 401 | llama-3.1-70b-instruct | Continuation | 339 | 1 | 50.0% | 19.5% | 189.0 | 177.00 |
| 402 | qwen3-vl-235b-a22b-instruct | Continuation | 339 | 1 | 50.0% | 19.5% | 189.0 | 65.00 |
| 403 | deepseek-r1-distill-qwen-7b | Reasoning | 333 | 2 | 25.0% | 21.8% | 96.0 | — |
| 404 | mistral-small-24b-instruct-2501 | Continuation | 329 | 2 | 50.0% | 17.0% | 80.0 | 77.00 |
| 405 | llama-3.3-nemotron-super-49b-v1 | Continuation | 324 | 1 | 50.0% | 14.4% | 103.0 | — |
| 406 | qwen2.5-vl-32b-instruct | Continuation | 316 | 1 | 50.0% | 14.1% | 75.0 | — |