Alibaba’s Qwen3.6 Plus tops AI benchmark as rivals dispute results
Models from Alibaba have posted strong results on the MMLU-Pro benchmark, with Qwen3.6 Plus claiming the top spot on at least one major ranking. However, independently verified evaluations suggest a more contested picture of leadership in language understanding, highlighting ongoing disputes over how to measure AI performance.
According to LLM Stats, Qwen3.6 Plus achieved a self-reported score of 88.5 percent on MMLU-Pro, ahead of competing models including MiniMax M2.1 and earlier Qwen variants. The company’s Qwen3.5 model, released earlier this year, also scored 87.8 percent on the same benchmark based on its listing on Hugging Face. These results position Alibaba’s models among the top tier of large language systems.
Independent rankings present a different hierarchy. Google’s Gemini 3 Pro Preview scored 89.8 percent in one evaluation, while Anthropic’s Claude Opus 4.5 reached 89.5 percent. Other assessments place Gemini 3.1 Pro Preview even higher, at around 91 percent, narrowly ahead of newer Claude models. The variation reflects methodological differences, including prompt design and whether testing is conducted by developers or third-party evaluators.
MMLU-Pro was designed as a more demanding successor to earlier benchmarks, expanding answer choices and introducing more complex reasoning tasks. Yet even this updated test shows signs of saturation. Analysts note that leading models are now separated by margins of roughly one percentage point, limiting the benchmark’s ability to distinguish between top performers.
The rise of Qwen reflects broader momentum among Chinese open-source AI models, which have narrowed the gap with proprietary systems developed in the United States. Advanced architectures such as mixture-of-experts have enabled models like Qwen3.5 to match or exceed competitors on tests including GPQA Diamond and AIME 2025. The newer Qwen3.6 Plus extends capabilities further, supporting long context windows and more advanced coding and multimodal tasks.
The debate over MMLU-Pro rankings underscores a deeper issue in the AI sector. Benchmark scores remain a key marketing tool, even as researchers question their real-world relevance. Earlier tests such as standard MMLU have become less useful, with most frontier models scoring between 89 and 92 percent. While MMLU-Pro offers greater differentiation, narrowing gaps suggest the industry may soon require new evaluation standards to measure progress.
-
17:30
-
17:15
-
17:00
-
16:45
-
16:30
-
16:20
-
16:15
-
16:14
-
16:00
-
15:50
-
15:45
-
15:32
-
15:30
-
15:20
-
15:15
-
15:00
-
14:50
-
14:45
-
14:30
-
14:27
-
14:18
-
14:15
-
14:00
-
14:00
-
13:45
-
13:43
-
13:37
-
13:30
-
13:16
-
13:15
-
13:05
-
13:00
-
12:45
-
12:30
-
12:15
-
12:00
-
11:45
-
11:30
-
11:25
-
11:20
-
11:15
-
11:00
-
10:55
-
10:45
-
10:30
-
10:28
-
10:15
-
10:04
-
10:00
-
09:57
-
09:54
-
09:45
-
09:41
-
09:30
-
09:25
-
09:15
-
09:00
-
08:53
-
08:45
-
08:33
-
08:30
-
08:18
-
08:15
-
08:00
-
07:56
-
07:45
-
07:39
-
07:30
-
07:20
-
07:15
-
07:00
-
06:59