大模型 Agent 能力评测排行榜

本页面提供大模型 Agent 能力评测排行榜，涵盖 Aider-Polyglot、τ²-Bench、Terminal Bench 2.0、Tool Decathlon、OSWorld-Verified 等主流 Agent 评测基准，深度对比 GPT、Claude、Qwen、DeepSeek 等模型的工具使用、任务规划与自主执行能力。

数据更新于 2026-04-28 13:02:03

截至 2026年4月，本页覆盖 Aider-Polyglot, τ²-Bench, Terminal Bench 2.0, Tool Decathlon 等评测基准，聚焦 大模型 Agent 能力评测排行榜 方向的模型对比。

点击模型名称可进入详情页查看上下文长度、许可方式与 API 价格。数据口径说明见数据方法论。

基准评测

Agent能力评测Aider-Polyglot τ²-Bench

AI Agent - 工具使用Terminal Bench 2.0 Tool Decathlon OSWorld-Verified

更多评测

参数规模:全部 3B及以下 7B 13B 34B 65B 100B及以上

模型类型:全部推理大模型基座大模型指令优化/聊天优化大模型编程大模型

开源：全部开源闭源

来源：全部国产模型

模型发布时间截止:

榜单亮点

按 OSWorld-Verified 排序

当前 SOTA

Claude Mythos Preview

Anthropic

79.60OSWorld-Verified

查看详情

最佳开源

Kimi K2.6

Moonshot AI

73.10OSWorld-Verified−6.50

查看详情

最佳国产

Qwen3.5-397B-A17B

阿里巴巴

62.20OSWorld-Verified−17.40

查看详情

大模型性能评测结果

数据来源：DataLearnerAI

点击任意行查看模型详情；勾选左侧可对比最多 4 个模型。

排名	模型						开源情况
	Claude Mythos Preview 扩展思考工具 Anthropic	—	—	82.00	—	79.60	闭源	详情
	GPT-5.5 开启思考工具 OpenAI	—	—	82.70	—	78.70	闭源	详情
	Opus 4.7 扩展思考工具 Anthropic	—	—	69.40	—	78.00	闭源	详情
4	GPT-5.4 思考水平 · 极高工具 OpenAI	—	—	75.10	—	75.00	闭源	详情
5	Kimi K2.6 开启思考工具 Moonshot AI	—	—	66.70	50.00	73.10	免费商用	详情
6	Claude Opus 4.6 扩展思考工具 Anthropic	—	91.89	65.40	—	72.70	闭源	详情
7	Claude Sonnet 4.6 开启思考工具 Anthropic	—	—	59.10	—	72.50	闭源	详情
8	GPT-5.4 mini 思考水平 · 极高工具 OpenAI	—	—	60.00	42.90	72.10	闭源	详情
9	Qwen3.5-397B-A17B 开启思考工具阿里巴巴	—	86.70	52.50	38.30	62.20	免费商用	详情
10	Claude Sonnet 4.5 开启思考工具 Anthropic	—	84.70	42.80	—	61.40	闭源	详情
11	Qwen3.5-27B 开启思考工具阿里巴巴	—	79.00	41.60	—	56.20	免费商用	详情
12	Claude Sonnet 4 开启思考工具 Anthropic	—	—	—	—	42.20	闭源	详情
13	GPT-5.4 nano 思考水平 · 极高工具 OpenAI	—	—	46.30	35.50	39.00	闭源	详情
14	Claude Sonnet 3.7 开启思考工具 Anthropic	—	61.80	—	—	28.00	闭源	详情
15	o3-pro 思考水平 · 高 OpenAI	84.90	—	—	—	—	闭源	详情
16	Gemini 2.5-Pro 开启思考 Google Deep Mind	83.10	—	—	—	—	闭源	详情
17	OpenAI o3 思考水平 · 高 OpenAI	81.30	—	—	—	—	闭源	详情
18	Grok 4 开启思考 xAI	79.60	—	—	—	—	闭源	详情
19	DeepSeek-V3.1 开启思考 DeepSeek-AI	76.30	—	—	—	—	免费商用	详情
20	DeepSeek-V3.1 Terminus DeepSeek-AI	76.10	—	—	—	—	免费商用	详情
21	DeepSeek V3.2-Exp 开启思考工具 DeepSeek-AI	74.50	66.70	—	—	—	免费商用	详情
22	OpenAI o4 - mini 思考水平 · 高 OpenAI	72.00	—	—	—	—	闭源	详情
23	Claude Opus 4 开启思考 Anthropic	72.00	—	—	—	—	闭源	详情
24	DeepSeek-R1-0528 开启思考 DeepSeek-AI	71.40	—	—	—	—	免费商用	详情
25	Claude Opus 4 Anthropic	70.10	—	—	—	—	闭源	详情
26	DeepSeek V3.2 开启思考工具 DeepSeek-AI	69.90	80.30	46.40	—	—	免费商用	详情
27	DeepSeek-V3.1 DeepSeek-AI	68.40	—	—	—	—	免费商用	详情
28	Qwen3-Coder-Next 常规模式工具阿里巴巴	66.20	—	36.20	—	—	免费商用	详情
29	Claude Sonnet 3.7 开启思考 Anthropic	64.90	—	—	—	—	闭源	详情
30	Claude Sonnet 4 开启思考 Anthropic	61.30	—	—	—	—	闭源	详情
31	M2.1 开启思考工具 MiniMaxAI	61.00	—	47.90	—	—	免费商用	详情
32	Claude Sonnet 3.7 Anthropic	60.40	—	—	—	—	闭源	详情
33	Kimi K2 Moonshot AI	59.10	—	—	—	—	免费商用	详情
34	Gemini 2.5 Flash 开启思考 Google Deep Mind	56.70	—	—	—	—	闭源	详情
35	DeepSeek-V3-0324 DeepSeek-AI	55.10	—	—	—	—	免费商用	详情
36	GLM-4.7 开启思考工具智谱AI	52.10	87.40	41.00	—	—	免费商用	详情
37	Claude 3.5 Sonnet New Anthropic	51.60	—	—	—	—	闭源	详情
38	Qwen3-Next 阿里巴巴	49.80	—	—	—	—	免费商用	详情
39	Qwen3-32B 开启思考阿里巴巴	40.00	—	—	—	—	免费商用	详情
40	GPT-4o(2025-03-27) OpenAI	27.10	—	—	—	—	闭源	详情
41	Qwen 3.6 Plus Preview 开启思考工具阿里巴巴	—	—	61.60	39.80	—	闭源	详情
42	Composer 2 开启思考 Cursor	—	—	61.70	—	—	闭源	详情
43	DeepSeek-V4-Pro 开启思考工具 DeepSeek-AI	—	—	63.30	—	—	免费商用	详情
44	GLM 5.1 开启思考工具智谱AI	—	—	63.50	40.70	—	免费商用	详情
45	Qwen3.6-Max-Preview 深度思考模式工具阿里巴巴	—	—	65.40	—	—	闭源	详情
46	DeepSeek-V4-Pro 思考水平 · 极高工具 DeepSeek-AI	—	—	67.90	—	—	免费商用	详情
47	GPT-5.3 Codex 常规模式工具 OpenAI	—	—	77.30	—	—	闭源	详情
48	Haiku 4.5 常规模式工具 Anthropic	—	33.00	—	—	—	闭源	详情
49	Qwen3-235B-A22B 开启思考工具阿里巴巴	—	34.40	—	—	—	免费商用	详情
50	DeepSeek-V3.1 Terminus 开启思考工具 DeepSeek-AI	—	37.00	—	—	—	免费商用	详情

Claude Mythos Preview Anthropic

扩展思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.082.00

Tool Decathlon—

OSWorld-Verified79.60

闭源

GPT-5.5 OpenAI

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.082.70

Tool Decathlon—

OSWorld-Verified78.70

闭源

Opus 4.7 Anthropic

扩展思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.069.40

Tool Decathlon—

OSWorld-Verified78.00

闭源

GPT-5.4 OpenAI

思考水平 · 极高工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.075.10

Tool Decathlon—

OSWorld-Verified75.00

闭源

Kimi K2.6 Moonshot AI

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.066.70

Tool Decathlon50.00

OSWorld-Verified73.10

免费商用

Claude Opus 4.6 Anthropic

扩展思考工具

Aider-Polyglot—

τ²-Bench91.89

Terminal Bench 2.065.40

Tool Decathlon—

OSWorld-Verified72.70

闭源

Claude Sonnet 4.6 Anthropic

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.059.10

Tool Decathlon—

OSWorld-Verified72.50

闭源

GPT-5.4 mini OpenAI

思考水平 · 极高工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.060.00

Tool Decathlon42.90

OSWorld-Verified72.10

闭源

Qwen3.5-397B-A17B 阿里巴巴

开启思考工具

Aider-Polyglot—

τ²-Bench86.70

Terminal Bench 2.052.50

Tool Decathlon38.30

OSWorld-Verified62.20

免费商用

Claude Sonnet 4.5 Anthropic

开启思考工具

Aider-Polyglot—

τ²-Bench84.70

Terminal Bench 2.042.80

Tool Decathlon—

OSWorld-Verified61.40

闭源

Qwen3.5-27B 阿里巴巴

开启思考工具

Aider-Polyglot—

τ²-Bench79.00

Terminal Bench 2.041.60

Tool Decathlon—

OSWorld-Verified56.20

免费商用

Claude Sonnet 4 Anthropic

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified42.20

闭源

GPT-5.4 nano OpenAI

思考水平 · 极高工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.046.30

Tool Decathlon35.50

OSWorld-Verified39.00

闭源

Claude Sonnet 3.7 Anthropic

开启思考工具

Aider-Polyglot—

τ²-Bench61.80

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified28.00

闭源

o3-pro OpenAI

思考水平 · 高

Aider-Polyglot84.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Gemini 2.5-Pro Google Deep Mind

开启思考

Aider-Polyglot83.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

OpenAI o3 OpenAI

思考水平 · 高

Aider-Polyglot81.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Grok 4 xAI

开启思考

Aider-Polyglot79.60

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V3.1 DeepSeek-AI

开启思考

Aider-Polyglot76.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek-V3.1 Terminus DeepSeek-AI

Aider-Polyglot76.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek V3.2-Exp DeepSeek-AI

开启思考工具

Aider-Polyglot74.50

τ²-Bench66.70

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

OpenAI o4 - mini OpenAI

思考水平 · 高

Aider-Polyglot72.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Claude Opus 4 Anthropic

开启思考

Aider-Polyglot72.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-R1-0528 DeepSeek-AI

开启思考

Aider-Polyglot71.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Claude Opus 4 Anthropic

Aider-Polyglot70.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek V3.2 DeepSeek-AI

开启思考工具

Aider-Polyglot69.90

τ²-Bench80.30

Terminal Bench 2.046.40

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek-V3.1 DeepSeek-AI

Aider-Polyglot68.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Qwen3-Coder-Next 阿里巴巴

常规模式工具

Aider-Polyglot66.20

τ²-Bench—

Terminal Bench 2.036.20

Tool Decathlon—

OSWorld-Verified—

免费商用

Claude Sonnet 3.7 Anthropic

开启思考

Aider-Polyglot64.90

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Claude Sonnet 4 Anthropic

开启思考

Aider-Polyglot61.30

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

M2.1 MiniMaxAI

开启思考工具

Aider-Polyglot61.00

τ²-Bench—

Terminal Bench 2.047.90

Tool Decathlon—

OSWorld-Verified—

免费商用

Claude Sonnet 3.7 Anthropic

Aider-Polyglot60.40

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Kimi K2 Moonshot AI

Aider-Polyglot59.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Gemini 2.5 Flash Google Deep Mind

开启思考

Aider-Polyglot56.70

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V3-0324 DeepSeek-AI

Aider-Polyglot55.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

GLM-4.7 智谱AI

开启思考工具

Aider-Polyglot52.10

τ²-Bench87.40

Terminal Bench 2.041.00

Tool Decathlon—

OSWorld-Verified—

免费商用

Claude 3.5 Sonnet New Anthropic

Aider-Polyglot51.60

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Qwen3-Next 阿里巴巴

Aider-Polyglot49.80

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

Qwen3-32B 阿里巴巴

开启思考

Aider-Polyglot40.00

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

GPT-4o(2025-03-27)OpenAI

Aider-Polyglot27.10

τ²-Bench—

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Qwen 3.6 Plus Preview 阿里巴巴

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.061.60

Tool Decathlon39.80

OSWorld-Verified—

闭源

Composer 2 Cursor

开启思考

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.061.70

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V4-Pro DeepSeek-AI

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.063.30

Tool Decathlon—

OSWorld-Verified—

免费商用

GLM 5.1 智谱AI

开启思考工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.063.50

Tool Decathlon40.70

OSWorld-Verified—

免费商用

Qwen3.6-Max-Preview 阿里巴巴

深度思考模式工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.065.40

Tool Decathlon—

OSWorld-Verified—

闭源

DeepSeek-V4-Pro DeepSeek-AI

思考水平 · 极高工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.067.90

Tool Decathlon—

OSWorld-Verified—

免费商用

GPT-5.3 Codex OpenAI

常规模式工具

Aider-Polyglot—

τ²-Bench—

Terminal Bench 2.077.30

Tool Decathlon—

OSWorld-Verified—

闭源

Haiku 4.5 Anthropic

常规模式工具

Aider-Polyglot—

τ²-Bench33.00

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

闭源

Qwen3-235B-A22B 阿里巴巴

开启思考工具

Aider-Polyglot—

τ²-Bench34.40

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

DeepSeek-V3.1 Terminus DeepSeek-AI

开启思考工具

Aider-Polyglot—

τ²-Bench37.00

Terminal Bench 2.0—

Tool Decathlon—

OSWorld-Verified—

免费商用

排序：

已显示 50 / 93 个模型查看 OSWorld-Verified 基准测试完整页面