跳过正文

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Zijian Wu
Xiangyan Liu
Xinyuan Zhang
Lingjun Chen
Fanqing Meng
Lingxiao Du
Yiran Zhao
Fanshi Zhang
Yaoqi Ye
Jiawei Wang
Zirui Wang
Jinjie Ni
Yufan Yang
Arvin Xu
Michael Qizhe Shieh
Arxiv Github ICLR 2026

The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents, each with a curated initial state and programmatic verification script. These tasks demand diverse CRUD operations and richer environmental interactions. We evaluate cutting-edge LLMs using a minimal agent framework. The best-performing model, gpt-5-medium, reaches only 52.56% pass@1 and 33.86% passˆ4, while other strong models including claude-sonnet-4 and o3 fall below 30% pass@1 and 15% passˆ4. On average, LLMs require 16.2 turns and 17.4 tool calls per task, highlighting the stress-testing nature of MCPMark.