The MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents, each with a curated initial state and programmatic verification script. These tasks demand diverse CRUD operations and richer environmental interactions. We evaluate cutting-edge LLMs using a minimal agent framework. The best-performing model, gpt-5-medium, reaches only 52.56% pass@1 and 33.86% passˆ4, while other strong models including claude-sonnet-4 and o3 fall below 30% pass@1 and 15% passˆ4. On average, LLMs require 16.2 turns and 17.4 tool calls per task, highlighting the stress-testing nature of MCPMark.
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Zijian Wu
, Xiangyan Liu
, Xinyuan Zhang
, Lingjun Chen
, Fanqing Meng
, Lingxiao Du
, Yiran Zhao
, Fanshi Zhang
, Yaoqi Ye
, Jiawei Wang
, Zirui Wang
, Jinjie Ni
, Yufan Yang
, Arvin Xu
, Michael Qizhe Shieh
Arxiv
Github
ICLR 2026
