News

LLMs Fail More than Half Real-Life MCP Tests: Salesforce Research

A few weeks ago, Salesforce Research released a tool to test AI agent performance when using MCP servers. They’ve now published results of their MCP-Universe benchmark, which show that leading LLMs including ChatGPT, Google Gemini, Grok and Claude all fail more than half the time at real-life tasks such as location navigation, browser automation, and web search. Reality can be so annoying.