The Three Ways an MCP Server Fails You (And Why Tool Design Is Everything)

Three failure modes separate a good MCP server from a bad one. Half-stocked capabilities, wrong labels, stations drawn in the wrong place. Here’s what each one costs you.

Written By

Harry Abram

BizOps & Growth at Attention

Date published

5/20/2026

Summarize with AI:

In the first piece in this series, I introduced MCP servers using the operational analogy of the McDonald’s kitchen. The protocol is the global blueprint for how a standardized kitchen should function. The MCP server is a specific brand’s localized operating manual and physical kitchen setup. The AI model is the new hire who can read the manual and start producing dishes immediately, without any experience in that specific building or with those menu items.

That is what good looks like.

This piece is about what bad looks like. Because here is the thing every buyer eventually figures out: when a vendor announces “we have an MCP,” they mean they have an MCP server. That statement, on its own, tells you almost nothing about whether their AI integration is actually any good. Your corner bodega may tell you they have an operating manual and then show you two grease-stained pages crumpled in a corner. Your bodega is not McDonald’s.

Three things determine whether an MCP server actually performs at the level the marketing implies. They fail independently. Each one ruins service in its own way.

TL;DR

An MCP server fails in three distinct ways: the kitchen is half-stocked (the underlying capabilities are thin), the labels are wrong (tools exist but are described badly), or the stations are drawn in the wrong place (each tool does too little, forcing the model into long, expensive call chains).
Tool count is a dance, not a number to maximize. Five tools is usually too few for any real workflow. A hundred poorly-labeled tools collapses model accuracy from 78% to 13.62%. The good kitchen sits in between and leans toward depth.
Niche, purpose-built tools, "unitaskers" exist for the same reason a McDonald’s clam-shell grill exists. Reasoning your way through with a general-purpose tool is its own kind of expensive.
The most important architecture choice in any MCP server is whether it has a bundled, server-side analysis tool. Anthropic’s own engineering team has documented that approaches which push intermediate work out of the model’s context window can collapse a 150,000-token workflow to roughly 2,000 tokens, a 98.7% reduction, with no loss of accuracy.
Attention exposes 68 tools across 15 functional groups today. That is intentional depth, but it is also more than we will ship long-term. We are actively reducing the count while keeping the unitaskers that matter. Five is too few. Sixty-eight is too many. The right number is somewhere in between.

Failure Mode 1: The Kitchen Is Half-Stocked

The underlying API is thin. The vendor exposes five tools when they should expose 25. Or everything is read-only. Or the tools can retrieve information but cannot do anything with it.

The line cook walks in, looks around, and finds a chef’s knife, a single sauté pan, and an electric kettle. They can do something. They cannot cook what was ordered. You cannot produce thousands of identical burgers without the proper equipment, and you certainly cannot run a lunch rush.

A lot of MCP announcements in 2025 fell into this category. Vendors wrapped their existing read-only API in MCP packaging and called it a day. The announcement looked like “we have an MCP.” The reality was “you can now ask Claude to tell you these three numbers we already display in the UI.” Broadly speaking, vendors who already had an 'open architecture' and many capable API routes were at an advantage in what you could do when their MCP made it to market.

This is the failure mode that is easiest to detect and easiest to underestimate. Easy to detect because the documentation outs itself: if there are only 3 tools and each of them starts with get_, list_, or search_, you have a cute viewing window, not a kitchen. Fine for a database, maybe, but not much else. Easy to underestimate because the limitation only shows up when you ask the AI to do something, not just look at something.

Attention’s MCP server exposes 68 tools across 15 functional groups, versioned continuously since launch. Read tools for searching and analyzing calls. Write tools for configuring scorecards and managing teams. Admin tools for workspace management. The breadth matters because real workflows span all of it. “Show me objections from last quarter” and “now add a scorecard criterion to track that pattern going forward” are two halves of the same job. A server with only the first half is a brochure.

Failure Mode 2: The Labels Are Wrong

The tools exist. But they are described badly. Names overlap. Station signs are vague or inconsistent. The cook cannot tell from a glance which station handles what, so they pull out four implements, carry them over to the line, and try each one until something works. Time spent rummaging is wasted time. In a kitchen, that is service falling behind. In an AI workflow, time is tokens, and tokens are dollars.

This is not a theoretical failure. Research testing tool selection accuracy found that performance dropped from 78% with 10 tools to just 13.62% with more than 100 tools. The degradation is non-linear and catastrophic at scale. A separate study documented a “lost in the middle” effect, tools positioned in the center of a long list get selected correctly far less often than tools at the beginning or end.

Tool descriptions are not metadata. They are the entire user interface for the AI. A vague label is a station sign that just reads “food.” The cook will eventually grab something. It will not be the right thing.

This is also the failure mode that vendors are most likely to be in denial about, because to the engineer who wrote the tool, the description seems perfectly clear. The test of a tool description is not whether the person who wrote it can understand it. The test is whether an AI model who has never seen this kitchen before can read the label, glance at twenty other labels, and reliably pick the right one for the order on the ticket.

Failure Mode 3: The Stations Are Drawn in the Wrong Place

The equipment is there. The labels are precise. But each tool does too little, so plating one order requires the cook to make thirty trips between stations. Each trip costs time. Each intermediate component sits on the pass while the cook goes to fetch the next one. By the time the order is finished, the line is backed up and the food is cold.

This is the failure mode with the most severe cost implications. Research from Anthropic’s own engineering team found that a single workflow dropped from roughly 150,000 tokens under a naive multi-call pattern to a fraction of that when intermediate processing happened server-side rather than through the model’s context.

This is also the failure mode that is least visible from outside the kitchen. The tools look fine on paper. The labels look fine on paper. But when you ask a meaty analytical question, like "what objections came up in our enterprise deals last quarter,” the proverbial Big Mac takes 3 hours and $230 dollars to assemble because the model had to pick up dozens of tools fifty times just to butter the bun. The architecture was wrong from the start.

`Tool` Count Is a Dance

Before any cooking happens, the cook looks at the kitchen around them. This is the manifest, the labeled inventory of every station and every tool the kitchen contains. The model reads it on every request to understand what is available.

When the labels are precise, this glance is cheap. The cook sees the ticket, sees a tool labeled for exactly that job, picks it up, and goes to work.

When the labels are vague or overlapping, the cook cannot tell from the labels which implement is the right one, so they grab three or four and carry them to the line. Time spent rummaging is wasted time, and in the world of AI, time is tokens.

Now consider the other extreme: a kitchen with very few tools, all of them general-purpose. The manifest is small. The glance is fast. But when the ticket calls for twenty Quarter Pounders, the cook is suddenly reasoning their way through how to cook them efficiently with what is available. They will get there, eventually, and the result will be uneven in temperature and slow. A highly specific tool -- the clam-shell grill -- would have cooked twenty patties simultaneously and perfectly to temperature in three minutes. The generalist kitchen finishes the order eventually. It does not finish on time, and it does not deliver consistent results.

This is the case for specificity. Niche, purpose-built tools, aka "unitaskers" exist because some jobs have one right instrument that guarantees speed and consistency. A dedicated bun toaster. A sauce dispenser calibrated to a precise volume. A patty-press set to a specific thickness. Each one looks absurd sitting alone. Each one is the only thing on earth that does its job with the precision the brand requires.

Attention exposes 68 tools today. That is intentional depth, but it is also more than we will ship long-term. We are actively reducing the number of tools we expose while keeping the relevant unitaskers in place. Ten tools is too few. Sixty-eight is too many. The right number is somewhere in between, and the work of getting there is ongoing.

Why Attention Has 68 Tools, and Why That Will Change

When Claude needs to add a user to a team in Attention, it calls add_team_member. When it needs to change someone’s role, that has its own tool too. The precision of a purpose-built tool is what makes it fast and inexpensive to use.

This depth reflects something older than MCP. Attention has had a public API for years, the same API that lets customers build Attention into their broader sales stack, connect it to their CRM workflows, and automate processes that competitors lock behind closed interfaces. In conversation intelligence, vendors are frequently described as walled gardens: a polished dining room with the kitchen door welded shut, take it or leave it. Attention’s 68 tools are a direct extension of the opposite philosophy. That depth is what genuine openness looks like when you count it.

But depth has a ceiling. We are not going to ship 200 tools. The work in front of us is consolidation: identifying which atomic tools are genuinely used as unitaskers, and which can be folded together or replaced by a single more capable tool. The goal is not the smallest possible number. It is the smallest number that still gives the model the right instrument for every job that matters.

The Robot Coupe: Why Server-Side Processing Changes the Math

There is a second kind of expense that specificity alone cannot solve. Some questions are not “do one specific thing.” They are “look across a hundred conversations and tell me what is happening.”

ask_attention is the Robot Coupe in this kitchen. The Robot Coupe is what real production kitchens use when a knife would take an hour. You drop in the mirepoix, you hit the button, you get a uniform dice in eight seconds. You pass ask_attention a question about your process which spans thousands of calls, like “what are the common objections we face.” Attention’s own AI engine processes up to 25 calls server-side per batch across transcripts, scorecard results, and intelligence items, and returns a finished answer. Looking through 100 calls means picking up the tool four times. Comparatively, if search_calls were the only tool you had, the cook would have to pick it up 100 different times. The heavy work happens on Attention’s infrastructure. Claude gets the plated dish, not the prep.

Anthropic’s own engineering team has documented the broader pattern—pushing intermediate work out of the model’s context window and onto the server side—and reported a representative case in which a 150,000-token workflow collapsed to roughly 2,000 tokens, a 98.7% reduction with no loss of accuracy. ask_attention applies the same underlying principle to conversation intelligence: do the heavy work on Attention’s infrastructure and return only the finished answer.

The two-part design works together. Precision tools for specific operational jobs like "permission this user" -- one clean call, the right instrument, done. A bundled server-side analysis tool for the heavy analytical questions -- one call, Attention does the cooking, Claude gets the plate. The 68-tool manifest still loads on every request as a fixed cost. But the variable cost, i.e. how many tools Claude actually picks up to answer your question, is where the real spending lives. That is where ask_attention collapses what would be fifty trips into one.

You can reinforce this explicitly. Claude is responsive to tool-use guidance in system prompts, so adding a single instruction to your Claude configuration steers it toward ask_attention for analytical queries:

“When analyzing Attention calls or asking questions about sales conversations, always use the ask_attention tool rather than calling or get_call_ details in sequence.”

That does not change the manifest overhead. It does change which tool gets called, and for any real analytical question, that is by far the larger number.

What Comes Next

If you have made it this far, you can name the three things that separate a good MCP server from a bad one. Depth of capability. Precision of labeling. Architecture that scopes tools at the right level of work -- with a server-side bundled tool for the analytical questions that would otherwise destroy your token budget.

The next piece in this series is the practical version of everything in this one. It is a buyer’s checklist: the questions to ask, the documentation to read, the things to verify before you connect any MCP server to a system that touches your company’s data.

→ Continue to Piece 3: How to Evaluate an MCP Server Before You Connect It

References

MCP Playground — “MCP Token Counter: Why Your Tools Are Silently Eating Your Context Window” — https://mcpplaygroundonline.com/blog/mcp-token-counter-optimize-context-window — 2026
Attention Docs — “Attention MCP Server” — https://docs.attention.com/mcp/overview — 2026
Hou et al. / arXiv — “RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation” — https://arxiv.org/pdf/2505.03275 — 2025
vLLM Semantic Router — “Semantic Tool Selection: Building Smarter AI Agents with Context-Aware Routing” — https://vllm-semantic-router.com/blog/semantic-tool-selection/ — 2025
Maxim AI / Bifrost — “Cutting MCP Token Costs by 92% at 500+ Tools” — https://www.getmaxim.ai/articles/cutting-mcp-token-costs-by-92-at-500-tools/ — 2026
Attention Docs — “AI Analysis Tools” — https://docs.attention.com/mcp/tools/ai-analysis — 2026

FAQ

Why does an MCP server cost tokens even before I ask anything?

Every tool definition your MCP server exposes is loaded into the model’s context window at the start of each request. The model needs to see the full inventory before it can decide which tools are relevant to your question. Each definition typically runs anywhere from a few hundred to roughly a thousand tokens, depending on complexity. This is a fixed cost per request, not a variable one—it does not grow as you ask more questions in a session. The more important cost variable is what happens after the model picks a tool: whether it needs to reach for one instrument or fifty to plate your answer.

What is the lost in the middle problem and does it affect Attention?

When a model receives a long list of tool definitions, research shows it is more likely to correctly select tools near the beginning or end of the list than tools positioned in the middle. This is a real phenomenon at very large tool counts with poorly differentiated descriptions. The mitigation is precise, distinctive labeling—each tool doing exactly one thing and being described unambiguously. That is the design standard Attention builds to. You can also address it directly by adding tool-use guidance to your Claude system prompt, steering it toward the right tool for the right job.

Is having more tools in an MCP server always worse?

No. More tools is worse when the tools are poorly labeled or overlapping, because the model has to reason its way through which one to pick. More tools is better when each tool is a true unitasker for a job that genuinely benefits from a purpose-built instrument. The dance is finding the right balance: enough specificity that the right instrument always exists for the job, but not so many tools that the manifest becomes unreadable or the cook has to walk to forty stations to plate one dish.

What is server-side processing and why does it matter for MCP?

Server-side processing means the heavy analytical work happens on the vendor’s infrastructure, with the result returned to the AI as a finished answer, rather than the AI doing the retrieval and synthesis itself by chaining many smaller tool calls. The difference is the difference between a Robot Coupe and a chef’s knife: both can produce a uniform dice, but one takes eight seconds and the other takes an hour. Anthropic’s engineering team has documented a representative case in which this pattern reduced a 150,000-token workflow to roughly 2,000 tokens—a 98.7% reduction with no loss of accuracy.

What should I read next after learning about MCP failure modes?

If you have not already, the first piece in this series—What Is an MCP Server? The Magic Problem, Explained.—introduces the underlying concepts and the McDonald’s kitchen analogy this piece builds on. The third piece in the series is a practical checklist for evaluating any MCP server before you connect it. Both are available on the Attention blog. And if you want to nail down the exact vocabulary people use when talking about MCPs, Piece 5 is a working glossary of every term you’ll hear.

Ready to learn more?

Attention's AI-native platform is trusted by the world's leading revenue organizations

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.