MCP Discoverability: The Hidden Cost of Scale

In the early days of building AI agents, the hard part was getting anything to work. Connecting models to tools, managing context, handling basic loops — all of it required duct tape and prayer. But as the agentic ecosystem matures, we've crossed an invisible threshold. Tools are no longer scarce. They're everywhere. We now have entire MCP servers (Model Context Protocol) powering multi-agent runtimes with access to thousands of endpoints, scripted tools, live APIs, and memory fetchers. The frontier has shifted. The hard part now isn't wiring up tools. It's helping models discover which ones to use.

This is the crisis of discoverability.

As open-source toolkits like LangGraph, Autogen, CrewAI, and Manus race to enable complex agent workflows, they quietly inherit an unsolved problem: LLMs don't natively know how to navigate a growing universe of tools. This is not just a prompt formatting issue. It's a systems design problem. And as more teams deploy agents in real-world production flows, the cost of poor discoverability compounds.

Tools Without Maps

The promise of MCP is that tools and context can be passed in programmatically, so developers can bind agents to external systems without needing to manually cram everything into a static prompt. But this flexibility has a tradeoff: most MCP runtimes now inject hundreds of functions, structured context blobs, and memory references into the model without a clear indexing system. The result? Tools outpace the model's ability to reason about them.

Imagine being handed a toolbelt with 200 tools. Some have clear names. Others don't. Some were just added yesterday and aren't documented yet. Others have overlapping names like get_invoice() and fetch_invoice_data(). You have no IDE, no autocomplete, no teammate to ask. And you're expected to build a plan in 5 seconds.

That's the LLM's experience inside most agentic stacks today.

Some systems try to mitigate this by selectively injecting tools based on hardcoded filters or embedding-based similarity. Others rely on action masking and logits steering. But these are brittle hacks. They sidestep the root issue: discoverability isn't being treated as a first-class problem.

Emergent Bloat

As teams layer more capabilities into their MCPs, the action space grows faster than the semantic signal available to navigate it. Planning quality doesn't degrade linearly. It cliffs.

What used to be a 3-tool decision with clear affordances becomes a 50-tool soup where the planner starts guessing. Execution failures rise. Retry loops kick in. Token costs spike. The agent stalls not because it doesn't know how to reason — but because it can't find what to reason with.

We've seen this pattern emerge across several open-source stacks. A new workflow gets added. A few tools join the pool. Then another team adds their vertical. Over time, the MCP becomes a microservices zoo. No one trims the toolset because it's unclear which tools are safe to remove. Observability is weak. And the agents? They get slower, less reliable, and harder to debug.

This is the paradox: more tools should mean more capability. But without discoverability, it means more entropy.

Beyond Function Names

The naive solution is to "just name tools better." But naming alone isn't enough. Models don't think like devs. They don't pattern match on camelCase or deduce semantics from prefixes. They rely on co-occurrence, frequency, and examples. If the tool name is resolveConflicts, but the model has never seen that term associated with version control or scheduling, it won't guess right.

Tool metadata helps — but only if it's exposed in a form the model can learn from. Descriptions should include natural language examples, expected arguments, side effects, and common failures. Tools that return the same shape should note when they differ in semantics. Models can learn to generalize from patterns, but they need a tight, consistent grammar to do so.

What's missing is an index. Not just a JSON list of tool specs, but a contextual, searchable, relevance-ranked interface that the agent can query at runtime. Think of it as grep for agent toolkits. Not unlike VS Code's command palette, but for LLMs.

From Discoverability to Adaptivity

True discoverability isn't static. It should evolve. Agents should learn from usage logs, success rates, tool failures, and execution traces. If a tool is constantly selected and fails, that should be surfaced. If another tool solves the same problem more efficiently, promote it. Over time, the system becomes adaptive — ranking tools not just by description similarity but by historical context and outcome.

This implies deeper integration between planning and telemetry. The MCP runtime must track not just which tools were called, but in what context, with what arguments, and whether they succeeded. This data can be fed back into embeddings, fine-tuning, or planner heuristics.

Without this loop, we're flying blind.

Why This Matters Now

MCP adoption is accelerating. Dozens of teams are spinning up local orchestration runtimes. Foundation models are shipping with better tool-use capabilities. LLMs are being turned loose on real-world business processes. The window of experimentation is closing. Discoverability is no longer a toy problem. It's a scaling bottleneck.

The risk isn't just technical debt. It's user trust. When agents fail to pick the right tools, they don't just waste compute. They fail publicly. They misfire in customer support chats. They miss deadlines in operations workflows. They suggest broken next actions in sensitive domains like healthcare or finance.

Users don't care if your planner had 200 tools to pick from. They care that it chose wrong.

Toward a Discoverability Stack

To solve this, we need to treat discoverability like search. Not just a byproduct of prompt design, but an active, composable layer in the agent stack.

What might that look like?

A tool registry with natural language embeddings, usage stats, and context-aware ranking.
An interface planner that can ask: "Given this task, what are 3 candidate tools I've seen succeed in similar situations?"
A self-repair module that can retry with an alternate tool if the initial call fails.
A memory trace system that links goals with tool outcomes over time.

This won't emerge overnight. But teams that build toward this will gain an unfair advantage. Their agents won't just act faster. They'll act smarter, more consistently, and with clearer reasoning paths.

In the age of agent swarms and decentralized workflows, discoverability is the difference between orchestration and chaos. It's not a nice-to-have. It's the next frontier.

The agent that finds the right tool wins. The system that helps it get there, faster and more reliably, defines the next wave of agentic computing.