You ask your AI assistant: "What's the current status of incident-42 in our Jira board?" It types back a real answer - ticket owner, priority, last comment. No hallucination, no stale training data. But you sent one chat message and the model has no internet connection. So what actually happened between your question and that answer?
The answer is tool calling, and understanding it changes how you think about everything from chatbot reliability to why your agent sometimes returns the wrong arguments. This is not a metaphor piece. Here is the mechanics.
What "tool calling" actually means (and what it doesn't)
Tool calling - also called function calling - lets an LLM request the execution of external functions during a conversation. The model does not execute the function itself: it outputs a structured JSON request specifying which function to call and with what arguments. Your application code handles the actual execution and hands the result back.
That distinction matters more than it sounds. The LLM does not call functions. Regardless of what you have been told, it simply doesn't. If your LLM API or SDK calls functions for you, there is a layer of software wrapped around it taking care of that and invoking the function.
LLM tool calling is a mechanism that allows an AI model to generate structured requests - typically in JSON - to invoke external functions or APIs. Instead of the model guessing information it doesn't have, it recognizes a gap in its capability and requests to use a specific tool to bridge it.
The terms "function calling" and "tool calling" are often used interchangeably, but there's a subtle difference worth knowing. While people often use "function calling" and "tool calling" interchangeably, the latter is the modern standard. Function calling originally referred to matching a specific JSON signature, while tool calling builds upon this idea and supports a wider range of capabilities including provider-built tools, such as code interpreters, web browsing, and retrieval.
The loop, step by step
Take the Jira example. The developer building the assistant has registered a tool called jira_get_issue with a JSON schema that describes its name, what it does, and what arguments it accepts - in this case, an issue_id string. That schema goes into every API request alongside the user's message.
Claude decides when to call a tool based on the user's request and the tool's description, then returns a structured call that your application executes.
When you ask about incident-42, the model reads the tool descriptions and infers that jira_get_issue is the right one.
If Claude decides to use a tool, the API response will not contain the final answer. Instead, it will have a stop_reason of tool_use, and the content block will specify the tool call details.
Your code reads that response, sees stop_reason: tool_use, calls your actual Jira API with issue_id: "incident-42", and sends the result back to the model as a new message.
Tool calling follows a request-decide-execute-synthesize loop: the LLM requests a function, your code runs it, and the result feeds back into the next LLM turn. Steps two through four can repeat multiple times - the LLM might call several tools before generating a final response.
The model then reads the Jira API's JSON response, extracts the meaningful fields, and writes you a plain-English summary. That is the entire trick. Nothing magic. One extra network round-trip.
The description you write for each tool is the single biggest driver of whether the model picks the right one. Anthropic's own docs say to aim for at least 3-4 sentences per tool description, more if the tool is complex.
Agents also need to learn correct tool usage from examples, not just schema definitions. JSON schemas define what's structurally valid, but can't express usage patterns: when to include optional parameters, which combinations make sense, or what conventions your API expects. A poorly described tool is one that gets called at the wrong moment or with the wrong arguments.
Where it breaks in production
A few failure modes that catch teams by surprise.
Hallucinated arguments. The LLM may try to call tools that don't exist or pass arguments for functions it doesn't have. Always validate the tool name against your registry before execution. The model's JSON is a suggestion, not a guarantee.
Too many tool definitions. MCP tool definitions provide important context, but as more servers connect, those tokens can add up. A five-server setup can mean 58 tools consuming approximately 55K tokens before the conversation even starts. Add a Jira server (which alone uses ~17K tokens) and you're quickly approaching 100K+ token overhead. Every tool schema you pass in is tokens you're paying for on every request, even when those tools never get called.
Parallel calls burning latency. Each tool call requires a full model inference pass. After receiving results, the model must parse the data, reason about how pieces fit together, and decide what to do next - all through natural language processing. A five-tool workflow means five inference passes plus parsing each result. If your tasks are independent, configure parallel tool invocation; many orchestration layers support it.
There is also an invocation-mode decision to make early. Teams generally choose between two modes: automatic tool invocation, where the LLM decides dynamically if and when to call a tool based on the user's intent - the standard for conversational agents - and forced tool invocation, where the system developer configures the model to always use a specific tool, while the model still generates the arguments based on the input. Forced invocation is ideal for deterministic pipelines, such as structured data extraction where you need the model to output a specific schema every single time.
Prompt caching and why it changes the cost math for tool-heavy agents
Once you have a dozen tool definitions riding in every request, one optimization becomes critical: prompt caching.
Prompt caching is a provider-native feature that stores and reuses the initial, unchanging part of a prompt - the prompt prefix - so that large language models don't have to process it again on every request. More specifically, it caches the internal state of the model for that prefix, reducing redundant computation. This results in reduced latency and input token savings, without any loss in quality.
In practical terms: your system prompt plus tool definitions might be 6,000 tokens. If you're serving 10,000 conversations per day, you're paying for those tokens 10,000 times. That's 50 million tokens per day in system prompt costs alone - before a single user message is processed. Prompt caching makes those repeated tokens near-free.
Prompt caching cuts cached input cost by 90% on Anthropic's platform.
OpenAI offers automatic prompt caching on GPT-4o and newer models, where caching activates automatically for prompts exceeding a minimum token threshold, with cache hits occurring only for exact prefix matches.
Anthropic provides developer-controlled caching through explicit cache breakpoints, allowing users to specify which portions of their prompt should be cached, with configurable time-to-live options.
The catch: cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control. One stray timestamp injected into your system prompt on every request will kill the cache entirely. Avoid timestamps in system prompts. Including a timestamp at the start of a system prompt changes the prefix on every request, defeating the cache entirely. If time context is required, add it to the user message instead.
Order your prompt content from most stable to most dynamic: system instructions first, tool definitions next, conversation history after that, user message last. The cache can only cover a contiguous prefix - anything dynamic in the middle breaks it for everything below.
One team's real numbers make the stakes clear. ProjectDiscovery, building an agentic security platform, moved dynamic content out of their cached prefix and took their cache hit rate from 7% to 74% in a single deployment. Going from 7% to 84% - their eventual ceiling - meant that complex security audits that used to be orders of magnitude more expensive per run became economically viable to run repeatedly.
That shift - from "the model answers questions" to "the model routes requests to real systems" - is what makes tool calling the hinge of most practical AI work. Understanding the loop is the prerequisite for debugging it. And the debugging is where most of the real engineering lives.
For teams building agents that live in Slack or Teams, the same loop applies to every lookup a teammate like Beagle might run - a Jira query, a Linear status check, a thread summary - each one is a tool call your code executes and feeds back to the model. The prompt cache is what keeps that affordable at volume.
For more on how agents connect to external systems at scale, see how MCP fits into the tool calling picture and the Beagle integrations overview.