Most people who use AI agents every day have a rough sense that the model "called a tool." What actually happened is stranger and more interesting than that phrase implies.
Rather than executing code directly, the LLM outputs a JSON object describing which function to call and what arguments to pass.
The LLM itself does not execute the function. Instead, it identifies the appropriate function, gathers all required parameters, and provides the information in a structured JSON format. Your application receives that JSON, runs the actual function, and sends the result back. The model then generates a response using what it got.
This is the loop most people don't visualize clearly. Let's make it concrete.
Say you ask an assistant: "What's the weather in Berlin this week?"
If you ask an LLM without function calling ability about the current weather in a city, it may generate text that describes a weather situation - but the model has no knowledge of the actual current weather. The generated text may read like a weather report, but it may not have anything to do with the real conditions.
With tool calling, something different happens. The system prepares the environment by gathering the prompt, user history, and the tool definitions. You must provide the LLM with a JSON schema that describes each tool and its parameters, allowing the model to understand which tools are available and what inputs they expect before the first token is even generated.
Each tool definition has three parts: a name, a description, and a parameter schema. Each tool needs these components, and the description is critical - the AI uses it to decide when to invoke the tool. A vague description like "gets data" leads to poor decisions.
The model reads the user's message and the tool definitions together. The LLM analyzes the query and recognizes it needs external data or an action to fulfill the request. If the user asks about weather, the model identifies the need to fetch live data. Then the LLM decides to execute a function call.
When it does, the response looks different from a normal chat reply.
When the model wants to call a tool, the content field in the response is empty - the model isn't talking to the user, it's requesting an action. The finish_reason is "tool_calls" instead of the usual "stop".
Your code receives this, runs get_weather("Berlin") against a real API, and appends the result to the conversation as a tool message. The model picks up the thread, now with actual data, and writes the answer.
The round-trip cost is real
A single user message can require three LLM round trips. A smaller model calls tools sequentially - one per round - rather than batching them. Larger models would likely batch them.
Each round trip costs latency and tokens. If the number of functions and tools is high, the JSON schema gets big, increasing the overall token count and cost. Deciding which tool to call, executing it, then doing a final generation adds latency - making it a non-starter for applications requiring very low response times.
There's a hidden cost most developers don't notice immediately. Under the hood, function definitions are injected into the system message in a syntax the model has been trained on. This means callable function definitions count against the model's context limit and are billed as input tokens. A large toolset - say, 40 functions - means you're paying for all those schema descriptions on every single request, whether the model uses them or not.
If you run into token limits, limiting the number of functions loaded up front, shortening descriptions where possible, or using tool search so deferred tools are loaded only when needed helps.
Strict mode and schema enforcement
There's a meaningful difference between the model trying to return valid JSON and being constrained to return it.
Setting strict: true ensures function calls reliably adhere to the function schema, instead of being best effort. OpenAI recommends always enabling strict mode.
Without strict mode, you're essentially using few-shot tool use - teaching the model to format tool calls via examples, which is fragile and requires parsing free-form text output. Structured function calling with a dedicated API mechanism returns validated JSON matching the tool schema, with schema validation happening at the API level, not via text parsing. Far more reliable for production use.
What "tool calling" and "function calling" actually mean
Both terms coexist for historical reasons. LLMs originally had access to a small selection of functions. Later, models became able to handle larger collections of external APIs, and the name "tool" - as in "toolbox" - was established.
While people often use the terms interchangeably, "tool calling" is the modern standard. Function calling originally referred to matching a specific JSON signature, while tool calling builds upon this and supports a wider range of capabilities including provider-built tools such as code interpreters, web browsing, and retrieval.
The distinction matters more in agent frameworks than in everyday conversation. When a teammate like Beagle calls a Jira API or reads a Notion page, there's a full tool-calling loop happening behind every response - schema lookup, model decision, application execution, result injection.
Where things break
The most common failure mode isn't the JSON - it's the description.
Precision and specificity matter. The description should explain what the tool does, when to use it, what the parameters mean, and what constraints apply. Vague descriptions lead to incorrect usage. Good descriptions include expected input types, output format, and failure cases.
The second failure mode is security. Indirect prompt injection can manipulate an agent into calling functions it shouldn't. Prompt injection now sits at the top of the OWASP Top 10 for LLM Applications, and research has formalized how tool-calling agents can transform prompt injection from an information leak into an operational threat. If your agent can take actions - book a meeting, file a ticket, send a message - the stakes of a hijacked tool call are higher than a hijacked chat reply.
The mechanism is simple once you see it. The model reads a menu of tools, decides what it needs, writes an order slip in JSON, and waits. Your code fills the order and slides the result back through the window. The model reads the result and finishes answering.
Nothing the model "does" touches a live system. Everything that touches a live system is code you wrote and control. That's not a limitation - it's the design.