Paste a weather tool into a GPT-4 prompt and ask "What's the temperature in Oslo right now?" The model will not query a weather API. It will write a small JSON object describing which function you should call and with what arguments - then stop and wait. Your application does the actual work. This distinction matters more than it sounds, because most mental models of "AI agents calling tools" have it backwards.
The core insight is that the model doesn't execute code; it describes what code should run, and you control execution. That one sentence untangles a lot of confusion about why agents sometimes behave oddly, why tool errors look the way they do, and where the real costs come from.
What the model actually sees
When you send a tool schema to an LLM, you are not "registering" a function. You are injecting a JSON blob into the model's context window, formatted as a system message. The model is then fine-tuned to output a special token sequence that signals a tool call. This is why the schema counts against your token budget - it is literally part of the prompt.
Under the hood, functions are injected into the system message in a syntax the model has been trained on. This means callable function definitions count against the model's context limit and are billed as input tokens. A schema for a single function with a good description might cost 50-150 tokens. Register 20 tools and you have burned 1,000-3,000 tokens before the user types anything.
The JSON schema you define is not just documentation - it is the only thing the model sees. If your descriptions are vague, the model will hallucinate arguments.
A description that says "location": "a place" is nearly useless. One that says "location": "City name in English, e.g. 'Oslo'" is the whole game.
The tool schema is not metadata - it is a prompt. Every word in the description influences the model's behaviour. Treat it like a system prompt, not a docstring.
Function calling is also known as tool calling, and both terms coexist for historic reasons. First, LLMs only had access to a small selection of functions. Later, models became able to handle larger collections of external APIs, hence the name "tool" (as in "toolbox") was established.
OpenAI calls it "function calling" while Anthropic calls it "tool use," but the implementation is nearly identical. Both use JSON schemas to define tools and return structured outputs. Worth knowing if you switch providers.
The loop, turn by turn
Here is what actually happens when you send "What's the temperature in Oslo?":
- Your application bundles the user message, the system prompt, and all tool schemas into one big context and sends it to the model.
- The model reads the context, decides a tool is needed, and emits a structured response - not the answer, but a tool call object containing the function name and arguments.
- Your application receives that object, extracts it, and runs the actual function - calling the weather API, querying the database, whatever the function does.
- Your application sends the result back to the model in a new message, labelled as a tool response.
- The model reads the result and generates the final reply.
The LLM does not execute these calls directly; instead it creates a data structure that describes the call, passing that to a separate program for execution and further processing. The LLM's prompt includes details about possible function calls and when they should be used.
The model generates a tool_calls field in the response. Under the hood, the model outputs a JSON string inside the arguments field. The API then parses this JSON for you - but if the model outputs malformed JSON, the API returns an error.
Anthropic's Claude uses a different approach: it outputs a special XML-like tag <function_calls> and then a JSON block.
Different syntax, same mechanic.
The conversation does not end when the model decides to use a tool. It pauses. The loop only closes when your code runs the function and sends the result back as a new message.
When the model fires multiple tools at once
Parallel tool calling is one of the most impactful performance features in modern LLM APIs. Instead of waiting for one tool call to finish before starting the next, the model can request multiple tool executions in a single response - and your agent runtime can execute them all at once.
OpenAI ships parallel function calling as a default behaviour. When the model determines that multiple functions are needed, it can emit multiple tool-call objects in a single response. The parallel_tool_calls parameter, which defaults to true, controls this behaviour.
The latency math is worth pausing on. Three tools each taking 200ms, called sequentially, takes 600ms of tool execution plus three separate LLM inference cycles totalling around 1,500ms - roughly 2,100ms end-to-end. In parallel, those same three tools finish in 200ms (the slowest one), with a single inference cycle, putting total latency around 700ms - a 3x improvement.
When the model calls a function, you must execute it and return the result. Since model responses can include zero, one, or multiple calls, it is best practice to assume there are several. A loop that only handles single tool calls in production will silently drop parallel requests and corrupt state.
There is a catch with parallel calls and side effects. When you define multiple functions, the model can request several at once - and if those functions have side effects like decrementing inventory or charging a credit card, executing them in parallel without proper idempotency or ordering guarantees can corrupt state, double-charge customers, or create race conditions. The naive implementation just runs all requested functions simultaneously and returns results; the correct pattern requires sequential execution with dependency tracking, idempotency keys, and rollback logic.
The part nobody mentions at the schema level
The model still has to understand which tools are available, and generate the instructions on which tools it wants to use. As LLMs are nondeterministic by design, there is no guarantee that tool calling works flawlessly all the time.
If you run into token limits, limit the number of functions loaded up front, shorten descriptions where possible, or use tool search so deferred tools are loaded only when needed. Loading 40 tool schemas into every call because it's easy is a reliable way to burn tokens on tools the model will never use for that particular message.
Few-shot tool use - where you teach the model to format tool calls via examples in the prompt - is fragile and requires parsing free-form text output. Structured function calling uses a dedicated API mechanism where the model returns validated JSON matching the tool schema. Schema validation happens at the API level, not via text parsing. Far more reliable for production use.
None of this requires that the tools be external APIs. A tool can read from an in-memory cache, query a local database, or call another model. The mechanism is the same: the LLM writes JSON, your code runs something, the result comes back as a message. An AI teammate like Beagle works this way when it pulls context about a Slack thread or a Jira ticket - the model describes what it needs, and the surrounding infrastructure actually fetches it.
The key thing to carry forward: the intelligence lives in how the model decides which tool to call and with what arguments. The execution lives entirely in your code. Build accordingly - especially around error handling, because LLMs are nondeterministic by design, and there is no guarantee that tool calling works flawlessly all the time. A broken tool call should return a clear error string, not a silent null, so the model has a chance to recover on the next turn.