How LLM Tool Calling Works at the Token Level

When an LLM calls a tool, it isn't running code - it's generating specific tokens your app is trained to intercept. Here's the exact mechanic, step by step.

Cover art for How LLM Tool Calling Works at the Token Level

A teammate pastes a question into your AI assistant: "What's the current sprint velocity?" One second later, the assistant replies with a precise number pulled live from Jira. It looks like magic. The reality is stranger and more specific: the language model never touched Jira. It produced a short sequence of JSON tokens that your application intercepted, executed against an API, and fed back into the conversation. The model just wrote text. Your code did the rest.

That loop - model emits JSON, app runs code, result goes back to model - is what "tool calling" or "function calling" means in practice. Both terms coexist for historic reasons: they refer to the same underlying mechanism. But the name doesn't reveal what's actually happening inside the model. Once you see the full picture, a lot of agent behavior that seems mysterious becomes predictable.

What the model actually receives

Here's the first counterintuitive fact. When you send a tool schema to an LLM, you are not "registering" a function. You are injecting a JSON blob into the model's context window, formatted as a system message. The model has no concept of a function registry. It sees text. The schema you define - name, description, parameters, types - is written into the prompt alongside the user's message before a single output token is generated.

Functions are injected into the system message in a syntax the model has been trained on. This means callable function definitions count against the model's context limit and are billed as input tokens. If you hand five tools to a model, you are paying for those schema definitions on every single request. A verbose tool description with lots of parameter documentation is not free. If the number of functions and tools is high, the JSON gets big, increasing the overall token count and cost.

The practical implication: the JSON schema you define is not just documentation - it is the only thing the model sees. If your descriptions are vague, the model will hallucinate arguments. Treat every tool description like a micro-prompt.

How the model decides to call a tool

After reading the context - system message, injected schemas, user query - the model generates its response the way it always does: one token at a time, autoregressively. The model has been fine-tuned to output a special token sequence that signals a tool call. That is the whole trick. There is no separate "decision module." The model has learned, through training on examples, that certain user intents should produce a particular output shape rather than prose.

The model then generates a tool_calls field in the response. Under the hood, the model outputs a JSON string inside the arguments field. The API then parses this JSON for you - but if the model outputs malformed JSON, the API returns an error.

Different providers implement this slightly differently at the surface. Anthropic's Claude uses a different approach: it outputs a special XML-like <function_calls> tag and then a JSON block. OpenAI uses the tool_calls array. The underlying idea is identical in both cases: structured text that the surrounding application is built to catch.

The constrained decoding problem

Allowing the model to freely generate JSON is risky in production. A misplaced comma or an unescaped newline inside a string breaks the parser. This is where constrained decoding comes in - and it's the piece most explanations skip over.

At each generation step, the model produces a logits vector across its vocabulary, which is then converted into a probability distribution using the softmax function. A sampler then selects the next token from this distribution. Normally, every token in the vocabulary is a candidate. Constrained decoding changes that. Constrained decoding guides the structure of LLM-generated text by restricting available tokens at each step. At each step, tokens that would violate the required structure are identified as invalid. Their logits are set to negative infinity, effectively assigning them zero probability after the softmax operation and preserving the relative probabilities of other valid tokens. This ensures that only valid tokens are sampled.

The schema you pass is compiled into something like a finite state machine. At each decoding step, the engine checks which tokens are legal next given the grammar state - and hard-blocks everything else. When you enable structured output, the model physically cannot produce tokens that violate your schema. You define a JSON Schema, pass it to the API, and get back a response that matches it every single time.

Setting strict to true ensures function calls reliably adhere to the function schema, instead of being best effort. OpenAI recommends always enabling strict mode.

The full loop, and what it costs you

Here is the complete sequence for a single tool-assisted turn:

  1. Your app builds a request: system prompt + tool schemas + user message → sent as input tokens.

  2. The model outputs either prose or a tool_calls block. Model responses can include zero, one, or multiple calls. The response has an array of tool_calls, each with an id and a function containing a name and JSON-encoded arguments.

The LLM waits while the application processes the request and executes the function. This is where Jira gets queried, the database gets read, the calendar API responds. 4. The output of the function is formatted and passed back to the LLM along with the original context. The LLM reasons about the function's outputs and the query and returns a grounded response.

Each round-trip adds latency. The round-trip to execute a function and feed results back adds 500ms-2s per call under typical conditions - which compounds quickly in multi-step agent flows where three or four tool calls happen in sequence.

Token cost compounds too. In production, models can emit multiple tool calls in a single turn. If your loop doesn't handle that, you'll silently drop requests and corrupt state. A common mistake is writing code that assumes tool_calls always has exactly one entry.

The other cost trap is parallel calls. Instead of answering a single question, the model can orchestrate multiple function calls to solve multi-step problems. Planning a trip might involve checking flight availability, booking a hotel, and renting a car through different APIs, all in one conversation. If those calls are independent, you can execute them in parallel and avoid serial latency. If your scaffolding runs them sequentially by default, you are leaving time on the table.

What actually breaks in production

The sprint velocity example at the top is clean. Real production flows are messier. Three failure modes come up repeatedly.

Vague descriptions cause wrong tool selection. The model picks tools based on the name and description text in the schema. Two tools with overlapping descriptions - say, get_sprint_data and get_project_metrics - will be confused if neither description clearly delineates scope. The fix is treating schema descriptions as unambiguous contracts, not friendly labels.

Malformed arguments are a real edge case. Even with constrained decoding, even slight deviations from the expected structure can lead to crashes or unpredictable behavior. This risk is compounded by data type issues - for example, the LLM might output a date in a non-standard format or return a string where an integer is expected. Validate on the way in.

Retries create side effects. Function calls are not transactional. If your payment service processes a charge inside a tool call and the LLM retries due to a timeout, you will double-charge the customer. Always include an idempotency key. This is less of an issue for read-only tools like a Jira query, and a significant issue for anything that writes state.

A tool-calling setup inside a Slack-connected assistant - the kind Beagle lives in - runs this entire loop in the background, invisible to the person who asked the question. The teammate sees an answer. What actually happened was: schema injection, token generation, JSON intercept, API execution, result injection, second model call, final prose generation. Knowing that sequence makes it much easier to debug when something returns wrong data, hallucinates an argument, or simply goes silent.

The model never left the conversation. It just wrote very specific tokens at the right moment.

Keep reading