What Does the Model Actually Do When It Calls a Tool?

Most people who work with AI agents have a rough mental model of tool calling: the model "calls a function," something runs, and an answer comes back. That mental model is close enough for casual use and completely wrong in ways that matter when things break.

Here is what actually happens, step by step, with a concrete example you can hold in your head.

The model can't run anything

Start here. Function calling does not give an LLM the ability to execute code. The model chooses the most appropriate function from a list you provide. The actual execution happens in the same environment that made the LLM call - your application, not the model.

This is not a minor footnote. It means the model is more like a dispatcher than an actor. It reads the situation, decides what needs to happen, and writes a request - in JSON - that your code then acts on. Whether that request results in a Jira ticket update, a database query, or a Slack message is entirely up to your infrastructure.

Step one: describe your tools in JSON Schema

Before the conversation starts, you give the model a catalogue of what it's allowed to request. Each entry is a JSON Schema block: a name, a description, and a set of typed parameters. For models to understand these functions, you outline the function specifications with a JSON Schema - the type, function name, function description, function parameters, and which parameters are required.

Say you're building a Slack bot that can look up open support tickets. You'd describe a get_open_tickets function with parameters like assignee (string) and priority (enum: low, medium, high). That description gets sent to the model with every request - it's not stored anywhere special; it just travels as tokens.

Every request to the LLM includes the tool definitions alongside the conversation messages. The bot doesn't decide when to use tools - the model does. The bot just tells the model what tools exist.

Step two: the model reads the intent and picks a tool

When a message arrives - say, "which high-priority tickets is Maya not making progress on?" - the LLM analyzes the query and recognizes it needs external data or an action to fulfill the request. If the user asks about something requiring live data, the model identifies the need to fetch it.

The model then does something that looks like reasoning but is technically token prediction with hard guardrails applied. It outputs a response with a special shape: when the model wants to call a tool, the content field is empty. The model isn't talking to the user - it's requesting an action. The finish_reason is "tool_calls" instead of the usual "stop".

That response object contains the function name and arguments the model decided to use - for instance, get_open_tickets with assignee: "Maya" and priority: "high".

The model made a decision. Your code made a phone call.

Step three: constrained decoding keeps the JSON valid

Here's the part most explanations skip. You might wonder: how does the model reliably output well-formed JSON with exactly the right field names and types, every time? The answer is not training alone - it's a mechanism called constrained decoding.

When generating text that must conform to specific syntactic structures like JSON, directly sampling from the model's probability distribution alone may not guarantee valid outputs. Constrained decoding addresses this by applying a logits mask before token sampling. This process sets the logits of invalid tokens to negative infinity, effectively zeroing their probabilities after softmax.

To make that concrete: the model generates one token at a time. At each step, it produces a probability score for every token in its vocabulary - often 100,000+ tokens. Instead of letting the model choose from 50,000 words, invalid tokens get masked. If the model is halfway through a JSON object and just typed "age":, the mask says every token in the universe is now illegal except for numbers.

Consider generating a JSON structure: after producing {, a parser identifies that only " (to start a string key) or } (for an empty object) are valid next tokens. The constrained decoding step masks all other tokens' logits to negative infinity, restricting sampling to only those valid tokens. This ensures the generated text strictly adheres to the specified grammar.

This is why structured tool calls don't occasionally produce "prioritty" or forget a closing brace. The grammar won't allow it.

Step four: your code runs the function and returns the result

Your application receives the tool call request, validates the arguments, runs the actual function - hitting a database, calling an API, querying Jira - and packages the result as a new message in the conversation history, tagged with the tool call's ID.

Tool calls and responses are matched with an ID, and generation will error if the IDs don't match. This pairing matters: if you're running parallel tool calls, the model needs to know which result belongs to which request.

The result goes back into the context window as a tool role message. Then the model gets invoked again, this time with the original user question plus the function output in view. It reads the ticket data and writes a plain-language reply: "Maya has three high-priority tickets open for more than five days, two of which haven't been updated since last Thursday."

Step five: the loop can repeat

Basic tool calling is one tool and one result. Agent function calling is when the model chains tools across steps to complete a goal - searching docs, extracting relevant IDs, fetching a record, updating a ticket, notifying a user. This is how LLMs shift from chatbots into workflow assistants.

A teammate like Beagle running in Slack might make three tool calls in sequence - one to look up the ticket, one to check the last comment thread, one to draft a summary - before posting a single message. Each round trip adds latency, which is why systems that can batch parallel tool calls are meaningfully faster.

Smaller models call tools sequentially - one per round - rather than batching both into a single response. Larger models would likely batch them.

What can go wrong

The execution boundary creates a real security surface. The moment an LLM can trigger real actions, mistakes become expensive. Common risks include hallucinated parameters - invented IDs, emails, or amounts - unauthorized access through tool calls that shouldn't be allowed for that user or session, and prompt injection, where malicious input tries to override tool rules.

If you want tool use to work consistently, treat your functions like a public API - because to the model, they are. Use precise names: create_support_ticket beats ticketAction. Keep parameters tight: fewer fields, more explicit constraints.

The core thing to hold onto: the model reasons about what needs to happen and writes a structured request. Your code handles the consequences. That boundary is where most production problems live, and understanding it clearly makes both debugging and security design a lot simpler.

The model can't run anything

Step one: describe your tools in JSON Schema

Step two: the model reads the intent and picks a tool

Step three: constrained decoding keeps the JSON valid

Step four: your code runs the function and returns the result

Step five: the loop can repeat

What can go wrong

Keep reading

How the LLM Context Window Actually Works

Fix Your GitHub Slack PR Notifications Before They Cost You

OpenCode Runs in Your Terminal and Has 160K GitHub Stars