The Moment a Model Stops Talking and Starts Asking for a Tool

When an AI agent checks a calendar or queries a database, the model itself never touches the data. Here's exactly what happens in the gap between your question and the answer.

It's 9:15 on a Tuesday. You type into a chat interface: "What meetings do I have this afternoon?" The reply comes back a few seconds later with your 2 p.m. and your 3:30. Feels seamless. But there is a small, mechanically interesting gap in the middle of that exchange - a moment where the model stops generating language and starts requesting an action instead.

That gap is tool calling, and understanding it changes how you think about what these systems can and cannot do reliably.

The model does not touch the data

This is the first thing to absorb. Tool calling is a structured protocol between your application and the model. The model outputs a JSON object describing which function to call and what arguments to pass. Your code handles execution, then sends the result back for interpretation.

The model is not reaching into your calendar. It is producing a request - a structured description of what it needs - and handing that back to whatever software is running around it. That software does the actual work, then feeds the result back into the conversation.

The model acts as a coordinator. It understands the query, decides whether an external function needs to be called, correctly formats the call, and integrates the tool's output into the conversation. But the execution step lives entirely outside the model.

How the request is shaped

Before any of this can happen, someone has to tell the model what tools exist. Every request to the model includes the tool definitions alongside the conversation messages. The model does not decide when tools exist - it just knows what tools are available because the application told it. Those tool definitions are sent with every request.

Each tool definition is a JSON Schema: a machine-readable description of a function name, what arguments it takes, and what types those arguments should be. Think of it as an API contract written for a language model to read.

When the model decides a tool is needed, it stops generating normal language. When the model wants to call a tool, the content field in the response is empty - the model is not talking to the user, it is requesting an action. The finish_reason field is "tool_calls" instead of the usual "stop". Your application sees that signal, routes to the right function, runs it, and appends the result to the conversation before sending everything back to the model for the actual reply.

Why the output is reliably structured

You might wonder: if a language model fundamentally generates tokens one at a time by sampling from a probability distribution, how does it reliably produce valid JSON every time? The answer is that, for tool calls, providers do not leave it to chance.

Constrained decoding is the inference-time technique that forces an LLM's output to conform to a schema by masking the next-token distribution at every step. Tokens that would make the partial output invalid are set to logit negative infinity before sampling, leaving only legal continuations. Because the constraint is enforced during generation rather than after, the output is guaranteed valid - you never get a parse error, never have to retry, never need a fallback parser.

Constrained decoding is like building a maze where only one path leads to the exit. Instead of letting the model choose from 50,000 tokens, the invalid ones are masked. If the model is halfway through a JSON object and has just typed "age":, the mask says every token is now illegal except for numbers.

A state machine derived from the constraint - a DFA for regex, a pushdown automaton for a context-free grammar, a JSON-schema walker for JSON - tracks what is legal at each generation step. This is not prompt engineering. It is inference-time constraint enforcement sitting between the model's raw output distribution and the sampler.

The model does not produce valid JSON because you asked nicely. It produces valid JSON because the sampling step only allows valid tokens.

When one tool call isn't enough

The exchange described so far - one question, one tool call, one answer - is the simple case. Real tasks often require several. Your question about this afternoon's meetings might require checking a calendar, then looking up whether one of those meetings has a prep doc attached, then surfacing any unread messages from those attendees.

Parallel function calls allow you to perform multiple function calls together, allowing for parallel execution and retrieval of results. This reduces the number of calls to the API that need to be made and can improve overall performance. The model can emit an array of tool calls in a single turn, each with its own unique ID, and your application can run them concurrently before returning all the results at once.

Some models support parallel_tool_calls, allowing the model to return an array of functions to execute in parallel. However, reasoning models may produce a sequence of function calls that must be made in series, particularly when some steps depend on the results of previous ones. Whether calls run in parallel or in sequence matters a lot when you are counting latency.

The round-trip your users never see

Back to the Tuesday morning example. When you asked about your afternoon meetings, here is the actual sequence:

  1. Your message, plus tool definitions for a calendar lookup, were sent to the model.
  2. The model returned a tool_calls response - not an answer, just a request for calendar.get_events with an argument like {"date": "today", "time_range": "afternoon"}.
  3. Your application ran the function, got back a list of events, and appended that as a tool result message.
  4. The full conversation - original message, tool request, tool result - went back to the model.
  5. The model then generated the reply you read.

Even state-of-the-art models frequently fail to make accurate tool calls , which is why tool descriptions matter as much as the schema. A vague function name or a parameter description that omits edge cases will produce worse routing decisions. The schema is the interface contract; treat it with the same care you would a public API.

An AI teammate like Beagle, surfacing Slack context or answering questions about active projects, is running this loop constantly - deciding which workspace resources to query, forming the requests, and assembling the results before any reply is visible. The seams are invisible by design, but they are there.

The useful thing about understanding this loop isn't that it helps you trust it more. It's that it tells you exactly where things go wrong - and where to look when they do.