Does CodeAct Actually Reduce How Many Times the Model Has to Think?

Most agent loops look the same under the hood. The model picks a tool, waits for a result, picks another tool, waits again. For a task with a dozen small lookups, that pattern forces twelve model turns when one would do.

CodeAct changes this by letting the model write a single short Python program that calls your tools via call_tool(...), runs it once in a sandbox, and returns a consolidated result. The model still reasons about the task, but it does so once - up front - then hands a program to the runtime rather than re-entering the reasoning loop after every tool response.

Microsoft Agent Framework reached 1.0 GA on April 2, 2026, bringing the convergence of AutoGen and Semantic Kernel into a single supported platform. CodeAct arrived shortly after as part of the BUILD 2026 announcements. It ships inside a separate alpha package called agent-framework-hyperlight, and the two things - the pattern and the runtime - are worth keeping distinct.

What Hyperlight actually does

CodeAct ships in the agent-framework-hyperlight alpha package, which runs the model-generated code in a fresh, locally isolated Hyperlight micro-VM per call, so strong isolation is essentially free at the granularity of a single tool call.

Hyperlight is the sandboxing layer, not the pattern itself. CodeAct as a research idea predates this implementation - the novelty here is that Microsoft baked it into a production-oriented framework with sensible defaults and a clean API surface.

The short version: the model gets one execute_code tool, writes a small Python program per turn, and calls your tools from inside a fast, locally isolated micro-VM via call_tool(...).

The micro-VM starts clean on every invocation. There is no shared state leaking between agent runs, which is a meaningful property when you are running concurrent agents over the same tool set.

The numbers, and what they mean

With CodeAct support in Agent Framework, agents can collapse multi-step plans into a single executable code block, cutting end-to-end latency by roughly 50% and token usage by over 60% in representative workloads, without compromising on safety or isolation.

Those numbers come from Microsoft's own benchmarks on tool-heavy workloads - fetch data, compute something, assemble a result. That is the scenario where CodeAct earns its keep. For small tasks with only one or two tool calls, the added abstraction may not buy you much. The gains are real in the procedural, chainable case; they are modest or zero for one-shot tool calls.

The improvement is not that tools execute faster. It is that the model reasons once instead of twelve times.

The isolation story, read carefully

This is where the announcement language gets imprecise, and it is worth slowing down.

Hyperlight does provide isolation, but it isolates the model-generated code, not your tools. The tools you write and register live in your application's runtime, with whatever access your process has. The Python program the model writes inside execute_code runs inside the Hyperlight sandbox, with no host access except the file mounts and allowed domains you opted into.

CodeAct sandboxing protects the host from unsafe generated code. It does not automatically make your tools safe. If your tool can send an email, delete a file, update a database, approve a refund, or trigger a deployment, the sandbox is not enough. You still need tool-level permissions, approval policies, and auditability.

This is the boundary that tends to get glossed over in framework announcements. The micro-VM is a real and useful isolation layer. It is not a substitute for thinking about what your tools are actually allowed to do.

The Agent Control Specification, quietly shipped alongside

One announcement from BUILD that got less attention than CodeAct: an open-source Agent Control Specification (ACS), a portable vendor-neutral spec for runtime agent governance. It defines eight lifecycle interception points - input, pre- and post-model call, pre- and post-tool call, output, startup, and shutdown - with a declarative YAML manifest that works across Python, Node, .NET, and Rust. Write your policies once, enforce them in any framework.

That is a genuinely useful primitive. A team using LangGraph today and MAF tomorrow should not have to rewrite their governance layer from scratch. Whether ACS gets broad adoption outside Microsoft's own stack is the open question, but the design - declarative, cross-language, anchored to specific lifecycle points - is reasonable.

What to actually do with this

The best fit for CodeAct is read-heavy, chainable work: data lookups, light computation, and report assembly - tasks where several small tool calls can be composed safely. Keep side-effecting tools, such as sending email or writing to production systems, approval-gated as direct tools.

A teammate like Beagle pulling together context from Slack, a project tracker, and a calendar before a standup is exactly the kind of workload CodeAct is designed for - chained reads, no writes, output assembled in one pass.

The alpha label on agent-framework-hyperlight is honest: macOS support is still on the roadmap, and .NET support for CodeAct is listed as coming soon. The backend currently ships for Linux and Windows; macOS support is on the roadmap. If you are running Python on Linux in your agent infrastructure, it is worth testing now. If you are on macOS in development, you will need to wait or use a Linux container.

The pattern itself - one program per turn, sandboxed, tools reachable via a proxy function - is worth understanding regardless of whether you use MAF. Other frameworks will ship variants of this. The terminology will differ; the core trade-off will not.

What Hyperlight actually does

The numbers, and what they mean

The isolation story, read carefully

The Agent Control Specification, quietly shipped alongside

What to actually do with this

Keep reading

Build Python Agents with Pydantic AI V2's Capability Primitive

Does Vercel Eve Actually Solve the Agent Deployment Problem?

How Does RAG Retrieval Actually Work Under the Hood?