Where AI Agents Run: Isolation, Governance, and Cost in the Cloud

Bishoy Youssef

Cover Image for Where AI Agents Run: Isolation, Governance, and Cost in the Cloud

Bishoy Youssef

March 28, 2026

The model is not the product. The place where your agent acts—running tools, reading data, calling APIs—is.

If the agent only recommends text, your laptop and a browser tab are enough. Once an agent executes—shell commands, database queries, long-running jobs—that execution needs a runtime you can define, isolate, govern, observe, and pay for on purpose.

This guide is for anyone picking (or reviewing) where that runtime lives. It separates the problem from the solution families so you can reason about new vendors even when names change.

Read this first

Problem: Tool-using agents need trustworthy execution—not just a smart model.
Answer: Treat the runtime like production infrastructure: versioned config, isolation, IAM, logs, and cost tags.
Reality: No single vendor sells “the” agent cloud. You mix compute shape (VM, container, function, microVM) with governance (your cloud account vs a vendor’s).

The problem in one picture

An agent is a loop: plan → act → observe. The act step always touches something real—a process, a network, a secret.

Diagram: model and agent logic connect to tools, which execute in a runtime bounded by isolation, policy, observability, and cost tags

If you skip the runtime discussion, you get shadow IT: agents on laptops with ambient credentials, or “temporary” servers nobody can reproduce. If you design the runtime, security and finance can treat it like any other production system.

Five things every serious setup must get right

Think of these as checklist dimensions—not optional extras for “later.”

Dimension	Plain English	Why it matters
1. Definition	The environment is described in code or manifests (IaC, K8s, task definitions).	You can answer what exactly ran and diff it like software.
2. Isolation	Workloads run in a boundary (VM, microVM, hardened container) with limited blast radius.	Agents are high-privilege; assume bugs and misuse.
3. Control	Who can start a run, what it may call, how long it lives—enforced outside the prompt.	Prompts are not security policy.
4. Observability	Logs, traces, audit trails tied to identity and to a definition version.	Incidents and compliance need evidence, not screenshots.
5. Cost clarity	Tags, budgets, per-team or per-customer attribution for compute and tokens.	Parallel agents spike usage fast.

Isolation: why “Firecracker” keeps coming up

Containers share the host kernel—great for density, weaker if you need strong separation for untrusted code.

MicroVMs (often built with Firecracker, AWS’s open-source microVM monitor) give a hardware-virtualization boundary in a small footprint. AWS Lambda uses Firecracker-style execution environments; Vercel Sandbox and several AI sandboxes advertise the same class of isolation.

Other patterns you’ll hear about: gVisor (extra layer between container and host—common on Google Cloud), Kata Containers (VM per pod). Same goal—performance vs isolation tradeoffs differ.

A map of solution families (not a vendor shootout)

Before naming products, know which family fits your agent’s shape:

Family	Typical use	Governance & cost
Managed dev / Linux workspaces	Repo-centric work, human-like environments	Often bundled with Git + seats; check export of logs
VMs & Kubernetes	Full control, custom networking, existing platform team	Best tagging and FinOps integration; you operate more
Serverless containers & functions	Event-driven workers, scale-to-zero, no servers to patch	Watch timeouts, concurrency, cold starts in cost models
CI/CD runners	Bounded, pipeline-attached steps	Great per-run identity; poor fit for long interactive sessions alone
Edge + managed sandboxes	Untrusted code, global routing, API-first	Compose multiple services; read platform limits carefully
Purpose-built AI sandboxes & GPU serverless	Tool-using agents, heavy ML steps	Evaluate SOC2, regions, $/minute vs $/seat

Everything below is an instance of one of these families.

Hyperscalers: VMs, Kubernetes, and the “middle tier”

VMs and Kubernetes (EC2, GCE, Azure VMs, EKS, GKE, AKS)

Best when: You need maximum control, private networking, existing security baselines, or Kubernetes is already standard.

Watch out: You own patching, scaling, and glue—but you also get the richest IAM, network policy, and cost allocation tooling.

Serverless containers and functions

You don’t want to manage EC2? The clouds offer a middle path:

AWS Fargate — run ECS/EKS tasks on AWS-managed capacity: task definitions, IAM, VPC, CloudWatch. Good for batch-shaped or service-shaped agents without running nodes yourself.
Google Cloud Run & Azure Container Apps — HTTP or event-driven containers, scale-to-zero, per-service billing. Good when the agent is a stateless worker behind an API or queue.
AWS Lambda — Short, event-scoped runs with IAM per function and Firecracker-style isolation. Ideal for glue, webhooks, fan-out; pair with Fargate or EKS when you need long runtimes, big images, or shell-heavy toolchains.

Watch out: Timeouts, concurrency caps, and cold starts affect both SLAs and bills.

Managed, repo-backed Linux environments

GitHub Codespaces and similar products: consistent environments from branches, org identity, and predictable minutes-style billing.

Best when: Work looks like software delivery (clone, build, verify).

Watch out: Exotic kernels, hardware, or network needs may push you to general compute.

Long-lived “workstations” (Cloud Workstations, Dev Box, …)

Best when: You standardize human desktops and need IT-style control.

Watch out: Not always tuned for massive parallel ephemeral agent runs without extra automation.

CI/CD as a gate

GitHub Actions, GitLab CI, Cloud Build, CodeBuild, Azure Pipelines: runs are declarative, logged, and tied to pipeline identity.

Best when: Agents run inside release gates or trigger remote sandboxes from a job.

Watch out: Interactive or hours-long sessions usually need another layer.

Edge platforms and managed sandboxes (Vercel, Cloudflare)

These optimize for placement, integration, and disposable execution—not for “I need a pet VM forever.”

Vercel Sandbox

Ephemeral microVMs (Firecracker-class) for untrusted or AI-generated workloads: run, capture output, destroy the boundary.

Best when: You already ship on Vercel and need isolation without folding risk into long-lived functions.

Cloudflare (composition of services)

Workers + Durable Objects — Edge JS/WebAssembly, global, request-level control: routing, auth, rate limits in front of heavier work. Not a full Linux box.
Workers AI + AI Gateway — Model routing, caching, failover, token visibility.
Cloudflare Containers (beta) — More CPU/RAM when Workers aren’t enough; still Workers-orchestrated.
Cloudflare Sandboxes (Sandbox SDK) — Processes, filesystems, tool-using agents on top of Containers.
R2, Queues, Browser Rendering, Vectorize — Artifacts, async, browser automation, retrieval—usually combined with Workers.

Watch out: You may chain several products to match one monolithic cluster—and limits (CPU, wall time, egress) drive cost and architecture.

More options teams actually use

Category	Examples	Notes
AI-native sandboxes	E2B, similar APIs	Programmatic sandboxes—check SOC2, data retention, pricing unit.
Serverless GPU / Python	Modal, Baseten-style, Replicate-style	Great for ML-heavy steps; weaker if you need a full arbitrary OS.
Regional VMs / bare metal	Fly.io, Latitude, Hetzner (where allowed)	Predictable per-VM cost, latency control; you own more ops.
PaaS	Railway, Render, Heroku-style	Fast to ship; verify SSO, VPC, log export for production.
Cloud dev envs	Gitpod, Codespaces ecosystem	Overlap with managed Linux above; sometimes API-first.
Batch on clusters	K8s Jobs, Argo, AWS Batch, Step Functions	Often the cheapest add-on if Kubernetes is already there.
Distributed Python	Ray, Anyscale	Scale for training/simulation—not a sandbox by itself.

Model vendors: intelligence vs execution

OpenAI, Anthropic, Google, etc. sell models. You still decide where tools run—your VPC, cluster, or a vendor sandbox you explicitly choose.

Bundled “agent” products may hide execution. Ask: data residency, audit logs, whether execution can stay in your cloud account, and per-seat vs usage pricing.

API-only models: compare tool-call logging, model version pinning, allowlisted endpoints, and whether token spend and infrastructure spend reconcile in one FinOps view.

Quick-start sandboxes are fine for demos. Production needs the same bar as any regulated workload: SSO, private networking, SLAs, billing tags.

How to choose (durable framework)

Classify the workload — Short tasks vs long sessions? Internal vs customer-facing? Read-mostly vs writes near production?
Codify the environment — If you cannot rebuild from a definition (commit, image digest, manifest), you cannot audit or optimize cost.
Unify telemetry — Same logs / traces / audit standards as the rest of your stack.
Split spend — Separate sandbox vs production accounts or tags; set budgets before scale.
Compare governance and unit economics — IAM, egress, regions, SOC reports, $/vCPU-hour and $/1M tokens—not CPU size alone.

Bottom line

Agents in production are automation. The runtime must be customizable, isolated, controlled, observable, and cost-visible.

Hyperscalers offer primitives and mature FinOps at the cost of assembly. Edge platforms (e.g. Vercel, Cloudflare) offer strong isolation stories and tight product fit at the cost of composition and limit discipline. Model vendors sell intelligence—not your SOC 2 boundary for tool execution.

Shareable takeaways

The model is not the runtime—tool execution needs the same rigor as any production system.
Five dimensions: definition-in-code, isolation, control, observability, cost attribution.
MicroVMs (e.g. Firecracker-class) address strong isolation; Fargate / Cloud Run / Lambda address who manages servers—different questions.
Pick a family first (VM, serverless container, edge sandbox, CI gate), then shortlist vendors.
If you can’t prove what ran, what it cost, and who approved it, you’re not ready for unattended agents at scale.

Questions or gaps? The platform landscape changes fast—use the families above as a lens, then validate limits, contracts, and exports with each vendor.

Blog.