Blog.

Where AI Agents Run: Isolation, Governance, and Cost in the Cloud

Cover Image for Where AI Agents Run: Isolation, Governance, and Cost in the Cloud
Bishoy Youssef
Bishoy Youssef

The model is not the product. The place where your agent acts—running tools, reading data, calling APIs—is.

If the agent only recommends text, your laptop and a browser tab are enough. Once an agent executes—shell commands, database queries, long-running jobs—that execution needs a runtime you can define, isolate, govern, observe, and pay for on purpose.

This guide is for anyone picking (or reviewing) where that runtime lives. It separates the problem from the solution families so you can reason about new vendors even when names change.


Read this first

  • Problem: Tool-using agents need trustworthy execution—not just a smart model.
  • Answer: Treat the runtime like production infrastructure: versioned config, isolation, IAM, logs, and cost tags.
  • Reality: No single vendor sells “the” agent cloud. You mix compute shape (VM, container, function, microVM) with governance (your cloud account vs a vendor’s).

The problem in one picture

An agent is a loop: plan → act → observe. The act step always touches something real—a process, a network, a secret.

Diagram: model and agent logic connect to tools, which execute in a runtime bounded by isolation, policy, observability, and cost tags

If you skip the runtime discussion, you get shadow IT: agents on laptops with ambient credentials, or “temporary” servers nobody can reproduce. If you design the runtime, security and finance can treat it like any other production system.


Five things every serious setup must get right

Think of these as checklist dimensions—not optional extras for “later.”

Dimension Plain English Why it matters
1. Definition The environment is described in code or manifests (IaC, K8s, task definitions). You can answer what exactly ran and diff it like software.
2. Isolation Workloads run in a boundary (VM, microVM, hardened container) with limited blast radius. Agents are high-privilege; assume bugs and misuse.
3. Control Who can start a run, what it may call, how long it lives—enforced outside the prompt. Prompts are not security policy.
4. Observability Logs, traces, audit trails tied to identity and to a definition version. Incidents and compliance need evidence, not screenshots.
5. Cost clarity Tags, budgets, per-team or per-customer attribution for compute and tokens. Parallel agents spike usage fast.

Isolation: why “Firecracker” keeps coming up

Containers share the host kernel—great for density, weaker if you need strong separation for untrusted code.

MicroVMs (often built with Firecracker, AWS’s open-source microVM monitor) give a hardware-virtualization boundary in a small footprint. AWS Lambda uses Firecracker-style execution environments; Vercel Sandbox and several AI sandboxes advertise the same class of isolation.

Other patterns you’ll hear about: gVisor (extra layer between container and host—common on Google Cloud), Kata Containers (VM per pod). Same goal—performance vs isolation tradeoffs differ.


A map of solution families (not a vendor shootout)

Before naming products, know which family fits your agent’s shape:

Family Typical use Governance & cost
Managed dev / Linux workspaces Repo-centric work, human-like environments Often bundled with Git + seats; check export of logs
VMs & Kubernetes Full control, custom networking, existing platform team Best tagging and FinOps integration; you operate more
Serverless containers & functions Event-driven workers, scale-to-zero, no servers to patch Watch timeouts, concurrency, cold starts in cost models
CI/CD runners Bounded, pipeline-attached steps Great per-run identity; poor fit for long interactive sessions alone
Edge + managed sandboxes Untrusted code, global routing, API-first Compose multiple services; read platform limits carefully
Purpose-built AI sandboxes & GPU serverless Tool-using agents, heavy ML steps Evaluate SOC2, regions, $/minute vs $/seat

Everything below is an instance of one of these families.


Hyperscalers: VMs, Kubernetes, and the “middle tier”

VMs and Kubernetes (EC2, GCE, Azure VMs, EKS, GKE, AKS)

Best when: You need maximum control, private networking, existing security baselines, or Kubernetes is already standard.

Watch out: You own patching, scaling, and glue—but you also get the richest IAM, network policy, and cost allocation tooling.

Serverless containers and functions

You don’t want to manage EC2? The clouds offer a middle path:

  • AWS Fargate — run ECS/EKS tasks on AWS-managed capacity: task definitions, IAM, VPC, CloudWatch. Good for batch-shaped or service-shaped agents without running nodes yourself.

  • Google Cloud Run & Azure Container AppsHTTP or event-driven containers, scale-to-zero, per-service billing. Good when the agent is a stateless worker behind an API or queue.

  • AWS LambdaShort, event-scoped runs with IAM per function and Firecracker-style isolation. Ideal for glue, webhooks, fan-out; pair with Fargate or EKS when you need long runtimes, big images, or shell-heavy toolchains.

Watch out: Timeouts, concurrency caps, and cold starts affect both SLAs and bills.

Managed, repo-backed Linux environments

GitHub Codespaces and similar products: consistent environments from branches, org identity, and predictable minutes-style billing.

Best when: Work looks like software delivery (clone, build, verify).

Watch out: Exotic kernels, hardware, or network needs may push you to general compute.

Long-lived “workstations” (Cloud Workstations, Dev Box, …)

Best when: You standardize human desktops and need IT-style control.

Watch out: Not always tuned for massive parallel ephemeral agent runs without extra automation.

CI/CD as a gate

GitHub Actions, GitLab CI, Cloud Build, CodeBuild, Azure Pipelines: runs are declarative, logged, and tied to pipeline identity.

Best when: Agents run inside release gates or trigger remote sandboxes from a job.

Watch out: Interactive or hours-long sessions usually need another layer.


Edge platforms and managed sandboxes (Vercel, Cloudflare)

These optimize for placement, integration, and disposable execution—not for “I need a pet VM forever.”

Vercel Sandbox

Ephemeral microVMs (Firecracker-class) for untrusted or AI-generated workloads: run, capture output, destroy the boundary.

Best when: You already ship on Vercel and need isolation without folding risk into long-lived functions.

Cloudflare (composition of services)

  • Workers + Durable ObjectsEdge JS/WebAssembly, global, request-level control: routing, auth, rate limits in front of heavier work. Not a full Linux box.

  • Workers AI + AI GatewayModel routing, caching, failover, token visibility.

  • Cloudflare Containers (beta) — More CPU/RAM when Workers aren’t enough; still Workers-orchestrated.

  • Cloudflare Sandboxes (Sandbox SDK) — Processes, filesystems, tool-using agents on top of Containers.

  • R2, Queues, Browser Rendering, VectorizeArtifacts, async, browser automation, retrieval—usually combined with Workers.

Watch out: You may chain several products to match one monolithic cluster—and limits (CPU, wall time, egress) drive cost and architecture.


More options teams actually use

Category Examples Notes
AI-native sandboxes E2B, similar APIs Programmatic sandboxes—check SOC2, data retention, pricing unit.
Serverless GPU / Python Modal, Baseten-style, Replicate-style Great for ML-heavy steps; weaker if you need a full arbitrary OS.
Regional VMs / bare metal Fly.io, Latitude, Hetzner (where allowed) Predictable per-VM cost, latency control; you own more ops.
PaaS Railway, Render, Heroku-style Fast to ship; verify SSO, VPC, log export for production.
Cloud dev envs Gitpod, Codespaces ecosystem Overlap with managed Linux above; sometimes API-first.
Batch on clusters K8s Jobs, Argo, AWS Batch, Step Functions Often the cheapest add-on if Kubernetes is already there.
Distributed Python Ray, Anyscale Scale for training/simulation—not a sandbox by itself.

Model vendors: intelligence vs execution

OpenAI, Anthropic, Google, etc. sell models. You still decide where tools run—your VPC, cluster, or a vendor sandbox you explicitly choose.

Bundled “agent” products may hide execution. Ask: data residency, audit logs, whether execution can stay in your cloud account, and per-seat vs usage pricing.

API-only models: compare tool-call logging, model version pinning, allowlisted endpoints, and whether token spend and infrastructure spend reconcile in one FinOps view.

Quick-start sandboxes are fine for demos. Production needs the same bar as any regulated workload: SSO, private networking, SLAs, billing tags.


How to choose (durable framework)

  1. Classify the workload — Short tasks vs long sessions? Internal vs customer-facing? Read-mostly vs writes near production?

  2. Codify the environment — If you cannot rebuild from a definition (commit, image digest, manifest), you cannot audit or optimize cost.

  3. Unify telemetry — Same logs / traces / audit standards as the rest of your stack.

  4. Split spend — Separate sandbox vs production accounts or tags; set budgets before scale.

  5. Compare governance and unit economicsIAM, egress, regions, SOC reports, $/vCPU-hour and $/1M tokens—not CPU size alone.


Bottom line

Agents in production are automation. The runtime must be customizable, isolated, controlled, observable, and cost-visible.

Hyperscalers offer primitives and mature FinOps at the cost of assembly. Edge platforms (e.g. Vercel, Cloudflare) offer strong isolation stories and tight product fit at the cost of composition and limit discipline. Model vendors sell intelligence—not your SOC 2 boundary for tool execution.


Shareable takeaways

  1. The model is not the runtime—tool execution needs the same rigor as any production system.
  2. Five dimensions: definition-in-code, isolation, control, observability, cost attribution.
  3. MicroVMs (e.g. Firecracker-class) address strong isolation; Fargate / Cloud Run / Lambda address who manages servers—different questions.
  4. Pick a family first (VM, serverless container, edge sandbox, CI gate), then shortlist vendors.
  5. If you can’t prove what ran, what it cost, and who approved it, you’re not ready for unattended agents at scale.

Questions or gaps? The platform landscape changes fast—use the families above as a lens, then validate limits, contracts, and exports with each vendor.

Share this post