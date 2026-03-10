AI apps in production don't fail for interesting reasons. They fail for boring ones: a provider goes down at 2 AM, a rate limit kicks in during a traffic spike, someone hard-coded a model name that got deprecated. And because most teams have their routing and fallback logic scattered across application code, these failures are invisible until they're already breaking things.

That problem gets worse as AI workloads shift from single-call inference to multi-step agentic pipelines. Your agent is making multiple calls in a loop, so every added millisecond of latency compounds, retries start cascading, and a single provider hiccup doesn’t just slow down one request; it stalls the whole chain.

I wanted to see what it would look like to move that entire layer to the edge. One endpoint, any model, routing is decided before the request ever hits a provider. So I built a proof of concept on Fastly Compute .

Building an Edge AI Gateway for Multi-Provider LLM Routing

This gateway is a policy-driven routing layer that sits between your application and your LLM providers. Your app sends a standard OpenAI-compatible chat completion request to a single Fastly endpoint. At the edge, a fast classification model looks at the request and decides which provider and model should handle it, based on complexity, cost, and availability. Then Fastly forwards the request to the right place.

Your application doesn't change. It sends the same request it always would. The gateway handles the rest.

An important point: this isn't a model marketplace, and it's not selling access to anything. You bring your own provider relationships, your own API keys, your own negotiated rates. The gateway is just the routing layer, the part that decides where each request goes and makes sure it gets there, fast.

Why Fastly Compute

Three properties of Compute make it a natural fit for this.

Latency is the obvious one. Routing decisions happen in the same PoPs that already handle your HTTP traffic. There's no extra hop to a centralized gateway service sitting in us-east-1. And because Compute uses WebAssembly, cold starts are in the low-microsecond range, not the hundreds of milliseconds you'd see with a Lambda or similar container-based edge function. For a gateway that's adding a classification step to every request, that matters.

Isolation matters too. Compute runs on WebAssembly , so each request gets its own sandboxed execution environment. When you're handling API keys for multiple providers, that kind of strict isolation isn't optional.

And a gateway is just a Compute service. You can deploy it today with the existing toolchain: Secret Store for credentials, KV Store for routing policies, backends for provider connections. There's no new infrastructure to stand up.

Classifying AI Requests Before Inference

For the POC, I used Mercury 2 from Inception Labs , and it turned out to be a good fit for a specific reason: speed.

Mercury is a diffusion-based language model, which means it doesn't generate tokens one at a time the way autoregressive models do; it refines the entire output in parallel. For a routing classification task (where the output is a small JSON object with a tier, a reason, and a complexity estimate), that architecture is fast.

The classification adds 200-300ms, but the downstream provider is typically taking 1-5+ seconds depending on the model and task. With a reasoning request that routes to a heavier model, the classification overhead is maybe 5-10% of total response time. That's the tradeoff that makes it worth it: you spend a fraction of a second deciding where to send the request, and save money (and response time) on every request that doesn't need an expensive model.

For the gateway use case, I needed three things from the classifier: tool use support (so it could call a structured routing function), schema-aligned JSON output (so the routing decision is always parseable), and enough context window to look at the conversation history being classified. Mercury 2 covers all three.

How the Edge AI Gateway Works

Here's the flow. A client sends a POST to /v1/chat/completions on the Fastly endpoint, the same format you'd send to OpenAI directly.

Compute handles the request and makes a classification call to Mercury 2. The prompt is compact: it sends the system prompt (truncated), the latest user message (truncated), the message count, and the requested model. Mercury returns a structured JSON decision:

Copied! { "tier": "reasoning", "reason": "multi-step math problem requiring careful sequential logic", "estimated_complexity": "high" }

That tier maps to a provider and model in a routing policy that you can define and store in Fastly's KV Store:

Copied! { "tiers": { "fast": { "provider": "openai", "model": "gpt-5-nano" }, "balanced": { "provider": "anthropic", "model": "claude-haiku-4-5" }, "quality": { "provider": "openai", "model": "gpt-5.2" }, "reasoning": { "provider": "anthropic", "model": "claude-opus-4-6" } } }

Compute rewrites the model field in the original request body and forwards it to the selected provider. The response comes back to the client with a set of metadata headers attached:

Copied! X-Routed-Provider: anthropic X-Routed-Model: claude-opus-4-6 X-Routing-Tier: reasoning X-Routing-Reason: multi-step math problem requiring careful sequential logic X-Classification-Time: 73ms

The client gets a normal chat completion response. The only difference is those headers, which tell you exactly what happened and why.

The Classification Prompt

The routing logic is the interesting part; here's what it actually looks like. The router LLM gets a system prompt that defines the tiers:

Copied! You are an LLM request router. Given a chat completion request, choose the right quality tier for routing. Be fast and decisive. Tiers: - fast: simple Q&A, classification, summarization, extraction, short tasks - balanced: multi-step tasks, writing, moderate reasoning, medium context - quality: complex generation, nuanced writing, long documents, important tasks - reasoning: math, code, multi-hop logic, anything requiring careful step-by-step thinking Reply ONLY with one word: fast, balanced, quality, or reasoning.

And it receives a condensed version of the inbound request, not the full body, just enough to make the determination:

Copied! Requested model: auto Message count: 3 Latest user message (first 500 chars): Solve this step by step...

Mercury responds with the JSON routing decision, and Compute maps it to the right provider. The whole classification step runs in under 300ms for the vast majority of requests, and that's with us calling out to the LLM with no optimization of geolocalization or placement of infrastructure.

Policy Without Deploys

One thing I like about this architecture is that the routing policy lives in Fastly's KV Store, not in code. You can change which tier maps to which model, swap providers, adjust fallback chains, all without redeploying.

That sounds like a small thing, but it matters when you're iterating on model selection, swapping a new model into your "balanced" tier, or shifting traffic after a provider price cut. It is a simple update to your KV store.

Credentials work the same way. All provider API keys live in Fastly's Secret Store. Your application authenticates to Fastly. Fastly handles provider auth. Your app code never touches a provider key.

Skipping Classification When You Don't Need It

Not every request needs to be classified. If a client already knows it wants a specific tier, it can pass an X-LLM-Policy header:

Copied! curl https://your-gateway.edgecompute.app/v1/chat/completions \ -H "Content-Type: application/json" \ -H "X-LLM-Policy: quality" \ -d '{"model": "auto", "messages": [{"role": "user", "content": "Draft a detailed project proposal..."}]}'

That skips Mercury entirely and routes straight to the provider mapped to the "quality" tier. Useful for latency-sensitive paths where the application already has enough context to make the routing decision itself.

What the Routing Headers Look Like in Practice

Here's what you actually see when you test it. A simple factual question:

Copied! curl -s -i https://your-gateway.edgecompute.app/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Response headers:

Copied! X-Routed-Provider: openai X-Routed-Model: gpt-5-nano X-Routing-Tier: fast X-Routing-Reason: simple factual question, single turn, no reasoning required

A complex reasoning task:

Copied! curl -s -i https://your-gateway.edgecompute.app/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "auto", "messages": [{"role": "user", "content": "Solve this step by step: if a train leaves Chicago at 9am traveling 60mph..."}]}'

Response headers:

Copied! X-Routed-Provider: anthropic X-Routed-Model: claude-opus-4-6 X-Routing-Tier: reasoning X-Routing-Reason: multi-step math problem with sequential dependencies

The X-Routing-Reason header is the part that builds trust in the system. You can see exactly why Mercury made the decision it did on every request. That's useful for debugging, for auditing, and for tuning your routing policy over time.

Latency Budget

Two sequential calls happen in the critical path: Mercury classification, then provider inference.

Mercury classification runs in about 200-300ms for typical requests. Provider inference is whatever it normally is, anywhere from 200ms to several seconds, depending on the model and task.

For any non-trivial LLM request (where the provider takes 1-5+ seconds), Mercury's overhead is a small fraction of total latency. And the cost savings from not sending everything to your most expensive model more than offset the classification cost. For ultra-fast requests where even 300ms matters, the X-LLM-Policy override lets you skip classification entirely. Mercury or another low-latency LLM can themselves be one of the models, such as for those ultra-fast requests.

What this doesn't do (yet)

This is a proof of concept, not a production product. A few things are deliberately left out.

Streaming passthrough isn't wired up in the prototype. Compute supports SSE streaming, but handling it through a gateway adds complexity that wasn't worth solving for a POC.

Automatic failover is partially there; if Mercury classification fails, it falls back to the "fast" tier by default, but full provider failover with circuit breakers and retry chains is a V1 feature, not a prototype feature.

There's no semantic caching yet. But there's a natural pairing here with Fastly's AI Accelerator , which could sit in front of the classification step and serve cached responses for repeated queries without ever hitting a provider.

Next Steps and Future Capabilities

The latency overhead is minimal, and the architecture fits naturally into how Compute already works. The bring-your-own-keys model means you keep your existing provider relationships and rates; the gateway just makes them work better together.

We're exploring what it would look like to make this a native Fastly capability. If you're running AI workloads in production and dealing with multi-provider complexity (cost management, failover, model selection sprawl), I'd like to hear what you'd actually need. Things like: BYO keys vs managed billing? What logging destinations matter? What policy controls would actually change how you operate?