Cloud Summit Vancouver 2026

The Illustrated Primer
to GenAI Networking.
Why we need AI Gateways when we already had API Gateways.

Morgan Foster  ·  Co-Chair, WG AI Gateway 25 min / 24 slides
Setup

Didn’t we already have API Gateways?

"AI Gateway" is an overloaded term.
What’s actually different — and why can’t we just use what we have?
api-gateway.yaml ?
Setup

Two postures.

API Gateways apply policy to traffic entering a cluster. AI Gateways apply policy to traffic leaving it.
CLUSTER clients external external external API Gateway INGRESS web-app api-svc agent AI Gateway EGRESS inference providers OpenAI Anthropic self-hosted Policy for traffic we receive. We control both sides. Policy for traffic we send. We don’t control the other side.
Setup

The body is where the signal is.

Traditional gateways route on headers. AI inference moves every routing, security, and caching decision into the payload.
Traditional API request
# headers carry the decision Host: api.example.com ◄ route Authorization: Bearer ... ◄ auth Content-Type: application/json ◄ negotiate # body is opaque payload { "user_id": 42, "action": "get_balance" }
Everything the gateway needs is in the headers.
AI inference request
# headers are decorative Host: api.openai.com Authorization: Bearer ... # body carries every decision { "model": "gpt-4", ◄ route "messages": [ { "role": "user", ◄ guardrails "content": "..." } ◄ cache key ], "temperature": 0.7 ◄ policy }
Everything the gateway needs is in the body.
Setup

An “AI Gateway” is broadly defined as a gateway that speaks AI protocols.

But in practice, it means an egress gateway that inspects request payloads — not dissimilar from deep packet inspection.
Policy landscape

What an AI Gateway actually does.

Two categories of work.
01
Observe & meter.
Token-level cost attribution, usage tracking, and anomaly detection across workloads.
02
Apply policies that…
Allow or deny.
Pass the request through, or reject it. No mutation.
guardrails · AuthN / AuthZ · token rate limits
Respond.
Short-circuit the pipeline and return a response directly.
semantic cache
Mutate.
Rewrite the request before it reaches the upstream.
semantic routing · credential injection
User workload observe · meter AI Gateway allow / deny respond mutate Backend api.openai.com request request response response cache hit
Policy landscape

Observe & meter.

Like API Gateways, one of the most important roles for an AI Gateway is as a centralized source of truth.
Track costs
Token usage delta after upgrading from Opus 4.6 to Opus 4.7 — did our spend change?
Detect anomalies
Sudden spike in tool calls — is an agent stuck in a loop, or did behavior genuinely change?
Attribute usage
Per-workload, per-team token spend — chargeback without guessing.
Monitor policy
Guardrail rejection rates — a spike after deployment means a misconfigured policy, not a security event.
50k 35k 20k 10k 0 Mon Tue Wed Thu Fri Sat Model upgrade Opus 4.6 → 4.7 tokens / day (before) tokens / day (after) tokens / day
Policy landscape

Allow / deny policies.

These should come early in a pipeline — bail out before spending compute on expensive downstream work.
Guardrails
Inspect the prompt and reject it outright if it violates policy — jailbreak attempts, prompt injection, restricted topics.
Cheap relative to inference. Running them first means a bad request never reaches the tokenizer, cache, or model.
Token rate limiting
Count tokens against a budget and reject when exceeded. Sounds simple, but raises real questions:
  • Per-model budgets? Token cost varies dramatically by model. A budget that makes sense for a distilled model will bankrupt you on a frontier one.
  • Streaming responses. Tokens arrive in chunks — you’ll often overshoot your limit before you can cut off.
  • Input ≠ output cost. Output tokens are often 3–5× more expensive. A single budget doesn’t capture real spend.
Policy landscape

Direct response policies.

Short-circuit the pipeline and return immediately — never reaching the upstream.
Semantic cache
Uses a vector database to find semantically similar past requests and return cached responses. A cache hit skips inference entirely.
Should come early in the pipeline — arguably before token counting, since a cache hit means zero inference cost. But it must come after guardrails, so we never serve a cached response to a request that should have been rejected.
Guardrails can respond too
A content-policy guardrail doesn’t have to return a bare 403. It could return a canned message — “I’m unable to help with that” — making it a direct-response policy in disguise.
This blurs the line between allow/deny and respond. A single processor can act as both, depending on configuration.
The category a policy belongs to can change based on how it’s configured — which is why we need a pipeline that can express all three behaviors.
Policy landscape

Mutating policies.

These rewrite the request or response. Powerful, but they create tension in the pipeline.
They add information downstream policies need
  • Model selection. Semantic routing picks a model — and that choice triggers model-specific tokenizers, rate limits, and endpoint pools downstream.
  • Protocol translation. Rewrites the payload into the upstream’s wire format. Must happen after we know which upstream we’re talking to.
  • Credential injection. Attaches provider-specific credentials. Must happen last, after model and endpoint are resolved.
They can invalidate policies that already ran
  • PII redaction — a guardrail that mutates instead of rejecting. Redacting tokens changes the token count, so any rate limit that ran before redaction is now wrong.
  • File handle hydration — expanding a reference into full content can dramatically change payload size and token count.
Mutation means ordering isn’t just about efficiency — it’s about correctness. A policy that ran before a mutation may need to run again after it.
Policy landscape

These features already exist.

Every major AI Gateway ships some combination of these capabilities today.
LiteLLM Portkey Kong AI Envoy AI
AuthN / AuthZ
Guardrails (in + out)
Model routing
Semantic cache
Token rate limiting
Protocol translation
Credential management
KV-aware endpoint picking
Observability & metering
Retry / fallback
supported partial not available
Composition

When policies read and write the body, composition gets harder.

Data dependencies
One policy’s output is the next one’s input. Get the order wrong and the pipeline silently degrades.
Retry semantics
Policies make expensive callouts. On failure, some must re-run and some must not.
Composition problem

Data
dependencies.

These aren’t filters. They’re a pipeline.
Data dependencies  /  A simple example

PII redaction before token rate limiting.

When a mutating policy runs before a counting policy, the count is wrong.
Token rate limit counts 500 tokens → allows ✓ 500 / 1000 budget PII redaction strips 80 tokens of PII 420 tokens remain Rate limit committed to 500 tokens but only 420 were actually sent — budget is now wrong
The token rate limit ran first and counted 500 tokens against the budget.
Then PII redaction stripped 80 tokens of sensitive content. The request that actually reached the model had only 420 tokens.
The budget is now overcharged by 80 tokens. Over thousands of requests, this drift compounds.
The fix is ordering: count tokens after redaction. But that means the pipeline must know about the dependency.
Data dependencies  /  The full picture

B’s input is A’s output.

Semantic routing in: prompt out: model name conditional model (cache key) model (replica selection) Guardrails in: prompt (+model?) out: approved / rejected Semantic cache in: prompt + model out: hit / miss Endpoint picker in: model out: endpoint address hard dependency conditional dependency
  • Cache needs the model. The cache key includes the model — you can’t look up a result until routing resolves.
  • Endpoint picker needs the model. Given a model, pick a replica with warm KV state.
  • Guardrails: the dependency is conditional. Some guardrails need the model (model-specific policies), some don’t (PII, jailbreak). The arrow is dashed — not every guardrail is downstream of routing.
Composition problem

Retry
semantics.

When the connection fails, what do you re-run?
Retry semantics  /  Connection fails — the pipeline already ran

Re-run which processors?

Pipeline runs fully endpoint chosen, request sent Connection fails which processors re-run? Semantic cache validate-only NO skip Same result in 200ms. Rerunning wastes a vector search. Guardrails validate-only NO skip Payload unchanged. Same verdict. Semantic routing validate-only NO skip Prompt unchanged. Same model. Endpoint picker state-dependent YES re-run Endpoint just failed. KV state shifted. Need fresh pick. Credential injection state-dependent MAYBE if token expired Token-lifetime aware.
Composition problem

Egress gotchas
to consider.

Egress

TLS policies are client-oriented.

SNI and hostname won’t necessarily match — an assumption that holds for ingress but breaks for egress.
When a gateway proxies egress traffic, the workload connects to a cluster-local service name. The gateway must rewrite the connection to the external FQDN.
If the HTTP Host header carries the internal name but TLS SNI carries the external name, the upstream sees two conflicting identities. Depending on the provider: the handshake fails, or the request routes to the wrong vhost.
Ingress gateways never hit this — the client already knows the real hostname. Egress gateways must reconcile two identities.
Workload in-cluster AI Gateway egress proxy api.openai.com external What the upstream sees HTTP Host: api-openai.svc.cluster.local wrong TLS SNI: api.openai.com right Handshake failure or wrong vhost routing
Egress

Policies scoped too broadly leak credentials.

A credential injection policy on the gateway attaches tokens to every outbound request — including ones that shouldn’t have them.
Gateway — inject OpenAI creds /chat → OpenAI + creds ✓ /embed → OpenAI + creds ✓ /images → DALL-E + creds ✓ /health → local svc + creds ✗ bearer token leaked to non-OpenAI endpoint
Credential injection at the gateway scope applies to all routes — including routes that don’t target the provider those credentials belong to.
The /health endpoint hits a local service, but it still gets an OpenAI bearer token attached. That token could end up in plaintext logs, error responses, or forwarded to an untrusted upstream.
Egress policies need finer scoping than gateway-wide. Ideally per-destination — so credentials only attach to routes targeting the provider they belong to.
Putting it together

Writing policies against AI protocols means our gateways need to handle things they didn’t before.

Responses don’t just flow through our system — they change the flow of the system.
Inference responses trigger tool calls, spawn agents, redirect workflows. The policies we write against these requests are stateful and order-dependent.
Policies have side effects.
Guardrails call models. Routing runs inference. Caching queries a vector database. These are network calls with cost, latency, and failure modes of their own.
We may connect to services we don’t control.
When traffic flows out to external providers, TLS semantics flip, credential scoping matters, and failure modes change at the boundary.
Putting it together

Everything and the kitchen sink inference policy.

All of these policies exist in products today. What would it look like to compose them in a single AI Gateway?
AI Gateway User workload INFERENCE REQUEST  → AuthN / AuthZ Input guardrails jailbreak, PII Model picking SR, mapping Cache lookup Token rate limit per-model budget Endpoint picker KV locality Protocol translation Credential injection per-upstream In-cluster primary pool Cloud fallback ←  RESPONSE Output guardrails PII, toxicity, leaks Token rate limit output tokens Cache write Observability · token metering
Close

Recap: why we need an AI Gateway.

An AI Gateway lets us write policies based on AI protocols — like inference requests.
This implies changes in how we think about policy composition.
Responses don’t just flow through our system — they change the flow of the system. Policies are stateful, order-dependent, and have side effects.
This may imply changes in how we think about the flow of traffic.
When connecting to external providers, TLS semantics, credential scoping, and failure modes all change at the boundary.
Close

Get involved.

We’re working on a common Kubernetes AI Gateway control plane. Please drop in if you’re interested.
Repo github.com/kubernetes-sigs/wg-ai-gateway
Slack #wg-ai-gateway  (Kubernetes)
Meetings Thursdays · 2 pm EST · weekly
Open for input Payload processing · Retry semantics · Backend × multi-cluster
QR code linking to WG AI Gateway repo
Scan → WG AI Gateway repo