The Illustrated Primer to GenAI Networking

Setup

The body is where the signal is.

Traditional gateways route on headers. AI inference moves every routing, security, and caching decision into the payload.

Traditional API request

# headers carry the decision Host: api.example.com ◄ route Authorization: Bearer ... ◄ auth Content-Type: application/json ◄ negotiate # body is opaque payload { "user_id": 42, "action": "get_balance" }

Everything the gateway needs is in the headers.

AI inference request

# headers are decorative Host: api.openai.com Authorization: Bearer ... # body carries every decision { "model": "gpt-4", ◄ route "messages": [ { "role": "user", ◄ guardrails "content": "..." } ◄ cache key ], "temperature": 0.7 ◄ policy }

Everything the gateway needs is in the body.

Setup

An “AI Gateway” is broadly defined as a gateway that speaks AI protocols.

But in practice, it means an egress gateway that inspects request payloads — not dissimilar from deep packet inspection.

Policy landscape

What an AI Gateway actually does.

Two categories of work.

Observe & meter.

Token-level cost attribution, usage tracking, and anomaly detection across workloads.

Apply policies that…

Allow or deny.

Pass the request through, or reject it. No mutation.

guardrails · AuthN / AuthZ · token rate limits

Respond.

Short-circuit the pipeline and return a response directly.

semantic cache

Mutate.

Rewrite the request before it reaches the upstream.

semantic routing · credential injection

Policy landscape

Observe & meter.

Like API Gateways, one of the most important roles for an AI Gateway is as a centralized source of truth.

Track costs

Token usage delta after upgrading from Opus 4.6 to Opus 4.7 — did our spend change?

Detect anomalies

Sudden spike in tool calls — is an agent stuck in a loop, or did behavior genuinely change?

Attribute usage

Per-workload, per-team token spend — chargeback without guessing.

Monitor policy

Guardrail rejection rates — a spike after deployment means a misconfigured policy, not a security event.

Policy landscape

Allow / deny policies.

These should come early in a pipeline — bail out before spending compute on expensive downstream work.

Guardrails

Inspect the prompt and reject it outright if it violates policy — jailbreak attempts, prompt injection, restricted topics.

Cheap relative to inference. Running them first means a bad request never reaches the tokenizer, cache, or model.

Token rate limiting

Count tokens against a budget and reject when exceeded. Sounds simple, but raises real questions:

Per-model budgets? Token cost varies dramatically by model. A budget that makes sense for a distilled model will bankrupt you on a frontier one.
Streaming responses. Tokens arrive in chunks — you’ll often overshoot your limit before you can cut off.
Input ≠ output cost. Output tokens are often 3–5× more expensive. A single budget doesn’t capture real spend.

Policy landscape

Direct response policies.

Short-circuit the pipeline and return immediately — never reaching the upstream.

Semantic cache

Uses a vector database to find semantically similar past requests and return cached responses. A cache hit skips inference entirely.

Should come early in the pipeline — arguably before token counting, since a cache hit means zero inference cost. But it must come after guardrails, so we never serve a cached response to a request that should have been rejected.

Guardrails can respond too

A content-policy guardrail doesn’t have to return a bare 403. It could return a canned message — “I’m unable to help with that” — making it a direct-response policy in disguise.

This blurs the line between allow/deny and respond. A single processor can act as both, depending on configuration.

The category a policy belongs to can change based on how it’s configured — which is why we need a pipeline that can express all three behaviors.

Policy landscape

Mutating policies.

These rewrite the request or response. Powerful, but they create tension in the pipeline.

They add information downstream policies need

Model selection. Semantic routing picks a model — and that choice triggers model-specific tokenizers, rate limits, and endpoint pools downstream.
Protocol translation. Rewrites the payload into the upstream’s wire format. Must happen after we know which upstream we’re talking to.
Credential injection. Attaches provider-specific credentials. Must happen last, after model and endpoint are resolved.

They can invalidate policies that already ran

PII redaction — a guardrail that mutates instead of rejecting. Redacting tokens changes the token count, so any rate limit that ran before redaction is now wrong.
File handle hydration — expanding a reference into full content can dramatically change payload size and token count.

Mutation means ordering isn’t just about efficiency — it’s about correctness. A policy that ran before a mutation may need to run again after it.

Policy landscape

These features already exist.

Every major AI Gateway ships some combination of these capabilities today.

	LiteLLM	Portkey	Kong AI	Envoy AI
AuthN / AuthZ	•	•	•	•
Guardrails (in + out)	•	•	•	•
Model routing	•	•	•	•
Semantic cache	•	•	•	◐
Token rate limiting	•	•	•	•
Protocol translation	•	•	•	•
Credential management	•	•	•	•
KV-aware endpoint picking	○	○	○	•
Observability & metering	•	•	•	•
Retry / fallback	•	•	•	•

• supported ◐ partial ○ not available

Composition

When policies read and write the body, composition gets harder.

Data dependencies

One policy’s output is the next one’s input. Get the order wrong and the pipeline silently degrades.

Retry semantics

Policies make expensive callouts. On failure, some must re-run and some must not.

Data dependencies / A simple example

PII redaction before token rate limiting.

When a mutating policy runs before a counting policy, the count is wrong.

The token rate limit ran first and counted 500 tokens against the budget.

Then PII redaction stripped 80 tokens of sensitive content. The request that actually reached the model had only 420 tokens.

The budget is now overcharged by 80 tokens. Over thousands of requests, this drift compounds.

The fix is ordering: count tokens after redaction. But that means the pipeline must know about the dependency.

Data dependencies / The full picture

B’s input is A’s output.

Cache needs the model. The cache key includes the model — you can’t look up a result until routing resolves.
Endpoint picker needs the model. Given a model, pick a replica with warm KV state.
Guardrails: the dependency is conditional. Some guardrails need the model (model-specific policies), some don’t (PII, jailbreak). The arrow is dashed — not every guardrail is downstream of routing.

Retry semantics / Connection fails — the pipeline already ran

Re-run which processors?

Egress

TLS policies are client-oriented.

SNI and hostname won’t necessarily match — an assumption that holds for ingress but breaks for egress.

When a gateway proxies egress traffic, the workload connects to a cluster-local service name. The gateway must rewrite the connection to the external FQDN.

If the HTTP Host header carries the internal name but TLS SNI carries the external name, the upstream sees two conflicting identities. Depending on the provider: the handshake fails, or the request routes to the wrong vhost.

Ingress gateways never hit this — the client already knows the real hostname. Egress gateways must reconcile two identities.

Egress

Policies scoped too broadly leak credentials.

A credential injection policy on the gateway attaches tokens to every outbound request — including ones that shouldn’t have them.

Credential injection at the gateway scope applies to all routes — including routes that don’t target the provider those credentials belong to.

The /health endpoint hits a local service, but it still gets an OpenAI bearer token attached. That token could end up in plaintext logs, error responses, or forwarded to an untrusted upstream.

Egress policies need finer scoping than gateway-wide. Ideally per-destination — so credentials only attach to routes targeting the provider they belong to.

Putting it together

Writing policies against AI protocols means our gateways need to handle things they didn’t before.

Responses don’t just flow through our system — they change the flow of the system.

Inference responses trigger tool calls, spawn agents, redirect workflows. The policies we write against these requests are stateful and order-dependent.

Policies have side effects.

Guardrails call models. Routing runs inference. Caching queries a vector database. These are network calls with cost, latency, and failure modes of their own.

We may connect to services we don’t control.

When traffic flows out to external providers, TLS semantics flip, credential scoping matters, and failure modes change at the boundary.

Putting it together

Everything and the kitchen sink inference policy.

All of these policies exist in products today. What would it look like to compose them in a single AI Gateway?

Recap: why we need an AI Gateway.

An AI Gateway lets us write policies based on AI protocols — like inference requests.

This implies changes in how we think about policy composition.

Responses don’t just flow through our system — they change the flow of the system. Policies are stateful, order-dependent, and have side effects.

This may imply changes in how we think about the flow of traffic.

When connecting to external providers, TLS semantics, credential scoping, and failure modes all change at the boundary.

Get involved.

We’re working on a common Kubernetes AI Gateway control plane. Please drop in if you’re interested.

Repo	github.com/kubernetes-sigs/wg-ai-gateway
Slack	#wg-ai-gateway (Kubernetes)
Meetings	Thursdays · 2 pm EST · weekly
Open for input	Payload processing · Retry semantics · Backend × multi-cluster

Repo

github.com/kubernetes-sigs/wg-ai-gateway

Slack

#wg-ai-gateway (Kubernetes)

Meetings

Thursdays · 2 pm EST · weekly

Open for input

Payload processing · Retry semantics · Backend × multi-cluster

Scan → WG AI Gateway repo

The Illustrated Primer
to GenAI Networking.
Why we need AI Gateways when we already had API Gateways.

Didn’t we already have API Gateways?

Two postures.

The body is where the signal is.

An “AI Gateway” is broadly defined as a gateway that speaks AI protocols.

What an AI Gateway actually does.

Observe & meter.

Allow / deny policies.

Direct response policies.

Mutating policies.

These features already exist.

When policies read and write the body, composition gets harder.

Data
dependencies.

PII redaction before token rate limiting.

B’s input is A’s output.

Retry
semantics.

Re-run which processors?

Egress gotchas
to consider.

TLS policies are client-oriented.

Policies scoped too broadly leak credentials.

Writing policies against AI protocols means our gateways need to handle things they didn’t before.

Everything and the kitchen sink inference policy.

Recap: why we need an AI Gateway.

Get involved.

The Illustrated Primer to GenAI Networking. Why we need AI Gateways when we already had API Gateways.

Didn’t we already have API Gateways?

Two postures.

The body is where the signal is.

An “AI Gateway” is broadly defined as a gateway that speaks AI protocols.

What an AI Gateway actually does.

Observe & meter.

Allow / deny policies.

Direct response policies.

Mutating policies.

These features already exist.

When policies read and write the body, composition gets harder.

Datadependencies.

PII redaction before token rate limiting.

B’s input is A’s output.

Retrysemantics.

Re-run which processors?

Egress gotchasto consider.

TLS policies are client-oriented.

Policies scoped too broadly leak credentials.

Writing policies against AI protocols means our gateways need to handle things they didn’t before.

Everything and the kitchen sink inference policy.

Recap: why we need an AI Gateway.

Get involved.

The Illustrated Primer
to GenAI Networking.
Why we need AI Gateways when we already had API Gateways.

Data
dependencies.

Retry
semantics.

Egress gotchas
to consider.