Scale by adding pods. Not by sharding state.
The fastest LLM proxy is the one that doesn't have to think between requests. modelux's hot path stays in memory, writes nothing on the response thread, and keeps every instance interchangeable.
Engineered for 25,000+ requests per second. Per pod.
The proxy hot path is bounded work: parse, route, budget check, log handoff. None of it touches a database, none of it waits on disk. The result is a throughput envelope that scales linearly with concurrency until the host saturates — then you add pods.
A single pod is designed to sustain 25,000+ req/s with p99 under 70ms of internal latency. Beyond that point, horizontal scaling does the rest.
- ▸ Stateless — add pods to add capacity
- ▸ No database on the hot path
- ▸ Log writes are async and bounded
| Conc | Req/s | p50 | p99 |
|---|---|---|---|
| 50 | 4,800 | 8ms | 14ms |
| 200 | 14,200 | 11ms | 19ms |
| 500 | 22,800 | 16ms | 32ms |
| 1000 | 25,400 | 28ms | 68ms |
| 2500 | 24,900 | 65ms | 180ms |
Six choices that make scale a non-event.
Every architectural decision in modelux trades cleverness for boring scalability. The hot path does the minimum. Everything slow happens elsewhere.
Horizontal proxy fleet
Every proxy instance is stateless. No leader election, no sharded ownership, no per-instance config. Add capacity by adding pods. The control plane database is the only piece of stateful infrastructure.
Hot path doesn't hit storage
Auth, rate-limit decision, budget enforcement, and routing all resolve from in-memory state warmed at startup. No per-request database roundtrip — config changes propagate within seconds, not on every call.
Logs go off the hot path
Decision traces, request logs, and analytics events are buffered in-memory and flushed asynchronously after the response is sent. The slowest analytics query in the world doesn't slow your call.
Rate limits are absorbed, not propagated
When an upstream returns 429, modelux can re-route to a fallback provider, queue the request behind a token bucket, or fail fast — your config decides. By default, app code never sees a provider rate-limit error.
Streaming preserves backpressure
SSE chunks forward as they arrive, with the upstream connection's flow control intact. A slow consumer slows the upstream, not the proxy. No HOL blocking across requests.
Failure domains are bounded
A degraded provider, a slow analytics flush, or a noisy neighbor can't cascade. Each subsystem has its own circuit breaker, timeout, and bulkhead. One slow thing stays one slow thing.
Provider 429s shouldn't reach your code.
When a provider rate-limits, modelux can route the request to a fallback provider, hold it behind a token bucket until the window resets, or fail fast — per your routing config. Your retry code in the app stops being load-bearing.
Routing docs →{
"strategy": "fallback",
"attempts": [
{ "model": "claude-haiku-4-5", "timeout_ms": 2000 },
{ "model": "gpt-4o-mini", "timeout_ms": 3000 },
{ "model": "gemini-2.5-flash", "timeout_ms": 5000 }
],
"retry_on": ["429", "5xx", "timeout"],
"total_budget_ms": 8000
} Built for the next zero on your req/s.
The free tier gets you live in minutes. When you need more headroom, dedicated capacity, or a regional endpoint — we'll have it ready.