$ modelux scale

Scale by adding pods. Not by sharding state.

The fastest LLM proxy is the one that doesn't have to think between requests. modelux's hot path stays in memory, writes nothing on the response thread, and keeps every instance interchangeable.

Start free → How routing works →

Throughput per instance

25k+ req/s

design target per pod, p99 under 70ms at conc 1,000

Proxy-internal overhead

< 5ms

p99 CPU work per request — auth + route + log + budget check; queue/wait under load is on top

Rate-limit absorption

built-in

with a fallback chain, provider 429s don't reach your code — modelux re-routes, queues, or fails fast per config

Scale model

horizontal

stateless; add pods for capacity, no sticky sessions

# throughput

Engineered for 25,000+ requests per second. Per pod.

The proxy hot path is bounded work: parse, route, budget check, log handoff. None of it touches a database, none of it waits on disk. The result is a throughput envelope that scales linearly with concurrency until the host saturates — then you add pods.

A single pod is designed to sustain 25,000+ req/s with p99 under 70ms of internal latency. Beyond that point, horizontal scaling does the rest.

▸ Stateless — add pods to add capacity
▸ No database on the hot path
▸ Log writes are async and bounded

designed throughput by concurrency per pod

Conc	Req/s	p50	p99
50	4,800	8ms	14ms
200	14,200	11ms	19ms
500	22,800	16ms	32ms
1000	25,400	28ms	68ms
2500	24,900	65ms	180ms

Latencies are proxy-internal only — upstream provider time is on top.

# architecture

Six choices that make scale a non-event.

Every architectural decision in modelux trades cleverness for boring scalability. The hot path does the minimum. Everything slow happens elsewhere.

Horizontal proxy fleet

Every proxy instance is stateless. No leader election, no sharded ownership, no per-instance config. Add capacity by adding pods. The control plane database is the only piece of stateful infrastructure.

Hot path doesn't hit storage

Auth, rate-limit decision, budget enforcement, and routing all resolve from in-memory state warmed at startup. No per-request database roundtrip — config changes propagate within seconds, not on every call.

Logs go off the hot path

Decision traces, request logs, and analytics events are buffered in-memory and flushed asynchronously after the response is sent. The slowest analytics query in the world doesn't slow your call.

Rate limits are absorbed, not propagated

When an upstream returns 429, modelux can re-route to a fallback provider, queue the request behind a token bucket, or fail fast — your config decides. By default, app code never sees a provider rate-limit error.

Streaming preserves backpressure

SSE chunks forward as they arrive, with the upstream connection's flow control intact. A slow consumer slows the upstream, not the proxy. No HOL blocking across requests.

Failure domains are bounded

A degraded provider, a slow analytics flush, or a noisy neighbor can't cascade. Each subsystem has its own circuit breaker, timeout, and bulkhead. One slow thing stays one slow thing.

# rate-limit absorption

Provider 429s shouldn't reach your code.

When a provider rate-limits, modelux can route the request to a fallback provider, hold it behind a token bucket until the window resets, or fail fast — per your routing config. Your retry code in the app stops being load-bearing.

Routing docs →

@fallback-on-429 json

{
  "strategy": "fallback",
  "attempts": [
    { "model": "claude-haiku-4-5",   "timeout_ms": 2000 },
    { "model": "gpt-4o-mini",        "timeout_ms": 3000 },
    { "model": "gemini-2.5-flash",   "timeout_ms": 5000 }
  ],
  "retry_on": ["429", "5xx", "timeout"],
  "total_budget_ms": 8000
}

Built for the next zero on your req/s.

The free tier gets you live in minutes. When you need more headroom, dedicated capacity, or a regional endpoint — we'll have it ready.

Start free → See pricing →