modelux
$ modelux scale

Scale by adding pods. Not by sharding state.

The fastest LLM proxy is the one that doesn't have to think between requests. modelux's hot path stays in memory, writes nothing on the response thread, and keeps every instance interchangeable.

Throughput per instance
25k+ req/s
design target per pod, p99 under 70ms at conc 1,000
Proxy-internal overhead
< 5ms
p99 CPU work per request — auth + route + log + budget check; queue/wait under load is on top
Rate-limit absorption
built-in
with a fallback chain, provider 429s don't reach your code — modelux re-routes, queues, or fails fast per config
Scale model
horizontal
stateless; add pods for capacity, no sticky sessions
# throughput

Engineered for 25,000+ requests per second. Per pod.

The proxy hot path is bounded work: parse, route, budget check, log handoff. None of it touches a database, none of it waits on disk. The result is a throughput envelope that scales linearly with concurrency until the host saturates — then you add pods.

A single pod is designed to sustain 25,000+ req/s with p99 under 70ms of internal latency. Beyond that point, horizontal scaling does the rest.

  • Stateless — add pods to add capacity
  • No database on the hot path
  • Log writes are async and bounded
designed throughput by concurrency per pod
Conc Req/s p50 p99
50 4,800 8ms 14ms
200 14,200 11ms 19ms
500 22,800 16ms 32ms
1000 25,400 28ms 68ms
2500 24,900 65ms 180ms
Latencies are proxy-internal only — upstream provider time is on top.
# architecture

Six choices that make scale a non-event.

Every architectural decision in modelux trades cleverness for boring scalability. The hot path does the minimum. Everything slow happens elsewhere.

Horizontal proxy fleet

Every proxy instance is stateless. No leader election, no sharded ownership, no per-instance config. Add capacity by adding pods. The control plane database is the only piece of stateful infrastructure.

Hot path doesn't hit storage

Auth, rate-limit decision, budget enforcement, and routing all resolve from in-memory state warmed at startup. No per-request database roundtrip — config changes propagate within seconds, not on every call.

Logs go off the hot path

Decision traces, request logs, and analytics events are buffered in-memory and flushed asynchronously after the response is sent. The slowest analytics query in the world doesn't slow your call.

Rate limits are absorbed, not propagated

When an upstream returns 429, modelux can re-route to a fallback provider, queue the request behind a token bucket, or fail fast — your config decides. By default, app code never sees a provider rate-limit error.

Streaming preserves backpressure

SSE chunks forward as they arrive, with the upstream connection's flow control intact. A slow consumer slows the upstream, not the proxy. No HOL blocking across requests.

Failure domains are bounded

A degraded provider, a slow analytics flush, or a noisy neighbor can't cascade. Each subsystem has its own circuit breaker, timeout, and bulkhead. One slow thing stays one slow thing.

# rate-limit absorption

Provider 429s shouldn't reach your code.

When a provider rate-limits, modelux can route the request to a fallback provider, hold it behind a token bucket until the window resets, or fail fast — per your routing config. Your retry code in the app stops being load-bearing.

Routing docs →
@fallback-on-429 json
{
  "strategy": "fallback",
  "attempts": [
    { "model": "claude-haiku-4-5",   "timeout_ms": 2000 },
    { "model": "gpt-4o-mini",        "timeout_ms": 3000 },
    { "model": "gemini-2.5-flash",   "timeout_ms": 5000 }
  ],
  "retry_on": ["429", "5xx", "timeout"],
  "total_budget_ms": 8000
}

Built for the next zero on your req/s.

The free tier gets you live in minutes. When you need more headroom, dedicated capacity, or a regional endpoint — we'll have it ready.