$ modelux performance

Frontier accuracy. Half the latency. Zero added overhead.

modelux sits on the path of every LLM request your team sends. Two things have to be true: the proxy can't add latency you'd notice, and the routing has to make the underlying calls faster — not slower. This page shows the numbers behind both.

Start free → How routing works →

Proxy-internal overhead

< 5ms

p99, parse + route + log on the Go proxy

TTFT, OpenAI gpt-4o-mini

240ms

p50 streaming through modelux (internal benchmark)

Ensembles vs GPT-5

2.5× faster

1,771ms vs 4,307ms — modelux ensemble of small models

Failover

sub-second

automatic on provider errors or timeouts

# proxy overhead

The proxy is invisible at p50.

modelux's internal processing — auth, routing decision, budget check, decision-trace write — is bounded at under 5ms p99 of CPU time. Across the wire, that overhead routinely disappears: a warm connection pool to the upstream provider saves more time than the proxy spends.

Numbers below come from a 500-sample benchmark of OpenAI gpt-4o-mini, non-streaming, measured against the same prompt sent direct to OpenAI from the same client.

▸ p50 through modelux: 380ms (vs 410ms direct — pool wins ~30ms)
▸ p99 through modelux: 800ms (vs 820ms direct — proxy adds ~zero at the tail)
▸ Internal overhead, measured separately: < 5ms p99

total roundtrip — non-stream N=500

time to first token — streaming N=500

# streaming

Faster first token — especially at the tail.

modelux holds a warm connection pool to every upstream provider. For streaming requests, that means the SSE handshake completes before your direct call would even finish TLS. The effect is biggest where it hurts most: the long tail.

Same prompt, same model, same client — TTFT through modelux is ~17% faster at p50 and ~22% faster at p99 than calling OpenAI direct from the same machine.

▸ No buffering — chunks forward the moment they arrive
▸ Connection pool kept warm across provider regions
▸ SSE passthrough preserves event boundaries exactly

# ensembles

Match frontier accuracy. 2.5× faster.

Small models running in parallel finish in max(per-model latency) — not the sum. An ensemble of gpt-4.1-mini + claude-haiku with confidence routing ties Claude Sonnet on accuracy and beats GPT-5 — at a fraction of the latency of either.

From the modelux ensembles benchmark: 150 real cases pulled from MMLU, GSM8K, and TriviaQA. Same prompts, same scoring.

▸ 74% accuracy — matches Claude Sonnet, beats GPT-5 (73%)
▸ 1.5× faster than Sonnet, 2.5× faster than GPT-5
▸ Ensembles are a one-line config in modelux routing

Ensembles concept →

end-to-end latency, parallel ensemble N=150

Latency = max(per-model inference time). Source: modelux ensembles benchmark, MMLU + GSM8K + TriviaQA, 150 cases.

# mechanics

Three more places latency disappears.

Hard latency wins compound. Cache hits take a provider call to zero. The router picks the fastest healthy provider for every request. Streaming forwards without a buffer.

Cache hits return in under a millisecond.

Semantic cache is embedding-indexed and read in memory. A hit short-circuits the provider call entirely — your app sees a sub-millisecond response and pays $0 for the inference.

Caching docs →

Latency-optimized routing picks the fastest provider, live.

Every model and provider is continuously scored on live p50 latency. When you ask for a capability instead of a specific model, modelux picks the fastest healthy candidate at that exact moment — and re-decides for the next request.

Routing docs →

Streaming passthrough, chunk-by-chunk.

Streaming responses forward the moment they arrive — no buffering, no head-of-line blocking, no extra round-trip. Your app's first token arrives the instant the upstream sends it.

Streaming API →

# latency-optimized routing

Pick "fastest" instead of a model.

Ask for a capability — speed, quality, cost — and let the router pick. modelux measures live p50 per provider continuously and re-decides for each request. Your config doesn't change when a provider degrades; the routing does.

Routing docs →

@latency-first json

{
  "strategy": "fastest",
  "candidates": [
    { "model": "gpt-4o-mini" },
    { "model": "claude-haiku-4-5" },
    { "model": "gemini-2.5-flash" }
  ],
  "fallback_on": ["429", "5xx", "timeout"],
  "total_budget_ms": 5000
}

See your own p50 in 60 seconds.

The free tier is everything you need to point your app at modelux, watch real percentiles flow into the dashboard, and try a fastest-route policy. No card.

Start free → See pricing →