Everything you need to run LLMs in production.
Gateway features to keep your stack flexible. Control plane features to keep it accountable. Built for engineering teams who treat LLMs like any other production dependency.
Eight routing strategies. All config, no code.
Every routing config is a named, versioned resource. Your app calls
@production and doesn't care what's
underneath.
Single model
Lock traffic to a specific model + provider. The simplest routing config.
Fallback chain
Ordered list of models with per-attempt timeouts. Auto-retry on 429, 5xx, and timeouts.
Cost-optimized
Pick the cheapest model that meets your quality tier. Allowlist specific models.
Latency-optimized
Route based on real-time p50 latency measurements across healthy providers.
Ensemble
Fan out to multiple models in parallel. Aggregate with voting, first-valid, or weighted consensus.
A/B test
Split traffic by percentage between configs. Compare quality and cost in production.
Cascade
Sequential attempts with early stop on success. Useful for quality-tier fallbacks.
Custom rules
DSL over cost, latency, budget, and tags. Programmable policies for complex traffic.
Match frontier quality at 20% of the cost.
Run multiple smaller models in parallel, aggregate their outputs. Research shows 3-model ensembles of small open-weights models match or exceed frontier single-model quality for most tasks — at a fraction of the cost.
- ▸ Parallel fan-out with per-attempt timeouts
- ▸ Aggregation: voting, first-valid, weighted
- ▸ Weight tuning per-model
- ▸ Live cost estimate in the builder UI
{
"strategy": "ensemble",
"aggregation": "weighted_vote",
"members": [
{ "model": "claude-haiku-4-5", "weight": 1.0 },
{ "model": "gpt-4o-mini", "weight": 1.0 },
{ "model": "gemini-2.5-flash", "weight": 0.8 }
],
"timeout_ms": 5000
} Ship changes with confidence.
The gateway is table stakes. The control plane is what makes Modelux different: budgets, replay, explainability, audit, versioning.
Budgets & spend governance
Set per-project, per-tag, or org-wide spend caps. Auto-downgrade to cheaper models near the cap. Alerts at 80% and 100%.
Replay simulator
Take 24 hours of historical requests and replay them against a new routing config. Compare cost, latency, and success rate before shipping.
Decision explainability
Every request stores a full decision trace: which attempts ran, why the router picked what it picked, per-attempt cost and latency.
Audit log
Every config change, API key action, and team event is logged. Searchable. Exportable. Required for SOC 2.
Config versioning
Every routing config change creates a version. Diff changes, rollback with one click, promote from simulation results.
Webhooks
Subscribe to events: config changes, budget alerts, provider health changes, request anomalies. Signed HMAC payloads.
Stay up when providers don't.
The routing layer is the reliability layer. Failover, health scoring, and circuit breakers are built in — not a bolt-on. See the full reliability page →
Multi-provider failover
Fallback chains with per-attempt timeouts. Automatic retries on 429, 5xx, and timeout — no retry loop in your app.
Health-aware routing
Providers scored continuously on error rate and latency. Traffic shifts to healthier options before you'd notice from graphs.
Per-attempt timeouts
Each fallback attempt has its own timeout budget. A slow primary can't blow up your tail latency — the router moves on.
Streaming passthrough
Go proxy engineered for low overhead. Streaming responses forwarded chunk-by-chunk; no buffering, no head-of-line blocking.
Know what's happening. Know why.
Most LLM observability tools stop at request/response logging. We capture the full routing decision trace — so you can answer "why did this request go to that model?" for any request in your history.
Request logs
Every request captured: input, output, tokens, cost, latency, model, provider, decision trace. Searchable by tag, user, project.
Cost attribution
Per-request cost computation. Drill down by project, tag, end-user, model, or provider. Exportable to your warehouse.
Latency percentiles
p50, p95, p99 latencies per model and provider. Detect regressions before they page you.
Error tracking
Group errors by type, provider, model. See which prompts consistently fail and why.
> create a cascade that tries haiku first and falls back to sonnet [modelux] creating routing config @cascade-v1 strategy: cascade attempt_1: claude-haiku-4-5 timeout 2s attempt_2: claude-sonnet-4-5 timeout 5s retry_on: [429, 5xx, timeout] [modelux] config @cascade-v1 created. active. > show me yesterday's spend by model [modelux] fetching analytics report... gpt-4o-mini $12.47 4,821 req claude-haiku-4-5 $ 8.22 2,140 req gemini-2.5-flash $ 3.91 1,103 req total $24.60 8,064 req
Manage Modelux from your AI.
Every dashboard action is also a tool in our MCP server. Connect Claude Code, Cursor, or any MCP-aware client and manage routing, budgets, providers, and analytics through natural language.
- ▸ 80+ MCP tools covering the full API surface
- ▸ REST API for everything (dashboard is a client)
- ▸ Webhooks for event-driven integrations
- ▸ OpenAPI spec for generated SDKs