modelux
$ modelux cost

13× cheaper than GPT-5. Same accuracy.

Most LLM bills are over-provisioned. The frontier model gets used for traffic the cheap model would've answered correctly. modelux is the routing layer that fixes that — without you rewriting your app or your prompts.

Cheaper than GPT-5
13×
modelux ensemble vs single GPT-5, 150 cases
Cheaper than Sonnet
modelux ensemble vs single Claude Sonnet
Cache hit cost
$0
semantic cache short-circuits the provider call
Typical workload savings
50–70%
common chat workloads vs a frontier-only stack, with no accuracy regression in benchmark
# headline benchmark

Same accuracy. A fraction of the bill.

From the modelux ensembles benchmark: 150 real cases pulled from MMLU, GSM8K, and TriviaQA. Same prompts. Same scoring. The ensemble of gpt-4.1-mini + claude-haiku with confidence routing ties Claude Sonnet on accuracy and beats GPT-5 — at a tenth of the spend.

  • 74% accuracy — same as Sonnet, beats GPT-5 (73%)
  • $0.04 vs $0.27 vs $0.54 for the same 150 cases
  • Configurable as a one-line routing strategy
total spend across 150 cases USD
$0.10 $0.20 $0.30 $0.40 $0.50 modelux ensemble gpt-4.1-mini + claude-haiku, confidence routing $0.04 74% acc. claude sonnet single model $0.27 74% acc. gpt-5 single model $0.54 73% acc.
Source: modelux ensembles benchmark, MMLU + GSM8K + TriviaQA, 150 cases. Cost is actual provider spend at April 2026 list prices.
# where savings come from

Three levers. Compose any combination.

Each one is a configuration change, not a code change. Combine them and the savings compound multiplicatively.

Ensembles

Run two or three small models in parallel. Pick the consensus answer. Match Sonnet at 6× lower cost. The latency stays bounded by the slowest small model — not the sum.

Sonnet $0.27 → ensemble $0.04
Ensembles concept →

Fallback downgrades

Mark a cheap model as the primary, a frontier model as the safety net. Most traffic resolves on the cheap one; the rest spills to frontier. You never pay frontier rates for traffic that didn't need them.

All-frontier $1.00/req → tiered $0.21/req on typical mix
Fallback guide →

Semantic cache

Embedding-keyed cache returns identical-meaning answers in under a millisecond, at zero provider cost. The hit rate compounds across users and sessions — repeat questions stop costing money.

10% cache hit rate ≈ 10% straight off the bill
Caching docs →
# $ per million tokens

The cheap models aren't bad. The expensive ones aren't always worth it.

Cheap → expensive, list prices as of April 2026. The blended column assumes a typical chat mix (75% output cost). Routing in modelux can pick the cheapest model that meets your quality bar per request.

Model Provider Input / 1M Output / 1M Blended
gpt-4o-mini openai $0.15 $0.60 $1.95
gpt-5-mini openai $0.25 $2.00 $6.25
gpt-4.1-mini openai $0.40 $1.60 $5.20
claude-haiku-4-5 anthropic $1.00 $5.00 $16.00
gpt-4.1 openai $2.00 $8.00 $26.00
gpt-5 openai $1.25 $10.00 $31.25
claude-sonnet-4-5 anthropic $3.00 $15.00 $48.00
gpt-4o openai $2.50 $10.00 $32.50
gpt-5.4 openai $2.50 $15.00 $47.50
claude-opus-4-5 anthropic $5.00 $25.00 $80.00
gpt-5-pro openai $15.00 $120.00 $375.00
gpt-5.4-pro openai $30.00 $180.00 $570.00
Blended = input + 3 × output. Approximates chat traffic where output dominates cost. Pricing as of April 2026.
# budgets

Hard caps. Enforced before the call leaves the proxy.

Set a daily, weekly, or monthly cap per project, key, or end-user. modelux checks the budget in under 5ms before every request and rejects with a clean error when the cap is hit. No surprise bills. No "we'll true up next month."

  • Per-project, per-key, or per-end-user caps
  • Soft warnings (Slack / email / webhook) before the hard cap hits
  • Atomic enforcement on the hot path — no race conditions, no overshoot
@budget json
{
  "scope": "project",
  "period": "monthly",
  "limit_usd": 500,
  "warn_at_pct": [50, 80, 95],
  "warn_to": ["slack:#ops", "webhook:billing"],
  "on_limit": "reject"
}
# measure before you ship

Projected savings aren't good enough? Measure them.

The ROI calculator gives you a ballpark. An experiment gives you the number. Replay up to 50,000 real requests from your own logs against a cheaper candidate config — routing-only mode is free, projects cost and latency from each candidate's real token counts and per-provider health metrics. When the candidate also changes response quality, flip to with-responses mode: modelux actually calls the candidate model and scores each response against the baseline with embedding similarity.

  • Routing-only — $0 provider spend, projected numbers, full decision trace per row
  • With-responses — measured cost, measured latency, cosine-similarity score per pair
  • Per-experiment spend cap + auto-cancel on overrun
  • Promote the winning candidate to a versioned production config in one click
@savings-sim json
{
  "id": "sim_8f3c…",
  "window":   { "last_days": 7 },
  "requests": 14218,
  "mode": "routing_only",
  "baseline": {
    "cost_usd": 412.88
  },
  "candidate": {
    "cost_usd": 178.40,   // −56.8%
    "route_distribution": {
      "gpt-4o-mini":      0.72,
      "claude-haiku-4-5": 0.23,
      "gpt-4o":           0.05
    }
  }
}

Plug your numbers in.

The ROI calculator takes your monthly LLM spend, the share of traffic that could downgrade, and the expected cost ratio — and shows you what modelux saves. Free tier gets you live in five minutes, then a routing-only experiment replays your real traffic against the candidate so you replace the estimate with a measurement. No card.