$ modelux cost

13× cheaper than GPT-5. Same accuracy.

Most LLM bills are over-provisioned. The frontier model gets used for traffic the cheap model would've answered correctly. modelux is the routing layer that fixes that — without you rewriting your app or your prompts.

Start free → ROI calculator →

Cheaper than GPT-5

13×

modelux ensemble vs single GPT-5, 150 cases

Cheaper than Sonnet

6×

modelux ensemble vs single Claude Sonnet

Cache hit cost

semantic cache short-circuits the provider call

Typical workload savings

50–70%

common chat workloads vs a frontier-only stack, with no accuracy regression in benchmark

# headline benchmark

Same accuracy. A fraction of the bill.

From the modelux ensembles benchmark: 150 real cases pulled from MMLU, GSM8K, and TriviaQA. Same prompts. Same scoring. The ensemble of gpt-4.1-mini + claude-haiku with confidence routing ties Claude Sonnet on accuracy and beats GPT-5 — at a tenth of the spend.

▸ 74% accuracy — same as Sonnet, beats GPT-5 (73%)
▸ $0.04 vs $0.27 vs $0.54 for the same 150 cases
▸ Configurable as a one-line routing strategy

How ensembles work →

total spend across 150 cases USD

Source: modelux ensembles benchmark, MMLU + GSM8K + TriviaQA, 150 cases. Cost is actual provider spend at April 2026 list prices.

# where savings come from

Three levers. Compose any combination.

Each one is a configuration change, not a code change. Combine them and the savings compound multiplicatively.

Ensembles

Run two or three small models in parallel. Pick the consensus answer. Match Sonnet at 6× lower cost. The latency stays bounded by the slowest small model — not the sum.

Sonnet $0.27 → ensemble $0.04

Ensembles concept →

Fallback downgrades

Mark a cheap model as the primary, a frontier model as the safety net. Most traffic resolves on the cheap one; the rest spills to frontier. You never pay frontier rates for traffic that didn't need them.

All-frontier $1.00/req → tiered $0.21/req on typical mix

Fallback guide →

Semantic cache

Embedding-keyed cache returns identical-meaning answers in under a millisecond, at zero provider cost. The hit rate compounds across users and sessions — repeat questions stop costing money.

10% cache hit rate ≈ 10% straight off the bill

Caching docs →

# $ per million tokens

The cheap models aren't bad. The expensive ones aren't always worth it.

Cheap → expensive, list prices as of April 2026. The blended column assumes a typical chat mix (75% output cost). Routing in modelux can pick the cheapest model that meets your quality bar per request.

Model	Provider	Input / 1M	Output / 1M	Blended
gpt-4o-mini	openai	$0.15	$0.60	$1.95
gpt-5-mini	openai	$0.25	$2.00	$6.25
gpt-4.1-mini	openai	$0.40	$1.60	$5.20
claude-haiku-4-5	anthropic	$1.00	$5.00	$16.00
gpt-4.1	openai	$2.00	$8.00	$26.00
gpt-5	openai	$1.25	$10.00	$31.25
claude-sonnet-4-5	anthropic	$3.00	$15.00	$48.00
gpt-4o	openai	$2.50	$10.00	$32.50
gpt-5.4	openai	$2.50	$15.00	$47.50
claude-opus-4-5	anthropic	$5.00	$25.00	$80.00
gpt-5-pro	openai	$15.00	$120.00	$375.00
gpt-5.4-pro	openai	$30.00	$180.00	$570.00

Blended = input + 3 × output. Approximates chat traffic where output dominates cost. Pricing as of April 2026.

# budgets

Hard caps. Enforced before the call leaves the proxy.

Set a daily, weekly, or monthly cap per project, key, or end-user. modelux checks the budget in under 5ms before every request and rejects with a clean error when the cap is hit. No surprise bills. No "we'll true up next month."

▸ Per-project, per-key, or per-end-user caps
▸ Soft warnings (Slack / email / webhook) before the hard cap hits
▸ Atomic enforcement on the hot path — no race conditions, no overshoot

Budgets docs →

@budget json

{
  "scope": "project",
  "period": "monthly",
  "limit_usd": 500,
  "warn_at_pct": [50, 80, 95],
  "warn_to": ["slack:#ops", "webhook:billing"],
  "on_limit": "reject"
}

# measure before you ship

Projected savings aren't good enough? Measure them.

The ROI calculator gives you a ballpark. An experiment gives you the number. Replay up to 50,000 real requests from your own logs against a cheaper candidate config — routing-only mode is free, projects cost and latency from each candidate's real token counts and per-provider health metrics. When the candidate also changes response quality, flip to with-responses mode: modelux actually calls the candidate model and scores each response against the baseline with embedding similarity.

▸ Routing-only — $0 provider spend, projected numbers, full decision trace per row
▸ With-responses — measured cost, measured latency, cosine-similarity score per pair
▸ Per-experiment spend cap + auto-cancel on overrun
▸ Promote the winning candidate to a versioned production config in one click

How experiments work →

@savings-sim json

{
  "id": "sim_8f3c…",
  "window":   { "last_days": 7 },
  "requests": 14218,
  "mode": "routing_only",
  "baseline": {
    "cost_usd": 412.88
  },
  "candidate": {
    "cost_usd": 178.40,   // −56.8%
    "route_distribution": {
      "gpt-4o-mini":      0.72,
      "claude-haiku-4-5": 0.23,
      "gpt-4o":           0.05
    }
  }
}

Plug your numbers in.

The ROI calculator takes your monthly LLM spend, the share of traffic that could downgrade, and the expected cost ratio — and shows you what modelux saves. Free tier gets you live in five minutes, then a routing-only experiment replays your real traffic against the candidate so you replace the estimate with a measurement. No card.

Start free → Open ROI calculator → Experimentation →