Ship routing changes without holding your breath.
Replay your real production traffic against any candidate routing config. See the cost, latency, and quality delta before a single live request hits it. When the number looks right, promote it to production in one click — versioned, with rollback.
Pick a window. Define a candidate. See the diff.
Every experiment runs on real logs, in the same routing engine that serves live traffic. No synthetic benchmarks, no toy prompts, no trust gap between what you measure and what you ship.
- 01 Pick a window. Last 7 / 14 / 30 days, or a custom range. Filter by model, end_user_id, tag, or project.
- 02 Define a candidate. Any supported routing strategy. Stored as a standard routing config — the same object that serves live traffic.
- 03 See the diff. Cost, p50/p95 latency, error rate, route distribution, and a decision trace per request.
{
"id": "sim_8f3c…",
"window": { "start": "2026-04-09", "end": "2026-04-16" },
"requests": 14218,
"mode": "routing_only",
"baseline": {
"cost_usd": 412.88,
"p50_ms": 980, "p95_ms": 3120,
"errors": 47
},
"candidate": {
"cost_usd": 178.40, // −56.8%
"p50_ms": 640, // −34.7%
"p95_ms": 2210, // −29.2%
"errors": 51
},
"route_distribution": {
"gpt-4o-mini:openai": 0.72,
"claude-haiku-4-5:anthropic": 0.23,
"gpt-4o:openai": 0.05
}
} Every routing change teams hesitate to ship.
The ones where the upside is clear but the downside is uncountable. Run them offline first.
Swap a model for its cheaper sibling
Replay last week's traffic against gpt-4o-mini instead of gpt-4o. See cost delta per request, route distribution, and whether the cheap model covers the long tail.
Add a fallback chain
Simulate a cross-provider chain on top of your current primary. Count how often the fallback would have been exercised — and whether the latency budget holds.
Tier expensive end-users
Route your top 10% of end_user_ids to a premium model, everyone else to a cheaper one. Check projected savings against real attribution data before flipping the switch.
Migrate provider
Replay OpenAI traffic against Anthropic (or vice versa). Scope by model, tag, or end-user. Score the candidate responses against baseline with semantic similarity.
Cascade by confidence
Try a cheap-model-first, escalate-on-low-confidence cascade against historical traffic. See what fraction would escalate and what the blended cost looks like.
Split an A/B in advance
Shape an ab_test config — control vs. treatment with different policies — and preview both arms against the same log slice before a single live user sees it.
Projected answers when you want speed. Measured answers when you need proof.
Most teams start with routing-only — it's free, instant, and covers every cost or reliability question. When the change also affects response quality, flip to with-responses and score the candidate against the baseline directly.
Routing-only
FreeReplay the routing decision against every log row. Costs are projected from each candidate model's pricing × the baseline's real token counts. Latency is projected from per-provider health metrics.
- ▸ No provider calls — zero spend
- ▸ Replay 50,000 requests in seconds
- ▸ Full decision trace per request
- ▸ Perfect for cost / routing / reliability sweeps
With responses
MeteredSample the window, actually call the candidate model, and score each response against the baseline using embedding-based similarity. The number you get is measured, not projected.
- ▸ Real measured cost + latency + response
- ▸ Semantic similarity score per pair (cosine, 0–1)
- ▸ Agreement %, histogram, outlier browser
- ▸ Hard spend caps and overrun auto-cancel built in
Similarity tells you when answers differ. Judge eval tells you whether it matters.
On any completed with-responses experiment, run a third LLM
as a judge. We sample stratified pairs from each similarity
bucket — high (almost identical), mild (similar wording),
major (clear divergence) — and label each one
better /
equivalent /
worse against your rubric.
Verdict breakdown lands with Wilson 95% CIs so you can tell
a real signal from noise.
- ▸ Stratified by default. High / mild / major similarity buckets sampled independently so you don't drown the major-divergence signal in 90% high-bucket equivalents.
- ▸ Customer-owned criterion. You write the one-sentence rubric ("focus on factual correctness for medical questions"). modelux owns the JSON-output envelope so the verdict shape stays stable across rubric edits.
- ▸ Hard spend cap. Set a per-run dollar ceiling at create time. Worker cancels mid-run if actual spend exceeds it. Default ~$2 for a 300-pair gpt-4o judge.
- ▸ Multiple runs per experiment. A/B two rubrics, cross-check with a second judge model, or drill into one bucket cheaper. Each run is immutable.
{
"judged_count": 300,
"spend_usd": 2.05,
"verdict_breakdown": {
"better": 0.41,
"equivalent": 0.41,
"worse": 0.18
},
"verdict_breakdown_ci_95": {
"better": { "lower": 0.36, "upper": 0.46 },
"equivalent": { "lower": 0.36, "upper": 0.46 },
"worse": { "lower": 0.14, "upper": 0.23 }
},
"by_bucket": {
"high": { "n": 100, "verdicts": { "better": 0.20, "equivalent": 0.75, "worse": 0.05 } },
"mild": { "n": 100, "verdicts": { "better": 0.45, "equivalent": 0.40, "worse": 0.15 } },
"major": { "n": 100, "verdicts": { "better": 0.55, "equivalent": 0.10, "worse": 0.35 } }
}
} Promote the sim. Keep experimenting in production.
The candidate config you simulated is a standard routing config.
Click promote and it becomes a new version — ready to attach to a
project, or to wrap in an ab_test
for a gradual live rollout.
- ▸ ab_test — split traffic by percentage between two configs. Each arm can run any strategy.
- ▸ ensemble — fan out in parallel, aggregate with voting, first-valid, or weighted consensus.
- ▸ cascade — sequential attempts with early-stop on confidence. Cheap model first, escalate when needed.
- ▸ versioned — every change is a new config version. Rollback is one click.
{
"strategy": "ab_test",
"split": [
{
"weight": 90,
"name": "control",
"policy": {
"strategy": "single",
"model": "gpt-4o"
}
},
{
"weight": 10,
"name": "sim_8f3c_candidate",
"policy": {
"strategy": "cost_optimized",
"allow": ["gpt-4o-mini", "claude-haiku-4-5"]
}
}
]
} {
"request_id": "req_3c91…",
"policy_type": "cost_optimized",
"policy_name": "sim_8f3c_candidate",
"candidates": [
{ "model": "gpt-4o-mini:openai", "rank": 1, "reason": "cheapest in allowlist" },
{ "model": "claude-haiku-4-5:anthropic", "rank": 2, "reason": "fallback" }
],
"selected": "gpt-4o-mini:openai",
"selection_reason": "cheapest",
"measured": {
"cost_usd": 0.00041,
"latency_ms": 612,
"similarity": 0.91
}
} Every replayed request explains itself.
A summary number is never enough. modelux writes a full decision trace for every replayed request — which candidates the router considered, which it picked, why, and what it measured.
- ▸ Ranked candidates with per-attempt reasoning
- ▸ Selected model + selection reason in plain English
- ▸ Measured cost, latency, and similarity per row
- ▸ Drill into outliers — high cost-delta, low similarity — without leaving the sim
Your logs are the benchmark. Nothing else is close.
Observability vendors tell you what happened. Gateways route traffic but can't replay it. Eval platforms score prompts, not policies. modelux is the only control plane where your real production traffic is the dataset.
| Category | Replay real traffic | Note |
|---|---|---|
| Observability (Helicone, Langfuse, Datadog) | — no | captures what happened — no candidate replay |
| Gateway / router (Portkey, LiteLLM, OpenRouter) | — no | routes traffic — no offline replay surface |
| Eval platforms (PromptFoo, Braintrust) | — no | evaluates prompts on fixed datasets, not your real traffic |
| modelux | ✓ yes | replay your real traffic, diff any candidate policy, promote to live |
Run your first experiment in under five minutes.
Point modelux at your last 7 days of traffic, pick a candidate policy, and watch the diff. Routing-only mode is free and costs you nothing but a refresh. The free tier is enough to get the loop running.