$ modelux experimentation

Ship routing changes without holding your breath.

Replay your real production traffic against any candidate routing config. See the cost, latency, and quality delta before a single live request hits it. When the number looks right, promote it to production in one click — versioned, with rollback.

Try it on your traffic → A/B testing guide →

Max sample size

50,000

requests replayed per experiment, streamed in batches

Routing-only cost

decisions replayed from logs — no provider calls required

Full eval

~$5

Phase A replay + 300-pair judge eval = the price of a coffee to know if a model swap will hurt your users

Promote to live

1 click

experiment result → versioned routing config, no redeploy

# the loop

Pick a window. Define a candidate. See the diff.

Every experiment runs on real logs, in the same routing engine that serves live traffic. No synthetic benchmarks, no toy prompts, no trust gap between what you measure and what you ship.

01 Pick a window. Last 7 / 14 / 30 days, or a custom range. Filter by model, end_user_id, tag, or project.
02 Define a candidate. Any supported routing strategy. Stored as a standard routing config — the same object that serves live traffic.
03 See the diff. Cost, p50/p95 latency, error rate, route distribution, and a decision trace per request.

@sim-result json

{
  "id": "sim_8f3c…",
  "window":   { "start": "2026-04-09", "end": "2026-04-16" },
  "requests": 14218,
  "mode": "routing_only",
  "baseline": {
    "cost_usd": 412.88,
    "p50_ms": 980,  "p95_ms": 3120,
    "errors": 47
  },
  "candidate": {
    "cost_usd": 178.40,  // −56.8%
    "p50_ms": 640,       // −34.7%
    "p95_ms": 2210,      // −29.2%
    "errors": 51
  },
  "route_distribution": {
    "gpt-4o-mini:openai":      0.72,
    "claude-haiku-4-5:anthropic": 0.23,
    "gpt-4o:openai":           0.05
  }
}

# what you can test

Every routing change teams hesitate to ship.

The ones where the upside is clear but the downside is uncountable. Run them offline first.

cost_optimized

Swap a model for its cheaper sibling

Replay last week's traffic against gpt-4o-mini instead of gpt-4o. See cost delta per request, route distribution, and whether the cheap model covers the long tail.

fallback_chain

Add a fallback chain

Simulate a cross-provider chain on top of your current primary. Count how often the fallback would have been exercised — and whether the latency budget holds.

custom_rules

Tier expensive end-users

Route your top 10% of end_user_ids to a premium model, everyone else to a cheaper one. Check projected savings against real attribution data before flipping the switch.

single

Migrate provider

Replay OpenAI traffic against Anthropic (or vice versa). Scope by model, tag, or end-user. Score the candidate responses against baseline with semantic similarity.

cascade

Cascade by confidence

Try a cheap-model-first, escalate-on-low-confidence cascade against historical traffic. See what fraction would escalate and what the blended cost looks like.

ab_test

Split an A/B in advance

Shape an ab_test config — control vs. treatment with different policies — and preview both arms against the same log slice before a single live user sees it.

# two modes

Projected answers when you want speed. Measured answers when you need proof.

Most teams start with routing-only — it's free, instant, and covers every cost or reliability question. When the change also affects response quality, flip to with-responses and score the candidate against the baseline directly.

Routing-only

Free

Replay the routing decision against every log row. Costs are projected from each candidate model's pricing × the baseline's real token counts. Latency is projected from per-provider health metrics.

▸ No provider calls — zero spend
▸ Replay 50,000 requests in seconds
▸ Full decision trace per request
▸ Perfect for cost / routing / reliability sweeps

With responses

Metered

Sample the window, actually call the candidate model, and score each response against the baseline using embedding-based similarity. The number you get is measured, not projected.

▸ Real measured cost + latency + response
▸ Semantic similarity score per pair (cosine, 0–1)
▸ Agreement %, histogram, outlier browser
▸ Hard spend caps and overrun auto-cancel built in

# judge the divergences

Similarity tells you when answers differ. Judge eval tells you whether it matters.

On any completed with-responses experiment, run a third LLM as a judge. We sample stratified pairs from each similarity bucket — high (almost identical), mild (similar wording), major (clear divergence) — and label each one better / equivalent / worse against your rubric. Verdict breakdown lands with Wilson 95% CIs so you can tell a real signal from noise.

▸ Stratified by default. High / mild / major similarity buckets sampled independently so you don't drown the major-divergence signal in 90% high-bucket equivalents.
▸ Customer-owned criterion. You write the one-sentence rubric ("focus on factual correctness for medical questions"). modelux owns the JSON-output envelope so the verdict shape stays stable across rubric edits.
▸ Hard spend cap. Set a per-run dollar ceiling at create time. Worker cancels mid-run if actual spend exceeds it. Default ~$2 for a 300-pair gpt-4o judge.
▸ Multiple runs per experiment. A/B two rubrics, cross-check with a second judge model, or drill into one bucket cheaper. Each run is immutable.

Judge eval guide →

@judge-result json

{
  "judged_count": 300,
  "spend_usd": 2.05,
  "verdict_breakdown": {
    "better":     0.41,
    "equivalent": 0.41,
    "worse":      0.18
  },
  "verdict_breakdown_ci_95": {
    "better":     { "lower": 0.36, "upper": 0.46 },
    "equivalent": { "lower": 0.36, "upper": 0.46 },
    "worse":      { "lower": 0.14, "upper": 0.23 }
  },
  "by_bucket": {
    "high":  { "n": 100, "verdicts": { "better": 0.20, "equivalent": 0.75, "worse": 0.05 } },
    "mild":  { "n": 100, "verdicts": { "better": 0.45, "equivalent": 0.40, "worse": 0.15 } },
    "major": { "n": 100, "verdicts": { "better": 0.55, "equivalent": 0.10, "worse": 0.35 } }
  }
}

# offline → live

Promote the sim. Keep experimenting in production.

The candidate config you simulated is a standard routing config. Click promote and it becomes a new version — ready to attach to a project, or to wrap in an ab_test for a gradual live rollout.

▸ ab_test — split traffic by percentage between two configs. Each arm can run any strategy.
▸ ensemble — fan out in parallel, aggregate with voting, first-valid, or weighted consensus.
▸ cascade — sequential attempts with early-stop on confidence. Cheap model first, escalate when needed.
▸ versioned — every change is a new config version. Rollback is one click.

@production json

{
  "strategy": "ab_test",
  "split": [
    {
      "weight": 90,
      "name": "control",
      "policy": {
        "strategy": "single",
        "model": "gpt-4o"
      }
    },
    {
      "weight": 10,
      "name": "sim_8f3c_candidate",
      "policy": {
        "strategy": "cost_optimized",
        "allow": ["gpt-4o-mini", "claude-haiku-4-5"]
      }
    }
  ]
}

@decision-trace json

{
  "request_id": "req_3c91…",
  "policy_type": "cost_optimized",
  "policy_name": "sim_8f3c_candidate",
  "candidates": [
    { "model": "gpt-4o-mini:openai",     "rank": 1, "reason": "cheapest in allowlist" },
    { "model": "claude-haiku-4-5:anthropic", "rank": 2, "reason": "fallback" }
  ],
  "selected": "gpt-4o-mini:openai",
  "selection_reason": "cheapest",
  "measured": {
    "cost_usd": 0.00041,
    "latency_ms": 612,
    "similarity": 0.91
  }
}

# traces

Every replayed request explains itself.

A summary number is never enough. modelux writes a full decision trace for every replayed request — which candidates the router considered, which it picked, why, and what it measured.

▸ Ranked candidates with per-attempt reasoning
▸ Selected model + selection reason in plain English
▸ Measured cost, latency, and similarity per row
▸ Drill into outliers — high cost-delta, low similarity — without leaving the sim

# only modelux

Your logs are the benchmark. Nothing else is close.

Observability vendors tell you what happened. Gateways route traffic but can't replay it. Eval platforms score prompts, not policies. modelux is the only control plane where your real production traffic is the dataset.

Category	Replay real traffic	Note
Observability (Helicone, Langfuse, Datadog)	— no	captures what happened — no candidate replay
Gateway / router (Portkey, LiteLLM, OpenRouter)	— no	routes traffic — no offline replay surface
Eval platforms (PromptFoo, Braintrust)	— no	evaluates prompts on fixed datasets, not your real traffic
modelux	✓ yes	replay your real traffic, diff any candidate policy, promote to live

Run your first experiment in under five minutes.

Point modelux at your last 7 days of traffic, pick a candidate policy, and watch the diff. Routing-only mode is free and costs you nothing but a refresh. The free tier is enough to get the loop running.

Start an experiment → Read the guide →