[view as .md]

A/B testing models

A/B tests route a configurable percentage of traffic to each sub-config so you can compare cost, latency, and quality in real production traffic.

Why A/B test?

  • Changing models is high-stakes. Benchmarks don’t match your specific use case.
  • Ensemble configs are especially tricky — aggregation behavior depends on your data distribution.
  • Cost/latency claims from vendors rarely match your real numbers.

Create an A/B test

{
  "strategy": "ab_test",
  "variants": [
    { "weight": 80, "config": "@production" },
    { "weight": 20, "config": "@production-candidate" }
  ]
}

Call the wrapper config from your app:

client.chat.completions.create(
    model="@experiment",
    messages=[...],
)

Modelux logs which variant ran per request, so you can compare.

Read the results

Go to Analytics -> Compare variants. Modelux shows side-by-side:

  • Request volume
  • Mean cost per request
  • p50 / p95 latency
  • Error rate

If you tag requests with a quality signal from your app (e.g., user thumbs-up/down), the analytics can also compare quality metrics across variants.

Promote a variant

Once you’ve seen enough volume to be confident, promote the winner:

  1. Go to Simulations or the routing config’s versions view
  2. Select the variant
  3. Click Promote — Modelux atomically switches your traffic over

Replay before you A/B

If you want signal before sending real traffic, use the replay simulator: take the last 24h of requests and run them through the candidate config. You’ll see the cost/latency diff without risking production quality.