A/B testing models

A/B tests route a configurable percentage of traffic to each sub-config so you can compare cost, latency, and quality in real production traffic.

Why A/B test?

Changing models is high-stakes. Benchmarks don’t match your specific use case.
Ensemble configs are especially tricky — aggregation behavior depends on your data distribution.
Cost/latency claims from vendors rarely match your real numbers.

Create an A/B test

{
  "strategy": "ab_test",
  "variants": [
    { "weight": 80, "config": "@production" },
    { "weight": 20, "config": "@production-candidate" }
  ]
}

Call the wrapper config from your app:

client.chat.completions.create(
    model="@experiment",
    messages=[...],
)

modelux logs which variant ran per request, so you can compare.

Read the results

Go to Analytics -> Compare variants. modelux shows side-by-side:

Request volume
Mean cost per request
p50 / p95 latency
Error rate

If you tag requests with a quality signal from your app (e.g., user thumbs-up/down), the analytics can also compare quality metrics across variants.

Promote a variant

Once you’ve seen enough volume to be confident, promote the winner:

Go to Experiments or the routing config’s versions view
Select the variant
Click Promote — modelux atomically switches your traffic over

Replay before you A/B

If you want signal before sending real traffic, run a replay experiment: take the last 24h of requests and replay them through the candidate config. You’ll see the cost/latency diff without risking production quality.