A/B testing models
A/B tests route a configurable percentage of traffic to each sub-config so you can compare cost, latency, and quality in real production traffic.
Why A/B test?
- Changing models is high-stakes. Benchmarks don’t match your specific use case.
- Ensemble configs are especially tricky — aggregation behavior depends on your data distribution.
- Cost/latency claims from vendors rarely match your real numbers.
Create an A/B test
{
"strategy": "ab_test",
"variants": [
{ "weight": 80, "config": "@production" },
{ "weight": 20, "config": "@production-candidate" }
]
}
Call the wrapper config from your app:
client.chat.completions.create(
model="@experiment",
messages=[...],
)
Modelux logs which variant ran per request, so you can compare.
Read the results
Go to Analytics -> Compare variants. Modelux shows side-by-side:
- Request volume
- Mean cost per request
- p50 / p95 latency
- Error rate
If you tag requests with a quality signal from your app (e.g., user thumbs-up/down), the analytics can also compare quality metrics across variants.
Promote a variant
Once you’ve seen enough volume to be confident, promote the winner:
- Go to Simulations or the routing config’s versions view
- Select the variant
- Click Promote — Modelux atomically switches your traffic over
Replay before you A/B
If you want signal before sending real traffic, use the replay simulator: take the last 24h of requests and run them through the candidate config. You’ll see the cost/latency diff without risking production quality.