Running experiments
Before you change production routing, you want to know: would this candidate config save money? Speed things up? Break anything? modelux experiments answer that question by replaying your real historical traffic against a proposed config and comparing the results to what actually ran.
There are two modes:
- Routing-only — replays the routing decision against historical logs. No provider calls, no charges, no quality signal. Cost and latency are estimated from token counts × pricing. Free, fast, fine for “what model would this have picked?”
- With responses — re-runs a sample of historical prompts through the candidate model for real, captures the response, and scores each pair against the baseline response with cosine similarity. Real spend, real latency, and a per-pair “did the candidate produce a meaningfully different answer?” signal.
Pick routing-only when you only care about cost or routing logic. Pick with-responses when you care about whether a different model actually produces acceptable answers.
Routing-only: a quick what-if
Replay last week’s traffic through a candidate config:
curl -X POST https://api.modelux.ai/manage/v1/experiments \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "switch to gpt-4o-mini",
"projectId": "proj_abc",
"candidatePolicy": "single",
"candidateConfig": {
"model": "gpt-4o-mini",
"provider_credential_id": "pc_openai_default"
},
"windowStart": "2026-04-10T00:00:00Z",
"windowEnd": "2026-04-17T00:00:00Z"
}'
The experiment runs asynchronously. Poll GET /manage/v1/experiments/{id}
until status is completed, then read the baseline_summary and
candidate_summary JSON for cost, p50/p95 latency, and route distribution.
The dashboard shows the same data with a Promote to production button that creates a new routing config version from the candidate.
With-responses: actually call the model
This mode answers the question routing-only can’t: do the candidate’s responses look like the baseline’s?
A typical workflow:
- Preflight to get a cost estimate and a suggested spend cap.
- Create the experiment with
mode: "with_responses", the sample spec, and aspendCapUsd(the engine cancels mid-run if actual spend exceeds it). - Watch the dashboard for the similarity histogram, headline “X% agreement” number, and worst-similarity outlier pairs.
Preflight
curl -X POST https://api.modelux.ai/manage/v1/experiments/preflight \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-H "Content-Type: application/json" \
-d '{
"projectId": "proj_abc",
"candidatePolicy": "single",
"candidateConfig": {
"model": "gpt-4o-mini",
"provider_credential_id": "pc_openai_default"
},
"windowStart": "2026-04-10T00:00:00Z",
"windowEnd": "2026-04-17T00:00:00Z",
"sample": { "size": 1000, "method": "random" }
}'
You get back something like:
{
"data": {
"estimate": {
"estimatedCostUsd": 0.0003,
"effectiveSampleSize": 81,
"rowCountInWindow": 81,
"perCallEstimateUsd": 0.0000037,
"embeddingCostUsd": 0.0000016,
"notes": []
},
"spendCapUsd": 50,
"suggestedSpendCapUsd": 0.05,
"requiresConfirmation": false,
"autoCancelRatio": 1.5
}
}
effectiveSampleSize is min(sample.size, rowCountInWindow) — the actual
number of pairs the engine will run. If you ask for 1,000 but the window
only has 81 rows, you get 81.
suggestedSpendCapUsd is max(estimate × 2, $0.05) clamped to
spendCapUsd. The $0.05 floor stops penny-level dev sims from being
cancelled at fractions of a cent.
Create
Pass the suggested cap (or your own) as spendCapUsd:
curl -X POST https://api.modelux.ai/manage/v1/experiments \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "gpt-4o-mini quality check",
"projectId": "proj_abc",
"candidatePolicy": "single",
"candidateConfig": {
"model": "gpt-4o-mini",
"provider_credential_id": "pc_openai_default"
},
"windowStart": "2026-04-10T00:00:00Z",
"windowEnd": "2026-04-17T00:00:00Z",
"mode": "with_responses",
"sample": { "size": 1000, "method": "random" },
"spendCapUsd": 0.05
}'
spendCapUsd is required for with_responses and must be ≤ the project’s
experiment_spend_cap_usd setting. The engine watches the running spend
every 50 rows and cancels the experiment if it crosses your cap. The
cancel reason names your cap explicitly so you know what tripped.
There’s also a hard rule: projects in metadata_only logging mode are
rejected because there’s no logged response to compare against.
Read the similarity output
Once the experiment is completed, three endpoints expose the
quality signal:
# Headline + histogram (10 fixed buckets across [0, 1])
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
"https://api.modelux.ai/manage/v1/experiments/$SIM_ID/similarity-histogram?threshold=0.85"
# Per-pair rows; sort by worst-first to surface outliers
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
"https://api.modelux.ai/manage/v1/experiments/$SIM_ID/results?sort=similarity_asc&limit=10"
# Or just candidate-side errors
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
"https://api.modelux.ai/manage/v1/experiments/$SIM_ID/results?candidate_status=error"
The histogram returns mean similarity, scored vs. unscored counts, and the fraction of pairs at or above the agreement threshold. The dashboard renders this as a single headline: “73% agreement (≥0.85) over 1,000 scored pairs”.
For the outliers endpoint, sort options are:
similarity_asc/similarity_desc— worst-first / best-firstcost_delta_desc/cost_delta_asc— most cost-regressive / most savingstimestamp_desc(default) /timestamp_asc
Filters: candidate_status=ok|error|any, min_similarity, max_similarity.
Setting any similarity filter excludes unscored rows, so you can ask “show
me the 50 worst scored pairs” without wading through routing-only zeros.
Reading similarity scores
The similarity score is the cosine similarity between the embedding of the
baseline response and the embedding of the candidate response, both run
through text-embedding-3-small by default. Range is [0, 1]:
- 0.95+ — near-identical answers. Safe candidate substitution.
- 0.80–0.95 — same content, different wording. Usually fine.
- 0.60–0.80 — meaningful divergence. Worth a human spot-check.
- Below 0.60 — different answers. Open the diff drawer in the
dashboard or fetch the pair via
GET /manage/v1/logs/{request_id}to see whether the candidate is wrong, more terse, or just stylistically different.
The agreement threshold defaults to 0.85 in the histogram endpoint and the
dashboard headline; raise or lower it via the ?threshold= query param.
MCP
All experiment operations are exposed as MCP tools — create_experiment,
get_experiment, list_experiments, get_experiment_results,
get_experiment_similarity_histogram, cancel_experiment,
promote_experiment, estimate_experiment. With your IDE pointed at the
modelux MCP server, you can ask in natural language:
Simulate switching production to gpt-4o-mini for the last 7 days, with responses, and tell me if the similarity drop is acceptable.
See the MCP setup guide to wire it up.
What the cost estimate gets wrong
The estimator multiplies the baseline’s average input + output token counts by the candidate’s per-token price. That’s wrong when the candidate is systematically more or less verbose than the baseline:
- A baseline that answers “ok” in 5 tokens, vs. a candidate that answers “Of course! How can I help you today?” in 30 tokens, will under-estimate by ~6×.
- The opposite — a verbose baseline being replaced by a terser candidate — over-estimates.
The per-sim spend cap exists exactly for this case. Pick a cap with
headroom (the suggestedSpendCapUsd is 2× the estimate by default), and
the engine will stop at your cap rather than blow past the estimate.
Run a judge eval on the results
Similarity tells you how often the candidate diverged from the
baseline. The next question is was the divergence acceptable? — i.e.
when the candidate produced a different answer, was it actually worse,
or just stylistically different? A judge eval runs a stratified
sample of your with-responses pairs through a third LLM (the “judge”)
that labels each pair better / equivalent / worse against a
rubric you supply.
Judge runs are explicit. They cost real money on top of the with- responses sim itself, so we never start one automatically.
Preflight
curl -X POST https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/preflight \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-H "Content-Type: application/json" \
-d '{
"judgeModel": "gpt-4o",
"stratification": { "high": 100, "mild": 100, "major": 100 }
}'
Returns the effective per-bucket count (after redistribution if any bucket is short on pairs), the dollar estimate, and a suggested cap:
{
"data": {
"effectivePairCount": 300,
"effectiveByBucket": { "high": 100, "mild": 100, "major": 100 },
"estimatedCostUsd": 2.10,
"perCallEstimateUsd": 0.007,
"spendCapUsd": 20,
"suggestedSpendCapUsd": 4.20,
"requiresConfirmation": false,
"notes": [],
"rubric": "Is the candidate response at least as correct, helpful, and complete as the baseline response for this prompt? Consider correctness first, then helpfulness, then completeness.",
"rubricHash": "a1b2c3..."
}
}
Create
curl -X POST https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-H "Content-Type: application/json" \
-d '{
"judgeModel": "gpt-4o",
"rubric": "Focus on factual correctness for medical questions.",
"stratification": { "high": 100, "mild": 100, "major": 100 },
"spendCapUsd": 4.20
}'
The run kicks off asynchronously. Poll GET .../judge/{judgeRunId}
until status is complete; the summary field then contains the
verdict breakdown plus Wilson 95% confidence intervals on each
proportion.
Read the verdicts
# Run-level summary
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
"https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID"
# Per-pair verdicts; surface the worst-quality outliers first
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
"https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID/results?verdict=worse"
# Or focus on a single similarity tier (e.g. did high-similarity pairs
# all judge equivalent, as expected?)
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
"https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID/results?bucket=major"
The summary shape:
{
"judged_count": 300,
"parse_errors": 0,
"spend_usd": 2.05,
"verdict_breakdown": { "better": 0.41, "equivalent": 0.41, "worse": 0.18 },
"verdict_breakdown_ci_95": {
"better": { "lower": 0.36, "upper": 0.46 },
"equivalent": { "lower": 0.36, "upper": 0.46 },
"worse": { "lower": 0.14, "upper": 0.23 }
},
"by_bucket": {
"high": { "n": 100, "verdicts": { "better": 0.20, "equivalent": 0.75, "worse": 0.05 } },
"mild": { "n": 100, "verdicts": { ... } },
"major": { "n": 100, "verdicts": { ... } }
}
}
Reading the breakdown
A few rules of thumb:
- High-bucket pairs should mostly be
equivalent. If high- similarity pairs are showing meaningfulworserates, the similarity score is misleading you — likely a stylistic-vs- substantive issue the embedder isn’t catching. Consider a different embedding model. - Major-bucket
worserate is the headline risk. If 35% of major- divergence pairs judge worse, the candidate has substantive quality drops on ~35% of the prompts where it answered very differently. That’s roughly0.35 × major_share_of_trafficof your real users. - Wilson CIs matter on small samples. A 41% better rate with CI [36%, 46%] is meaningfully different from 41% with CI [10%, 70%]; the latter just means “we don’t know yet.” The 100-per-bucket default gives ±5% on the headline; 500-per-bucket gives ±2%.
- Watch the
parse_errorscount. A high parse-error rate means the judge model isn’t following the JSON contract reliably. Switch to a different judge (Claude tends to be slightly better at structured output than GPT-4o-mini, but more expensive).
Multiple runs per experiment
Each judge run is immutable — different rubric, model, or sample size
all create a new judge_run row. Use this to:
- A/B two rubrics (“focus on correctness” vs. “focus on conciseness”) on the same pair set.
- Cross-check by running the same rubric through two judge models. If they disagree, your rubric is probably ambiguous.
- Re-judge a subset (
stratification: { high: 0, mild: 0, major: 200 }) to drill into one bucket cheaper.
Default judge model + costs
Defaults to gpt-4o. Pricing for the standard 300-pair run:
| Judge model | Cost |
|---|---|
| gpt-4o-mini | $0.13 |
| gpt-4o | $2.10 |
| claude-sonnet-4 | $2.70 |
Total Phase A + B on default settings: ~$5.
Attaching a hypothesis
Every experiment accepts an optional hypothesis — freeform text, up
to 2,000 characters, never parsed. It renders on the result page next
to the deltas grid so readers know what question the experiment was
meant to answer.
{
"name": "cheaper-model-rehearsal",
"projectId": "proj_...",
"candidatePolicy": "single",
"candidateConfig": { "model": "gpt-4o-mini", "provider_credential_id": "..." },
"windowStart": "2026-04-13T00:00:00Z",
"windowEnd": "2026-04-20T00:00:00Z",
"hypothesis": "Switching to gpt-4o-mini will cut cost ~30% without a noticeable drop in quality."
}
One-off experiments stay exploratory — they do not compute a pass/fail verdict in v1, even when a hypothesis is set. For machine-checkable verdicts, attach the experiment to a scheduled experiment and set structured success criteria.
Limits
- Sample size (with_responses): 1,000 minimum, 50,000 maximum per experiment in v1.
- Experiment spend cap defaults to $50; override via
experiment_spend_cap_usdon the project. - Judge run spend cap defaults to $20; override via
judge_spend_cap_usd. - Judge stratification total capped at 1,500 pairs in v1.
with_responsesrequires the project to be infulllogging mode.ensemblepolicies cannot be used as candidates in experiments.- Each experiment / judge run has a 30-minute hard timeout.
API reference
Experiments:
POST /manage/v1/experiments— createPOST /manage/v1/experiments/preflight— preflight cost estimateGET /manage/v1/experiments/{id}— fetchGET /manage/v1/experiments/{id}/results— per-pair rowsGET /manage/v1/experiments/{id}/similarity-histogram— histogram + headlinePOST /manage/v1/experiments/{id}/cancel— cancel runningPOST /manage/v1/experiments/{id}/promote— promote candidate to production
Judge runs:
POST /manage/v1/experiments/{id}/judge— createPOST /manage/v1/experiments/{id}/judge/preflight— preflight cost estimateGET /manage/v1/experiments/{id}/judge— list runs for a simGET /manage/v1/experiments/{id}/judge/{judgeRunId}— fetchGET /manage/v1/experiments/{id}/judge/{judgeRunId}/results— per-pair verdictsPOST /manage/v1/experiments/{id}/judge/{judgeRunId}/cancel— cancel running