[view as .md]

Running experiments

Before you change production routing, you want to know: would this candidate config save money? Speed things up? Break anything? modelux experiments answer that question by replaying your real historical traffic against a proposed config and comparing the results to what actually ran.

There are two modes:

  • Routing-only — replays the routing decision against historical logs. No provider calls, no charges, no quality signal. Cost and latency are estimated from token counts × pricing. Free, fast, fine for “what model would this have picked?”
  • With responses — re-runs a sample of historical prompts through the candidate model for real, captures the response, and scores each pair against the baseline response with cosine similarity. Real spend, real latency, and a per-pair “did the candidate produce a meaningfully different answer?” signal.

Pick routing-only when you only care about cost or routing logic. Pick with-responses when you care about whether a different model actually produces acceptable answers.

Routing-only: a quick what-if

Replay last week’s traffic through a candidate config:

curl -X POST https://api.modelux.ai/manage/v1/experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "switch to gpt-4o-mini",
    "projectId": "proj_abc",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "windowStart": "2026-04-10T00:00:00Z",
    "windowEnd":   "2026-04-17T00:00:00Z"
  }'

The experiment runs asynchronously. Poll GET /manage/v1/experiments/{id} until status is completed, then read the baseline_summary and candidate_summary JSON for cost, p50/p95 latency, and route distribution.

The dashboard shows the same data with a Promote to production button that creates a new routing config version from the candidate.

With-responses: actually call the model

This mode answers the question routing-only can’t: do the candidate’s responses look like the baseline’s?

A typical workflow:

  1. Preflight to get a cost estimate and a suggested spend cap.
  2. Create the experiment with mode: "with_responses", the sample spec, and a spendCapUsd (the engine cancels mid-run if actual spend exceeds it).
  3. Watch the dashboard for the similarity histogram, headline “X% agreement” number, and worst-similarity outlier pairs.

Preflight

curl -X POST https://api.modelux.ai/manage/v1/experiments/preflight \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "proj_abc",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "windowStart": "2026-04-10T00:00:00Z",
    "windowEnd":   "2026-04-17T00:00:00Z",
    "sample": { "size": 1000, "method": "random" }
  }'

You get back something like:

{
  "data": {
    "estimate": {
      "estimatedCostUsd": 0.0003,
      "effectiveSampleSize": 81,
      "rowCountInWindow": 81,
      "perCallEstimateUsd": 0.0000037,
      "embeddingCostUsd": 0.0000016,
      "notes": []
    },
    "spendCapUsd": 50,
    "suggestedSpendCapUsd": 0.05,
    "requiresConfirmation": false,
    "autoCancelRatio": 1.5
  }
}

effectiveSampleSize is min(sample.size, rowCountInWindow) — the actual number of pairs the engine will run. If you ask for 1,000 but the window only has 81 rows, you get 81.

suggestedSpendCapUsd is max(estimate × 2, $0.05) clamped to spendCapUsd. The $0.05 floor stops penny-level dev sims from being cancelled at fractions of a cent.

Create

Pass the suggested cap (or your own) as spendCapUsd:

curl -X POST https://api.modelux.ai/manage/v1/experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpt-4o-mini quality check",
    "projectId": "proj_abc",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "windowStart": "2026-04-10T00:00:00Z",
    "windowEnd":   "2026-04-17T00:00:00Z",
    "mode": "with_responses",
    "sample": { "size": 1000, "method": "random" },
    "spendCapUsd": 0.05
  }'

spendCapUsd is required for with_responses and must be ≤ the project’s experiment_spend_cap_usd setting. The engine watches the running spend every 50 rows and cancels the experiment if it crosses your cap. The cancel reason names your cap explicitly so you know what tripped.

There’s also a hard rule: projects in metadata_only logging mode are rejected because there’s no logged response to compare against.

Read the similarity output

Once the experiment is completed, three endpoints expose the quality signal:

# Headline + histogram (10 fixed buckets across [0, 1])
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/similarity-histogram?threshold=0.85"

# Per-pair rows; sort by worst-first to surface outliers
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/results?sort=similarity_asc&limit=10"

# Or just candidate-side errors
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/results?candidate_status=error"

The histogram returns mean similarity, scored vs. unscored counts, and the fraction of pairs at or above the agreement threshold. The dashboard renders this as a single headline: “73% agreement (≥0.85) over 1,000 scored pairs”.

For the outliers endpoint, sort options are:

  • similarity_asc / similarity_desc — worst-first / best-first
  • cost_delta_desc / cost_delta_asc — most cost-regressive / most savings
  • timestamp_desc (default) / timestamp_asc

Filters: candidate_status=ok|error|any, min_similarity, max_similarity. Setting any similarity filter excludes unscored rows, so you can ask “show me the 50 worst scored pairs” without wading through routing-only zeros.

Reading similarity scores

The similarity score is the cosine similarity between the embedding of the baseline response and the embedding of the candidate response, both run through text-embedding-3-small by default. Range is [0, 1]:

  • 0.95+ — near-identical answers. Safe candidate substitution.
  • 0.80–0.95 — same content, different wording. Usually fine.
  • 0.60–0.80 — meaningful divergence. Worth a human spot-check.
  • Below 0.60 — different answers. Open the diff drawer in the dashboard or fetch the pair via GET /manage/v1/logs/{request_id} to see whether the candidate is wrong, more terse, or just stylistically different.

The agreement threshold defaults to 0.85 in the histogram endpoint and the dashboard headline; raise or lower it via the ?threshold= query param.

MCP

All experiment operations are exposed as MCP tools — create_experiment, get_experiment, list_experiments, get_experiment_results, get_experiment_similarity_histogram, cancel_experiment, promote_experiment, estimate_experiment. With your IDE pointed at the modelux MCP server, you can ask in natural language:

Simulate switching production to gpt-4o-mini for the last 7 days, with responses, and tell me if the similarity drop is acceptable.

See the MCP setup guide to wire it up.

What the cost estimate gets wrong

The estimator multiplies the baseline’s average input + output token counts by the candidate’s per-token price. That’s wrong when the candidate is systematically more or less verbose than the baseline:

  • A baseline that answers “ok” in 5 tokens, vs. a candidate that answers “Of course! How can I help you today?” in 30 tokens, will under-estimate by ~6×.
  • The opposite — a verbose baseline being replaced by a terser candidate — over-estimates.

The per-sim spend cap exists exactly for this case. Pick a cap with headroom (the suggestedSpendCapUsd is 2× the estimate by default), and the engine will stop at your cap rather than blow past the estimate.

Run a judge eval on the results

Similarity tells you how often the candidate diverged from the baseline. The next question is was the divergence acceptable? — i.e. when the candidate produced a different answer, was it actually worse, or just stylistically different? A judge eval runs a stratified sample of your with-responses pairs through a third LLM (the “judge”) that labels each pair better / equivalent / worse against a rubric you supply.

Judge runs are explicit. They cost real money on top of the with- responses sim itself, so we never start one automatically.

Preflight

curl -X POST https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/preflight \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "judgeModel": "gpt-4o",
    "stratification": { "high": 100, "mild": 100, "major": 100 }
  }'

Returns the effective per-bucket count (after redistribution if any bucket is short on pairs), the dollar estimate, and a suggested cap:

{
  "data": {
    "effectivePairCount": 300,
    "effectiveByBucket": { "high": 100, "mild": 100, "major": 100 },
    "estimatedCostUsd": 2.10,
    "perCallEstimateUsd": 0.007,
    "spendCapUsd": 20,
    "suggestedSpendCapUsd": 4.20,
    "requiresConfirmation": false,
    "notes": [],
    "rubric": "Is the candidate response at least as correct, helpful, and complete as the baseline response for this prompt? Consider correctness first, then helpfulness, then completeness.",
    "rubricHash": "a1b2c3..."
  }
}

Create

curl -X POST https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "judgeModel": "gpt-4o",
    "rubric": "Focus on factual correctness for medical questions.",
    "stratification": { "high": 100, "mild": 100, "major": 100 },
    "spendCapUsd": 4.20
  }'

The run kicks off asynchronously. Poll GET .../judge/{judgeRunId} until status is complete; the summary field then contains the verdict breakdown plus Wilson 95% confidence intervals on each proportion.

Read the verdicts

# Run-level summary
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID"

# Per-pair verdicts; surface the worst-quality outliers first
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID/results?verdict=worse"

# Or focus on a single similarity tier (e.g. did high-similarity pairs
# all judge equivalent, as expected?)
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID/results?bucket=major"

The summary shape:

{
  "judged_count": 300,
  "parse_errors": 0,
  "spend_usd": 2.05,
  "verdict_breakdown": { "better": 0.41, "equivalent": 0.41, "worse": 0.18 },
  "verdict_breakdown_ci_95": {
    "better":     { "lower": 0.36, "upper": 0.46 },
    "equivalent": { "lower": 0.36, "upper": 0.46 },
    "worse":      { "lower": 0.14, "upper": 0.23 }
  },
  "by_bucket": {
    "high":  { "n": 100, "verdicts": { "better": 0.20, "equivalent": 0.75, "worse": 0.05 } },
    "mild":  { "n": 100, "verdicts": { ... } },
    "major": { "n": 100, "verdicts": { ... } }
  }
}

Reading the breakdown

A few rules of thumb:

  • High-bucket pairs should mostly be equivalent. If high- similarity pairs are showing meaningful worse rates, the similarity score is misleading you — likely a stylistic-vs- substantive issue the embedder isn’t catching. Consider a different embedding model.
  • Major-bucket worse rate is the headline risk. If 35% of major- divergence pairs judge worse, the candidate has substantive quality drops on ~35% of the prompts where it answered very differently. That’s roughly 0.35 × major_share_of_traffic of your real users.
  • Wilson CIs matter on small samples. A 41% better rate with CI [36%, 46%] is meaningfully different from 41% with CI [10%, 70%]; the latter just means “we don’t know yet.” The 100-per-bucket default gives ±5% on the headline; 500-per-bucket gives ±2%.
  • Watch the parse_errors count. A high parse-error rate means the judge model isn’t following the JSON contract reliably. Switch to a different judge (Claude tends to be slightly better at structured output than GPT-4o-mini, but more expensive).

Multiple runs per experiment

Each judge run is immutable — different rubric, model, or sample size all create a new judge_run row. Use this to:

  • A/B two rubrics (“focus on correctness” vs. “focus on conciseness”) on the same pair set.
  • Cross-check by running the same rubric through two judge models. If they disagree, your rubric is probably ambiguous.
  • Re-judge a subset (stratification: { high: 0, mild: 0, major: 200 }) to drill into one bucket cheaper.

Default judge model + costs

Defaults to gpt-4o. Pricing for the standard 300-pair run:

Judge modelCost
gpt-4o-mini$0.13
gpt-4o$2.10
claude-sonnet-4$2.70

Total Phase A + B on default settings: ~$5.

Attaching a hypothesis

Every experiment accepts an optional hypothesis — freeform text, up to 2,000 characters, never parsed. It renders on the result page next to the deltas grid so readers know what question the experiment was meant to answer.

{
  "name": "cheaper-model-rehearsal",
  "projectId": "proj_...",
  "candidatePolicy": "single",
  "candidateConfig": { "model": "gpt-4o-mini", "provider_credential_id": "..." },
  "windowStart": "2026-04-13T00:00:00Z",
  "windowEnd": "2026-04-20T00:00:00Z",
  "hypothesis": "Switching to gpt-4o-mini will cut cost ~30% without a noticeable drop in quality."
}

One-off experiments stay exploratory — they do not compute a pass/fail verdict in v1, even when a hypothesis is set. For machine-checkable verdicts, attach the experiment to a scheduled experiment and set structured success criteria.

Limits

  • Sample size (with_responses): 1,000 minimum, 50,000 maximum per experiment in v1.
  • Experiment spend cap defaults to $50; override via experiment_spend_cap_usd on the project.
  • Judge run spend cap defaults to $20; override via judge_spend_cap_usd.
  • Judge stratification total capped at 1,500 pairs in v1.
  • with_responses requires the project to be in full logging mode.
  • ensemble policies cannot be used as candidates in experiments.
  • Each experiment / judge run has a 30-minute hard timeout.

API reference

Experiments:

Judge runs: