<!-- source: https://modelux.ai/docs/guides/experiments -->

> Replay historical traffic against a candidate routing config — with or without making real LLM calls — to see what would happen before you ship.

# Running experiments

Before you change production routing, you want to know: would this candidate
config save money? Speed things up? Break anything? modelux experiments
answer that question by replaying your real historical traffic against a
proposed config and comparing the results to what actually ran.

There are two modes:

- **Routing-only** — replays the *routing decision* against historical logs.
  No provider calls, no charges, no quality signal. Cost and latency are
  estimated from token counts × pricing. Free, fast, fine for "what model
  would this have picked?"
- **With responses** — re-runs a sample of historical prompts through the
  candidate model for real, captures the response, and scores each pair
  against the baseline response with cosine similarity. Real spend, real
  latency, and a per-pair "did the candidate produce a meaningfully
  different answer?" signal.

Pick routing-only when you only care about cost or routing logic. Pick
with-responses when you care about whether a different model actually
produces acceptable answers.

## Routing-only: a quick what-if

Replay last week's traffic through a candidate config:

```bash
curl -X POST https://api.modelux.ai/manage/v1/experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "switch to gpt-4o-mini",
    "projectId": "proj_abc",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "windowStart": "2026-04-10T00:00:00Z",
    "windowEnd":   "2026-04-17T00:00:00Z"
  }'
```

The experiment runs asynchronously. Poll `GET /manage/v1/experiments/{id}`
until `status` is `completed`, then read the `baseline_summary` and
`candidate_summary` JSON for cost, p50/p95 latency, and route distribution.

The dashboard shows the same data with a **Promote to production** button
that creates a new routing config version from the candidate.

## With-responses: actually call the model

This mode answers the question routing-only can't: *do the candidate's
responses look like the baseline's?*

A typical workflow:

1. **Preflight** to get a cost estimate and a suggested spend cap.
2. **Create** the experiment with `mode: "with_responses"`, the sample
   spec, and a `spendCapUsd` (the engine cancels mid-run if actual
   spend exceeds it).
3. **Watch** the dashboard for the similarity histogram, headline
   "X% agreement" number, and worst-similarity outlier pairs.

### Preflight

```bash
curl -X POST https://api.modelux.ai/manage/v1/experiments/preflight \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "proj_abc",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "windowStart": "2026-04-10T00:00:00Z",
    "windowEnd":   "2026-04-17T00:00:00Z",
    "sample": { "size": 1000, "method": "random" }
  }'
```

You get back something like:

```json
{
  "data": {
    "estimate": {
      "estimatedCostUsd": 0.0003,
      "effectiveSampleSize": 81,
      "rowCountInWindow": 81,
      "perCallEstimateUsd": 0.0000037,
      "embeddingCostUsd": 0.0000016,
      "notes": []
    },
    "spendCapUsd": 50,
    "suggestedSpendCapUsd": 0.05,
    "requiresConfirmation": false,
    "autoCancelRatio": 1.5
  }
}
```

`effectiveSampleSize` is `min(sample.size, rowCountInWindow)` — the actual
number of pairs the engine will run. If you ask for 1,000 but the window
only has 81 rows, you get 81.

`suggestedSpendCapUsd` is `max(estimate × 2, $0.05)` clamped to
`spendCapUsd`. The $0.05 floor stops penny-level dev sims from being
cancelled at fractions of a cent.

### Create

Pass the suggested cap (or your own) as `spendCapUsd`:

```bash
curl -X POST https://api.modelux.ai/manage/v1/experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gpt-4o-mini quality check",
    "projectId": "proj_abc",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "windowStart": "2026-04-10T00:00:00Z",
    "windowEnd":   "2026-04-17T00:00:00Z",
    "mode": "with_responses",
    "sample": { "size": 1000, "method": "random" },
    "spendCapUsd": 0.05
  }'
```

`spendCapUsd` is required for `with_responses` and must be ≤ the project's
`experiment_spend_cap_usd` setting. The engine watches the running spend
every 50 rows and cancels the experiment if it crosses your cap. The
cancel reason names your cap explicitly so you know what tripped.

There's also a hard rule: projects in `metadata_only` logging mode are
rejected because there's no logged response to compare against.

### Read the similarity output

Once the experiment is `completed`, three endpoints expose the
quality signal:

```bash
# Headline + histogram (10 fixed buckets across [0, 1])
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/similarity-histogram?threshold=0.85"

# Per-pair rows; sort by worst-first to surface outliers
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/results?sort=similarity_asc&limit=10"

# Or just candidate-side errors
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/results?candidate_status=error"
```

The histogram returns mean similarity, scored vs. unscored counts, and the
fraction of pairs at or above the agreement threshold. The dashboard renders
this as a single headline: *"73% agreement (≥0.85) over 1,000 scored pairs"*.

For the outliers endpoint, sort options are:

- `similarity_asc` / `similarity_desc` — worst-first / best-first
- `cost_delta_desc` / `cost_delta_asc` — most cost-regressive / most savings
- `timestamp_desc` (default) / `timestamp_asc`

Filters: `candidate_status=ok|error|any`, `min_similarity`, `max_similarity`.
Setting any similarity filter excludes unscored rows, so you can ask "show
me the 50 worst scored pairs" without wading through routing-only zeros.

## Reading similarity scores

The similarity score is the cosine similarity between the embedding of the
baseline response and the embedding of the candidate response, both run
through `text-embedding-3-small` by default. Range is [0, 1]:

- **0.95+** — near-identical answers. Safe candidate substitution.
- **0.80–0.95** — same content, different wording. Usually fine.
- **0.60–0.80** — meaningful divergence. Worth a human spot-check.
- **Below 0.60** — different answers. Open the diff drawer in the
  dashboard or fetch the pair via `GET /manage/v1/logs/{request_id}` to
  see whether the candidate is wrong, more terse, or just stylistically
  different.

The agreement threshold defaults to 0.85 in the histogram endpoint and the
dashboard headline; raise or lower it via the `?threshold=` query param.

## MCP

All experiment operations are exposed as MCP tools — `create_experiment`,
`get_experiment`, `list_experiments`, `get_experiment_results`,
`get_experiment_similarity_histogram`, `cancel_experiment`,
`promote_experiment`, `estimate_experiment`. With your IDE pointed at the
modelux MCP server, you can ask in natural language:

> Simulate switching production to gpt-4o-mini for the last 7 days, with
> responses, and tell me if the similarity drop is acceptable.

See the [MCP setup guide](/docs/guides/mcp-setup) to wire it up.

## What the cost estimate gets wrong

The estimator multiplies the *baseline's* average input + output token
counts by the *candidate's* per-token price. That's wrong when the
candidate is systematically more or less verbose than the baseline:

- A baseline that answers "ok" in 5 tokens, vs. a candidate that answers
  "Of course! How can I help you today?" in 30 tokens, will under-estimate
  by ~6×.
- The opposite — a verbose baseline being replaced by a terser candidate —
  over-estimates.

The per-sim spend cap exists exactly for this case. Pick a cap with
headroom (the `suggestedSpendCapUsd` is 2× the estimate by default), and
the engine will stop at your cap rather than blow past the estimate.

## Run a judge eval on the results

Similarity tells you *how often* the candidate diverged from the
baseline. The next question is *was the divergence acceptable?* — i.e.
when the candidate produced a different answer, was it actually worse,
or just stylistically different? A **judge eval** runs a stratified
sample of your with-responses pairs through a third LLM (the "judge")
that labels each pair `better` / `equivalent` / `worse` against a
rubric you supply.

Judge runs are explicit. They cost real money on top of the with-
responses sim itself, so we never start one automatically.

### Preflight

```bash
curl -X POST https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/preflight \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "judgeModel": "gpt-4o",
    "stratification": { "high": 100, "mild": 100, "major": 100 }
  }'
```

Returns the effective per-bucket count (after redistribution if any
bucket is short on pairs), the dollar estimate, and a suggested cap:

```json
{
  "data": {
    "effectivePairCount": 300,
    "effectiveByBucket": { "high": 100, "mild": 100, "major": 100 },
    "estimatedCostUsd": 2.10,
    "perCallEstimateUsd": 0.007,
    "spendCapUsd": 20,
    "suggestedSpendCapUsd": 4.20,
    "requiresConfirmation": false,
    "notes": [],
    "rubric": "Is the candidate response at least as correct, helpful, and complete as the baseline response for this prompt? Consider correctness first, then helpfulness, then completeness.",
    "rubricHash": "a1b2c3..."
  }
}
```

### Create

```bash
curl -X POST https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "judgeModel": "gpt-4o",
    "rubric": "Focus on factual correctness for medical questions.",
    "stratification": { "high": 100, "mild": 100, "major": 100 },
    "spendCapUsd": 4.20
  }'
```

The run kicks off asynchronously. Poll `GET .../judge/{judgeRunId}`
until status is `complete`; the `summary` field then contains the
verdict breakdown plus Wilson 95% confidence intervals on each
proportion.

### Read the verdicts

```bash
# Run-level summary
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID"

# Per-pair verdicts; surface the worst-quality outliers first
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID/results?verdict=worse"

# Or focus on a single similarity tier (e.g. did high-similarity pairs
# all judge equivalent, as expected?)
curl -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  "https://api.modelux.ai/manage/v1/experiments/$SIM_ID/judge/$RUN_ID/results?bucket=major"
```

The summary shape:

```json
{
  "judged_count": 300,
  "parse_errors": 0,
  "spend_usd": 2.05,
  "verdict_breakdown": { "better": 0.41, "equivalent": 0.41, "worse": 0.18 },
  "verdict_breakdown_ci_95": {
    "better":     { "lower": 0.36, "upper": 0.46 },
    "equivalent": { "lower": 0.36, "upper": 0.46 },
    "worse":      { "lower": 0.14, "upper": 0.23 }
  },
  "by_bucket": {
    "high":  { "n": 100, "verdicts": { "better": 0.20, "equivalent": 0.75, "worse": 0.05 } },
    "mild":  { "n": 100, "verdicts": { ... } },
    "major": { "n": 100, "verdicts": { ... } }
  }
}
```

### Reading the breakdown

A few rules of thumb:

- **High-bucket pairs should mostly be `equivalent`.** If high-
  similarity pairs are showing meaningful `worse` rates, the
  similarity score is misleading you — likely a stylistic-vs-
  substantive issue the embedder isn't catching. Consider a different
  embedding model.
- **Major-bucket `worse` rate is the headline risk.** If 35% of major-
  divergence pairs judge worse, the candidate has substantive quality
  drops on ~35% of the prompts where it answered very differently.
  That's roughly `0.35 × major_share_of_traffic` of your real users.
- **Wilson CIs matter on small samples.** A 41% better rate with CI
  [36%, 46%] is meaningfully different from 41% with CI [10%, 70%];
  the latter just means "we don't know yet." The 100-per-bucket
  default gives ±5% on the headline; 500-per-bucket gives ±2%.
- **Watch the `parse_errors` count.** A high parse-error rate means
  the judge model isn't following the JSON contract reliably. Switch
  to a different judge (Claude tends to be slightly better at
  structured output than GPT-4o-mini, but more expensive).

### Multiple runs per experiment

Each judge run is immutable — different rubric, model, or sample size
all create a new `judge_run` row. Use this to:

- A/B two rubrics ("focus on correctness" vs. "focus on conciseness")
  on the same pair set.
- Cross-check by running the same rubric through two judge models. If
  they disagree, your rubric is probably ambiguous.
- Re-judge a subset (`stratification: { high: 0, mild: 0, major: 200 }`)
  to drill into one bucket cheaper.

### Default judge model + costs

Defaults to **gpt-4o**. Pricing for the standard 300-pair run:

| Judge model | Cost |
|---|---|
| gpt-4o-mini | $0.13 |
| gpt-4o | $2.10 |
| claude-sonnet-4 | $2.70 |

Total Phase A + B on default settings: **~$5**.

## Attaching a hypothesis

Every experiment accepts an optional `hypothesis` — freeform text, up
to 2,000 characters, never parsed. It renders on the result page next
to the deltas grid so readers know what question the experiment was
meant to answer.

```json
{
  "name": "cheaper-model-rehearsal",
  "projectId": "proj_...",
  "candidatePolicy": "single",
  "candidateConfig": { "model": "gpt-4o-mini", "provider_credential_id": "..." },
  "windowStart": "2026-04-13T00:00:00Z",
  "windowEnd": "2026-04-20T00:00:00Z",
  "hypothesis": "Switching to gpt-4o-mini will cut cost ~30% without a noticeable drop in quality."
}
```

One-off experiments stay exploratory — they do not compute a pass/fail
verdict in v1, even when a hypothesis is set. For machine-checkable
verdicts, attach the experiment to a [scheduled
experiment](/docs/guides/scheduled-experiments) and set structured
[success criteria](/docs/guides/experiment-success-criteria).

## Limits

- Sample size (with_responses): 1,000 minimum, 50,000 maximum per
  experiment in v1.
- Experiment spend cap defaults to $50; override via
  `experiment_spend_cap_usd` on the project.
- Judge run spend cap defaults to $20; override via
  `judge_spend_cap_usd`.
- Judge stratification total capped at 1,500 pairs in v1.
- `with_responses` requires the project to be in `full` logging mode.
- `ensemble` policies cannot be used as candidates in experiments.
- Each experiment / judge run has a 30-minute hard timeout.

## API reference

Experiments:

- [`POST /manage/v1/experiments`](/openapi.yaml#/paths/~1experiments/post) — create
- [`POST /manage/v1/experiments/preflight`](/openapi.yaml#/paths/~1experiments~1preflight/post) — preflight cost estimate
- [`GET /manage/v1/experiments/{id}`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D/get) — fetch
- [`GET /manage/v1/experiments/{id}/results`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1results/get) — per-pair rows
- [`GET /manage/v1/experiments/{id}/similarity-histogram`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1similarity-histogram/get) — histogram + headline
- [`POST /manage/v1/experiments/{id}/cancel`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1cancel/post) — cancel running
- [`POST /manage/v1/experiments/{id}/promote`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1promote/post) — promote candidate to production

Judge runs:

- [`POST /manage/v1/experiments/{id}/judge`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1judge/post) — create
- [`POST /manage/v1/experiments/{id}/judge/preflight`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1judge~1preflight/post) — preflight cost estimate
- [`GET /manage/v1/experiments/{id}/judge`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1judge/get) — list runs for a sim
- [`GET /manage/v1/experiments/{id}/judge/{judgeRunId}`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1judge~1%7BjudgeRunId%7D/get) — fetch
- [`GET /manage/v1/experiments/{id}/judge/{judgeRunId}/results`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1judge~1%7BjudgeRunId%7D~1results/get) — per-pair verdicts
- [`POST /manage/v1/experiments/{id}/judge/{judgeRunId}/cancel`](/openapi.yaml#/paths/~1experiments~1%7Bid%7D~1judge~1%7BjudgeRunId%7D~1cancel/post) — cancel running