<!-- source: https://modelux.ai/docs/guides/experiment-success-criteria -->

> Attach a narrative hypothesis to any experiment. On scheduled experiments, set machine-checkable success criteria and get a pass/fail verdict on every run.

# Hypothesis and success criteria

An experiment tells you what changed; a hypothesis tells you what you
were hoping for. modelux lets you attach both to an experiment:

- **Hypothesis** — a free-form sentence describing what you expect. It
  never gets parsed; it just renders on the result page so anyone reading
  the numbers knows the question the experiment was meant to answer.
- **Success criteria** — a small, checkable DSL that says "pass if cost
  drops 20%, error rate stays within 5%, and judge worse-rate stays under
  10%." On scheduled experiments, every run is evaluated against these
  criteria; the verdict (`pass` / `fail` / `inconclusive`) renders as a
  banner on the run page and ships in the `experiment.completed` webhook.

One-off experiments carry the hypothesis but skip the verdict — they're
exploratory by design. Scheduled experiments carry both.

## Writing a hypothesis

Any text, up to 2000 characters. modelux never reads it. Good hypotheses
answer three questions:

1. **What are you testing?** ("Routing to gpt-4o-mini as the primary.")
2. **What do you expect to happen?** ("Cost drops ~30% on customer-support
   traffic.")
3. **What would count as a regression?** ("Judge 'worse' rate stays under
   10% and p95 latency doesn't slip more than 20%.")

If you're using a template, the create form pre-fills a hypothesis you
can edit before submitting.

## Writing success criteria

Success criteria live on scheduled experiments. Each criterion is a
predicate of the form `{ metric, operator, value }`, with an optional
`params` block for metrics that need one. All criteria are AND-ed: a
run passes only if every criterion passes.

```json
{
  "logic": "and",
  "min_sample_size": 100,
  "predicates": [
    { "metric": "cost_delta_pct", "op": "lte", "value": -20 },
    { "metric": "error_rate_delta_pct", "op": "lte", "value": 5 },
    { "metric": "similarity_pct_above_threshold", "op": "gte", "value": 80,
      "params": { "threshold": 0.9 } },
    { "metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 10 }
  ]
}
```

`min_sample_size` (default 100) is a guard against thin windows — a run
with fewer rows than this returns verdict `inconclusive` regardless of
the predicate values.

### Operators

`lt`, `lte`, `gt`, `gte`, `eq`. No `ne` — there's no realistic
"anything-but" hypothesis.

### Direction conventions

Some metrics are negative-is-good. The most important:

- **Cost delta percentages** — negative means savings. "20% cheaper" is
  `cost_delta_pct` `lte` `-20`.
- **Latency delta percentages** — positive means slower. "No more than 30%
  slower" is `latency_p95_delta_pct` `lte` `30`.
- **Error rate delta percentages** — positive means more errors. "No
  increase" is `error_rate_delta_pct` `lte` `0`.
- **Similarity percentages** — higher is better. "At least 80% of pairs
  scored ≥ 0.9" is `similarity_pct_above_threshold` `gte` `80` with
  `params.threshold: 0.9`.
- **Judge worse percentages** — lower is better. "Fewer than 10% of
  pairs judged worse" is `judge_worse_pct_upper_ci` `lte` `10`.

Prefer `judge_worse_pct_upper_ci` over `judge_worse_pct` when you want a
statistically honest bound — the upper confidence interval protects
against small-sample noise.

## Metric catalog

The metric enum is closed: the only metrics accepted are the ones below.
Some are only evaluable under `mode=with_responses` (they need real
candidate responses to compute) or when a judge run has completed against
the experiment.

### Cost

| Metric | Definition |
| --- | --- |
| `cost_delta_pct` | Total candidate cost vs total baseline cost, as a percent change. Negative = savings. |
| `cost_delta_usd_total` | Absolute dollar difference across the window. |
| `cost_per_request_delta_pct` | Mean of per-request cost percent changes. Robust to volume skew between sides. |

### Latency

| Metric | Definition |
| --- | --- |
| `latency_p50_delta_pct` | p50 latency delta, percent. |
| `latency_p95_delta_pct` | p95 latency delta, percent. **Preferred** — most customer SLOs live here. |
| `latency_p99_delta_pct` | Tail latency delta, percent. |

### Reliability

| Metric | Definition |
| --- | --- |
| `error_rate_delta_pct` | Candidate error rate vs baseline error rate, percent change. Blows up when baseline ≈ 0 — use `candidate_error_rate_abs_pct` in that case. |
| `candidate_error_rate_abs_pct` | Absolute candidate error rate. "Candidate errors less than 1%" → `lte 1`. |

### Response quality (`mode=with_responses` only)

| Metric | Definition |
| --- | --- |
| `similarity_mean` | Mean cosine similarity between baseline and candidate responses (0..1). |
| `similarity_pct_above_threshold` | Share of response pairs scoring at or above `params.threshold`. |
| `judge_better_pct` | Share of pairs the judge labelled better for the candidate. Requires a completed judge run. |
| `judge_equivalent_pct` | Share labelled equivalent. |
| `judge_worse_pct` | Share labelled worse. Point estimate. |
| `judge_worse_pct_upper_ci` | Wilson 95% upper bound on `judge_worse_pct`. Prefer this for criteria — it accounts for small-sample uncertainty. |

## How the verdict is computed

When a scheduled run completes:

1. If `sample_size < min_sample_size`, verdict is `inconclusive` and no
   predicates are evaluated.
2. Otherwise, every predicate is computed against the run's
   `sim_results` (for cost / latency / reliability / similarity) or the
   latest completed judge run (for judge\_\*). A predicate that references
   a with\_responses-only metric on a routing\_only run, or a judge\_\*
   metric with no judge run yet, is marked **unevaluable**.
3. If any predicate is unevaluable, the verdict is `inconclusive`.
4. If every predicate passed, the verdict is `pass`.
5. Otherwise, the verdict is `fail`.

The verdict, per-predicate breakdown, and `computed_at` timestamp land
on the experiment row, the run detail page, and the
`experiment.completed` webhook payload.

## What verdicts don't do

- They don't auto-promote passing candidates. Verdict is a signal, not
  an action — you still click promote yourself.
- They don't auto-pause failing scheduled experiments. If a fail
  verdict fires, modelux surfaces it on the overview; it doesn't alter
  the schedule.
- They don't evaluate hypothesis text with an LLM. Hypothesis is
  narrative, criteria are structured — the LLM only participates at the
  per-response layer via judge metrics, never at the verdict decision.

## Webhook payload

On every scheduled-run completion, `experiment.completed` fires with
the full payload. The hypothesis and verdict fields are always present
(nullable when absent):

```jsonc
{
  "scheduled_experiment_id": "...",
  "experiment_id": "...",
  "window_start": "2026-04-19T00:00:00Z",
  "window_end": "2026-04-20T00:00:00Z",
  "request_count": 1423,
  "baseline_cost_usd": 12.30,
  "candidate_cost_usd": 8.40,
  "cost_delta_usd": -3.90,
  "cost_delta_pct": -31.7,
  // ...
  "hypothesis": "Routing to gpt-4o-mini will cut cost without hurting quality.",
  "verdict": "pass",
  "verdict_breakdown": {
    "verdict": "pass",
    "sample_size": 1423,
    "min_sample_size": 100,
    "predicates": [
      { "metric": "cost_delta_pct", "op": "lte", "value": -20,
        "observed": -31.7, "passed": true }
      // ...
    ],
    "computed_at": "2026-04-20T09:00:12Z"
  }
}
```

A run with `success_criteria = null` (scheduled experiment has no
criteria) carries `verdict: null`. One-off experiments carry
`verdict: null` too, regardless of anything else.

## Related

- [Experiments](/docs/guides/experiments)
- [Scheduled experiments](/docs/guides/scheduled-experiments)
- [Webhooks](/docs/concepts/webhooks)
