Hypothesis and success criteria

An experiment tells you what changed; a hypothesis tells you what you were hoping for. modelux lets you attach both to an experiment:

Hypothesis — a free-form sentence describing what you expect. It never gets parsed; it just renders on the result page so anyone reading the numbers knows the question the experiment was meant to answer.
Success criteria — a small, checkable DSL that says “pass if cost drops 20%, error rate stays within 5%, and judge worse-rate stays under 10%.” On scheduled experiments, every run is evaluated against these criteria; the verdict (pass / fail / inconclusive) renders as a banner on the run page and ships in the experiment.completed webhook.

One-off experiments carry the hypothesis but skip the verdict — they’re exploratory by design. Scheduled experiments carry both.

Writing a hypothesis

Any text, up to 2000 characters. modelux never reads it. Good hypotheses answer three questions:

What are you testing? (“Routing to gpt-4o-mini as the primary.”)
What do you expect to happen? (“Cost drops ~30% on customer-support traffic.”)
What would count as a regression? (“Judge ‘worse’ rate stays under 10% and p95 latency doesn’t slip more than 20%.”)

If you’re using a template, the create form pre-fills a hypothesis you can edit before submitting.

Writing success criteria

Success criteria live on scheduled experiments. Each criterion is a predicate of the form { metric, operator, value }, with an optional params block for metrics that need one. All criteria are AND-ed: a run passes only if every criterion passes.

{
  "logic": "and",
  "min_sample_size": 100,
  "predicates": [
    { "metric": "cost_delta_pct", "op": "lte", "value": -20 },
    { "metric": "error_rate_delta_pct", "op": "lte", "value": 5 },
    { "metric": "similarity_pct_above_threshold", "op": "gte", "value": 80,
      "params": { "threshold": 0.9 } },
    { "metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 10 }
  ]
}

min_sample_size (default 100) is a guard against thin windows — a run with fewer rows than this returns verdict inconclusive regardless of the predicate values.

Operators

lt, lte, gt, gte, eq. No ne — there’s no realistic “anything-but” hypothesis.

Direction conventions

Some metrics are negative-is-good. The most important:

Cost delta percentages — negative means savings. “20% cheaper” is cost_delta_pct lte -20.
Latency delta percentages — positive means slower. “No more than 30% slower” is latency_p95_delta_pct lte 30.
Error rate delta percentages — positive means more errors. “No increase” is error_rate_delta_pct lte 0.
Similarity percentages — higher is better. “At least 80% of pairs scored ≥ 0.9” is similarity_pct_above_threshold gte 80 with params.threshold: 0.9.
Judge worse percentages — lower is better. “Fewer than 10% of pairs judged worse” is judge_worse_pct_upper_ci lte 10.

Prefer judge_worse_pct_upper_ci over judge_worse_pct when you want a statistically honest bound — the upper confidence interval protects against small-sample noise.

Metric catalog

The metric enum is closed: the only metrics accepted are the ones below. Some are only evaluable under mode=with_responses (they need real candidate responses to compute) or when a judge run has completed against the experiment.

Cost

Metric	Definition
`cost_delta_pct`	Total candidate cost vs total baseline cost, as a percent change. Negative = savings.
`cost_delta_usd_total`	Absolute dollar difference across the window.
`cost_per_request_delta_pct`	Mean of per-request cost percent changes. Robust to volume skew between sides.

Latency

Metric	Definition
`latency_p50_delta_pct`	p50 latency delta, percent.
`latency_p95_delta_pct`	p95 latency delta, percent. Preferred — most customer SLOs live here.
`latency_p99_delta_pct`	Tail latency delta, percent.

Reliability

Metric	Definition
`error_rate_delta_pct`	Candidate error rate vs baseline error rate, percent change. Blows up when baseline ≈ 0 — use `candidate_error_rate_abs_pct` in that case.
`candidate_error_rate_abs_pct`	Absolute candidate error rate. “Candidate errors less than 1%” → `lte 1`.

Response quality (`mode=with_responses` only)

Metric	Definition
`similarity_mean`	Mean cosine similarity between baseline and candidate responses (0..1).
`similarity_pct_above_threshold`	Share of response pairs scoring at or above `params.threshold`.
`judge_better_pct`	Share of pairs the judge labelled better for the candidate. Requires a completed judge run.
`judge_equivalent_pct`	Share labelled equivalent.
`judge_worse_pct`	Share labelled worse. Point estimate.
`judge_worse_pct_upper_ci`	Wilson 95% upper bound on `judge_worse_pct`. Prefer this for criteria — it accounts for small-sample uncertainty.

How the verdict is computed

When a scheduled run completes:

If sample_size < min_sample_size, verdict is inconclusive and no predicates are evaluated.
Otherwise, every predicate is computed against the run’s sim_results (for cost / latency / reliability / similarity) or the latest completed judge run (for judge_*). A predicate that references a with_responses-only metric on a routing_only run, or a judge_* metric with no judge run yet, is marked unevaluable.
If any predicate is unevaluable, the verdict is inconclusive.
If every predicate passed, the verdict is pass.
Otherwise, the verdict is fail.

The verdict, per-predicate breakdown, and computed_at timestamp land on the experiment row, the run detail page, and the experiment.completed webhook payload.

What verdicts don’t do

They don’t auto-promote passing candidates. Verdict is a signal, not an action — you still click promote yourself.
They don’t auto-pause failing scheduled experiments. If a fail verdict fires, modelux surfaces it on the overview; it doesn’t alter the schedule.
They don’t evaluate hypothesis text with an LLM. Hypothesis is narrative, criteria are structured — the LLM only participates at the per-response layer via judge metrics, never at the verdict decision.

Webhook payload

On every scheduled-run completion, experiment.completed fires with the full payload. The hypothesis and verdict fields are always present (nullable when absent):

{
  "scheduled_experiment_id": "...",
  "experiment_id": "...",
  "window_start": "2026-04-19T00:00:00Z",
  "window_end": "2026-04-20T00:00:00Z",
  "request_count": 1423,
  "baseline_cost_usd": 12.30,
  "candidate_cost_usd": 8.40,
  "cost_delta_usd": -3.90,
  "cost_delta_pct": -31.7,
  // ...
  "hypothesis": "Routing to gpt-4o-mini will cut cost without hurting quality.",
  "verdict": "pass",
  "verdict_breakdown": {
    "verdict": "pass",
    "sample_size": 1423,
    "min_sample_size": 100,
    "predicates": [
      { "metric": "cost_delta_pct", "op": "lte", "value": -20,
        "observed": -31.7, "passed": true }
      // ...
    ],
    "computed_at": "2026-04-20T09:00:12Z"
  }
}

A run with success_criteria = null (scheduled experiment has no criteria) carries verdict: null. One-off experiments carry verdict: null too, regardless of anything else.