Hypothesis and success criteria
An experiment tells you what changed; a hypothesis tells you what you were hoping for. modelux lets you attach both to an experiment:
- Hypothesis — a free-form sentence describing what you expect. It never gets parsed; it just renders on the result page so anyone reading the numbers knows the question the experiment was meant to answer.
- Success criteria — a small, checkable DSL that says “pass if cost
drops 20%, error rate stays within 5%, and judge worse-rate stays under
10%.” On scheduled experiments, every run is evaluated against these
criteria; the verdict (
pass/fail/inconclusive) renders as a banner on the run page and ships in theexperiment.completedwebhook.
One-off experiments carry the hypothesis but skip the verdict — they’re exploratory by design. Scheduled experiments carry both.
Writing a hypothesis
Any text, up to 2000 characters. modelux never reads it. Good hypotheses answer three questions:
- What are you testing? (“Routing to gpt-4o-mini as the primary.”)
- What do you expect to happen? (“Cost drops ~30% on customer-support traffic.”)
- What would count as a regression? (“Judge ‘worse’ rate stays under 10% and p95 latency doesn’t slip more than 20%.”)
If you’re using a template, the create form pre-fills a hypothesis you can edit before submitting.
Writing success criteria
Success criteria live on scheduled experiments. Each criterion is a
predicate of the form { metric, operator, value }, with an optional
params block for metrics that need one. All criteria are AND-ed: a
run passes only if every criterion passes.
{
"logic": "and",
"min_sample_size": 100,
"predicates": [
{ "metric": "cost_delta_pct", "op": "lte", "value": -20 },
{ "metric": "error_rate_delta_pct", "op": "lte", "value": 5 },
{ "metric": "similarity_pct_above_threshold", "op": "gte", "value": 80,
"params": { "threshold": 0.9 } },
{ "metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 10 }
]
}
min_sample_size (default 100) is a guard against thin windows — a run
with fewer rows than this returns verdict inconclusive regardless of
the predicate values.
Operators
lt, lte, gt, gte, eq. No ne — there’s no realistic
“anything-but” hypothesis.
Direction conventions
Some metrics are negative-is-good. The most important:
- Cost delta percentages — negative means savings. “20% cheaper” is
cost_delta_pctlte-20. - Latency delta percentages — positive means slower. “No more than 30%
slower” is
latency_p95_delta_pctlte30. - Error rate delta percentages — positive means more errors. “No
increase” is
error_rate_delta_pctlte0. - Similarity percentages — higher is better. “At least 80% of pairs
scored ≥ 0.9” is
similarity_pct_above_thresholdgte80withparams.threshold: 0.9. - Judge worse percentages — lower is better. “Fewer than 10% of
pairs judged worse” is
judge_worse_pct_upper_cilte10.
Prefer judge_worse_pct_upper_ci over judge_worse_pct when you want a
statistically honest bound — the upper confidence interval protects
against small-sample noise.
Metric catalog
The metric enum is closed: the only metrics accepted are the ones below.
Some are only evaluable under mode=with_responses (they need real
candidate responses to compute) or when a judge run has completed against
the experiment.
Cost
| Metric | Definition |
|---|---|
cost_delta_pct | Total candidate cost vs total baseline cost, as a percent change. Negative = savings. |
cost_delta_usd_total | Absolute dollar difference across the window. |
cost_per_request_delta_pct | Mean of per-request cost percent changes. Robust to volume skew between sides. |
Latency
| Metric | Definition |
|---|---|
latency_p50_delta_pct | p50 latency delta, percent. |
latency_p95_delta_pct | p95 latency delta, percent. Preferred — most customer SLOs live here. |
latency_p99_delta_pct | Tail latency delta, percent. |
Reliability
| Metric | Definition |
|---|---|
error_rate_delta_pct | Candidate error rate vs baseline error rate, percent change. Blows up when baseline ≈ 0 — use candidate_error_rate_abs_pct in that case. |
candidate_error_rate_abs_pct | Absolute candidate error rate. “Candidate errors less than 1%” → lte 1. |
Response quality (mode=with_responses only)
| Metric | Definition |
|---|---|
similarity_mean | Mean cosine similarity between baseline and candidate responses (0..1). |
similarity_pct_above_threshold | Share of response pairs scoring at or above params.threshold. |
judge_better_pct | Share of pairs the judge labelled better for the candidate. Requires a completed judge run. |
judge_equivalent_pct | Share labelled equivalent. |
judge_worse_pct | Share labelled worse. Point estimate. |
judge_worse_pct_upper_ci | Wilson 95% upper bound on judge_worse_pct. Prefer this for criteria — it accounts for small-sample uncertainty. |
How the verdict is computed
When a scheduled run completes:
- If
sample_size < min_sample_size, verdict isinconclusiveand no predicates are evaluated. - Otherwise, every predicate is computed against the run’s
sim_results(for cost / latency / reliability / similarity) or the latest completed judge run (for judge_*). A predicate that references a with_responses-only metric on a routing_only run, or a judge_* metric with no judge run yet, is marked unevaluable. - If any predicate is unevaluable, the verdict is
inconclusive. - If every predicate passed, the verdict is
pass. - Otherwise, the verdict is
fail.
The verdict, per-predicate breakdown, and computed_at timestamp land
on the experiment row, the run detail page, and the
experiment.completed webhook payload.
What verdicts don’t do
- They don’t auto-promote passing candidates. Verdict is a signal, not an action — you still click promote yourself.
- They don’t auto-pause failing scheduled experiments. If a fail verdict fires, modelux surfaces it on the overview; it doesn’t alter the schedule.
- They don’t evaluate hypothesis text with an LLM. Hypothesis is narrative, criteria are structured — the LLM only participates at the per-response layer via judge metrics, never at the verdict decision.
Webhook payload
On every scheduled-run completion, experiment.completed fires with
the full payload. The hypothesis and verdict fields are always present
(nullable when absent):
{
"scheduled_experiment_id": "...",
"experiment_id": "...",
"window_start": "2026-04-19T00:00:00Z",
"window_end": "2026-04-20T00:00:00Z",
"request_count": 1423,
"baseline_cost_usd": 12.30,
"candidate_cost_usd": 8.40,
"cost_delta_usd": -3.90,
"cost_delta_pct": -31.7,
// ...
"hypothesis": "Routing to gpt-4o-mini will cut cost without hurting quality.",
"verdict": "pass",
"verdict_breakdown": {
"verdict": "pass",
"sample_size": 1423,
"min_sample_size": 100,
"predicates": [
{ "metric": "cost_delta_pct", "op": "lte", "value": -20,
"observed": -31.7, "passed": true }
// ...
],
"computed_at": "2026-04-20T09:00:12Z"
}
}
A run with success_criteria = null (scheduled experiment has no
criteria) carries verdict: null. One-off experiments carry
verdict: null too, regardless of anything else.