<!-- source: https://modelux.ai/docs/guides/scheduled-experiments -->

> Run an experiment on a cron. Detect regressions when a model, provider, or your traffic shape drifts — without babysitting the dashboard.

# Scheduled experiments

A regular [experiment](/docs/guides/experiments) answers a one-shot
question: "what if I switched to this candidate config?" A **scheduled
experiment** asks that same question on a cron — every hour, every day,
every week — and fires a webhook when the answer changes in a way you
care about.

The loop:

1. You define a candidate config, a cadence, and a rolling window.
2. modelux replays your last *N* hours of traffic through the candidate
   on every fire.
3. Cost, latency, and error-rate deltas are stored on the run.
4. If any delta breaches a threshold you set, a regression signal lands
   on the overview and an `experiment.regression_detected` webhook
   fires.
5. You stay informed. No auto-pause, no auto-promote — modelux tells
   you, you decide.

## When to use it

- **Model provider drift.** OpenAI updates `gpt-4o-mini`, anthropic
  retunes `claude-haiku-4-5`, Google bumps their pricing. You find out
  on the next scheduled run instead of from a bill.
- **Traffic-shape shift.** Your users start asking longer questions, or
  the mix of `end_user_id` changes. A config that was optimal a month
  ago may no longer be.
- **Cheaper-model watch.** Run `gpt-4o-mini` against your production
  `gpt-4o` baseline every night. The day the gap narrows enough, you'll
  know.
- **Pre-launch candidates.** Replay a candidate routing config for two
  weeks before committing to it — catch any latency cliff before
  production does.

## Create one

```bash
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "proj_abc",
    "name": "daily gpt-4o-mini regression check",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "cronExpression": "0 9 * * *",
    "windowHours": 24,
    "mode": "routing_only",
    "hypothesis": "My current routing stays within its established cost, latency, and error baselines.",
    "successCriteria": {
      "logic": "and",
      "min_sample_size": 100,
      "predicates": [
        { "metric": "cost_delta_pct", "op": "lte", "value": 20 },
        { "metric": "latency_p95_delta_pct", "op": "lte", "value": 25 },
        { "metric": "error_rate_delta_pct", "op": "lte", "value": 50 }
      ]
    }
  }'
```

A one-off run kicks off immediately on create so you see results
without waiting for the first scheduled fire. The scheduled row's
`last_experiment_id` points at that run.

Or do it from the dashboard: **Experiments → Scheduled → New**.

## Cadence

`cronExpression` is a standard five-field UTC cron. Presets the UI
offers:

| Preset | Cron |
|---|---|
| Every hour | `0 * * * *` |
| Daily at 9am UTC | `0 9 * * *` |
| Daily at midnight UTC | `0 0 * * *` |
| Weekly Mon 9am UTC | `0 9 * * 1` |

Custom expressions work too. The worker evaluates the schedule against
`last_run_at` and fires whenever the next cron time has elapsed.

## Window

`windowHours` is the rolling slice of traffic each run replays. The
window slides forward with each fire — a 24h window on a daily cron
means every run replays the most recent 24h. 1–720 hours are accepted;
24 is the default and matches most daily cadences.

## Modes

Scheduled experiments support the same two modes as one-shot
experiments:

- **`routing_only`** — free, fast, zero provider calls. Cost and
  latency are estimated from token counts × pricing. Use this for most
  scheduled checks.
- **`with_responses`** — re-runs a sample of prompts through the
  candidate model for real, captures responses, scores similarity
  against the baseline. Requires `sampleSize`, `sampleMethod`, and
  `spendCapUsd`. Needed if you want a chained judge-eval.

```bash
# With-responses + daily $0.50 cap
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -d '{
    "projectId": "proj_abc",
    "name": "daily with-responses on gpt-4o-mini",
    "candidatePolicy": "single",
    "candidateConfig": { "model": "gpt-4o-mini", "provider_credential_id": "pc_openai_default" },
    "cronExpression": "0 9 * * *",
    "windowHours": 24,
    "mode": "with_responses",
    "sampleSize": 1000,
    "sampleMethod": "random",
    "spendCapUsd": 0.5
  }'
```

`spendCapUsd` caps each individual scheduled run. The engine watches
actual spend during the run and cancels mid-run on overrun — one bad
fire cannot drain a day's budget.

Rough monthly-cost math: `spendCapUsd × runs_per_month`. A $0.50 cap on
a daily cron tops out at about $15/month. The create form shows this
projection.

## Chained judge eval

If `mode=with_responses` and you set `judgeRubric`, every completed run
auto-chains a [judge eval](/docs/guides/experiments#run-a-judge-eval-on-the-results)
with that rubric. The judge summary feeds the `judge_worse_pct` /
`judge_worse_pct_upper_ci` metrics in `successCriteria` — so a tight
quality-regression check is just a predicate like
`{"metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 15}`.

```json
{
  "mode": "with_responses",
  "sampleSize": 1000,
  "sampleMethod": "random",
  "spendCapUsd": 0.5,
  "judgeRubric": "Is the candidate response at least as correct and helpful as the baseline?",
  "judgeModel": "gpt-4o",
  "successCriteria": {
    "logic": "and",
    "min_sample_size": 100,
    "predicates": [
      { "metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 15 },
      { "metric": "error_rate_delta_pct", "op": "lte", "value": 5 }
    ]
  }
}
```

Judge spend is in addition to the per-run `spendCapUsd` and uses the
project's `judge_spend_cap_usd` default (~$20).

## Hypothesis and success criteria

Attach a **hypothesis** (narrative text) and **success criteria**
(structured predicates from the 15-metric catalog) to the scheduled
experiment. Every run computes a `pass` / `fail` / `inconclusive`
verdict, renders a banner on the run detail page, and ships the verdict
+ per-predicate breakdown in the `experiment.completed` webhook.

A run whose verdict is `fail` fires `experiment.regression_detected`
and raises a signal on the overview with a stable `signalKey` of
`experiment-verdict:<scheduled-experiment-id>` — so dismissals and
snoozes hold across reruns of the same schedule.

Severity: `critical` when ≥ 2 predicates fail, when
`judge_worse_pct_upper_ci` exceeds 30%, or when candidate error rate
more than doubles baseline. `warn` otherwise. Both severities fire
the webhook; severity is on the payload.

See [hypothesis and success criteria](/docs/guides/experiment-success-criteria)
for the full DSL, metric catalog, and examples.

## Webhooks

See the full [webhooks reference](/docs/concepts/webhooks) for setup
and signature verification. Two event types fire on scheduled runs:

- **`experiment.completed`** — every completed run, pass or fail.
  Payload includes request count, cost delta, latency delta, and error
  rate delta (both raw and pct).
- **`experiment.regression_detected`** — a specific threshold was
  breached. Fires *once per breach*, not every subsequent run while the
  regression persists. Payload includes `metric`, `observed_delta_pct`,
  `threshold_pct`, `severity`, and a `signal_key` that correlates the
  event with the in-app signal on the overview page.

If multiple metrics breach in one run, you get one
`experiment.regression_detected` event per metric.

## Pause / resume / run now

```bash
# Pause (skips future fires, keeps history)
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/pause \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY"

# Resume
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/resume \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY"

# Fire one run now without touching the cron
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/run_now \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY"
```

## Limits

- `ensemble` policies cannot be used as candidates in scheduled
  experiments (same rule as one-shot experiments).
- `with_responses` requires the project to be in `full` logging mode.
- Each scheduled run has the same 30-minute hard timeout as a one-shot
  experiment.
- Thresholds are expressed as percentages, not absolute deltas.

## API reference

- [`POST /manage/v1/scheduled_experiments`](/openapi.yaml) — create
- `GET /manage/v1/scheduled_experiments` — list (filter by `projectId`, `status`)
- `GET /manage/v1/scheduled_experiments/{id}` — fetch
- `PATCH /manage/v1/scheduled_experiments/{id}` — update cron, window, thresholds, rubric
- `DELETE /manage/v1/scheduled_experiments/{id}` — delete
- `POST /manage/v1/scheduled_experiments/{id}/pause`
- `POST /manage/v1/scheduled_experiments/{id}/resume`
- `POST /manage/v1/scheduled_experiments/{id}/run_now`

Equivalent MCP tools: `create_scheduled_experiment`,
`list_scheduled_experiments`, `get_scheduled_experiment`,
`update_scheduled_experiment`, `delete_scheduled_experiment`,
`pause_scheduled_experiment`, `resume_scheduled_experiment`,
`run_scheduled_experiment_now`.
