[view as .md]

Scheduled experiments

A regular experiment answers a one-shot question: “what if I switched to this candidate config?” A scheduled experiment asks that same question on a cron — every hour, every day, every week — and fires a webhook when the answer changes in a way you care about.

The loop:

  1. You define a candidate config, a cadence, and a rolling window.
  2. modelux replays your last N hours of traffic through the candidate on every fire.
  3. Cost, latency, and error-rate deltas are stored on the run.
  4. If any delta breaches a threshold you set, a regression signal lands on the overview and an experiment.regression_detected webhook fires.
  5. You stay informed. No auto-pause, no auto-promote — modelux tells you, you decide.

When to use it

  • Model provider drift. OpenAI updates gpt-4o-mini, anthropic retunes claude-haiku-4-5, Google bumps their pricing. You find out on the next scheduled run instead of from a bill.
  • Traffic-shape shift. Your users start asking longer questions, or the mix of end_user_id changes. A config that was optimal a month ago may no longer be.
  • Cheaper-model watch. Run gpt-4o-mini against your production gpt-4o baseline every night. The day the gap narrows enough, you’ll know.
  • Pre-launch candidates. Replay a candidate routing config for two weeks before committing to it — catch any latency cliff before production does.

Create one

curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "proj_abc",
    "name": "daily gpt-4o-mini regression check",
    "candidatePolicy": "single",
    "candidateConfig": {
      "model": "gpt-4o-mini",
      "provider_credential_id": "pc_openai_default"
    },
    "cronExpression": "0 9 * * *",
    "windowHours": 24,
    "mode": "routing_only",
    "hypothesis": "My current routing stays within its established cost, latency, and error baselines.",
    "successCriteria": {
      "logic": "and",
      "min_sample_size": 100,
      "predicates": [
        { "metric": "cost_delta_pct", "op": "lte", "value": 20 },
        { "metric": "latency_p95_delta_pct", "op": "lte", "value": 25 },
        { "metric": "error_rate_delta_pct", "op": "lte", "value": 50 }
      ]
    }
  }'

A one-off run kicks off immediately on create so you see results without waiting for the first scheduled fire. The scheduled row’s last_experiment_id points at that run.

Or do it from the dashboard: Experiments → Scheduled → New.

Cadence

cronExpression is a standard five-field UTC cron. Presets the UI offers:

PresetCron
Every hour0 * * * *
Daily at 9am UTC0 9 * * *
Daily at midnight UTC0 0 * * *
Weekly Mon 9am UTC0 9 * * 1

Custom expressions work too. The worker evaluates the schedule against last_run_at and fires whenever the next cron time has elapsed.

Window

windowHours is the rolling slice of traffic each run replays. The window slides forward with each fire — a 24h window on a daily cron means every run replays the most recent 24h. 1–720 hours are accepted; 24 is the default and matches most daily cadences.

Modes

Scheduled experiments support the same two modes as one-shot experiments:

  • routing_only — free, fast, zero provider calls. Cost and latency are estimated from token counts × pricing. Use this for most scheduled checks.
  • with_responses — re-runs a sample of prompts through the candidate model for real, captures responses, scores similarity against the baseline. Requires sampleSize, sampleMethod, and spendCapUsd. Needed if you want a chained judge-eval.
# With-responses + daily $0.50 cap
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY" \
  -d '{
    "projectId": "proj_abc",
    "name": "daily with-responses on gpt-4o-mini",
    "candidatePolicy": "single",
    "candidateConfig": { "model": "gpt-4o-mini", "provider_credential_id": "pc_openai_default" },
    "cronExpression": "0 9 * * *",
    "windowHours": 24,
    "mode": "with_responses",
    "sampleSize": 1000,
    "sampleMethod": "random",
    "spendCapUsd": 0.5
  }'

spendCapUsd caps each individual scheduled run. The engine watches actual spend during the run and cancels mid-run on overrun — one bad fire cannot drain a day’s budget.

Rough monthly-cost math: spendCapUsd × runs_per_month. A $0.50 cap on a daily cron tops out at about $15/month. The create form shows this projection.

Chained judge eval

If mode=with_responses and you set judgeRubric, every completed run auto-chains a judge eval with that rubric. The judge summary feeds the judge_worse_pct / judge_worse_pct_upper_ci metrics in successCriteria — so a tight quality-regression check is just a predicate like {"metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 15}.

{
  "mode": "with_responses",
  "sampleSize": 1000,
  "sampleMethod": "random",
  "spendCapUsd": 0.5,
  "judgeRubric": "Is the candidate response at least as correct and helpful as the baseline?",
  "judgeModel": "gpt-4o",
  "successCriteria": {
    "logic": "and",
    "min_sample_size": 100,
    "predicates": [
      { "metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 15 },
      { "metric": "error_rate_delta_pct", "op": "lte", "value": 5 }
    ]
  }
}

Judge spend is in addition to the per-run spendCapUsd and uses the project’s judge_spend_cap_usd default (~$20).

Hypothesis and success criteria

Attach a hypothesis (narrative text) and success criteria (structured predicates from the 15-metric catalog) to the scheduled experiment. Every run computes a pass / fail / inconclusive verdict, renders a banner on the run detail page, and ships the verdict

  • per-predicate breakdown in the experiment.completed webhook.

A run whose verdict is fail fires experiment.regression_detected and raises a signal on the overview with a stable signalKey of experiment-verdict:<scheduled-experiment-id> — so dismissals and snoozes hold across reruns of the same schedule.

Severity: critical when ≥ 2 predicates fail, when judge_worse_pct_upper_ci exceeds 30%, or when candidate error rate more than doubles baseline. warn otherwise. Both severities fire the webhook; severity is on the payload.

See hypothesis and success criteria for the full DSL, metric catalog, and examples.

Webhooks

See the full webhooks reference for setup and signature verification. Two event types fire on scheduled runs:

  • experiment.completed — every completed run, pass or fail. Payload includes request count, cost delta, latency delta, and error rate delta (both raw and pct).
  • experiment.regression_detected — a specific threshold was breached. Fires once per breach, not every subsequent run while the regression persists. Payload includes metric, observed_delta_pct, threshold_pct, severity, and a signal_key that correlates the event with the in-app signal on the overview page.

If multiple metrics breach in one run, you get one experiment.regression_detected event per metric.

Pause / resume / run now

# Pause (skips future fires, keeps history)
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/pause \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY"

# Resume
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/resume \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY"

# Fire one run now without touching the cron
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/run_now \
  -H "Authorization: Bearer $MODELUX_MGMT_KEY"

Limits

  • ensemble policies cannot be used as candidates in scheduled experiments (same rule as one-shot experiments).
  • with_responses requires the project to be in full logging mode.
  • Each scheduled run has the same 30-minute hard timeout as a one-shot experiment.
  • Thresholds are expressed as percentages, not absolute deltas.

API reference

  • POST /manage/v1/scheduled_experiments — create
  • GET /manage/v1/scheduled_experiments — list (filter by projectId, status)
  • GET /manage/v1/scheduled_experiments/{id} — fetch
  • PATCH /manage/v1/scheduled_experiments/{id} — update cron, window, thresholds, rubric
  • DELETE /manage/v1/scheduled_experiments/{id} — delete
  • POST /manage/v1/scheduled_experiments/{id}/pause
  • POST /manage/v1/scheduled_experiments/{id}/resume
  • POST /manage/v1/scheduled_experiments/{id}/run_now

Equivalent MCP tools: create_scheduled_experiment, list_scheduled_experiments, get_scheduled_experiment, update_scheduled_experiment, delete_scheduled_experiment, pause_scheduled_experiment, resume_scheduled_experiment, run_scheduled_experiment_now.