Scheduled experiments
A regular experiment answers a one-shot question: “what if I switched to this candidate config?” A scheduled experiment asks that same question on a cron — every hour, every day, every week — and fires a webhook when the answer changes in a way you care about.
The loop:
- You define a candidate config, a cadence, and a rolling window.
- modelux replays your last N hours of traffic through the candidate on every fire.
- Cost, latency, and error-rate deltas are stored on the run.
- If any delta breaches a threshold you set, a regression signal lands
on the overview and an
experiment.regression_detectedwebhook fires. - You stay informed. No auto-pause, no auto-promote — modelux tells you, you decide.
When to use it
- Model provider drift. OpenAI updates
gpt-4o-mini, anthropic retunesclaude-haiku-4-5, Google bumps their pricing. You find out on the next scheduled run instead of from a bill. - Traffic-shape shift. Your users start asking longer questions, or
the mix of
end_user_idchanges. A config that was optimal a month ago may no longer be. - Cheaper-model watch. Run
gpt-4o-miniagainst your productiongpt-4obaseline every night. The day the gap narrows enough, you’ll know. - Pre-launch candidates. Replay a candidate routing config for two weeks before committing to it — catch any latency cliff before production does.
Create one
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-H "Content-Type: application/json" \
-d '{
"projectId": "proj_abc",
"name": "daily gpt-4o-mini regression check",
"candidatePolicy": "single",
"candidateConfig": {
"model": "gpt-4o-mini",
"provider_credential_id": "pc_openai_default"
},
"cronExpression": "0 9 * * *",
"windowHours": 24,
"mode": "routing_only",
"hypothesis": "My current routing stays within its established cost, latency, and error baselines.",
"successCriteria": {
"logic": "and",
"min_sample_size": 100,
"predicates": [
{ "metric": "cost_delta_pct", "op": "lte", "value": 20 },
{ "metric": "latency_p95_delta_pct", "op": "lte", "value": 25 },
{ "metric": "error_rate_delta_pct", "op": "lte", "value": 50 }
]
}
}'
A one-off run kicks off immediately on create so you see results
without waiting for the first scheduled fire. The scheduled row’s
last_experiment_id points at that run.
Or do it from the dashboard: Experiments → Scheduled → New.
Cadence
cronExpression is a standard five-field UTC cron. Presets the UI
offers:
| Preset | Cron |
|---|---|
| Every hour | 0 * * * * |
| Daily at 9am UTC | 0 9 * * * |
| Daily at midnight UTC | 0 0 * * * |
| Weekly Mon 9am UTC | 0 9 * * 1 |
Custom expressions work too. The worker evaluates the schedule against
last_run_at and fires whenever the next cron time has elapsed.
Window
windowHours is the rolling slice of traffic each run replays. The
window slides forward with each fire — a 24h window on a daily cron
means every run replays the most recent 24h. 1–720 hours are accepted;
24 is the default and matches most daily cadences.
Modes
Scheduled experiments support the same two modes as one-shot experiments:
routing_only— free, fast, zero provider calls. Cost and latency are estimated from token counts × pricing. Use this for most scheduled checks.with_responses— re-runs a sample of prompts through the candidate model for real, captures responses, scores similarity against the baseline. RequiressampleSize,sampleMethod, andspendCapUsd. Needed if you want a chained judge-eval.
# With-responses + daily $0.50 cap
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments \
-H "Authorization: Bearer $MODELUX_MGMT_KEY" \
-d '{
"projectId": "proj_abc",
"name": "daily with-responses on gpt-4o-mini",
"candidatePolicy": "single",
"candidateConfig": { "model": "gpt-4o-mini", "provider_credential_id": "pc_openai_default" },
"cronExpression": "0 9 * * *",
"windowHours": 24,
"mode": "with_responses",
"sampleSize": 1000,
"sampleMethod": "random",
"spendCapUsd": 0.5
}'
spendCapUsd caps each individual scheduled run. The engine watches
actual spend during the run and cancels mid-run on overrun — one bad
fire cannot drain a day’s budget.
Rough monthly-cost math: spendCapUsd × runs_per_month. A $0.50 cap on
a daily cron tops out at about $15/month. The create form shows this
projection.
Chained judge eval
If mode=with_responses and you set judgeRubric, every completed run
auto-chains a judge eval
with that rubric. The judge summary feeds the judge_worse_pct /
judge_worse_pct_upper_ci metrics in successCriteria — so a tight
quality-regression check is just a predicate like
{"metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 15}.
{
"mode": "with_responses",
"sampleSize": 1000,
"sampleMethod": "random",
"spendCapUsd": 0.5,
"judgeRubric": "Is the candidate response at least as correct and helpful as the baseline?",
"judgeModel": "gpt-4o",
"successCriteria": {
"logic": "and",
"min_sample_size": 100,
"predicates": [
{ "metric": "judge_worse_pct_upper_ci", "op": "lte", "value": 15 },
{ "metric": "error_rate_delta_pct", "op": "lte", "value": 5 }
]
}
}
Judge spend is in addition to the per-run spendCapUsd and uses the
project’s judge_spend_cap_usd default (~$20).
Hypothesis and success criteria
Attach a hypothesis (narrative text) and success criteria
(structured predicates from the 15-metric catalog) to the scheduled
experiment. Every run computes a pass / fail / inconclusive
verdict, renders a banner on the run detail page, and ships the verdict
- per-predicate breakdown in the
experiment.completedwebhook.
A run whose verdict is fail fires experiment.regression_detected
and raises a signal on the overview with a stable signalKey of
experiment-verdict:<scheduled-experiment-id> — so dismissals and
snoozes hold across reruns of the same schedule.
Severity: critical when ≥ 2 predicates fail, when
judge_worse_pct_upper_ci exceeds 30%, or when candidate error rate
more than doubles baseline. warn otherwise. Both severities fire
the webhook; severity is on the payload.
See hypothesis and success criteria for the full DSL, metric catalog, and examples.
Webhooks
See the full webhooks reference for setup and signature verification. Two event types fire on scheduled runs:
experiment.completed— every completed run, pass or fail. Payload includes request count, cost delta, latency delta, and error rate delta (both raw and pct).experiment.regression_detected— a specific threshold was breached. Fires once per breach, not every subsequent run while the regression persists. Payload includesmetric,observed_delta_pct,threshold_pct,severity, and asignal_keythat correlates the event with the in-app signal on the overview page.
If multiple metrics breach in one run, you get one
experiment.regression_detected event per metric.
Pause / resume / run now
# Pause (skips future fires, keeps history)
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/pause \
-H "Authorization: Bearer $MODELUX_MGMT_KEY"
# Resume
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/resume \
-H "Authorization: Bearer $MODELUX_MGMT_KEY"
# Fire one run now without touching the cron
curl -X POST https://api.modelux.ai/manage/v1/scheduled_experiments/sched_abc/run_now \
-H "Authorization: Bearer $MODELUX_MGMT_KEY"
Limits
ensemblepolicies cannot be used as candidates in scheduled experiments (same rule as one-shot experiments).with_responsesrequires the project to be infulllogging mode.- Each scheduled run has the same 30-minute hard timeout as a one-shot experiment.
- Thresholds are expressed as percentages, not absolute deltas.
API reference
POST /manage/v1/scheduled_experiments— createGET /manage/v1/scheduled_experiments— list (filter byprojectId,status)GET /manage/v1/scheduled_experiments/{id}— fetchPATCH /manage/v1/scheduled_experiments/{id}— update cron, window, thresholds, rubricDELETE /manage/v1/scheduled_experiments/{id}— deletePOST /manage/v1/scheduled_experiments/{id}/pausePOST /manage/v1/scheduled_experiments/{id}/resumePOST /manage/v1/scheduled_experiments/{id}/run_now
Equivalent MCP tools: create_scheduled_experiment,
list_scheduled_experiments, get_scheduled_experiment,
update_scheduled_experiment, delete_scheduled_experiment,
pause_scheduled_experiment, resume_scheduled_experiment,
run_scheduled_experiment_now.