Ensembles

An ensemble routing config fans a single request out to multiple models in parallel, then aggregates their responses into a single final output. Done right, ensembles of smaller models can match or exceed frontier-model quality at a fraction of the cost.

Aggregation strategies

Strategy	Description	Best for
`first_valid`	Return the first attempt that succeeds the validation	Latency-sensitive with reliability fallback
`weighted_vote`	Classification-style vote across outputs	Categorical / structured outputs
`weighted_average`	Numeric outputs combined by weight	Scoring, ratings
`llm_judge`	Send all outputs to a judge model for best-pick	Open-ended generation

Configuration

{
  "strategy": "ensemble",
  "aggregation": "weighted_vote",
  "members": [
    { "model": "claude-haiku-4-5",   "weight": 1.0 },
    { "model": "gpt-4o-mini",        "weight": 1.0 },
    { "model": "gemini-2.5-flash",   "weight": 0.8 }
  ],
  "timeout_ms": 5000
}

Cost math

A 3-model ensemble of cheap models costs roughly 3x the cost of one cheap model. Example:

Frontier model (e.g. GPT-4o): ~$0.015 per 1k tokens
3-model ensemble (haiku + 4o-mini + flash): ~$0.003 per 1k tokens

That’s 5x cheaper, often at comparable quality for many tasks. The ensemble cost estimator in the dashboard shows live per-request cost based on your typical prompt size.

When to use ensembles

Good fits:

Structured / classification tasks where voting helps
Quality-critical tasks where you’d otherwise use a frontier model
Tasks where small model variance is the main quality issue

Less ideal:

Streaming-heavy workloads (ensembles don’t stream)
Latency-critical (you wait for slowest member, bounded by timeout)
Tasks where cheap models already suffice on their own