Ensembles
An ensemble routing config fans a single request out to multiple models in parallel, then aggregates their responses into a single final output. Done right, ensembles of smaller models can match or exceed frontier-model quality at a fraction of the cost.
Aggregation strategies
| Strategy | Description | Best for |
|---|---|---|
first_valid | Return the first attempt that succeeds the validation | Latency-sensitive with reliability fallback |
weighted_vote | Classification-style vote across outputs | Categorical / structured outputs |
weighted_average | Numeric outputs combined by weight | Scoring, ratings |
llm_judge | Send all outputs to a judge model for best-pick | Open-ended generation |
Configuration
{
"strategy": "ensemble",
"aggregation": "weighted_vote",
"members": [
{ "model": "claude-haiku-4-5", "weight": 1.0 },
{ "model": "gpt-4o-mini", "weight": 1.0 },
{ "model": "gemini-2.5-flash", "weight": 0.8 }
],
"timeout_ms": 5000
}
Cost math
A 3-model ensemble of cheap models costs roughly 3x the cost of one cheap model. Example:
- Frontier model (e.g. GPT-4o): ~$0.015 per 1k tokens
- 3-model ensemble (haiku + 4o-mini + flash): ~$0.003 per 1k tokens
That’s 5x cheaper, often at comparable quality for many tasks. The ensemble cost estimator in the dashboard shows live per-request cost based on your typical prompt size.
When to use ensembles
Good fits:
- Structured / classification tasks where voting helps
- Quality-critical tasks where you’d otherwise use a frontier model
- Tasks where small model variance is the main quality issue
Less ideal:
- Streaming-heavy workloads (ensembles don’t stream)
- Latency-critical (you wait for slowest member, bounded by timeout)
- Tasks where cheap models already suffice on their own