[view as .md]

Ensembles

An ensemble routing config fans a single request out to multiple models in parallel, then aggregates their responses into a single final output. Done right, ensembles of smaller models can match or exceed frontier-model quality at a fraction of the cost.

Aggregation strategies

StrategyDescriptionBest for
first_validReturn the first attempt that succeeds the validationLatency-sensitive with reliability fallback
weighted_voteClassification-style vote across outputsCategorical / structured outputs
weighted_averageNumeric outputs combined by weightScoring, ratings
llm_judgeSend all outputs to a judge model for best-pickOpen-ended generation

Configuration

{
  "strategy": "ensemble",
  "aggregation": "weighted_vote",
  "members": [
    { "model": "claude-haiku-4-5",   "weight": 1.0 },
    { "model": "gpt-4o-mini",        "weight": 1.0 },
    { "model": "gemini-2.5-flash",   "weight": 0.8 }
  ],
  "timeout_ms": 5000
}

Cost math

A 3-model ensemble of cheap models costs roughly 3x the cost of one cheap model. Example:

  • Frontier model (e.g. GPT-4o): ~$0.015 per 1k tokens
  • 3-model ensemble (haiku + 4o-mini + flash): ~$0.003 per 1k tokens

That’s 5x cheaper, often at comparable quality for many tasks. The ensemble cost estimator in the dashboard shows live per-request cost based on your typical prompt size.

When to use ensembles

Good fits:

  • Structured / classification tasks where voting helps
  • Quality-critical tasks where you’d otherwise use a frontier model
  • Tasks where small model variance is the main quality issue

Less ideal:

  • Streaming-heavy workloads (ensembles don’t stream)
  • Latency-critical (you wait for slowest member, bounded by timeout)
  • Tasks where cheap models already suffice on their own