<!-- source: https://modelux.ai/docs/concepts/ensembles -->

> Run multiple models in parallel and aggregate their outputs.

# Ensembles

An ensemble routing config fans a single request out to multiple models in
parallel, then aggregates their responses into a single final output. Done
right, ensembles of smaller models can match or exceed frontier-model quality
at a fraction of the cost.

## Aggregation strategies

| Strategy | Description | Best for |
|---|---|---|
| `first_valid` | Return the first attempt that succeeds the validation | Latency-sensitive with reliability fallback |
| `weighted_vote` | Classification-style vote across outputs | Categorical / structured outputs |
| `weighted_average` | Numeric outputs combined by weight | Scoring, ratings |
| `llm_judge` | Send all outputs to a judge model for best-pick | Open-ended generation |

## Configuration

```json
{
  "strategy": "ensemble",
  "aggregation": "weighted_vote",
  "members": [
    { "model": "claude-haiku-4-5",   "weight": 1.0 },
    { "model": "gpt-4o-mini",        "weight": 1.0 },
    { "model": "gemini-2.5-flash",   "weight": 0.8 }
  ],
  "timeout_ms": 5000
}
```

## Cost math

A 3-model ensemble of cheap models costs roughly 3x the cost of one cheap
model. Example:

- Frontier model (e.g. GPT-4o): ~$0.015 per 1k tokens
- 3-model ensemble (haiku + 4o-mini + flash): ~$0.003 per 1k tokens

That's 5x cheaper, often at comparable quality for many tasks. The ensemble
cost estimator in the dashboard shows live per-request cost based on your
typical prompt size.

## When to use ensembles

Good fits:

- Structured / classification tasks where voting helps
- Quality-critical tasks where you'd otherwise use a frontier model
- Tasks where small model variance is the main quality issue

Less ideal:

- Streaming-heavy workloads (ensembles don't stream)
- Latency-critical (you wait for slowest member, bounded by timeout)
- Tasks where cheap models already suffice on their own
