<!-- source: https://modelux.ai/docs/concepts/caching -->

> Exact-match and semantic caching.

# Caching

Modelux caches successful responses keyed by request content. Cache hits skip
the provider call entirely and return the cached response with sub-millisecond
latency at zero provider cost.

## Cache modes

| Mode | Match behavior |
|---|---|
| `exact` | Request body must match byte-for-byte (default) |
| `semantic` | Embedding similarity above a threshold (advanced) |

## TTL

Configure TTL per routing config:

```json
{
  "cache": {
    "mode": "exact",
    "ttl_seconds": 3600
  }
}
```

## What gets cached

- Successful chat completions and embeddings
- Streaming responses (stitched into a complete response server-side)
- Structured outputs (JSON mode)

## What doesn't get cached

- Failed responses (4xx, 5xx)
- Requests with `temperature > 0` unless you opt in explicitly
- Requests with tool-calling (by default; opt-in available)

## Cache-control per-request

Override the cache at request time:

```python
client.chat.completions.create(
    model="@production",
    messages=[...],
    extra_body={"mlx:cache": {"skip": True}},
)
```
