Caching
Modelux caches successful responses keyed by request content. Cache hits skip the provider call entirely and return the cached response with sub-millisecond latency at zero provider cost.
Cache modes
| Mode | Match behavior |
|---|---|
exact | Request body must match byte-for-byte (default) |
semantic | Embedding similarity above a threshold (advanced) |
TTL
Configure TTL per routing config:
{
"cache": {
"mode": "exact",
"ttl_seconds": 3600
}
}
What gets cached
- Successful chat completions and embeddings
- Streaming responses (stitched into a complete response server-side)
- Structured outputs (JSON mode)
What doesn’t get cached
- Failed responses (4xx, 5xx)
- Requests with
temperature > 0unless you opt in explicitly - Requests with tool-calling (by default; opt-in available)
Cache-control per-request
Override the cache at request time:
client.chat.completions.create(
model="@production",
messages=[...],
extra_body={"mlx:cache": {"skip": True}},
)