Caching

modelux caches successful responses keyed by request content. Cache hits skip the provider call entirely and return the cached response with sub-millisecond latency at zero provider cost.

Cache modes

Mode	Match behavior
`exact`	Request body must match byte-for-byte (default)
`semantic`	Embedding similarity above a threshold (advanced)

TTL

Configure TTL per routing config:

{
  "cache": {
    "mode": "exact",
    "ttl_seconds": 3600
  }
}

What gets cached

Successful chat completions and embeddings
Streaming responses (stitched into a complete response server-side)
Structured outputs (JSON mode)

What doesn’t get cached

Failed responses (4xx, 5xx)
Requests with temperature > 0 unless you opt in explicitly
Requests with tool-calling (by default; opt-in available)

Cache-control per-request

Override the cache at request time:

client.chat.completions.create(
    model="@production",
    messages=[...],
    extra_body={"mlx:cache": {"skip": True}},
)