[view as .md]

Caching

Modelux caches successful responses keyed by request content. Cache hits skip the provider call entirely and return the cached response with sub-millisecond latency at zero provider cost.

Cache modes

ModeMatch behavior
exactRequest body must match byte-for-byte (default)
semanticEmbedding similarity above a threshold (advanced)

TTL

Configure TTL per routing config:

{
  "cache": {
    "mode": "exact",
    "ttl_seconds": 3600
  }
}

What gets cached

  • Successful chat completions and embeddings
  • Streaming responses (stitched into a complete response server-side)
  • Structured outputs (JSON mode)

What doesn’t get cached

  • Failed responses (4xx, 5xx)
  • Requests with temperature > 0 unless you opt in explicitly
  • Requests with tool-calling (by default; opt-in available)

Cache-control per-request

Override the cache at request time:

client.chat.completions.create(
    model="@production",
    messages=[...],
    extra_body={"mlx:cache": {"skip": True}},
)