Responses

modelux proxies the full OpenAI Responses API surface (/v1/responses) with auth + rate-limit + entitlements + observability + BYOK on top of byte-faithful request/response forwarding. Routing configs (@config) and fallback-within-OpenAI are supported on the create endpoint so Responses traffic benefits from the same indirection and reliability patterns as /chat/completions, scoped to OpenAI targets.

POST   /openai/v1/responses                       create (sync or stream, @config supported)
GET    /openai/v1/responses/{id}                  retrieve (or ?stream=true to replay)
POST   /openai/v1/responses/{id}/cancel           cancel a background response
DELETE /openai/v1/responses/{id}
GET    /openai/v1/responses/{id}/input_items      list input items

When to use this vs `/chat/completions`

The Responses API is OpenAI-specific — its item taxonomy (function_call, web_search_call, computer_call, reasoning, image outputs, …) has no faithful mapping onto Claude or Gemini. So:

Cross-provider routing, ensemble policies, semantic cache, decision traces → use /openai/v1/chat/completions. That’s the canonical multi-provider surface.
Reasoning effort, web_search tool, computer use, image outputs, stored / chained responses, the new SSE event taxonomy → use /openai/v1/responses. You get OpenAI’s full feature set plus @config routing, fallback-within-OpenAI, budgets, and A/B tests — the request always lands on OpenAI (or your BYOK OpenAI-compatible base URL).

Create

Drop-in for the OpenAI SDK — point baseURL at https://api.modelux.ai/openai/v1:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.modelux.ai/openai/v1",
    api_key="mlx_sk_...",  # your modelux key
)

resp = client.responses.create(
    model="gpt-4o-mini",
    input="Walk me through the proof that sqrt(2) is irrational.",
    instructions="Be concise.",
    max_output_tokens=400,
    reasoning={"effort": "medium"},
)
print(resp.output_text)

The request body is forwarded byte-identical so SDK additions (new tool types, new reasoning options, new input item shapes) work without proxy changes. The proxy only requires model to be present (so unrouted requests are caught at the boundary).

Reasoning parameter normalization

The one exception to byte-identical forwarding is the reasoning object, which is reconciled against the resolved upstream model:

reasoning: {"effort": "none"} — stripped from the outbound body before forwarding. OpenAI’s chat-only models (gpt-4o, gpt-4.1) reject reasoning outright, and effort: "none" is a pi-ai-style sentinel meaning “don’t reason” — dropping it is a semantic no-op that keeps the request working on any target.
reasoning: {"effort": "minimal" | "low" | "medium" | "high"} (or any other non-empty reasoning shape) — forwarded as-is, but the proxy refuses the request with a 400 unsupported_parameter error when the resolved target is a known non-reasoning model. On a @config fallback chain, non-reasoning targets are pruned; if no reasoning-capable targets remain, the request is rejected.

Unknown models (including custom-hosted OpenAI-compatible endpoints) are always trusted — the upstream’s own error surfaces unchanged if the parameter is actually rejected.

Routing configs (`@config`)

Pass @<routing-config-name> as the model field to resolve the target(s) through a routing config instead of naming a model directly. This is how you upgrade models, run A/B tests between model versions, and get fallback-within-OpenAI without changing the calling code.

# The app names a stable config; operators change what it points to.
client.responses.create(
    model="@icon-generator",
    input="a minimalist lightning bolt icon",
)

On the proxy side modelux:

Looks up the routing config by name within the calling project.
Applies the policy (single model, fallback chain, A/B test, etc.).
Validates that every resolved target is on OpenAI — rejects with a 400 if the config points at Anthropic or Gemini, since the Responses request shape doesn’t translate.
Rewrites the outbound body’s model field to the chosen target’s concrete model. All other fields (reasoning options, tool configs, image parameters, custom headers the SDK sent, …) round-trip unchanged.
Uses the target’s credential for the upstream call (not the org default), so different configs can point at different OpenAI accounts / Azure deployments / OpenAI-compatible base URLs.
Enforces the config’s attached budgets if enforce_budgets is set.

Fallback within OpenAI

A fallback_chain with multiple OpenAI targets retries on transient upstream failures. The proxy advances to the next target when the current one returns:

a transport / connection error, OR
HTTP 5xx, OR
HTTP 429 (rate limit — the next key may not share the limit), OR
HTTP 408 (request timeout)

4xx statuses other than 429/408 are not retried — a malformed request, revoked key, or model-not-on-account is a caller-side error that will fail identically on the sibling target. The X-Modelux-Model-Used response header names whichever target actually served the request.

Streaming requests fall back only before the first byte reaches the caller. Once the SSE relay starts we’re committed — a mid-stream disconnect is reported to the caller as a truncated stream, not retried, because switching targets mid-response would corrupt the output-item sequence.

A/B testing model upgrades

Point a routing config at an ab_test policy to split Responses traffic between two OpenAI targets — useful for rolling out a new image-capable or reasoning-capable model to a percentage of production traffic:

{
  "policy": "ab_test",
  "config": {
    "split_percent": 10,
    "bucket_key": "end_user_id",
    "control":   {"policy": "single", "config": {"model": "gpt-4o",   "provider_credential_id": "..."}},
    "treatment": {"policy": "single", "config": {"model": "gpt-5.2",  "provider_credential_id": "..."}}
  }
}

Each response carries X-Modelux-AB-Variant: control | treatment so dashboard breakdowns can compare latency, cost, and error rates between the two variants.

What doesn’t work with `@config` on Responses

Ensemble / consensus policies — they imply parallel execution with a similarity-aggregation step that has no meaning on the Responses taxonomy. Returns 400.
Cross-provider targets — any non-OpenAI target in the config returns 400 before the upstream call.
Semantic cache — caching by Responses input is not supported because the item taxonomy (reasoning steps, tool calls, image inputs, encrypted reasoning content) makes cache-key hashing fragile. Cache on /chat/completions if caching matters.

Bare model names (gpt-5.2) and provider-qualified names (openai/gpt-5.2) continue to pass through to OpenAI untouched with the org-default credential — no routing config required, and new OpenAI models work on day one with zero proxy changes.

Streaming

Set stream: true. modelux relays the SSE stream chunk-by-chunk — events like response.created, response.output_text.delta, response.function_call_arguments.delta, response.completed arrive in real time:

with client.responses.stream(
    model="gpt-4o-mini",
    input="Count from 1 to 5.",
    max_output_tokens=64,
) as stream:
    for event in stream:
        print(event.type, event)

The proxy snaps usage (input_tokens, output_tokens, input_tokens_details.cached_tokens) out of the terminal response.completed event into the analytics log row, so streaming traffic shows up in the dashboard with full token/cost breakdowns just like sync.

?stream=true on GET /v1/responses/{id} works the same way (used by SDK consumers to replay a stored response).

Background mode

Set background: true to start a long-running response that the client can poll for (or cancel):

resp = client.responses.create(
    model="gpt-4o-mini",
    input="...long task...",
    background=True,
)
# poll
while True:
    r = client.responses.retrieve(resp.id)
    if r.status in ("completed", "failed", "cancelled"):
        break

# or cancel
client.responses.cancel(resp.id)

Both retrieve and cancel are proxied — same auth + BYOK + observability.

Stored responses

store: true (default for non-streaming) persists the response on OpenAI’s side so you can chain via previous_response_id. The proxy forwards both fields verbatim. To clean up a stored response:

DELETE /openai/v1/responses/{id}

To inspect the input items of a chained response:

GET /openai/v1/responses/{id}/input_items?limit=20&order=desc

BYOK

Pass X-Modelux-Provider-Key: sk-... to use a caller-supplied OpenAI key for this single call. Wins over the org’s stored credential. The base URL still comes from any stored credential (so self-hosted OpenAI-compatible endpoints work with BYOK).

curl https://api.modelux.ai/openai/v1/responses \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "X-Modelux-Provider-Key: sk-..." \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","input":"hi"}'

Observability

Each request type is logged separately so dashboard breakdowns by request_type work:

responses_create — POST, both sync and streaming
responses_retrieve — GET (and ?stream=true replay)
responses_cancel — POST …/cancel
responses_delete — DELETE
responses_input_items — GET …/input_items

Token counts, cached tokens (from input_tokens_details.cached_tokens), model used, latency, and full error envelopes from the upstream all land on the log row.

Errors

Errors follow OpenAI’s wire shape:

{
  "error": {
    "type": "invalid_request_error",
    "message": "model is required",
    "code": null,
    "param": null
  }
}

When the upstream returns an error envelope the proxy captures error.type + error.message into the analytics log row so the dashboard can group failures by reason.