Responses
modelux proxies the full OpenAI Responses API surface (/v1/responses)
with auth + rate-limit + entitlements + observability + BYOK on top of
byte-faithful request/response forwarding. Routing configs (@config)
and fallback-within-OpenAI are supported on the create endpoint so
Responses traffic benefits from the same indirection and reliability
patterns as /chat/completions, scoped to OpenAI targets.
POST /openai/v1/responses create (sync or stream, @config supported)
GET /openai/v1/responses/{id} retrieve (or ?stream=true to replay)
POST /openai/v1/responses/{id}/cancel cancel a background response
DELETE /openai/v1/responses/{id}
GET /openai/v1/responses/{id}/input_items list input items
When to use this vs /chat/completions
The Responses API is OpenAI-specific — its item taxonomy
(function_call, web_search_call, computer_call, reasoning, image
outputs, …) has no faithful mapping onto Claude or Gemini. So:
- Cross-provider routing, ensemble policies, semantic cache, decision
traces → use
/openai/v1/chat/completions. That’s the canonical multi-provider surface. - Reasoning effort, web_search tool, computer use, image outputs,
stored / chained responses, the new SSE event taxonomy → use
/openai/v1/responses. You get OpenAI’s full feature set plus @config routing, fallback-within-OpenAI, budgets, and A/B tests — the request always lands on OpenAI (or your BYOK OpenAI-compatible base URL).
Create
Drop-in for the OpenAI SDK — point baseURL at
https://api.modelux.ai/openai/v1:
from openai import OpenAI
client = OpenAI(
base_url="https://api.modelux.ai/openai/v1",
api_key="mlx_sk_...", # your modelux key
)
resp = client.responses.create(
model="gpt-4o-mini",
input="Walk me through the proof that sqrt(2) is irrational.",
instructions="Be concise.",
max_output_tokens=400,
reasoning={"effort": "medium"},
)
print(resp.output_text)
The request body is forwarded byte-identical so SDK additions (new tool
types, new reasoning options, new input item shapes) work without
proxy changes. The proxy only requires model to be present (so
unrouted requests are caught at the boundary).
Reasoning parameter normalization
The one exception to byte-identical forwarding is the reasoning
object, which is reconciled against the resolved upstream model:
reasoning: {"effort": "none"}— stripped from the outbound body before forwarding. OpenAI’s chat-only models (gpt-4o,gpt-4.1) rejectreasoningoutright, andeffort: "none"is a pi-ai-style sentinel meaning “don’t reason” — dropping it is a semantic no-op that keeps the request working on any target.reasoning: {"effort": "minimal" | "low" | "medium" | "high"}(or any other non-empty reasoning shape) — forwarded as-is, but the proxy refuses the request with a 400unsupported_parametererror when the resolved target is a known non-reasoning model. On a@configfallback chain, non-reasoning targets are pruned; if no reasoning-capable targets remain, the request is rejected.
Unknown models (including custom-hosted OpenAI-compatible endpoints) are always trusted — the upstream’s own error surfaces unchanged if the parameter is actually rejected.
Routing configs (@config)
Pass @<routing-config-name> as the model field to resolve the
target(s) through a routing config instead of
naming a model directly. This is how you upgrade models, run A/B tests
between model versions, and get fallback-within-OpenAI without changing
the calling code.
# The app names a stable config; operators change what it points to.
client.responses.create(
model="@icon-generator",
input="a minimalist lightning bolt icon",
)
On the proxy side modelux:
- Looks up the routing config by name within the calling project.
- Applies the policy (single model, fallback chain, A/B test, etc.).
- Validates that every resolved target is on OpenAI — rejects with a 400 if the config points at Anthropic or Gemini, since the Responses request shape doesn’t translate.
- Rewrites the outbound body’s
modelfield to the chosen target’s concrete model. All other fields (reasoning options, tool configs, image parameters, custom headers the SDK sent, …) round-trip unchanged. - Uses the target’s credential for the upstream call (not the org default), so different configs can point at different OpenAI accounts / Azure deployments / OpenAI-compatible base URLs.
- Enforces the config’s attached budgets if
enforce_budgetsis set.
Fallback within OpenAI
A fallback_chain with multiple OpenAI targets retries on transient
upstream failures. The proxy advances to the next target when the
current one returns:
- a transport / connection error, OR
- HTTP 5xx, OR
- HTTP 429 (rate limit — the next key may not share the limit), OR
- HTTP 408 (request timeout)
4xx statuses other than 429/408 are not retried — a malformed
request, revoked key, or model-not-on-account is a caller-side error
that will fail identically on the sibling target. The X-Modelux-Model-Used
response header names whichever target actually served the request.
Streaming requests fall back only before the first byte reaches the caller. Once the SSE relay starts we’re committed — a mid-stream disconnect is reported to the caller as a truncated stream, not retried, because switching targets mid-response would corrupt the output-item sequence.
A/B testing model upgrades
Point a routing config at an ab_test policy to split Responses
traffic between two OpenAI targets — useful for rolling out a new
image-capable or reasoning-capable model to a percentage of production
traffic:
{
"policy": "ab_test",
"config": {
"split_percent": 10,
"bucket_key": "end_user_id",
"control": {"policy": "single", "config": {"model": "gpt-4o", "provider_credential_id": "..."}},
"treatment": {"policy": "single", "config": {"model": "gpt-5.2", "provider_credential_id": "..."}}
}
}
Each response carries X-Modelux-AB-Variant: control | treatment so
dashboard breakdowns can compare latency, cost, and error rates
between the two variants.
What doesn’t work with @config on Responses
- Ensemble / consensus policies — they imply parallel execution with a similarity-aggregation step that has no meaning on the Responses taxonomy. Returns 400.
- Cross-provider targets — any non-OpenAI target in the config returns 400 before the upstream call.
- Semantic cache — caching by Responses input is not supported
because the item taxonomy (reasoning steps, tool calls, image
inputs, encrypted reasoning content) makes cache-key hashing
fragile. Cache on
/chat/completionsif caching matters.
Bare model names (gpt-5.2) and provider-qualified names
(openai/gpt-5.2) continue to pass through to OpenAI untouched with
the org-default credential — no routing config required, and new
OpenAI models work on day one with zero proxy changes.
Streaming
Set stream: true. modelux relays the SSE stream chunk-by-chunk —
events like response.created, response.output_text.delta,
response.function_call_arguments.delta, response.completed arrive
in real time:
with client.responses.stream(
model="gpt-4o-mini",
input="Count from 1 to 5.",
max_output_tokens=64,
) as stream:
for event in stream:
print(event.type, event)
The proxy snaps usage (input_tokens, output_tokens,
input_tokens_details.cached_tokens) out of the terminal
response.completed event into the analytics log row, so streaming
traffic shows up in the dashboard with full token/cost breakdowns
just like sync.
?stream=true on GET /v1/responses/{id} works the same way (used
by SDK consumers to replay a stored response).
Background mode
Set background: true to start a long-running response that the
client can poll for (or cancel):
resp = client.responses.create(
model="gpt-4o-mini",
input="...long task...",
background=True,
)
# poll
while True:
r = client.responses.retrieve(resp.id)
if r.status in ("completed", "failed", "cancelled"):
break
# or cancel
client.responses.cancel(resp.id)
Both retrieve and cancel are proxied — same auth + BYOK + observability.
Stored responses
store: true (default for non-streaming) persists the response on
OpenAI’s side so you can chain via previous_response_id. The proxy
forwards both fields verbatim. To clean up a stored response:
DELETE /openai/v1/responses/{id}
To inspect the input items of a chained response:
GET /openai/v1/responses/{id}/input_items?limit=20&order=desc
BYOK
Pass X-Modelux-Provider-Key: sk-... to use a caller-supplied OpenAI
key for this single call. Wins over the org’s stored credential. The
base URL still comes from any stored credential (so self-hosted
OpenAI-compatible endpoints work with BYOK).
curl https://api.modelux.ai/openai/v1/responses \
-H "Authorization: Bearer mlx_sk_..." \
-H "X-Modelux-Provider-Key: sk-..." \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","input":"hi"}'
Observability
Each request type is logged separately so dashboard breakdowns by
request_type work:
responses_create— POST, both sync and streamingresponses_retrieve— GET (and?stream=truereplay)responses_cancel— POST …/cancelresponses_delete— DELETEresponses_input_items— GET …/input_items
Token counts, cached tokens (from input_tokens_details.cached_tokens),
model used, latency, and full error envelopes from the upstream all
land on the log row.
Errors
Errors follow OpenAI’s wire shape:
{
"error": {
"type": "invalid_request_error",
"message": "model is required",
"code": null,
"param": null
}
}
When the upstream returns an error envelope the proxy captures
error.type + error.message into the analytics log row so the
dashboard can group failures by reason.
See also
- Chat completions — the cross-provider OpenAI surface
- Capability matrix — what’s supported where