<!-- source: https://modelux.ai/docs/api/openai-responses -->

> POST /openai/v1/responses — OpenAI's Responses API with routing configs, fallback, BYOK, and observability.

# Responses

modelux proxies the full OpenAI Responses API surface (`/v1/responses`)
with auth + rate-limit + entitlements + observability + BYOK on top of
byte-faithful request/response forwarding. Routing configs (`@config`)
and fallback-within-OpenAI are supported on the create endpoint so
Responses traffic benefits from the same indirection and reliability
patterns as `/chat/completions`, scoped to OpenAI targets.

```
POST   /openai/v1/responses                       create (sync or stream, @config supported)
GET    /openai/v1/responses/{id}                  retrieve (or ?stream=true to replay)
POST   /openai/v1/responses/{id}/cancel           cancel a background response
DELETE /openai/v1/responses/{id}
GET    /openai/v1/responses/{id}/input_items      list input items
```

## When to use this vs `/chat/completions`

The Responses API is **OpenAI-specific** — its item taxonomy
(`function_call`, `web_search_call`, `computer_call`, `reasoning`, image
outputs, …) has no faithful mapping onto Claude or Gemini. So:

- **Cross-provider routing, ensemble policies, semantic cache, decision
  traces** → use [`/openai/v1/chat/completions`](/docs/api/chat-completions).
  That's the canonical multi-provider surface.
- **Reasoning effort, web_search tool, computer use, image outputs,
  stored / chained responses, the new SSE event taxonomy** → use
  `/openai/v1/responses`. You get OpenAI's full feature set plus
  @config routing, fallback-within-OpenAI, budgets, and A/B tests — the
  request always lands on OpenAI (or your BYOK OpenAI-compatible base
  URL).

## Create

Drop-in for the OpenAI SDK — point `baseURL` at
`https://api.modelux.ai/openai/v1`:

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.modelux.ai/openai/v1",
    api_key="mlx_sk_...",  # your modelux key
)

resp = client.responses.create(
    model="gpt-4o-mini",
    input="Walk me through the proof that sqrt(2) is irrational.",
    instructions="Be concise.",
    max_output_tokens=400,
    reasoning={"effort": "medium"},
)
print(resp.output_text)
```

The request body is forwarded byte-identical so SDK additions (new tool
types, new reasoning options, new input item shapes) work without
proxy changes. The proxy only requires `model` to be present (so
unrouted requests are caught at the boundary).

### Reasoning parameter normalization

The one exception to byte-identical forwarding is the `reasoning`
object, which is reconciled against the resolved upstream model:

- `reasoning: {"effort": "none"}` — stripped from the outbound body
  before forwarding. OpenAI's chat-only models (`gpt-4o`, `gpt-4.1`)
  reject `reasoning` outright, and `effort: "none"` is a pi-ai-style
  sentinel meaning "don't reason" — dropping it is a semantic no-op
  that keeps the request working on any target.
- `reasoning: {"effort": "minimal" | "low" | "medium" | "high"}` (or
  any other non-empty reasoning shape) — forwarded as-is, but the
  proxy refuses the request with a 400 `unsupported_parameter` error
  when the resolved target is a known non-reasoning model. On a
  `@config` fallback chain, non-reasoning targets are pruned; if no
  reasoning-capable targets remain, the request is rejected.

Unknown models (including custom-hosted OpenAI-compatible endpoints)
are always trusted — the upstream's own error surfaces unchanged if
the parameter is actually rejected.

## Routing configs (`@config`)

Pass `@<routing-config-name>` as the `model` field to resolve the
target(s) through a [routing config](/docs/concepts/routing) instead of
naming a model directly. This is how you upgrade models, run A/B tests
between model versions, and get fallback-within-OpenAI without changing
the calling code.

```python
# The app names a stable config; operators change what it points to.
client.responses.create(
    model="@icon-generator",
    input="a minimalist lightning bolt icon",
)
```

On the proxy side modelux:

1. Looks up the routing config by name within the calling project.
2. Applies the policy (single model, fallback chain, A/B test, etc.).
3. Validates that every resolved target is on OpenAI — rejects with a
   400 if the config points at Anthropic or Gemini, since the Responses
   request shape doesn't translate.
4. Rewrites the outbound body's `model` field to the chosen target's
   concrete model. All other fields (reasoning options, tool configs,
   image parameters, custom headers the SDK sent, …) round-trip
   unchanged.
5. Uses the target's credential for the upstream call (not the org
   default), so different configs can point at different OpenAI
   accounts / Azure deployments / OpenAI-compatible base URLs.
6. Enforces the config's attached budgets if `enforce_budgets` is set.

### Fallback within OpenAI

A `fallback_chain` with multiple OpenAI targets retries on transient
upstream failures. The proxy advances to the next target when the
current one returns:

- a transport / connection error, OR
- HTTP 5xx, OR
- HTTP 429 (rate limit — the next key may not share the limit), OR
- HTTP 408 (request timeout)

4xx statuses other than 429/408 are **not** retried — a malformed
request, revoked key, or model-not-on-account is a caller-side error
that will fail identically on the sibling target. The `X-Modelux-Model-Used`
response header names whichever target actually served the request.

Streaming requests fall back only before the first byte reaches the
caller. Once the SSE relay starts we're committed — a mid-stream
disconnect is reported to the caller as a truncated stream, not
retried, because switching targets mid-response would corrupt the
output-item sequence.

### A/B testing model upgrades

Point a routing config at an `ab_test` policy to split Responses
traffic between two OpenAI targets — useful for rolling out a new
image-capable or reasoning-capable model to a percentage of production
traffic:

```json
{
  "policy": "ab_test",
  "config": {
    "split_percent": 10,
    "bucket_key": "end_user_id",
    "control":   {"policy": "single", "config": {"model": "gpt-4o",   "provider_credential_id": "..."}},
    "treatment": {"policy": "single", "config": {"model": "gpt-5.2",  "provider_credential_id": "..."}}
  }
}
```

Each response carries `X-Modelux-AB-Variant: control | treatment` so
dashboard breakdowns can compare latency, cost, and error rates
between the two variants.

### What doesn't work with `@config` on Responses

- **Ensemble / consensus policies** — they imply parallel execution
  with a similarity-aggregation step that has no meaning on the
  Responses taxonomy. Returns 400.
- **Cross-provider targets** — any non-OpenAI target in the config
  returns 400 before the upstream call.
- **Semantic cache** — caching by Responses input is not supported
  because the item taxonomy (reasoning steps, tool calls, image
  inputs, encrypted reasoning content) makes cache-key hashing
  fragile. Cache on `/chat/completions` if caching matters.

Bare model names (`gpt-5.2`) and provider-qualified names
(`openai/gpt-5.2`) continue to pass through to OpenAI untouched with
the org-default credential — no routing config required, and new
OpenAI models work on day one with zero proxy changes.

## Streaming

Set `stream: true`. modelux relays the SSE stream chunk-by-chunk —
events like `response.created`, `response.output_text.delta`,
`response.function_call_arguments.delta`, `response.completed` arrive
in real time:

```python
with client.responses.stream(
    model="gpt-4o-mini",
    input="Count from 1 to 5.",
    max_output_tokens=64,
) as stream:
    for event in stream:
        print(event.type, event)
```

The proxy snaps usage (`input_tokens`, `output_tokens`,
`input_tokens_details.cached_tokens`) out of the terminal
`response.completed` event into the analytics log row, so streaming
traffic shows up in the dashboard with full token/cost breakdowns
just like sync.

`?stream=true` on `GET /v1/responses/{id}` works the same way (used
by SDK consumers to replay a stored response).

## Background mode

Set `background: true` to start a long-running response that the
client can poll for (or cancel):

```python
resp = client.responses.create(
    model="gpt-4o-mini",
    input="...long task...",
    background=True,
)
# poll
while True:
    r = client.responses.retrieve(resp.id)
    if r.status in ("completed", "failed", "cancelled"):
        break

# or cancel
client.responses.cancel(resp.id)
```

Both retrieve and cancel are proxied — same auth + BYOK + observability.

## Stored responses

`store: true` (default for non-streaming) persists the response on
OpenAI's side so you can chain via `previous_response_id`. The proxy
forwards both fields verbatim. To clean up a stored response:

```
DELETE /openai/v1/responses/{id}
```

To inspect the input items of a chained response:

```
GET /openai/v1/responses/{id}/input_items?limit=20&order=desc
```

## BYOK

Pass `X-Modelux-Provider-Key: sk-...` to use a caller-supplied OpenAI
key for this single call. Wins over the org's stored credential. The
base URL still comes from any stored credential (so self-hosted
OpenAI-compatible endpoints work with BYOK).

```bash
curl https://api.modelux.ai/openai/v1/responses \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "X-Modelux-Provider-Key: sk-..." \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","input":"hi"}'
```

## Observability

Each request type is logged separately so dashboard breakdowns by
`request_type` work:

- `responses_create` — POST, both sync and streaming
- `responses_retrieve` — GET (and `?stream=true` replay)
- `responses_cancel` — POST .../cancel
- `responses_delete` — DELETE
- `responses_input_items` — GET .../input_items

Token counts, cached tokens (from `input_tokens_details.cached_tokens`),
model used, latency, and full error envelopes from the upstream all
land on the log row.

## Errors

Errors follow OpenAI's wire shape:

```json
{
  "error": {
    "type": "invalid_request_error",
    "message": "model is required",
    "code": null,
    "param": null
  }
}
```

When the upstream returns an error envelope the proxy captures
`error.type` + `error.message` into the analytics log row so the
dashboard can group failures by reason.

## See also

- [Chat completions](/docs/api/chat-completions) — the cross-provider OpenAI surface
- [Capability matrix](/docs/concepts/capability-matrix) — what's supported where
