<!-- source: https://modelux.ai/docs/api/audio -->

> POST /openai/v1/audio/transcriptions and /audio/speech — speech-to-text and text-to-speech.

# Audio

Two endpoints on the OpenAI-shape surface cover speech-to-text (STT)
and text-to-speech (TTS). Both route through the same policy pipeline
as [chat completions](/docs/api/chat-completions) and
[embeddings](/docs/api/embeddings) — `@config` slugs, BYOK passthrough,
budgets, and analytics all work.

```
POST /openai/v1/audio/transcriptions
POST /openai/v1/audio/speech
```

## Supported providers

Today: **OpenAI** and **Azure OpenAI** (which serves the same Whisper
and TTS models through an OpenAI-compatible API). The Anthropic
surface has no audio endpoints upstream — use the OpenAI surface
for audio regardless of which surface your chat traffic is on.

| Capability | OpenAI models |
|---|---|
| Transcription (STT) | `whisper-1`, `gpt-4o-transcribe` |
| Speech (TTS) | `tts-1`, `tts-1-hd` |

## Routing

The `model` field accepts the same shapes as every other proxy
endpoint:

- **Raw model name** — `whisper-1`, `tts-1`, etc.
- **Provider-qualified** — `openai/whisper-1`, `azure_openai/tts-1`
- **Routing config slug** — `@transcribe`, `@voice`

For transcriptions, if you omit `model` the proxy defaults to
`openai/whisper-1` (Whisper's bare name doesn't match any
auto-resolve prefix, so the default uses the explicit form to stay
routable).

**Routing policies:** `single` and `fallback_chain` are supported.
`ensemble` and `cascade` are rejected at execution — running the
same audio through multiple models and merging the output doesn't
have a well-defined semantic for either STT text or TTS bytes.

Fallback chains work the same as chat: primary fails (5xx / 429 /
transport), next target is attempted. Typical setup: OpenAI Whisper
primary, Azure OpenAI Whisper fallback.

## Transcription request

`multipart/form-data` — same shape as OpenAI's upstream endpoint.

| Field | Type | Required | Notes |
|---|---|---|---|
| `file` | file | yes | Audio file. Max 25 MB. |
| `model` | string | no | Defaults to `openai/whisper-1`. |
| `language` | string | no | ISO-639-1 hint (e.g. `en`). |
| `prompt` | string | no | Bias vocabulary / style. |
| `response_format` | string | no | `json` (default), `text`, `srt`, `vtt`, `verbose_json`. |

```bash
curl https://api.modelux.ai/openai/v1/audio/transcriptions \
  -H "Authorization: Bearer mlx_sk_..." \
  -F file=@meeting.mp3 \
  -F model=@transcribe \
  -F language=en
```

### Transcription response

```json
{
  "text": "Hello, this is the transcript.",
  "duration": 12.4
}
```

`duration` (seconds of audio) is included when the upstream reports
it and drives STT cost computation.

## Speech request

`application/json` — OpenAI TTS shape.

| Field | Type | Required | Notes |
|---|---|---|---|
| `model` | string | yes | `tts-1`, `tts-1-hd`, `@voice`, etc. |
| `input` | string | yes | Text to synthesize. |
| `voice` | string | no | Defaults to `alloy`. OpenAI voices: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`. |
| `response_format` | string | no | `mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm`. |
| `speed` | number | no | `0.25`–`4.0`. |
| `user` | string | no | End-user identifier (also accepted via `X-Modelux-User-Id` header). |

```bash
curl https://api.modelux.ai/openai/v1/audio/speech \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello, world!","voice":"alloy"}' \
  --output hello.mp3
```

### Speech response

Binary audio bytes. The upstream `Content-Type` is forwarded
verbatim (typically `audio/mpeg` for `mp3`, `audio/wav` for `wav`,
etc.), so SDKs and `curl --output` handle it naturally. Response is
chunked.

## Cost accounting

STT is priced per second of audio; TTS is priced per input character.
Costs apply against the routing config's budget like any other
request, and are reported on the response as `x-modelux-cost-usd`
using the default pricing table (`whisper-1` and `gpt-4o-transcribe`
at $0.006/min, `tts-1` at $15/1M chars, `tts-1-hd` at $30/1M chars).

## Headers

All the standard proxy request/response headers apply — see
[chat completions](/docs/api/chat-completions#request-headers). For
transcriptions you can identify the end user via
`X-Modelux-User-Id` (there's no body field on a multipart form);
TTS supports either the header or the body's `user` field.

BYOK passthrough (`X-Modelux-Provider-Key`) is honored on direct
`provider/model` calls but not on `@config` — same rule as the rest
of the proxy.
