Audio

Two endpoints on the OpenAI-shape surface cover speech-to-text (STT) and text-to-speech (TTS). Both route through the same policy pipeline as chat completions and embeddings — @config slugs, BYOK passthrough, budgets, and analytics all work.

POST /openai/v1/audio/transcriptions
POST /openai/v1/audio/speech

Supported providers

Today: OpenAI and Azure OpenAI (which serves the same Whisper and TTS models through an OpenAI-compatible API). The Anthropic surface has no audio endpoints upstream — use the OpenAI surface for audio regardless of which surface your chat traffic is on.

Capability	OpenAI models
Transcription (STT)	`whisper-1`, `gpt-4o-transcribe`
Speech (TTS)	`tts-1`, `tts-1-hd`

Routing

The model field accepts the same shapes as every other proxy endpoint:

Raw model name — whisper-1, tts-1, etc.
Provider-qualified — openai/whisper-1, azure_openai/tts-1
Routing config slug — @transcribe, @voice

For transcriptions, if you omit model the proxy defaults to openai/whisper-1 (Whisper’s bare name doesn’t match any auto-resolve prefix, so the default uses the explicit form to stay routable).

Routing policies: single and fallback_chain are supported. ensemble and cascade are rejected at execution — running the same audio through multiple models and merging the output doesn’t have a well-defined semantic for either STT text or TTS bytes.

Fallback chains work the same as chat: primary fails (5xx / 429 / transport), next target is attempted. Typical setup: OpenAI Whisper primary, Azure OpenAI Whisper fallback.

Transcription request

multipart/form-data — same shape as OpenAI’s upstream endpoint.

Field	Type	Required	Notes
`file`	file	yes	Audio file. Max 25 MB.
`model`	string	no	Defaults to `openai/whisper-1`.
`language`	string	no	ISO-639-1 hint (e.g. `en`).
`prompt`	string	no	Bias vocabulary / style.
`response_format`	string	no	`json` (default), `text`, `srt`, `vtt`, `verbose_json`.

curl https://api.modelux.ai/openai/v1/audio/transcriptions \
  -H "Authorization: Bearer mlx_sk_..." \
  -F file=@meeting.mp3 \
  -F model=@transcribe \
  -F language=en

Transcription response

{
  "text": "Hello, this is the transcript.",
  "duration": 12.4
}

duration (seconds of audio) is included when the upstream reports it and drives STT cost computation.

Speech request

application/json — OpenAI TTS shape.

Field	Type	Required	Notes
`model`	string	yes	`tts-1`, `tts-1-hd`, `@voice`, etc.
`input`	string	yes	Text to synthesize.
`voice`	string	no	Defaults to `alloy`. OpenAI voices: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`.
`response_format`	string	no	`mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm`.
`speed`	number	no	`0.25`–`4.0`.
`user`	string	no	End-user identifier (also accepted via `X-Modelux-User-Id` header).

curl https://api.modelux.ai/openai/v1/audio/speech \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello, world!","voice":"alloy"}' \
  --output hello.mp3

Speech response

Binary audio bytes. The upstream Content-Type is forwarded verbatim (typically audio/mpeg for mp3, audio/wav for wav, etc.), so SDKs and curl --output handle it naturally. Response is chunked.

Cost accounting

STT is priced per second of audio; TTS is priced per input character. Costs apply against the routing config’s budget like any other request, and are reported on the response as x-modelux-cost-usd using the default pricing table (whisper-1 and gpt-4o-transcribe at $0.006/min, tts-1 at $15/1M chars, tts-1-hd at $30/1M chars).

Headers

All the standard proxy request/response headers apply — see chat completions. For transcriptions you can identify the end user via X-Modelux-User-Id (there’s no body field on a multipart form); TTS supports either the header or the body’s user field.

BYOK passthrough (X-Modelux-Provider-Key) is honored on direct provider/model calls but not on @config — same rule as the rest of the proxy.