Audio
Two endpoints on the OpenAI-shape surface cover speech-to-text (STT)
and text-to-speech (TTS). Both route through the same policy pipeline
as chat completions and
embeddings — @config slugs, BYOK passthrough,
budgets, and analytics all work.
POST /openai/v1/audio/transcriptions
POST /openai/v1/audio/speech
Supported providers
Today: OpenAI and Azure OpenAI (which serves the same Whisper and TTS models through an OpenAI-compatible API). The Anthropic surface has no audio endpoints upstream — use the OpenAI surface for audio regardless of which surface your chat traffic is on.
| Capability | OpenAI models |
|---|---|
| Transcription (STT) | whisper-1, gpt-4o-transcribe |
| Speech (TTS) | tts-1, tts-1-hd |
Routing
The model field accepts the same shapes as every other proxy
endpoint:
- Raw model name —
whisper-1,tts-1, etc. - Provider-qualified —
openai/whisper-1,azure_openai/tts-1 - Routing config slug —
@transcribe,@voice
For transcriptions, if you omit model the proxy defaults to
openai/whisper-1 (Whisper’s bare name doesn’t match any
auto-resolve prefix, so the default uses the explicit form to stay
routable).
Routing policies: single and fallback_chain are supported.
ensemble and cascade are rejected at execution — running the
same audio through multiple models and merging the output doesn’t
have a well-defined semantic for either STT text or TTS bytes.
Fallback chains work the same as chat: primary fails (5xx / 429 / transport), next target is attempted. Typical setup: OpenAI Whisper primary, Azure OpenAI Whisper fallback.
Transcription request
multipart/form-data — same shape as OpenAI’s upstream endpoint.
| Field | Type | Required | Notes |
|---|---|---|---|
file | file | yes | Audio file. Max 25 MB. |
model | string | no | Defaults to openai/whisper-1. |
language | string | no | ISO-639-1 hint (e.g. en). |
prompt | string | no | Bias vocabulary / style. |
response_format | string | no | json (default), text, srt, vtt, verbose_json. |
curl https://api.modelux.ai/openai/v1/audio/transcriptions \
-H "Authorization: Bearer mlx_sk_..." \
-F file=@meeting.mp3 \
-F model=@transcribe \
-F language=en
Transcription response
{
"text": "Hello, this is the transcript.",
"duration": 12.4
}
duration (seconds of audio) is included when the upstream reports
it and drives STT cost computation.
Speech request
application/json — OpenAI TTS shape.
| Field | Type | Required | Notes |
|---|---|---|---|
model | string | yes | tts-1, tts-1-hd, @voice, etc. |
input | string | yes | Text to synthesize. |
voice | string | no | Defaults to alloy. OpenAI voices: alloy, echo, fable, onyx, nova, shimmer. |
response_format | string | no | mp3 (default), opus, aac, flac, wav, pcm. |
speed | number | no | 0.25–4.0. |
user | string | no | End-user identifier (also accepted via X-Modelux-User-Id header). |
curl https://api.modelux.ai/openai/v1/audio/speech \
-H "Authorization: Bearer mlx_sk_..." \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","input":"Hello, world!","voice":"alloy"}' \
--output hello.mp3
Speech response
Binary audio bytes. The upstream Content-Type is forwarded
verbatim (typically audio/mpeg for mp3, audio/wav for wav,
etc.), so SDKs and curl --output handle it naturally. Response is
chunked.
Cost accounting
STT is priced per second of audio; TTS is priced per input character.
Costs apply against the routing config’s budget like any other
request, and are reported on the response as x-modelux-cost-usd
using the default pricing table (whisper-1 and gpt-4o-transcribe
at $0.006/min, tts-1 at $15/1M chars, tts-1-hd at $30/1M chars).
Headers
All the standard proxy request/response headers apply — see
chat completions. For
transcriptions you can identify the end user via
X-Modelux-User-Id (there’s no body field on a multipart form);
TTS supports either the header or the body’s user field.
BYOK passthrough (X-Modelux-Provider-Key) is honored on direct
provider/model calls but not on @config — same rule as the rest
of the proxy.