[view as .md]

Audio

Two endpoints on the OpenAI-shape surface cover speech-to-text (STT) and text-to-speech (TTS). Both route through the same policy pipeline as chat completions and embeddings@config slugs, BYOK passthrough, budgets, and analytics all work.

POST /openai/v1/audio/transcriptions
POST /openai/v1/audio/speech

Supported providers

Today: OpenAI and Azure OpenAI (which serves the same Whisper and TTS models through an OpenAI-compatible API). The Anthropic surface has no audio endpoints upstream — use the OpenAI surface for audio regardless of which surface your chat traffic is on.

CapabilityOpenAI models
Transcription (STT)whisper-1, gpt-4o-transcribe
Speech (TTS)tts-1, tts-1-hd

Routing

The model field accepts the same shapes as every other proxy endpoint:

  • Raw model namewhisper-1, tts-1, etc.
  • Provider-qualifiedopenai/whisper-1, azure_openai/tts-1
  • Routing config slug@transcribe, @voice

For transcriptions, if you omit model the proxy defaults to openai/whisper-1 (Whisper’s bare name doesn’t match any auto-resolve prefix, so the default uses the explicit form to stay routable).

Routing policies: single and fallback_chain are supported. ensemble and cascade are rejected at execution — running the same audio through multiple models and merging the output doesn’t have a well-defined semantic for either STT text or TTS bytes.

Fallback chains work the same as chat: primary fails (5xx / 429 / transport), next target is attempted. Typical setup: OpenAI Whisper primary, Azure OpenAI Whisper fallback.

Transcription request

multipart/form-data — same shape as OpenAI’s upstream endpoint.

FieldTypeRequiredNotes
filefileyesAudio file. Max 25 MB.
modelstringnoDefaults to openai/whisper-1.
languagestringnoISO-639-1 hint (e.g. en).
promptstringnoBias vocabulary / style.
response_formatstringnojson (default), text, srt, vtt, verbose_json.
curl https://api.modelux.ai/openai/v1/audio/transcriptions \
  -H "Authorization: Bearer mlx_sk_..." \
  -F file=@meeting.mp3 \
  -F model=@transcribe \
  -F language=en

Transcription response

{
  "text": "Hello, this is the transcript.",
  "duration": 12.4
}

duration (seconds of audio) is included when the upstream reports it and drives STT cost computation.

Speech request

application/json — OpenAI TTS shape.

FieldTypeRequiredNotes
modelstringyestts-1, tts-1-hd, @voice, etc.
inputstringyesText to synthesize.
voicestringnoDefaults to alloy. OpenAI voices: alloy, echo, fable, onyx, nova, shimmer.
response_formatstringnomp3 (default), opus, aac, flac, wav, pcm.
speednumberno0.254.0.
userstringnoEnd-user identifier (also accepted via X-Modelux-User-Id header).
curl https://api.modelux.ai/openai/v1/audio/speech \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello, world!","voice":"alloy"}' \
  --output hello.mp3

Speech response

Binary audio bytes. The upstream Content-Type is forwarded verbatim (typically audio/mpeg for mp3, audio/wav for wav, etc.), so SDKs and curl --output handle it naturally. Response is chunked.

Cost accounting

STT is priced per second of audio; TTS is priced per input character. Costs apply against the routing config’s budget like any other request, and are reported on the response as x-modelux-cost-usd using the default pricing table (whisper-1 and gpt-4o-transcribe at $0.006/min, tts-1 at $15/1M chars, tts-1-hd at $30/1M chars).

Headers

All the standard proxy request/response headers apply — see chat completions. For transcriptions you can identify the end user via X-Modelux-User-Id (there’s no body field on a multipart form); TTS supports either the header or the body’s user field.

BYOK passthrough (X-Modelux-Provider-Key) is honored on direct provider/model calls but not on @config — same rule as the rest of the proxy.