Your app calls an LLM. Don't let it be the weakest link.
You wrote client.chat.completions.create(...) six months ago and haven't touched it since. Meanwhile your cost doubled, your latency got worse, and you still can't tell your CFO what feature X cost last month. Modelux fixes that without changing a line of your code.
Change these two lines. Keep everything else.
Same OpenAI SDK. Same call sites. Same streaming, same tool calling, same structured outputs. You get fallbacks, analytics, and cost attribution for free.
from openai import OpenAI client = OpenAI( - api_key=os.environ["OPENAI_API_KEY"], + base_url="https://api.modelux.ai/v1", + api_key=os.environ["MODELUX_API_KEY"], )
The problems that show up once LLM traffic is real.
OpenAI is down mid-launch
Your demo hits a 503 and recovery takes hours. You ship the same code path for every user.
Costs are unpredictable
A prompt-engineering tweak doubles your bill overnight. Finance asks why; you open a spreadsheet.
Model tests require a deploy
Trying GPT-4o-mini vs Haiku on your summarization endpoint means a feature flag, a PR, and a rollout.
'Why was this slow?' is unanswerable
A user complains the response took 8 seconds. You have no per-request trace, no provider-level latency history, no way to know if it was the model or the network.
Each problem has a config-driven fix.
Fallbacks fix reliability in five minutes
Define a fallback chain once. Your app calls @production forever. Modelux retries across providers when one is degraded. You stop paging at 2am.
Cost optimization without code changes
Flip traffic from gpt-4o to a cost-optimized config that routes 60% of requests to cheaper models meeting the same quality tier. Typical savings: 40-60%.
A/B test models in production
Route 10% of traffic to a candidate config. Compare cost, latency, and your own quality signals. Promote the winner with one click.
Full decision traces
Every request stores the full routing trace: which attempts ran, why, per-attempt latency, cost. The "why was this slow" question has an answer now.
A reliability config you can ship tonight.
Three attempts across three providers. When the primary is slow or rate-limited, Modelux retries with the next one. Your app sees a single successful response every time; your p99 latency drops; your "LLM down" Slack channel goes quiet.
- ▸ Per-attempt timeouts
- ▸ Auto-retry on 429 / 5xx / timeout
- ▸ Full decision trace on every request
- ▸ Zero code changes in your app
{
"strategy": "fallback",
"attempts": [
{ "model": "claude-haiku-4-5", "timeout_ms": 2000 },
{ "model": "gpt-4o-mini", "timeout_ms": 3000 },
{ "model": "gemini-2.5-flash", "timeout_ms": 5000 }
],
"retry_on": ["429", "5xx", "timeout"]
} Two-minute migration. Free tier. No credit card.
Free tier covers 10k requests per month — enough to prove it works before anyone asks you to justify another vendor.