Agents are expensive. Make them observable.
An autonomous agent is dozens or hundreds of LLM calls per task. Most teams don't realize their cost-per-task until the invoice lands. Modelux is the control plane that makes agent traffic legible: per-run costs, multi-model orchestration, decision traces, and replay against historical runs before you ship a change.
Every call in a run tagged with the run_id.
Pass mlx:tags.run_id on every
call the agent makes. Analytics can then group-by run_id to
show cost, token count, latency, error rate per run. Budget
alerts can fire when a single run exceeds a threshold.
- ▸ Real-time cost-per-run in analytics
- ▸ Per-run budget caps with auto-abort via webhook
- ▸ Search logs by run_id to rebuild the trace
- ▸ Replay historical runs against new configs
run_id = f"run_{uuid.uuid4().hex[:12]}"
def step(prompt, model="@executor"):
return client.chat.completions.create(
model=model,
messages=prompt,
extra_body={
"mlx:tags": {"run_id": run_id, "agent": "research-v2"},
},
)
plan = step(plan_prompt, model="@planner")
tools = [step(tool_prompt) for tool_prompt in expand(plan)]
final = step(summary_prompt, model="@summarizer") Most LLM tools weren't built for loops.
Long-horizon runs that silently double in cost
Your agent loops twelve times instead of the expected four. You notice the next morning when the invoice arrives.
Retries that compound on themselves
The tool step fails, the agent retries, the retry takes a different path, the new path also fails. No single place to see the whole shape.
Mixing cheap and frontier models is a build step
Your planner is Claude Sonnet; your step executor is Haiku; your summarizer is Flash. Three clients, three config paths, three billing relationships.
"What did that agent actually do?" is unanswerable
A customer asks for a replay. You have the final result but not the intermediate steps, not the tool call outputs, not the reasoning chain.
A control plane that understands loops.
Per-run cost attribution
Tag every request in a run with a run_id. Modelux totals cost and latency per run; surface anomalies the moment they happen.
Multi-model orchestration via one API
Call @planner for strategy, @executor for tool steps, @summarizer for final writeup. Different routing configs, same client, same SDK, same analytics plane.
Replay any run against a new config
Take last week's 500 real runs, replay them against a candidate config with a new model, diff the cost and quality. Promote if it holds.
Decision traces for every step
Every call in a run stores its full routing trace. Reconstructing "what the agent did" becomes a log query, not an archaeology project.
Replaying last 24h of tagged agent traffic...
runs_replayed 487
requests 12,831
window 2026-04-13 → 2026-04-14
CURRENT CANDIDATE DELTA
mean cost/run $0.0284 $0.0173 -39.1%
p50 latency 8.21s 7.84s -4.5%
p95 latency 19.73s 18.12s -8.2%
error rate 0.42% 0.38% -0.04pp
agent success 91.4% 91.2% -0.2pp
[promote] [diff] [rollback-plan] Know the cost/quality tradeoff before you ship.
Changing the planner model is the kind of change that could save 40% or tank quality. With Modelux, you replay yesterday's real traffic against the candidate config and see the diff. Promote only if it holds.
- ▸ Select any historical window up to 24h
- ▸ Filter replay by tag (run_id, agent, tenant)
- ▸ Promote candidate with audited version bump
Stop flying blind in your agent loops.
Free tier is enough to tag a few thousand runs and get a feel for the analytics. Team tier covers 1M requests, replay, and full decision traces.