$ modelux for agent builders

Agents are expensive. Make them observable.

An autonomous agent is dozens or hundreds of LLM calls per task. Most teams don't realize their cost-per-task until the invoice lands. modelux is the control plane that makes agent traffic legible: per-run costs, multi-model orchestration, decision traces, and replay against historical runs before you ship a change.

Get started free See the data model →

# per-run tagging

Every call in a run tagged with the run_id.

Pass mlx:tags.run_id on every call the agent makes. Analytics can then group-by run_id to show cost, token count, latency, error rate per run. Budget alerts can fire when a single run exceeds a threshold.

▸ Real-time cost-per-run in analytics
▸ Per-run budget caps with auto-abort via webhook
▸ Search logs by run_id to rebuild the trace
▸ Replay historical runs against new configs

agent.py python

run_id = f"run_{uuid.uuid4().hex[:12]}"

def step(prompt, model="@executor"):
    return client.chat.completions.create(
        model=model,
        messages=prompt,
        extra_body={
            "mlx:tags": {"run_id": run_id, "agent": "research-v2"},
        },
    )

plan  = step(plan_prompt,    model="@planner")
tools = [step(tool_prompt)   for tool_prompt in expand(plan)]
final = step(summary_prompt, model="@summarizer")

# the agent-specific problems

Most LLM tools weren't built for loops.

> problem

Long-horizon runs that silently double in cost

Your agent loops twelve times instead of the expected four. You notice the next morning when the invoice arrives.

> problem

Retries that compound on themselves

The tool step fails, the agent retries, the retry takes a different path, the new path also fails. No single place to see the whole shape.

> problem

Mixing cheap and frontier models is a build step

Your planner is Claude Sonnet; your step executor is Haiku; your summarizer is Flash. Three clients, three config paths, three billing relationships.

> problem

"What did that agent actually do?" is unanswerable

A customer asks for a replay. You have the final result but not the intermediate steps, not the tool call outputs, not the reasoning chain.

# what modelux adds

A control plane that understands loops.

> solution

Per-run cost attribution

Tag every request in a run with a run_id. modelux totals cost and latency per run; surface anomalies the moment they happen.

> solution

Multi-model orchestration via one API

Call @planner for strategy, @executor for tool steps, @summarizer for final writeup. Different routing configs, same client, same SDK, same analytics plane.

> solution

Replay any run against a new config

Take last week's 500 real runs, replay them against a candidate config with a new model, diff the cost and quality. Promote if it holds.

> solution

Decision traces for every step

Every call in a run stores its full routing trace. Reconstructing "what the agent did" becomes a log query, not an archaeology project.

modelux simulate --replay=24h --config=@agent-v3 sim

Replaying last 24h of tagged agent traffic...

  runs_replayed   487
  requests        12,831
  window          2026-04-13 → 2026-04-14

                       CURRENT        CANDIDATE     DELTA
  mean cost/run      $0.0284        $0.0173     -39.1%
  p50 latency         8.21s          7.84s       -4.5%
  p95 latency        19.73s         18.12s       -8.2%
  error rate          0.42%          0.38%       -0.04pp
  agent success       91.4%          91.2%       -0.2pp

  [promote]  [diff]  [rollback-plan]

# replay

Know the cost/quality tradeoff before you ship.

Changing the planner model is the kind of change that could save 40% or tank quality. With modelux, you replay yesterday's real traffic against the candidate config and see the diff. Promote only if it holds.

▸ Select any historical window up to 24h
▸ Filter replay by tag (run_id, agent, tenant)
▸ Promote candidate with audited version bump

# related

Cost

Per-run cost attribution.

Ensembles plus per-run tagging make every agent loop accountable — and routinely 6× cheaper than a frontier-only stack.

Reliability

Loops survive provider hiccups.

Sub-second failover keeps long-running agents from cratering on a single 503. Health-aware routing shifts traffic before failure rates spike.

Stop flying blind in your agent loops.

Free tier is enough to tag a few thousand runs and get a feel for the analytics. Team tier covers 1M requests, replay, and full decision traces.

Start free Routing concepts →