Messages

Create a message using the Anthropic Messages API wire format. Drop-in replacement for the Anthropic base URL — point your SDK at https://api.modelux.ai/anthropic and existing code keeps working while requests flow through modelux’s routing, budgets, and observability.

POST /anthropic/v1/messages
POST /anthropic/v1/messages/count_tokens

The same routing pipeline backs the OpenAI surface at /openai/v1/chat/completions — pick whichever wire format your existing SDK speaks. Both accept the same model identifiers (raw model names or @config slugs) and the same X-Modelux-* headers.

Cross-provider routing

The headline pitch: an Anthropic-SDK-shaped request can route to any provider’s model. The proxy translates between Anthropic’s content-block format and the upstream provider’s native shape on the way out, then inverts the translation on the response.

# Anthropic SDK shape, Claude upstream:
curl https://api.modelux.ai/anthropic/v1/messages \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Same SDK shape, OpenAI upstream — only the model name changed:
curl https://api.modelux.ai/anthropic/v1/messages \
  -H "Authorization: Bearer mlx_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The response is in Anthropic’s shape ({type: "message", role: "assistant", content: [{type: "text", text: "..."}], usage: {input_tokens, output_tokens}, ...}) regardless of which upstream actually served the request. SDK consumers deserialize one shape; behind the scenes they can hit OpenAI, Anthropic, Google, or Bedrock-Claude.

Request

{
  "model": "claude-sonnet-4-5",
  "max_tokens": 1024,
  "system": "You are a helpful assistant.",
  "messages": [
    { "role": "user", "content": "Hello!" }
  ],
  "temperature": 0.7,
  "stream": false
}

Model identifier

Raw model name — claude-sonnet-4-5, gpt-4o-mini, gemini-2.5-flash, anthropic.claude-3-5-haiku-20241022-v1:0 (Bedrock form)
Routing config slug — @production, @fallback, @experiment

Content blocks

Anthropic’s content-block format is supported on both request and response sides:

text — text content
image — vision input (base64 source.type=base64 and URL source.type=url both supported)
tool_use — assistant’s request to invoke a tool (multi-turn echo)
tool_result — caller’s reply with the tool’s output
thinking / redacted_thinking — extended-thinking blocks (round-tripped verbatim, including the opaque signature field)

Streaming

Set stream: true. modelux returns SSE events in Anthropic’s streaming format:

event: message_start
data: {"type":"message_start","message":{...}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":N}}

event: message_stop
data: {"type":"message_stop"}

Streaming works identically across providers. When the upstream is Anthropic the events forward 1:1; for OpenAI / Google / Bedrock / Cohere the proxy translates the upstream stream into Anthropic’s event taxonomy on the way out.

Tool calling

Tools are first-class on this endpoint and route across providers:

{
  "model": "gpt-4o-mini",
  "max_tokens": 1024,
  "tools": [{
    "name": "get_weather",
    "description": "Get the current weather in a city",
    "input_schema": {
      "type": "object",
      "properties": {"city": {"type": "string"}},
      "required": ["city"]
    }
  }],
  "tool_choice": {"type": "auto"},
  "messages": [{"role": "user", "content": "Weather in SF?"}]
}

Response includes tool_use blocks regardless of upstream:

{
  "type": "message",
  "role": "assistant",
  "content": [
    {"type": "text", "text": "Let me check."},
    {"type": "tool_use", "id": "...", "name": "get_weather", "input": {"city": "SF"}}
  ],
  "stop_reason": "tool_use",
  "usage": {"input_tokens": 12, "output_tokens": 18}
}

Echo the assistant’s blocks back verbatim with a tool_result block on the next turn:

{
  "messages": [
    {"role": "user", "content": "Weather in SF?"},
    {"role": "assistant", "content": [
      {"type": "text", "text": "Let me check."},
      {"type": "tool_use", "id": "toolu_abc", "name": "get_weather", "input": {"city": "SF"}}
    ]},
    {"role": "user", "content": [
      {"type": "tool_result", "tool_use_id": "toolu_abc", "content": "72°F sunny"}
    ]}
  ]
}

tool_choice shapes supported: {"type": "auto"}, {"type": "any"} (forces a tool call), {"type": "tool", "name": "X"} (forces a specific tool). The proxy translates these to each upstream’s equivalent (required / none / function-named for OpenAI; AUTO / ANY / NONE

allowedFunctionNames for Google; etc.).

Tool result limitation: tool_result content is currently text-only on the cross-provider path. Image-typed tool_result content (Anthropic supports it natively) is dropped during translation because OpenAI’s role: "tool" shape doesn’t carry images and would break the cross-provider promise. If you need image results in tool replies, file an issue.

Native prompt caching (cache_control)

Anthropic’s cache_control marker passes through the proxy verbatim on every place the upstream API accepts it: per-content-block on messages, per-system-block (when system is sent as an array of blocks), and per-tool. Set it on the caller side and the next matching request hits Anthropic’s native prompt cache for the discounted price (cache reads ≈ 0.10× input rate; cache writes ≈ 1.25× for the 5-minute ephemeral tier).

{
  "model": "claude-sonnet-4-5",
  "max_tokens": 256,
  "system": [
    {"type": "text", "text": "short instructions"},
    {"type": "text", "text": "long context to cache",
     "cache_control": {"type": "ephemeral"}}
  ],
  "tools": [
    {"name": "first", "input_schema": {"type":"object","properties":{}}},
    {"name": "last", "input_schema": {"type":"object","properties":{}},
     "cache_control": {"type": "ephemeral"}}
  ],
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "long retrieved chunks…",
       "cache_control": {"type": "ephemeral"}},
      {"type": "text", "text": "Question about that context."}
    ]
  }]
}

The standard pattern for tools is to mark the last tool — that caches the entire tool list as the prefix. For system, mark the long shared block; for messages, mark whatever you expect to reuse across turns.

Honored when the routed upstream is Anthropic or Bedrock-Claude; other providers ignore it. Cache hit / write counts land in the request log as cache_read_tokens and cache_creation_tokens so dashboards can answer “did my marker actually hit?”. The proxy applies the matching discount when computing logged cost_usd.

This is separate from and additive to modelux’s own semantic cache (which keys on prompt similarity across surfaces and replays a cached response without an upstream call). The two layers compose.

Extended thinking

Pass the thinking config to enable Claude’s extended-thinking models:

{
  "model": "claude-sonnet-4-5",
  "max_tokens": 4000,
  "thinking": {"type": "enabled", "budget_tokens": 2000},
  "messages": [{"role": "user", "content": "Walk me through 17 × 24."}]
}

The response includes thinking content blocks ahead of the text answer:

{
  "type": "message",
  "content": [
    {"type": "thinking", "thinking": "I'll break 24 into 20 + 4...", "signature": "..."},
    {"type": "text", "text": "17 × 24 = 408."}
  ],
  ...
}

For multi-turn conversations, echo the thinking blocks back verbatim on subsequent turns — the signature field is the opaque token Anthropic uses to resume reasoning. Streaming emits thinking_delta and signature_delta events ahead of any text/tool_use blocks, preserving the signature across the wire.

Extended thinking is honored when the routed upstream is Anthropic or Bedrock-Claude. Other providers ignore the thinking field silently.

Semantic caching

When a project enables the semantic cache (Settings → Caching), responses to similar prompts are reused without a real upstream call. The cache index is keyed by (project, model, embedding-of-canonical-prompt) on modelux’s internal canonical message shape — so an entry stored from a request on /openai/v1/chat/completions is reusable from a request on /anthropic/v1/messages (same model, same prompt) and vice versa.

Cache hits are emitted in whichever wire shape the calling endpoint expects: a hit served to /anthropic/v1/messages comes back as {type:"message", content:[…]}; the same entry served to /openai/v1/chat/completions comes back as the OpenAI completion shape. Streaming hits replay the cached response as a well-formed SSE sequence (message_start → content_block_* → message_delta → message_stop).

Response headers identify cache behavior:

X-Modelux-Cache: HIT (or MISS)
X-Modelux-Cache-Similarity: 0.9876 (cosine similarity, hits only)

Cache writes only happen on successful upstream calls (a partial or errored stream isn’t poisoned into the cache). High-temperature (creative) requests skip the cache by default; clients can also set Cache-Control: no-store to bypass per request.

Count tokens

Pre-flight token estimation:

POST /anthropic/v1/messages/count_tokens

Same request body as /anthropic/v1/messages (model + messages, optionally tools/system). Returns:

{ "input_tokens": 42 }

The proxy forwards this verbatim to Anthropic upstream — token counts use Claude’s actual tokenizer. Requires an Anthropic credential configured for your org (or a X-Modelux-Provider-Key header).

Dry-run

Set X-Modelux-Dry-Run: true (or 1) to evaluate routing without calling the upstream:

curl https://api.modelux.ai/anthropic/v1/messages \
  -H "x-api-key: mlx_sk_..." \
  -H "X-Modelux-Dry-Run: true" \
  -H "Content-Type: application/json" \
  -d '{"model":"@production","max_tokens":256,"messages":[{"role":"user","content":"hi"}]}'

Returns the same dry_run envelope the OpenAI surface uses — routing policy, target model + provider, candidate trace, matched rule, variant bucket. Identical shape so the dashboard’s “preview route” feature works regardless of which endpoint a caller uses.

Other endpoints

GET /anthropic/v1/models

Returns Anthropic’s pagination envelope ({data: [], has_more: false}). Modelux doesn’t yet expose a curated model registry; the list is empty. Present so SDK probes via client.models.list() don’t 404.

Message batches

For async batch processing (50% upstream discount), the full /v1/messages/batches/* surface is proxied. See Message batches (Anthropic) for the endpoint table and a complete walkthrough.

Request headers

Same X-Modelux-* headers as the OpenAI surface. See Chat completions → Request headers.

The Modelux API key is accepted as either Authorization: Bearer mlx_sk_… or x-api-key: mlx_sk_… so the official Anthropic SDKs (which set x-api-key from their apiKey constructor option) work as drop-in clients with nothing more than a baseURL swap.

Response headers

x-modelux-request-id:   req_a1b2c3
x-modelux-model-used:   claude-sonnet-4-5
x-modelux-provider:     anthropic
x-modelux-cost-usd:     0.002134
x-modelux-cache:        MISS

Errors

Errors follow Anthropic’s wire shape:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "max_tokens must be > 0"
  }
}

`error.type`	Status
`invalid_request_error`	400
`authentication_error`	401
`permission_error`	402 (budget) / 403
`not_found_error`	404
`rate_limit_error`	429
`api_error`	5xx