Messages
Create a message using the Anthropic Messages API wire format. Drop-in
replacement for the Anthropic base URL — point your SDK at
https://api.modelux.ai/anthropic and existing code keeps working while
requests flow through modelux’s routing, budgets, and observability.
POST /anthropic/v1/messages
POST /anthropic/v1/messages/count_tokens
The same routing pipeline backs the OpenAI surface at
/openai/v1/chat/completions — pick whichever
wire format your existing SDK speaks. Both accept the same model identifiers
(raw model names or @config slugs) and the same X-Modelux-* headers.
Cross-provider routing
The headline pitch: an Anthropic-SDK-shaped request can route to any provider’s model. The proxy translates between Anthropic’s content-block format and the upstream provider’s native shape on the way out, then inverts the translation on the response.
# Anthropic SDK shape, Claude upstream:
curl https://api.modelux.ai/anthropic/v1/messages \
-H "Authorization: Bearer mlx_sk_..." \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Same SDK shape, OpenAI upstream — only the model name changed:
curl https://api.modelux.ai/anthropic/v1/messages \
-H "Authorization: Bearer mlx_sk_..." \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Hello!"}]
}'
The response is in Anthropic’s shape ({type: "message", role: "assistant", content: [{type: "text", text: "..."}], usage: {input_tokens, output_tokens}, ...})
regardless of which upstream actually served the request. SDK consumers
deserialize one shape; behind the scenes they can hit OpenAI, Anthropic,
Google, or Bedrock-Claude.
Request
{
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"system": "You are a helpful assistant.",
"messages": [
{ "role": "user", "content": "Hello!" }
],
"temperature": 0.7,
"stream": false
}
Model identifier
- Raw model name —
claude-sonnet-4-5,gpt-4o-mini,gemini-2.5-flash,anthropic.claude-3-5-haiku-20241022-v1:0(Bedrock form) - Routing config slug —
@production,@fallback,@experiment
Content blocks
Anthropic’s content-block format is supported on both request and response sides:
text— text contentimage— vision input (base64source.type=base64and URLsource.type=urlboth supported)tool_use— assistant’s request to invoke a tool (multi-turn echo)tool_result— caller’s reply with the tool’s outputthinking/redacted_thinking— extended-thinking blocks (round-tripped verbatim, including the opaquesignaturefield)
Streaming
Set stream: true. modelux returns SSE events in Anthropic’s streaming
format:
event: message_start
data: {"type":"message_start","message":{...}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":N}}
event: message_stop
data: {"type":"message_stop"}
Streaming works identically across providers. When the upstream is Anthropic the events forward 1:1; for OpenAI / Google / Bedrock / Cohere the proxy translates the upstream stream into Anthropic’s event taxonomy on the way out.
Tool calling
Tools are first-class on this endpoint and route across providers:
{
"model": "gpt-4o-mini",
"max_tokens": 1024,
"tools": [{
"name": "get_weather",
"description": "Get the current weather in a city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}],
"tool_choice": {"type": "auto"},
"messages": [{"role": "user", "content": "Weather in SF?"}]
}
Response includes tool_use blocks regardless of upstream:
{
"type": "message",
"role": "assistant",
"content": [
{"type": "text", "text": "Let me check."},
{"type": "tool_use", "id": "...", "name": "get_weather", "input": {"city": "SF"}}
],
"stop_reason": "tool_use",
"usage": {"input_tokens": 12, "output_tokens": 18}
}
Echo the assistant’s blocks back verbatim with a tool_result block on the
next turn:
{
"messages": [
{"role": "user", "content": "Weather in SF?"},
{"role": "assistant", "content": [
{"type": "text", "text": "Let me check."},
{"type": "tool_use", "id": "toolu_abc", "name": "get_weather", "input": {"city": "SF"}}
]},
{"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "toolu_abc", "content": "72°F sunny"}
]}
]
}
tool_choice shapes supported: {"type": "auto"}, {"type": "any"}
(forces a tool call), {"type": "tool", "name": "X"} (forces a specific
tool). The proxy translates these to each upstream’s equivalent
(required / none / function-named for OpenAI; AUTO / ANY / NONE
allowedFunctionNamesfor Google; etc.).
Tool result limitation: tool_result content is currently text-only on
the cross-provider path. Image-typed tool_result content (Anthropic
supports it natively) is dropped during translation because OpenAI’s
role: "tool" shape doesn’t carry images and would break the
cross-provider promise. If you need image results in tool replies, file
an issue.
Native prompt caching (cache_control)
Anthropic’s cache_control marker passes through the proxy verbatim
on every place the upstream API accepts it: per-content-block on
messages, per-system-block (when system is sent as an array of
blocks), and per-tool. Set it on the caller side and the next
matching request hits Anthropic’s native prompt cache for the
discounted price (cache reads ≈ 0.10× input rate; cache writes ≈
1.25× for the 5-minute ephemeral tier).
{
"model": "claude-sonnet-4-5",
"max_tokens": 256,
"system": [
{"type": "text", "text": "short instructions"},
{"type": "text", "text": "long context to cache",
"cache_control": {"type": "ephemeral"}}
],
"tools": [
{"name": "first", "input_schema": {"type":"object","properties":{}}},
{"name": "last", "input_schema": {"type":"object","properties":{}},
"cache_control": {"type": "ephemeral"}}
],
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "long retrieved chunks…",
"cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "Question about that context."}
]
}]
}
The standard pattern for tools is to mark the last tool — that caches the entire tool list as the prefix. For system, mark the long shared block; for messages, mark whatever you expect to reuse across turns.
Honored when the routed upstream is Anthropic or Bedrock-Claude; other
providers ignore it. Cache hit / write counts land in the request log
as cache_read_tokens and cache_creation_tokens so dashboards can
answer “did my marker actually hit?”. The proxy applies the matching
discount when computing logged cost_usd.
This is separate from and additive to modelux’s own semantic cache (which keys on prompt similarity across surfaces and replays a cached response without an upstream call). The two layers compose.
Extended thinking
Pass the thinking config to enable Claude’s extended-thinking models:
{
"model": "claude-sonnet-4-5",
"max_tokens": 4000,
"thinking": {"type": "enabled", "budget_tokens": 2000},
"messages": [{"role": "user", "content": "Walk me through 17 × 24."}]
}
The response includes thinking content blocks ahead of the text answer:
{
"type": "message",
"content": [
{"type": "thinking", "thinking": "I'll break 24 into 20 + 4...", "signature": "..."},
{"type": "text", "text": "17 × 24 = 408."}
],
...
}
For multi-turn conversations, echo the thinking blocks back verbatim
on subsequent turns — the signature field is the opaque token Anthropic
uses to resume reasoning. Streaming emits thinking_delta and
signature_delta events ahead of any text/tool_use blocks, preserving
the signature across the wire.
Extended thinking is honored when the routed upstream is Anthropic or
Bedrock-Claude. Other providers ignore the thinking field silently.
Semantic caching
When a project enables the semantic cache (Settings → Caching), responses
to similar prompts are reused without a real upstream call. The cache
index is keyed by (project, model, embedding-of-canonical-prompt) on
modelux’s internal canonical message shape — so an entry stored from a
request on /openai/v1/chat/completions is reusable from a request on
/anthropic/v1/messages (same model, same prompt) and vice versa.
Cache hits are emitted in whichever wire shape the calling endpoint
expects: a hit served to /anthropic/v1/messages comes back as
{type:"message", content:[…]}; the same entry served to
/openai/v1/chat/completions comes back as the OpenAI completion shape.
Streaming hits replay the cached response as a well-formed SSE
sequence (message_start → content_block_* → message_delta →
message_stop).
Response headers identify cache behavior:
X-Modelux-Cache: HIT(orMISS)X-Modelux-Cache-Similarity: 0.9876(cosine similarity, hits only)
Cache writes only happen on successful upstream calls (a partial or
errored stream isn’t poisoned into the cache). High-temperature
(creative) requests skip the cache by default; clients can also set
Cache-Control: no-store to bypass per request.
Count tokens
Pre-flight token estimation:
POST /anthropic/v1/messages/count_tokens
Same request body as /anthropic/v1/messages (model + messages, optionally
tools/system). Returns:
{ "input_tokens": 42 }
The proxy forwards this verbatim to Anthropic upstream — token counts use
Claude’s actual tokenizer. Requires an Anthropic credential configured
for your org (or a X-Modelux-Provider-Key header).
Dry-run
Set X-Modelux-Dry-Run: true (or 1) to evaluate routing without
calling the upstream:
curl https://api.modelux.ai/anthropic/v1/messages \
-H "x-api-key: mlx_sk_..." \
-H "X-Modelux-Dry-Run: true" \
-H "Content-Type: application/json" \
-d '{"model":"@production","max_tokens":256,"messages":[{"role":"user","content":"hi"}]}'
Returns the same dry_run envelope the OpenAI surface uses — routing
policy, target model + provider, candidate trace, matched rule,
variant bucket. Identical shape so the dashboard’s “preview route”
feature works regardless of which endpoint a caller uses.
Other endpoints
GET /anthropic/v1/models
Returns Anthropic’s pagination envelope ({data: [], has_more: false}).
Modelux doesn’t yet expose a curated model registry; the list is
empty. Present so SDK probes via client.models.list() don’t 404.
Message batches
For async batch processing (50% upstream discount), the full
/v1/messages/batches/* surface is proxied. See
Message batches (Anthropic) for the
endpoint table and a complete walkthrough.
Request headers
Same X-Modelux-* headers as the OpenAI surface. See
Chat completions → Request headers.
The Modelux API key is accepted as either Authorization: Bearer mlx_sk_… or x-api-key: mlx_sk_… so the official Anthropic SDKs
(which set x-api-key from their apiKey constructor option) work
as drop-in clients with nothing more than a baseURL swap.
Response headers
x-modelux-request-id: req_a1b2c3
x-modelux-model-used: claude-sonnet-4-5
x-modelux-provider: anthropic
x-modelux-cost-usd: 0.002134
x-modelux-cache: MISS
Errors
Errors follow Anthropic’s wire shape:
{
"type": "error",
"error": {
"type": "invalid_request_error",
"message": "max_tokens must be > 0"
}
}
error.type | Status |
|---|---|
invalid_request_error | 400 |
authentication_error | 401 |
permission_error | 402 (budget) / 403 |
not_found_error | 404 |
rate_limit_error | 429 |
api_error | 5xx |