One endpoint surface. Six resources.
REST + streaming over HTTPS. OpenAI request/response shapes. JSON only. Versioned at the path.
POST /v1/chat/completions
Routes a chat request to the optimal model for the attached policy. OpenAI-shape in, OpenAI-shape out, plus an x-inferly-* envelope with routing metadata.
curl https://api.inferly.com/v1/chat/completions \ -H "Authorization: Bearer sk_live_..." \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role":"user","content":"Summarize Q3."}], "stream": true, "inferly": {"slo_ms": 350, "max_cost_usd": 0.002} }'
POST /v1/embeddings
Routes to the cheapest embedding model that hits your accuracy floor.
{
"model": "auto",
"input": ["two sentences", "to embed"],
"inferly": { "min_accuracy": 0.95 }
}
GET /v1/models
Lists every model in your registry, with live health, region availability and current routing cost.
/v1/workflows
CRUD for named workflows. Each workflow attaches a policy, a golden set, retention rules and a routing scope.
POST /v1/workflows
{
"name": "support-tier-1",
"policy": { "slo_ms": 400, "max_cost_usd": 0.001 },
"golden_set_id": "gs_8h...",
"retention": "30d"
}
/v1/evals
Upload golden sets, list runs, fetch scores.
/v1/traces
Query per-request traces by workflow, time range, or filters. Server-side pagination. Stream via SSE.
Errors
Inferly returns standard HTTP status codes and an OpenAI-shaped error body, plus x-inferly-trace-id for support.
{ "error": { "type": "policy_violation", "message": "no provider met SLO" }}
Rate limits
Limits are per-key, scoped by plan. Headers x-ratelimit-remaining and x-ratelimit-reset on every response.
SDKs
- — Use any OpenAI SDK as-is
- — Native TypeScript / Python clients add typed access to
inferly.*fields - — Go and Rust SDKs in preview