API reference

One endpoint surface. Six resources.

REST + streaming over HTTPS. OpenAI request/response shapes. JSON only. Versioned at the path.

POST /v1/chat/completions

Routes a chat request to the optimal model for the attached policy. OpenAI-shape in, OpenAI-shape out, plus an x-inferly-* envelope with routing metadata.

curl https://api.inferly.com/v1/chat/completions \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role":"user","content":"Summarize Q3."}],
    "stream": true,
    "inferly": {"slo_ms": 350, "max_cost_usd": 0.002}
  }'

POST /v1/embeddings

Routes to the cheapest embedding model that hits your accuracy floor.

{
  "model": "auto",
  "input": ["two sentences", "to embed"],
  "inferly": { "min_accuracy": 0.95 }
}

GET /v1/models

Lists every model in your registry, with live health, region availability and current routing cost.

/v1/workflows

CRUD for named workflows. Each workflow attaches a policy, a golden set, retention rules and a routing scope.

POST /v1/workflows
{
  "name": "support-tier-1",
  "policy": { "slo_ms": 400, "max_cost_usd": 0.001 },
  "golden_set_id": "gs_8h...",
  "retention": "30d"
}

/v1/evals

Upload golden sets, list runs, fetch scores.

/v1/traces

Query per-request traces by workflow, time range, or filters. Server-side pagination. Stream via SSE.

Errors

Inferly returns standard HTTP status codes and an OpenAI-shaped error body, plus x-inferly-trace-id for support.

{ "error": { "type": "policy_violation", "message": "no provider met SLO" }}

Rate limits

Limits are per-key, scoped by plan. Headers x-ratelimit-remaining and x-ratelimit-reset on every response.

SDKs

  • — Use any OpenAI SDK as-is
  • — Native TypeScript / Python clients add typed access to inferly.* fields
  • — Go and Rust SDKs in preview