Docs

Get to your first routed request in five minutes.

A guided start, an exhaustive reference, and a recipe book of patterns we've seen work in production.

Quickstart

Install nothing. Inferly is an OpenAI-compatible endpoint. Point your existing SDK at https://api.inferly.com/v1 and you are routed.

# Python — OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="https://api.inferly.com/v1", api_key="sk_live_...")
resp = client.chat.completions.create(model="auto", messages=[...])

Authentication

Use a project-scoped key. Keys carry roles. Rotate from the dashboard or via API.

# curl
curl https://api.inferly.com/v1/chat/completions \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'

Concepts

Three primitives. Everything else composes from these.

  • Request — a chat, embedding or tool call. Inferly routes one at a time.
  • Policy — your declared constraints: latency, cost, accuracy, region.
  • Workflow — a named bundle of requests + policy + golden set. The unit of evals and observability.

Routing policy

Policies are JSON. Attach per-request or per-workflow.

{
  "slo_ms": 350,
  "max_cost_usd": 0.002,
  "min_accuracy": 0.92,
  "region": "eu-west",
  "allow": ["anthropic/*", "openai/*"]
}

Model registry

Listed at GET /v1/models. Tagged by modality, context, region availability and current health.

Failover

Inferly maintains a hot ranking of healthy providers. On error or SLO breach mid-stream, retries on the next best within ~400ms.

Semantic cache

Embedding-based deduplication. Tunable cosine threshold per workflow. Cached completions are billed at our caching rate, not the underlying model's.

Configuration

{
  "cache": {
    "mode": "semantic",
    "threshold": 0.97,
    "ttl_s": 86400
  }
}

Golden sets

Upload a JSONL of prompts and expected outputs. Inferly runs your suite against every candidate model on a schedule and on demand.

CI / blocking

Hook a webhook into your deploy pipeline. Block release if any workflow's eval score drops > Δ against its baseline.

Traces

Every request emits an OpenTelemetry span. Provider, tokens, latency, retries, cost, eval score — structured, queryable.

Exports

Snowflake, BigQuery, S3 (Parquet), Datadog, Honeycomb. Configure in the dashboard.

Stuck somewhere?

Engineering office hours every Thursday. Or DM us in the developer Slack.

Get helpAPI reference