Get to your first routed request in five minutes.
A guided start, an exhaustive reference, and a recipe book of patterns we've seen work in production.
Quickstart
Install nothing. Inferly is an OpenAI-compatible endpoint. Point your existing SDK at https://api.inferly.com/v1 and you are routed.
# Python — OpenAI SDK from openai import OpenAI client = OpenAI(base_url="https://api.inferly.com/v1", api_key="sk_live_...") resp = client.chat.completions.create(model="auto", messages=[...])
Authentication
Use a project-scoped key. Keys carry roles. Rotate from the dashboard or via API.
# curl curl https://api.inferly.com/v1/chat/completions \ -H "Authorization: Bearer sk_live_..." \ -H "Content-Type: application/json" \ -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'
Concepts
Three primitives. Everything else composes from these.
- Request — a chat, embedding or tool call. Inferly routes one at a time.
- Policy — your declared constraints: latency, cost, accuracy, region.
- Workflow — a named bundle of requests + policy + golden set. The unit of evals and observability.
Routing policy
Policies are JSON. Attach per-request or per-workflow.
{
"slo_ms": 350,
"max_cost_usd": 0.002,
"min_accuracy": 0.92,
"region": "eu-west",
"allow": ["anthropic/*", "openai/*"]
}
Model registry
Listed at GET /v1/models. Tagged by modality, context, region availability and current health.
Failover
Inferly maintains a hot ranking of healthy providers. On error or SLO breach mid-stream, retries on the next best within ~400ms.
Semantic cache
Embedding-based deduplication. Tunable cosine threshold per workflow. Cached completions are billed at our caching rate, not the underlying model's.
Configuration
{
"cache": {
"mode": "semantic",
"threshold": 0.97,
"ttl_s": 86400
}
}
Golden sets
Upload a JSONL of prompts and expected outputs. Inferly runs your suite against every candidate model on a schedule and on demand.
CI / blocking
Hook a webhook into your deploy pipeline. Block release if any workflow's eval score drops > Δ against its baseline.
Traces
Every request emits an OpenTelemetry span. Provider, tokens, latency, retries, cost, eval score — structured, queryable.
Exports
Snowflake, BigQuery, S3 (Parquet), Datadog, Honeycomb. Configure in the dashboard.
Stuck somewhere?
Engineering office hours every Thursday. Or DM us in the developer Slack.