v2.4 · Latency-aware routing now in public beta

The unified inference cloud for production AI.

One API. Every frontier model. Every GPU cloud. Inferly routes each request to the cheapest, fastest, most accurate provider — automatically — with sub-second failover and built-in caching.

SOC 2 Type II HIPAA-ready EU + US data residency 99.99% uptime SLA

Powering inference at

Lattice AI
Northwind
Helix
Parallax
Forma
Quantica
Why teams switch

Inference is the new networking layer. We built the router.

Hard-coding a single provider was fine when there were three. With 100+ open and closed models and a 10× spread in GPU prices, the routing decision is now the workload.

70%
Average cost reduction vs. single-provider stacks
180ms
Median p50 routing overhead, globally
10×
GPU price spread across providers, captured automatically
99.99%
Multi-provider availability with sub-second failover
Platform

A complete inference control plane.

Routing, caching, evaluation, observability and governance — engineered as one system, exposed behind one OpenAI-compatible endpoint.

routing

Policy-based model routing

Declare a latency SLO, a cost ceiling, an accuracy floor and PII constraints. Inferly picks the optimal model + provider per request and re-evaluates as the market moves.

model = router.select({ slo: "350ms", max_cost: 0.002 })
caching

Semantic + exact-match cache

Embedding-based deduplication catches paraphrased prompts. Cache hit rates of 30–60% are typical within two weeks of traffic.

cache.hit_rate = 0.48 (24h)
failover

Sub-second failover

Provider down? Quota hit? We retry on a healthy peer before the user notices.

structured

Guaranteed structured output

JSON-schema enforcement across every provider, not just the two that support it natively.

evals

Continuous evals

Run your golden set against every candidate model on a schedule. Regressions block deploy.

observability

Per-request traces

Token counts, provider chosen, retry chain, latency breakdown, cost and accuracy — for every call. Exportable to your warehouse.

governance

PII, residency, allowlists

Pin sensitive workloads to specific regions or on-prem providers. Auditable allow/deny per team, per workflow.

Drop-in

A one-line change to your SDK.

Inferly is OpenAI API compatible. Point your existing client at our base URL and routing, caching, evals and observability turn on instantly.

  • — No proxy infrastructure to run
  • — No vendor SDK to install
  • — Works with OpenAI, Anthropic, LangChain, LlamaIndex, Vercel AI SDK
  • — Streaming, tools, vision, embeddings — all supported
# Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferly.com/v1",
    api_key="sk_live_...",
)

response = client.chat.completions.create(
    model="auto",                 # let Inferly pick
    messages=[{"role": "user", "content": "Summarize Q3."}],
    extra_body={
        "inferly": {
            "slo_ms": 350,
            "max_cost_usd": 0.002,
            "min_accuracy": 0.92,
            "region": "eu-west",
        },
    },
)
Live router

Watch a single request hit the optimal provider.

Inferly evaluates every viable provider against your policy in under 200ms, then dispatches. Every decision is traceable.

inferly.dashboard / live
Overview
Router
Cache
Evals
Traces
Spend
Policies
REQ_a1f2…b07c
chat.completions · streaming · 1,204 tokens
routed · 162ms
policyslo<350ms · cost<$0.002 · acc>0.92
candidates11 evaluated
winneranthropic / claude-3.5-sonnet · groq-tpu
runner-upopenai / gpt-4o · fallback ready
cachemiss (new prompt)
cost$0.00118 — 41% under ceiling
traceview in observability →
Compare

Built for production. Not for demos.

A routing layer that fails open in the wrong direction will quietly drain your budget. Inferly is engineered for the opposite.

 InferlySingle-provider stackDIY router
Models supported100+1–3whatever you wire
GPU clouds10+ neoclouds + on-premvendor onlyone or two
Failoversub-second, policy-awarenoneretry-on-error only
Semantic cacheembeddings-basedexact-match only
Continuous evalsbuilt-inDIY
Cost telemetryper-request, exportableinvoice onlyDIY
ComplianceSOC 2, HIPAA-ready, EUvariesyour problem
From the field

Used by AI engineering teams from seed-stage to F500.

"We cut inference spend 64% in the first month. The router caught a Sonnet → Haiku swap our team had argued about for a quarter."

EM
Elena Marsh
Head of AI, Northwind

"The semantic cache alone paid for the platform. By month two we were running on 38% the tokens for the same product."

DK
Daniel Koh
Staff Eng, Helix

"We replaced an internal proxy four engineers maintained. Inferly's evals catch regressions our test suite never did."

RA
Rachel Adeyemi
VP Platform, Parallax
Questions

Things engineering teams ask first.

Not on this list? Talk to engineering →

Load balancers route on availability. Inferly routes on cost, latency, accuracy and compliance — re-evaluated per request against your declared policy, with sub-second failover when a chosen provider degrades.
Only if you opt in to caching or evals — and even then, data is encrypted at rest, pinned to your selected region, and never used to train any model.
Yes. Chat, completions, embeddings, tools, streaming and vision are surface-compatible. A one-line base URL change is usually all that is required.
The router keeps a hot ranking of healthy peers. If your chosen provider returns an error or breaches your SLO mid-stream, we retry on the next-best within ~400ms, transparently to your client.
Yes. VPC and on-prem deployments are available on the Enterprise plan, including air-gapped installations for regulated environments.

Ship AI to production. Stop overpaying for it.

Free tier includes 1,000 routed requests per day. No credit card. Drop-in OpenAI compatibility.

Start free Talk to sales