v2.4 · Latency-aware routing now in public beta

The unified inference cloud for production AI.

One API. Every frontier model. Every GPU cloud. Inferly routes each request to the cheapest, fastest, most accurate provider — automatically — with sub-second failover and built-in caching.

Start free Read the docs

SOC 2 Type II HIPAA-ready EU + US data residency 99.99% uptime SLA

Powering inference at

Lattice AI

Northwind

Helix

Parallax

Forma

Quantica

Why teams switch

Inference is the new networking layer. We built the router.

Hard-coding a single provider was fine when there were three. With 100+ open and closed models and a 10× spread in GPU prices, the routing decision is now the workload.

70%

Average cost reduction vs. single-provider stacks

180ms

Median p50 routing overhead, globally

10×

GPU price spread across providers, captured automatically

99.99%

Multi-provider availability with sub-second failover

Platform

A complete inference control plane.

Routing, caching, evaluation, observability and governance — engineered as one system, exposed behind one OpenAI-compatible endpoint.

routing

Policy-based model routing

Declare a latency SLO, a cost ceiling, an accuracy floor and PII constraints. Inferly picks the optimal model + provider per request and re-evaluates as the market moves.

model = router.select({ slo: "350ms", max_cost: 0.002 })

caching

Semantic + exact-match cache

Embedding-based deduplication catches paraphrased prompts. Cache hit rates of 30–60% are typical within two weeks of traffic.

cache.hit_rate = 0.48 (24h)

failover

Sub-second failover

Provider down? Quota hit? We retry on a healthy peer before the user notices.

structured

Guaranteed structured output

JSON-schema enforcement across every provider, not just the two that support it natively.

evals

Continuous evals

Run your golden set against every candidate model on a schedule. Regressions block deploy.

observability

Per-request traces

Token counts, provider chosen, retry chain, latency breakdown, cost and accuracy — for every call. Exportable to your warehouse.

governance

PII, residency, allowlists

Pin sensitive workloads to specific regions or on-prem providers. Auditable allow/deny per team, per workflow.

Drop-in

A one-line change to your SDK.

Inferly is OpenAI API compatible. Point your existing client at our base URL and routing, caching, evals and observability turn on instantly.

— No proxy infrastructure to run
— No vendor SDK to install
— Works with OpenAI, Anthropic, LangChain, LlamaIndex, Vercel AI SDK
— Streaming, tools, vision, embeddings — all supported

# Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferly.com/v1",
    api_key="sk_live_...",
)

response = client.chat.completions.create(
    model="auto",                 # let Inferly pick
    messages=[{"role": "user", "content": "Summarize Q3."}],
    extra_body={
        "inferly": {
            "slo_ms": 350,
            "max_cost_usd": 0.002,
            "min_accuracy": 0.92,
            "region": "eu-west",
        },
    },
)

Live router

Watch a single request hit the optimal provider.

Inferly evaluates every viable provider against your policy in under 200ms, then dispatches. Every decision is traceable.

inferly.dashboard / live

Overview

Router

Cache

Evals

Traces

Spend

Policies

REQ_a1f2…b07c

chat.completions · streaming · 1,204 tokens

routed · 162ms

policyslo<350ms · cost<$0.002 · acc>0.92

candidates11 evaluated

winneranthropic / claude-3.5-sonnet · groq-tpu

runner-upopenai / gpt-4o · fallback ready

cachemiss (new prompt)

cost$0.00118 — 41% under ceiling

traceview in observability →

Compare

Built for production. Not for demos.

A routing layer that fails open in the wrong direction will quietly drain your budget. Inferly is engineered for the opposite.

	Inferly	Single-provider stack	DIY router
Models supported	100+	1–3	whatever you wire
GPU clouds	10+ neoclouds + on-prem	vendor only	one or two
Failover	sub-second, policy-aware	none	retry-on-error only
Semantic cache	embeddings-based	—	exact-match only
Continuous evals	built-in	—	DIY
Cost telemetry	per-request, exportable	invoice only	DIY
Compliance	SOC 2, HIPAA-ready, EU	varies	your problem

From the field

Used by AI engineering teams from seed-stage to F500.

"We cut inference spend 64% in the first month. The router caught a Sonnet → Haiku swap our team had argued about for a quarter."

Elena Marsh

Head of AI, Northwind

"The semantic cache alone paid for the platform. By month two we were running on 38% the tokens for the same product."

Daniel Koh

Staff Eng, Helix

"We replaced an internal proxy four engineers maintained. Inferly's evals catch regressions our test suite never did."

Rachel Adeyemi

VP Platform, Parallax

Questions

Things engineering teams ask first.

Not on this list? Talk to engineering →

Load balancers route on availability. Inferly routes on cost, latency, accuracy and compliance — re-evaluated per request against your declared policy, with sub-second failover when a chosen provider degrades.

Only if you opt in to caching or evals — and even then, data is encrypted at rest, pinned to your selected region, and never used to train any model.

Yes. Chat, completions, embeddings, tools, streaming and vision are surface-compatible. A one-line base URL change is usually all that is required.

The router keeps a hot ranking of healthy peers. If your chosen provider returns an error or breaches your SLO mid-stream, we retry on the next-best within ~400ms, transparently to your client.

Yes. VPC and on-prem deployments are available on the Enterprise plan, including air-gapped installations for regulated environments.

Ship AI to production. Stop overpaying for it.

Free tier includes 1,000 routed requests per day. No credit card. Drop-in OpenAI compatibility.

Start free Talk to sales