Product

One platform. Every layer of the inference stack.

Routing, caching, evals, observability, governance — engineered together so you don't have to staff six teams to ship one feature.

Routing

Policy-aware model routing.

Stop hard-coding model names. Declare what you care about — latency, cost, accuracy, region — and let the router pick per request.

SLOs

Latency budgets

Set a p95 latency target. Inferly will not select a provider that has breached it in the last 60 seconds.

cost

Cost ceilings

Hard caps per request, per workflow, per team. Spend never exceeds policy, even under load spikes.

accuracy

Accuracy floors

Reject any model whose eval score on your golden set drops below threshold.

region

Data residency

Pin requests to EU, US, UK, AU, or on-prem clusters. Audited per call.

routing-rr

Trained reranker

Our routing model improves with every request you send.

Caching

A cache that thinks in meaning, not strings.

Embedding-based dedup catches paraphrased prompts your exact-match cache would miss. Typical workloads hit 30–60% in two weeks.

Exact

Sub-millisecond response on identical prompts and parameters.

Semantic

Cosine-similarity match across an embedding index. Tunable threshold per workflow.

Partial

Token-prefix caching for chat continuations. Re-uses prior turns, charges only new tokens.

Evals

Continuous evaluation, not vibes.

Define a golden set once. Inferly runs it nightly against every candidate model, charts the drift, and blocks regressions from reaching prod.

Eval suiteRun onScoreΔ vs. last weekStatus
customer-support-v3claude-3.5-sonnet0.942+0.011passing
customer-support-v3gpt-4o0.918−0.004passing
code-review-v2deepseek-v30.871+0.022passing
summarize-longllama-3.3-70b0.802−0.038blocked
structured-extractgpt-4o-mini0.965+0.003passing
Observability

Per-request truth, not aggregate guesses.

Every request emits a structured trace: provider, tokens, latency, retries, cost, eval score. Stream to Datadog, Honeycomb, or your warehouse.

Traces

OpenTelemetry-native. Drop into your existing tracing UI.

Spend

Live cost per team, per workflow, per model. Budget alerts.

Quality

Eval scores trended against deploys and provider changes.

Export

S3, Snowflake, BigQuery, Postgres. Your data, your warehouse.

Governance

Compliance built in, not bolted on.

SSO, SCIM, RBAC, allow/deny lists, PII redaction and full audit log — available from day one, not at the Enterprise tier.

Identity

SAML, OIDC, SCIM via WorkOS. Group-based RBAC per workflow.

Data controls

PII redaction pre-flight, region pinning, retention windows per workflow.

Audit log

Every config change and every call, immutable, exportable for SOC 2 audits.

The shortest path from prototype to production AI.

Free tier. OpenAI-compatible. One base URL away.

Start freeRead the docs