How We Compare

Our transparent, test-driven framework for comparing AI providers across quality, latency, cost, reliability, security and support.

What we measure

Quality — accuracy, coherence, instruction-following, hallucinations.
Latency — p50 / p95 end-to-end time to first token and full output.
Cost — effective cost per 1K input + output tokens and per completed task.
Reliability — success rate, error codes, timeouts, rate limits.
Security & Compliance — data retention, regional processing, SOC2/ISO27001, HIPAA/GDPR claims.
Support & Tooling — SDKs, streaming, function calling, eval tools, ecosystem.

Standardised prompts: curated suites for chat, RAG, function calling, JSON, coding.
Fixed seeds: temperature/top-p controlled; when available, we fix a seed for determinism.
Warm-start: we include a warm-up run; reported latency excludes cold-start outliers when noted.
Repeat runs: each test runs ≥5 times; we report median (p50) and p95.
Token accounting: we log input/output tokens and calculate effective cost.

Each category is scored 0–100, then we compute a weighted composite:

Category	Weight	Notes
Quality	40%	Task-specific evals + human spot-checks
Latency	15%	p50/p95, streaming support
Cost	20%	$/1K tokens & cost-per-task
Reliability	15%	Error rate, rate-limit ergonomics
Security	5%	Retention, regionality, certifications
Support & Tooling	5%	Docs, SDKs, ecosystem

We convert all pricing to USD and compute effective cost per task using the same prompts and truncation rules.

We do not accept paid placement for rankings.
Affiliate links, when present, are marked and never change scores.
When providers update models or prices, we re-run affected tests and version results.