How we compare
Our transparent, test-driven framework for comparing AI providers across quality, latency, cost, reliability, security and support.
What we measure
- Quality — accuracy, coherence, instruction-following, hallucinations.
- Latency — p50 / p95 end-to-end time to first token and full output.
- Cost — effective cost per 1K input + output tokens and per completed task.
- Reliability — success rate, error codes, timeouts, rate limits.
- Security & Compliance — data retention, regional processing, SOC2/ISO27001, HIPAA/GDPR claims.
- Support & Tooling — SDKs, streaming, function calling, eval tools, ecosystem.
How we run tests
- Standardised prompts: curated suites for chat, RAG, function calling, JSON, coding.
- Fixed seeds: temperature/top-p controlled; when available, we fix a seed for determinism.
- Warm-start: we include a warm-up run; reported latency excludes cold-start outliers when noted.
- Repeat runs: each test runs ≥5 times; we report median (p50) and p95.
- Token accounting: we log input/output tokens and calculate effective cost.
Scoring
Each category is scored 0–100, then we compute a weighted composite:
Category | Weight | Notes |
---|---|---|
Quality | 40% | Task-specific evals + human spot-checks |
Latency | 15% | p50/p95, streaming support |
Cost | 20% | $/1K tokens & cost-per-task |
Reliability | 15% | Error rate, rate-limit ergonomics |
Security | 5% | Retention, regionality, certifications |
Support & Tooling | 5% | Docs, SDKs, ecosystem |
Price normalisation
We convert all pricing to USD and compute effective cost per task using the same prompts and truncation rules.
Disclosures
- We do not accept paid placement for rankings.
- Affiliate links, when present, are marked and never change scores.
- When providers update models or prices, we re-run affected tests and version results.