Logo

How we compare

Our transparent, test-driven framework for comparing AI providers across quality, latency, cost, reliability, security and support.

What we measure

How we run tests

  1. Standardised prompts: curated suites for chat, RAG, function calling, JSON, coding.
  2. Fixed seeds: temperature/top-p controlled; when available, we fix a seed for determinism.
  3. Warm-start: we include a warm-up run; reported latency excludes cold-start outliers when noted.
  4. Repeat runs: each test runs ≥5 times; we report median (p50) and p95.
  5. Token accounting: we log input/output tokens and calculate effective cost.

Scoring

Each category is scored 0–100, then we compute a weighted composite:

CategoryWeightNotes
Quality40%Task-specific evals + human spot-checks
Latency15%p50/p95, streaming support
Cost20%$/1K tokens & cost-per-task
Reliability15%Error rate, rate-limit ergonomics
Security5%Retention, regionality, certifications
Support & Tooling5%Docs, SDKs, ecosystem

Price normalisation

We convert all pricing to USD and compute effective cost per task using the same prompts and truncation rules.

Disclosures