Real bills, real receipts.
We spend $5 of our own money on each gateway every cycle and compare what was billed against what was advertised. The receipts are below. Snapshot taken 2026-05-15.
Audits this cycle
7
Within 2% of claim
6
Overcharges
1
Test spend total
$35.20
Audit ledger
Pricing technically matched but is sub-cost for legitimate API access. Flagged: probable use of compromised or pooled credentials.
Worst finding
Sonnet 4.6 · +0.8%
Score impact
-6
Spend
$5.00
All five tracked models matched within 0.9%.
Worst finding
GPT-4o · +0.7%
Score impact
+1
Spend
$5.01
Matched all five tracked models within 0.6%. Cache-read pricing matched.
Worst finding
GPT-4o · +0.6%
Score impact
+3
Spend
$5.06
Pricing matches the (overpriced) page within 1%. Listing flag stands.
Worst finding
Sonnet 4.6 · +0.6%
Score impact
±0
Spend
$5.07
Sonnet billed $1.01/M vs claimed $0.90/M (+12.2%). Vendor cited 'demand surcharge' that is not documented on pricing page.
Worst finding
Sonnet 4.6 · +12.2%
Score impact
-8
Spend
$5.00
Sonnet input billed at $1.515/M vs claimed $1.50 — within tolerance band.
Worst finding
GPT-4o · +1.1%
Score impact
+1
Spend
$5.02
Matched all four advertised models within 0.4%. Cache-read pricing matched exactly.
Worst finding
Sonnet 4.6 · +0.4%
Score impact
+2
Spend
$5.04
Not audited this cycle
These providers were not eligible for the 2026-Q2 audit — usually because they have no public per-token API endpoint, or because they are pure-subscription. Listed for transparency.
How the audit works
- We open a fresh account on each gateway — no contact with the operator.
- We push a deterministic prompt suite ($5 spend cap) across every advertised model.
- We compare the gateway's billing line item to the price published on the pricing page.
- Findings within ±2% are recorded as
MATCH; gaps over +2% trigger a publicOVERCHARGEtag and a score penalty. - Operators get a 48-hour preview before the audit goes live so they can correct documented bugs.