Aretify - AI Output Verification

Methodology

For Q1 2026, we tested four leading LLMs across 10,000 factual claims spanning five domains: science, history, law, medicine, and current events. Each claim was independently verified against authoritative sources.

Key Findings

Overall Accuracy Rates

Our testing revealed significant variation in factual accuracy across models:

Claude 3.5 Opus: 94.2% accuracy (up from 91.8% in Q4 2025)
GPT-4o: 93.1% accuracy (up from 92.4%)
Gemini Ultra: 91.7% accuracy (new entry)
Llama 3 405B: 88.3% accuracy (up from 85.1%)

Domain-Specific Performance

Accuracy varies dramatically by domain. All models performed best on well-established scientific facts (96-98% accuracy) and worst on recent events (78-85% accuracy).

The legal domain showed the most interesting divergence: Claude 3.5 achieved 95.1% accuracy on legal questions, while Llama 3 dropped to 82.4%. This suggests significant differences in training data quality for specialized domains.

Types of Hallucinations

We categorized hallucinations into four types:

Fabrication (42%): Entirely invented facts, names, or events
Distortion (28%): Real facts with incorrect details (wrong dates, numbers, attributions)
Conflation (18%): Merging details from multiple real events or entities
Outdated information (12%): Presenting superseded facts as current

The Confidence-Accuracy Gap

Perhaps our most concerning finding: models often express high confidence in hallucinated content. We measured the correlation between stated confidence and actual accuracy:

When models say they're "certain," they're correct 96% of the time
When models say they're "fairly confident," accuracy drops to 87%
Critically, 15% of hallucinations are presented with high-confidence language

Implications for Verification

These benchmarks reinforce the need for external verification layers. Even the best-performing model produces approximately 600 inaccurate claims per 10,000 — and many of those come with confident framing that makes them harder to detect through human review alone.

Recommendations

Based on our findings, we recommend:

Never trust a single model: Cross-reference important claims across multiple LLMs
Domain-specific caution: Apply extra scrutiny to legal, medical, and current events content
Confidence is not calibrated: Don't use model confidence as a proxy for accuracy
Use verification tools: Automated verification catches hallucinations that human reviewers miss

Next Quarter

For Q2 2026, we're expanding our benchmark to include multimodal accuracy testing and will add new models as they're released.