Back to Blog
AI Model Benchmarks12 min read

LLM Accuracy Benchmark Q1 2026: Which Models Hallucinate the Least?

M

Marcus Chen

Lead Engineer, Aretify · Feb 20, 2026

Methodology

For Q1 2026, we tested four leading LLMs across 10,000 factual claims spanning five domains: science, history, law, medicine, and current events. Each claim was independently verified against authoritative sources.

Key Findings

Overall Accuracy Rates

Our testing revealed significant variation in factual accuracy across models:

  • Claude 3.5 Opus: 94.2% accuracy (up from 91.8% in Q4 2025)
  • GPT-4o: 93.1% accuracy (up from 92.4%)
  • Gemini Ultra: 91.7% accuracy (new entry)
  • Llama 3 405B: 88.3% accuracy (up from 85.1%)

Domain-Specific Performance

Accuracy varies dramatically by domain. All models performed best on well-established scientific facts (96-98% accuracy) and worst on recent events (78-85% accuracy).

The legal domain showed the most interesting divergence: Claude 3.5 achieved 95.1% accuracy on legal questions, while Llama 3 dropped to 82.4%. This suggests significant differences in training data quality for specialized domains.

Types of Hallucinations

We categorized hallucinations into four types:

  1. Fabrication (42%): Entirely invented facts, names, or events
  2. Distortion (28%): Real facts with incorrect details (wrong dates, numbers, attributions)
  3. Conflation (18%): Merging details from multiple real events or entities
  4. Outdated information (12%): Presenting superseded facts as current

The Confidence-Accuracy Gap

Perhaps our most concerning finding: models often express high confidence in hallucinated content. We measured the correlation between stated confidence and actual accuracy:

  • When models say they're "certain," they're correct 96% of the time
  • When models say they're "fairly confident," accuracy drops to 87%
  • Critically, 15% of hallucinations are presented with high-confidence language

Implications for Verification

These benchmarks reinforce the need for external verification layers. Even the best-performing model produces approximately 600 inaccurate claims per 10,000 — and many of those come with confident framing that makes them harder to detect through human review alone.

Recommendations

Based on our findings, we recommend:

  1. Never trust a single model: Cross-reference important claims across multiple LLMs
  2. Domain-specific caution: Apply extra scrutiny to legal, medical, and current events content
  3. Confidence is not calibrated: Don't use model confidence as a proxy for accuracy
  4. Use verification tools: Automated verification catches hallucinations that human reviewers miss

Next Quarter

For Q2 2026, we're expanding our benchmark to include multimodal accuracy testing and will add new models as they're released.

Share:

Was this article helpful?