HealthBench: Measuring Safe, Real-World AI in Healthcare

Author: Mike Breault
Published: Tue 13 May 2025
Episode Link: None

An in-depth look at HealthBench, the open-source benchmark for safe, effective healthcare AI. We explore how 5,000 multi-turn clinical chats are scored by 262 physicians across 60 countries on 48,562 criteria, covering accuracy, communication, context, and instruction following. We also review early results (GPT-3.5 Turbo ~16%, GPT-4 ~32%, O3 ~60%, and the surprising Nano outperforming a larger model) and why ecological validity matters for real-world medical AI.

Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.

EachPod

EachPod

HealthBench: Measuring Safe, Real-World AI in Healthcare