ChatGPT outperformed doctors in a new large-scale HealthBench test

OpenAI has developed a new HealthBench benchmark to assess medical knowledge of language models. It involved 262 physicians from 60 countries developing 5,000 realistic scenarios on 26 medical topics in 49 languages.

The test covers seven areas of medicine and evaluates AI on five criteria, including communication quality, accuracy and contextual understanding, using 48,000 medically valid metrics. The latest GPT-4.1 and o3 demonstrated results that outperformed physician responses in all five evaluation categories.

While in September 2024 physicians could improve on the old models’ responses, by April 2025 the new algorithms were autonomously more efficient than experts. The o3 model scored 0.60 against GPT-4o’s 0.32 only six months ago, leaving competitors like Grok 3 and Gemini 2.5 behind.

The test only evaluates a specific aspect of communication, not actual clinical practice. But the GPT-4.1 reduced errors in complex cases, and the smaller GPT-4.1 nano model was 25 times more cost-effective than its predecessors. All test materials are published in the public domain on GitHub.

ChatGPT outperformed doctors in a new large-scale HealthBench test

NATO condemns ‘Russia’s irresponsibility’ after drone hit in Romania

US says Iran nuclear deal is “90-95 per cent” ready to go

EU diplomats tell what they think about appointment of Bloc representative in talks with Russia

Iranian negotiators hold talks in Doha over potential deal with U.S.

ОСТАВЬТЕ ОТВЕТ Отменить ответ

In the sky over the United States exploded meteor: in NASA told about the strength of the shock wave

Russia: drones attacked oil refinery in Saratov, fire broke out

Oman has warned of a possible sea mine in the Strait of Hormuz

Moscow summoned its ambassador from Armenia for consultations

US military ‘disabled’ a vessel attempting to enter Iranian port command