A radiologist sits in near-darkness, eyes scanning image after image on two large monitors, making decisions that will change lives in a room somewhere in a hospital. You’ve probably passed one without giving it much thought. It’s a very human, cautious, and slow process. And somewhere in an electrically charged data center, a machine is performing a remarkably similar task, albeit more quickly and sometimes more effectively.
OpenAI’s multimodal AI model, GPT-4V, has been subtly gaining attention in the medical research community. The model achieved 81.6% accuracy when tested against the New England Journal of Medicine’s Image Challenge, a diagnostic test designed to test skilled doctors.
| Category | Detail |
|---|---|
| Full name | GPT-4 with Vision (GPT-4V) |
| Developed by | OpenAI, San Francisco, California |
| Release year | 2023 (multimodal capability added to GPT-4) |
| Model type | Multimodal Large Language Model (LLM) — processes both text and images |
| Medical accuracy (NEJM Image Challenge) | 81.6% vs. 77.8% for human physicians |
| USMLE performance (Step 3) | 88.9% accuracy on image-based questions |
| Key strength | Simultaneous analysis of medical images and clinical text |
| Key limitation | Flawed image rationales in 35.5% of correct answers; image misunderstanding in 76.3% of wrong answers |
| Error reduction with expert guidance | Average 40% reduction in errors when paired with human physicians |
| Clinical readiness | Not yet approved for standalone clinical use; requires further validation |
| Benchmark used | NEJM Image Challenge, USMLE Steps 1–3, Diagnostic Radiology Qualifying Core Exam |
| Comparable AI systems | GPT-3.5 Turbo, GPT-4 (text-only), Med-PaLM 2 (Google) |
The doctors against whom it was evaluated? 77.8%. That gap is both big enough to draw your attention and small enough to cause you to pause. It’s possible that the majority of readers still envision AI medicine in ten years. It isn’t.
It’s not just intelligence that sets GPT-4V apart from previous systems. It’s the ability to look. Large language models from the past, such as GPT-3.5 Turbo and even the text-only version of GPT-4, were capable of reading a patient’s medical history, parsing clinical notes, and making differential diagnosis suggestions.

However, when presented with an X-ray or a picture of a skin lesion, these models became blind. GPT-4V altered that. It started handling the type of multimodal thinking that is nearly always needed in medicine by processing both written and visual data at the same time. A patient arrives with scans in addition to symptoms, not just symptoms. Both are now visible to the machine.
It is difficult to ignore the figures from evaluations connected to hospitals and universities. GPT-4V scored 88.9% on image-based questions on USMLE Step 3, which is arguably the most clinically challenging licensing exam in American medicine. That grade is not good enough to pass. It’s quite impressive. For comparison, the text-only GPT-4 scored 66.7% on the same questions, while GPT-3.5 Turbo only achieved 50%.
There is a significant generational gap. Even though there wasn’t a single announcement that signaled the shift, it’s difficult to ignore the fact that we crossed some sort of threshold between 2022 and the present.
However, this is where things start to get really complicated, and this is where the zeal should stop. Researchers from several institutions discovered something unsettling when they examined not only whether GPT-4V arrived at the correct answer, but also how it got there—the reasoning, the image interpretation, the logical chain.
In over one-third of the cases where GPT-4V made the right diagnosis, its justification was incorrect. For reasons that weren’t totally convincing, the model was correct. Additionally, over three-quarters of the errors were caused by a misinterpretation of the image. In a way, it was sometimes accurately reading the room while misinterpreting the walls.
This is not a small technical detail. In medicine, the logic is just as important as the outcome. You would want to double-check a doctor who provides a patient with the correct diagnosis but explains it using flawed reasoning. An algorithm should be held to the same standards.
I was struck by one researcher’s framing: right clinical decisions are not the same as right final decisions based on faulty justifications. Where patients reside is the difference between those two factors.
A truly peculiar dynamic is developing here, one that is similar to what occurred with the widespread use of GPS navigation. People lost all comprehension of how the roads connected, but they began to arrive at their destination.
With diagnosis, GPT-4V may be doing something similar, generating accurate results through partially opaque processes that practitioners are unable to fully examine. In its early years, Tesla encountered similar concerns regarding automation and accountability. As was to be expected, the argument grew louder before it was settled.
The error rate decreased by an average of 40% when researchers added expert assistance to GPT-4V’s workflow, essentially having a doctor watch over the model’s shoulder and provide guidance. That number is significant because it indicates that the model’s ceiling is significantly higher when combined with human expertise as opposed to operating independently.
AI taking the place of the radiologist in the dark room is not the most likely short-term scenario. Sitting next to them, AI flags what it sees, sometimes catching what weary eyes miss, and getting caught in return when its visual reasoning wanders.
It remains to be seen if that collaboration survives in real-world clinical settings with all of their commotion, urgency, and liability concerns. Researchers exercise caution. Administrators at hospitals are even more wary.
Additionally, the regulatory agencies that must approve these systems for clinical use operate at a speed that makes academic publishing appear quick. Whether the excitement in research papers will result in deployment schedules expressed in years or decades is still up in the air.
Something has changed, that much is certain. In medical AI circles, the question of whether a machine can read a scan as well as a doctor is no longer relevant. There is a tentative response to that question. The more difficult questions are only now starting to be taken seriously, such as those concerning accountability, error patterns, and what happens when the AI is blatantly incorrect. It’s possible that the hospital room with the two monitors is identical to the exterior. However, what’s taking place within it already has.
