As a radiologist I can say this is an accurate description of the situation. For the time being the AI tools can help perform some simple but tedious tasks, like finding lung nodules or rib fractures, but are no where close to doing the more complex, big picture diagnosis. I look forward to having more of these tools to spend less time on the mundane, and focus more of my attention on those more interesting, cognitively involved tasks which is why I went into radiology in the first place.
Interesting piece! I didn't like this sentence though "Human radiologists spend a minority of their time on diagnostics and the majority on other activities, like talking to patients and fellow clinicians."
The figure for "Interpret images" says 36.4%. Time spent talking to colleagues sums only to 17.8%. The remainders is mainly teaching and interventional radiology. Both of these activities are not something all radiologists do.
So, for the "prototypical" radiologist, the kind that people talk about replacing with AI, I would expect that a majority of time spent on images does seem realistic. Since nobody thinks current AI/robotics can replace nimble-fingered interventional radiologists, it seems a bit misleading to lump interventional radiology with the rest. To a lesser extent, I'd say the same for teaching, certainly not the current focus of AI replacements either.
Also, since I'm being overly anal, I wouldn't count "Meals" and "Personal" in the denominator of a workday either. Guess what, I spend a third of my day asleep, but that doesn't help you understand my job. Surely, the point wasn't that AI can't eat the radiologist's meals yet.
I lived this previously. I think I can add some color commentary.
# Spray-and-Pray Algorithms
After AlexNet, dozens of companies rushed into medical imaging. They grabbed whatever data they could find, trained a model, then pushed it through the FDA’s broken clearance process. Most of these products failed in practice because they were junk.
In mammography, only 2–3 companies actually built clinically useful products.
# Products actually have to be useful.
There were two products in the space: CAD, and Triage. CAD is basically overlay on the screen as you read the case. Rads hated this because it was distracting and because the feature-engineering based CAD from the 80s-90s was demonstrated to be a failure. Users basically ignored "CADs."
Triage is when you prioritize cases (cancers to the top of the stack). This has little to no value because when you have a stack of 50 cases you have to do today, then why do you care about the order? There were some niche use cases but it was largely pointless. It could actually detrimental. The algotithm would put easy cancer cases on the top, so now the user would spend less time on the rest of the stack (where the harder cases would end up).
**Side note:** did you know that using CAD was a billable extra to insurance. Even through it was proven to not work, for years it remained reimbursable up until a few years ago.
# Poor Validation Standards
Models collapsed in the real world because the FDA process is designed for drugs/hardware, not adaptive software.
Validation typically = ~300 “golden” cases, labeled by 3 radiologists with majority vote arbitration.
If 3 rads say it’s cancer, it’s cancer. If they disagree, it's not a good case for the study. This filtering ignores the hard cases (where readers disagree), which is exactly what models need to handle in the real world.
Instead of 500K noisy real-world studies, you validate on a sanitized dataset. Companies learned how to “cheat” by over fitting to these toy datasets.
You can explain this to regulators endlessly, but the bureaucracy only accepts the previously blessed process.
Note: The previous process was defined by CAD, a product that was cleared in the 80s and shown to fail miserably in clinical use. This validation standard that demonstrated grand historical regulatory failure is the current standard that you MUST use for any devices that look like a CAD in mammography.
# Politics Over Outcomes
We ran the largest multi-site prospective (15) trial in the space. Results:
- ~50% reduction in radiologist workload.
- Increased cancer detection rate.
- 10x lower cost per study.
We structured the product to catch cancers missed in the standard workflow. Clinics still resisted adoption—because admitting missed cancers looked bad for their reputation. Bureaucratic EU healthcare systems preferred to avoid the embarrassment even through it was entirely internal. There was a lot of nonsense gatekeeping.
I'll leave you with one particularly salient story. I was speaking to the head a large US hospital IT/Ops organization. We had a 30 minute conversation about how to avoid putting our software decision in the EMR/PACS so that they could avoid litigation risk. Not once did we ever talk about patient impact. Not once... I wish i could say that this was rare, but as most people in healthcare will attest, it's not just the norm, but the majority.
Despite all that, our system caught cancers that would have been missed. Last I checked at least 104 women had their cancers detected by our software and are still walking around.
That’s the real win, even if politics buried the broader impact.
This article is replete with misconceptions which is not surprising because in my years at major academic hospitals, I have never encountered an AI developer who knew - or was even trying to learn - what a radiologist actually does.
And I have yet to encounter a single developer who has shown even the slightest awareness of what we do.
We - and other physicians - were reduced in all of this relentless bluster, to widgets that execute simple functions.
That of course is totally wrong and offensive. And perhaps it was besides the point. Perhaps the point was selling something to investors.
As a clinical radiologist in community practice, a former UCLA Professor and Tufts Associate Professor, and a Fellow of the American College of Radiology, this article makes scores and scores of points accurately. My only quibble is with merely one: the increase and perhaps acceleration in utilization of medical imaging, particularly CT and MRI, were well underway long prior to digitization. Apart from that detail, what a sound and comprehensive review!! I particularly liked the several paragraphs regarding the practical non usefulness of AI in mammography for which CMS nonetheless paid for decades, bureaucratic nonsense I highlighted in my teaching for many years.
Happy that you wrote this! I don't really agree with your exact conclusions, but you provided a lot of good data.
Here's what I see:
1. We still need more and better training data
2. Doctors unions/professional groups will be primary constraint on the widespread adoption of AI for radiology and therefore the productivity gains. The tech will be ready well before political clearance.
Looking at the radiologist task breakdown, it's obvious to me that most of the tasks your general radiologist performs can and should be automated.
The training data comes from radiologists. Once you automate them away the field will be stuck.
A great example are EKGs. An automated non-AI read was introduced ubiquitously decades ago. It was ‘good enough’ in that it caught 80% of dangerous easy rhythms; and we’d still think through the complex stuff it’s bad at. Right?
What ended up happening is barely any doctor knows how to read an EKG now and there is no more training data set. That’s why there’s little advance in the ‘AI for EKG’ world; it’s a desert. And now we’ve reached a point where that 20% miss-rate for previously easy rhythms are being actually missed; and the number of complex reads in a hospital is near 0
I am a physician (non-radiology) and I view the advent of AI as a boon to physician salaries and a net harm to everyone else.
The skill atrophy you discuss combined with the over diagnosis you also discuss is exactly why radiology salaries are sky high.
A study found that PAs order TEN TIMES more CT scans in a primary care setting than MDs despite seeing easier patients. The number is likely magnitudes worse for NPs or the inpatient setting.
Introducing a prescriber with a higher false positive rate (NP, PA, or AI) has exponential effects, not linear .
Another example is an EKG AI tool that was published that proudly announced its ability to screen for heart failure with a specificity of 80%. A specificity of 80% on tests conducted millions of times a day is not something to celebrate, yet here we are. If implemented you’ll see heart sonographists become the highest paid healthcare professionals in the country overnight…
It's a similar situation in pathology, except we learned from radiology that it cannot replace pathologists. And also there's this thing where slides have to be scanned first, which kind of takes forever, since the images are massive files sizes. Hoping to write something about this for you kids.
https://mofsafety.substack.com/cp/160552250 and https://meryam.substack.com/p/the-welfare-arbitrage-ai-is-most explain the framework for evaluating what happens to jobs and what to look for, under a productivity based firm structure, to evolve (or remain competitive) as software gains more agency. The core insight is that tasks are bundled together to create value so augmenting the completion of mechanical tasks substitutes the worker insofar as there aren't new areas of opportunity for them to expand into. For some vocations of the past, like bank tellers, there weren't additional profitable operations. Language-based and other predictive AI cannot replace bedside manner, which is a part of delivering care so that the patient follows medical advice.
I found the example of AI's assistive-ness drifting in real world settings fascinating. Thanks for writing!
IDx is an Ophthalmology company which uses fundus imaging of the Retina. Why it figures so prominently in an article about Radiology seems like a case of overfitting where the author sought any examples regardless of speciality in AI imaging analysis to make their case.
Here's an article you AI naysayers should find interesting:
----
The Path to Medical Superintelligence
by Dominic King & Harsha Nori
June 30, 2025
The Microsoft AI team shares research that demonstrates how AI can sequentially investigate and solve medicine’s most complex diagnostic challenges—cases that expert physicians struggle to answer.
Benchmarked against real-world case records published each week in the New England Journal of Medicine, we show that the Microsoft AI Diagnostic Orchestrator (MAI-DxO) correctly diagnoses up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians. MAI-DxO also gets to the correct diagnosis more cost-effectively than physicians.
If radiologists are all so wonderful, why can't I get consistent readings on the degree of emphysema I have? I used multiple health services (Stanford/Sutter) here in CA. I was a long-term smoker but quit 13 years ago. I get annual low-dose CT lung scans and they sometimes say I have moderate emphysema, sometimes severe. A lung MD said she thinks low moderate.
I do regular 10-12 mile hikes climbing 2k ft in good time with level/downhill trail jogging. This doesn't sound like something someone with severe emphysema might be able to do. I can hold my breath for over 60 seconds.
I want an AI to read my scans so I can get consistent and hopefully, more accurate sounding results.
As to radiologists talking to me, as a patient? That has never happened! Sounds like something from TV.
Really liked the title, and agree with your points as you already back them by evidence. I would just add more weight into how general-purpose multimodal frontier models (e.g., GPT-class, Claude, etc.) could perform better if deployed for the same task compared to the narrow radiology models you're covering.
These models' broad pretraining corpora can help them generalize better out-of-distribution. In practice, that means a model like GPT-5 could yield larger performance gains than a narrow model for each additional hospital dataset. So access to such sensitive medical data is still a bottleneck here, and arguably even harder for large models controlled by massive corporations. But my point is, the ceiling technical performance can be quite different. These models can also integrate other data such as clinical notes, lab results, and patient history... In fact, you could even expect a general purpose model to use or fine tune a narrow diagnosis model to improve its own analysis giving the best results overall - though I recognize this is not an on-the-spot medical expert task, but rather a system design consideration.
I guess I would also add another framing than medical expert replacement here. Speaking of transformer-based models especially, they will never be 100% accurate. Collecting field data over the accuracy of these models is tremendously important to inform medical experts how much credence they can give to models. I hear from my medical doctor friends also that they don't necessarily trust the results that companies that license these models give, and they have every right to be suspicious about this. I think this also should be supported with a debate that is more patient-centric: how long are people waiting for appointments at any level (family care to specific experts), how does diagnostic accuracy evolve over time, how satisfied are patients with their care, how satisfied are medical experts... Time spent on image analysis vs. patient care is a great split on this regard.
As a radiologist I can say this is an accurate description of the situation. For the time being the AI tools can help perform some simple but tedious tasks, like finding lung nodules or rib fractures, but are no where close to doing the more complex, big picture diagnosis. I look forward to having more of these tools to spend less time on the mundane, and focus more of my attention on those more interesting, cognitively involved tasks which is why I went into radiology in the first place.
Interesting piece! I didn't like this sentence though "Human radiologists spend a minority of their time on diagnostics and the majority on other activities, like talking to patients and fellow clinicians."
The figure for "Interpret images" says 36.4%. Time spent talking to colleagues sums only to 17.8%. The remainders is mainly teaching and interventional radiology. Both of these activities are not something all radiologists do.
So, for the "prototypical" radiologist, the kind that people talk about replacing with AI, I would expect that a majority of time spent on images does seem realistic. Since nobody thinks current AI/robotics can replace nimble-fingered interventional radiologists, it seems a bit misleading to lump interventional radiology with the rest. To a lesser extent, I'd say the same for teaching, certainly not the current focus of AI replacements either.
Also, since I'm being overly anal, I wouldn't count "Meals" and "Personal" in the denominator of a workday either. Guess what, I spend a third of my day asleep, but that doesn't help you understand my job. Surely, the point wasn't that AI can't eat the radiologist's meals yet.
I agree. Counting "Meals" and "Personal" in the denominator is highly sus, and borderline offensive.
I lived this previously. I think I can add some color commentary.
# Spray-and-Pray Algorithms
After AlexNet, dozens of companies rushed into medical imaging. They grabbed whatever data they could find, trained a model, then pushed it through the FDA’s broken clearance process. Most of these products failed in practice because they were junk.
In mammography, only 2–3 companies actually built clinically useful products.
# Products actually have to be useful.
There were two products in the space: CAD, and Triage. CAD is basically overlay on the screen as you read the case. Rads hated this because it was distracting and because the feature-engineering based CAD from the 80s-90s was demonstrated to be a failure. Users basically ignored "CADs."
Triage is when you prioritize cases (cancers to the top of the stack). This has little to no value because when you have a stack of 50 cases you have to do today, then why do you care about the order? There were some niche use cases but it was largely pointless. It could actually detrimental. The algotithm would put easy cancer cases on the top, so now the user would spend less time on the rest of the stack (where the harder cases would end up).
**Side note:** did you know that using CAD was a billable extra to insurance. Even through it was proven to not work, for years it remained reimbursable up until a few years ago.
# Poor Validation Standards
Models collapsed in the real world because the FDA process is designed for drugs/hardware, not adaptive software.
Validation typically = ~300 “golden” cases, labeled by 3 radiologists with majority vote arbitration.
If 3 rads say it’s cancer, it’s cancer. If they disagree, it's not a good case for the study. This filtering ignores the hard cases (where readers disagree), which is exactly what models need to handle in the real world.
Instead of 500K noisy real-world studies, you validate on a sanitized dataset. Companies learned how to “cheat” by over fitting to these toy datasets.
You can explain this to regulators endlessly, but the bureaucracy only accepts the previously blessed process.
Note: The previous process was defined by CAD, a product that was cleared in the 80s and shown to fail miserably in clinical use. This validation standard that demonstrated grand historical regulatory failure is the current standard that you MUST use for any devices that look like a CAD in mammography.
# Politics Over Outcomes
We ran the largest multi-site prospective (15) trial in the space. Results:
- ~50% reduction in radiologist workload.
- Increased cancer detection rate.
- 10x lower cost per study.
We structured the product to catch cancers missed in the standard workflow. Clinics still resisted adoption—because admitting missed cancers looked bad for their reputation. Bureaucratic EU healthcare systems preferred to avoid the embarrassment even through it was entirely internal. There was a lot of nonsense gatekeeping.
I'll leave you with one particularly salient story. I was speaking to the head a large US hospital IT/Ops organization. We had a 30 minute conversation about how to avoid putting our software decision in the EMR/PACS so that they could avoid litigation risk. Not once did we ever talk about patient impact. Not once... I wish i could say that this was rare, but as most people in healthcare will attest, it's not just the norm, but the majority.
Despite all that, our system caught cancers that would have been missed. Last I checked at least 104 women had their cancers detected by our software and are still walking around.
That’s the real win, even if politics buried the broader impact.
This article is replete with misconceptions which is not surprising because in my years at major academic hospitals, I have never encountered an AI developer who knew - or was even trying to learn - what a radiologist actually does.
And I have yet to encounter a single developer who has shown even the slightest awareness of what we do.
We - and other physicians - were reduced in all of this relentless bluster, to widgets that execute simple functions.
That of course is totally wrong and offensive. And perhaps it was besides the point. Perhaps the point was selling something to investors.
They’re breaking healthcare so thoroughly we become more indispensable by the day.
A main reason radiology salaries are sky high is because they allowed APPs to order images (effectively a consult to radiology).
It’s gotten to the point that my community hospital have PAs read images with such bad quality that the read is wholesale ignored by most internists.
Fun part is when the RPA reads feed an AI data set
As a clinical radiologist in community practice, a former UCLA Professor and Tufts Associate Professor, and a Fellow of the American College of Radiology, this article makes scores and scores of points accurately. My only quibble is with merely one: the increase and perhaps acceleration in utilization of medical imaging, particularly CT and MRI, were well underway long prior to digitization. Apart from that detail, what a sound and comprehensive review!! I particularly liked the several paragraphs regarding the practical non usefulness of AI in mammography for which CMS nonetheless paid for decades, bureaucratic nonsense I highlighted in my teaching for many years.
Happy that you wrote this! I don't really agree with your exact conclusions, but you provided a lot of good data.
Here's what I see:
1. We still need more and better training data
2. Doctors unions/professional groups will be primary constraint on the widespread adoption of AI for radiology and therefore the productivity gains. The tech will be ready well before political clearance.
Looking at the radiologist task breakdown, it's obvious to me that most of the tasks your general radiologist performs can and should be automated.
The training data comes from radiologists. Once you automate them away the field will be stuck.
A great example are EKGs. An automated non-AI read was introduced ubiquitously decades ago. It was ‘good enough’ in that it caught 80% of dangerous easy rhythms; and we’d still think through the complex stuff it’s bad at. Right?
What ended up happening is barely any doctor knows how to read an EKG now and there is no more training data set. That’s why there’s little advance in the ‘AI for EKG’ world; it’s a desert. And now we’ve reached a point where that 20% miss-rate for previously easy rhythms are being actually missed; and the number of complex reads in a hospital is near 0
I am a physician (non-radiology) and I view the advent of AI as a boon to physician salaries and a net harm to everyone else.
The skill atrophy you discuss combined with the over diagnosis you also discuss is exactly why radiology salaries are sky high.
A study found that PAs order TEN TIMES more CT scans in a primary care setting than MDs despite seeing easier patients. The number is likely magnitudes worse for NPs or the inpatient setting.
Introducing a prescriber with a higher false positive rate (NP, PA, or AI) has exponential effects, not linear .
Another example is an EKG AI tool that was published that proudly announced its ability to screen for heart failure with a specificity of 80%. A specificity of 80% on tests conducted millions of times a day is not something to celebrate, yet here we are. If implemented you’ll see heart sonographists become the highest paid healthcare professionals in the country overnight…
And I just bought a boat!
Good comment “Dong”
It's a similar situation in pathology, except we learned from radiology that it cannot replace pathologists. And also there's this thing where slides have to be scanned first, which kind of takes forever, since the images are massive files sizes. Hoping to write something about this for you kids.
Very well-done.
Good insight 😃. Can i translate part of this article into Spanish with links to you and a description of your newsletter?
https://mofsafety.substack.com/cp/160552250 and https://meryam.substack.com/p/the-welfare-arbitrage-ai-is-most explain the framework for evaluating what happens to jobs and what to look for, under a productivity based firm structure, to evolve (or remain competitive) as software gains more agency. The core insight is that tasks are bundled together to create value so augmenting the completion of mechanical tasks substitutes the worker insofar as there aren't new areas of opportunity for them to expand into. For some vocations of the past, like bank tellers, there weren't additional profitable operations. Language-based and other predictive AI cannot replace bedside manner, which is a part of delivering care so that the patient follows medical advice.
I found the example of AI's assistive-ness drifting in real world settings fascinating. Thanks for writing!
IDx is an Ophthalmology company which uses fundus imaging of the Retina. Why it figures so prominently in an article about Radiology seems like a case of overfitting where the author sought any examples regardless of speciality in AI imaging analysis to make their case.
Here's an article you AI naysayers should find interesting:
----
The Path to Medical Superintelligence
by Dominic King & Harsha Nori
June 30, 2025
The Microsoft AI team shares research that demonstrates how AI can sequentially investigate and solve medicine’s most complex diagnostic challenges—cases that expert physicians struggle to answer.
Benchmarked against real-world case records published each week in the New England Journal of Medicine, we show that the Microsoft AI Diagnostic Orchestrator (MAI-DxO) correctly diagnoses up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians. MAI-DxO also gets to the correct diagnosis more cost-effectively than physicians.
—
https://microsoft.ai/new/the-path-to-medical-superintelligence/
If radiologists are all so wonderful, why can't I get consistent readings on the degree of emphysema I have? I used multiple health services (Stanford/Sutter) here in CA. I was a long-term smoker but quit 13 years ago. I get annual low-dose CT lung scans and they sometimes say I have moderate emphysema, sometimes severe. A lung MD said she thinks low moderate.
I do regular 10-12 mile hikes climbing 2k ft in good time with level/downhill trail jogging. This doesn't sound like something someone with severe emphysema might be able to do. I can hold my breath for over 60 seconds.
I want an AI to read my scans so I can get consistent and hopefully, more accurate sounding results.
As to radiologists talking to me, as a patient? That has never happened! Sounds like something from TV.
Really liked the title, and agree with your points as you already back them by evidence. I would just add more weight into how general-purpose multimodal frontier models (e.g., GPT-class, Claude, etc.) could perform better if deployed for the same task compared to the narrow radiology models you're covering.
These models' broad pretraining corpora can help them generalize better out-of-distribution. In practice, that means a model like GPT-5 could yield larger performance gains than a narrow model for each additional hospital dataset. So access to such sensitive medical data is still a bottleneck here, and arguably even harder for large models controlled by massive corporations. But my point is, the ceiling technical performance can be quite different. These models can also integrate other data such as clinical notes, lab results, and patient history... In fact, you could even expect a general purpose model to use or fine tune a narrow diagnosis model to improve its own analysis giving the best results overall - though I recognize this is not an on-the-spot medical expert task, but rather a system design consideration.
I guess I would also add another framing than medical expert replacement here. Speaking of transformer-based models especially, they will never be 100% accurate. Collecting field data over the accuracy of these models is tremendously important to inform medical experts how much credence they can give to models. I hear from my medical doctor friends also that they don't necessarily trust the results that companies that license these models give, and they have every right to be suspicious about this. I think this also should be supported with a debate that is more patient-centric: how long are people waiting for appointments at any level (family care to specific experts), how does diagnostic accuracy evolve over time, how satisfied are patients with their care, how satisfied are medical experts... Time spent on image analysis vs. patient care is a great split on this regard.