A Rose by Any Other Name
The surprising nuances of diagnoses and what it means for training AI models
One of the most memorable lectures given during UCSF medical school is given by Brad Sharpe. A master clinician, Brad provides a framework for medical students on how to think about the diagnostic process and differential diagnoses. In that lecture, one pearl of wisdom that has stuck with me:
“When describing a symptom or condition, add as many adjectives and descriptors as possible”.
Chest pain can be a general symptom leading to a broad list of differential diagnoses, with the lungs, heart, aorta, and esophagus as potential culprits, but sharp, pleuritic, non-exertional left sided chest pain that improves with leaning forward has a much narrower list of potential diagnoses[1].
This is incredibly important for medical decision-making as a more specific “illness script” means a more focused examination, performing fewer but more relevant tests, and ultimately more accurately sets us on the path to the right diagnosis. But the ultimate conclusion is words matter in how you frame your thoughts.
Words Matter
With the excitement around large language models, there’s increasing recognition that language provides a great lens into actual knowledge, but we must also be cautious in evaluating their performance on clinical tasks. Already on Twitter (or X!), I’m seeing posts where patients with missed or complex diagnoses reporting that they listed a one line summary of their symptoms and LLMs were able to output their diagnosis - with the implication that had we only had LLMs, many medical mysteries would be solved. But even before LLMs, this was a common refrain.
Just ten years ago, patients with rare or missed diseases would mention if they put in their one-liner into Google, their diagnosis would come up in the search results. The conclusion is often that up to their doctors are incompetent, but I think it’s a form of hindsight bias. It’s the deliberate synthesis of a important symptoms into the summary one-liner leads to the right diagnosis (whether by LLM, Google, or Doctor). Selecting the right adjectives and relevant symptoms to raise into the one-liner incantation[2] is not just part of the diagnostic effort, but rather it is the entire diagnostic effort.
Pleurisy and angina both mean chest pain, but the choice of word is often made only after coming up with a unifying diagnosis. Choosing the language is a big part of the information synthesis (with one involving lung problems and the other involving heart issues), and it’s actually much less impressive that the final phrase can be matched to a diagnosis.
Severity Matters
In medical AI research, It’s common practice to train AI models based off of billing code diagnoses (ICD9 and ICD10 codes) and other large scale structured data. I often talk about the inconsistency and inaccuracy of medical labels, and this is especially true for billing codes, but this critique is not even about that. Let’s assume that all diagnoses are correct, but even in that setting, where many patients might meet the precise definition of a particular disease, the severity of the disease means very different things with regard to risk, prognosis, and threshold to treat.
This concept is best described by John Mandrola and you should read is Substack to get the most thorough description of the situation, particularly of the recent NOAH AFNET 6 clinical trial, but let me summarize the situation here for context. Atrial fibrillation (AF) is an off-and-on abnormal heart rhythm that can occur for seconds, minutes, hours, or days. Irregardless of the duration, if AF is detected, you have the diagnosis (and this would result in ICD-9 code 427.31 or ICD-10 code I48.2 in your health records).
In the past, where diagnostic tests such as smartwatches and portable ECG monitors are not as common, it’s likely only longer lasting episodes of AF were being caught, patients with AF were likely older, sicker, and had a different risk profile than patients who are now coming in with abnormal Apple Watch readings. In the past, with that older strata of AF patients (with likely longer episodes), there is strong evidence that many patients should take blood thinners to reduce their risk of stroke in the setting of AF. But what about the new patients with brief runs found on smartwatch?
Hence the NOAH AFNET 6 clinical trial that was presented last month at ESC in Amsterdam. Investigators randomized such patients with short AF runs to either taking blood thinners or not. Surprisingly, not only was there not a detectable benefit in taking blood thinners, the trial was ended early for blood thinners causing harm (significantly more major bleeding episodes without any improvement in prevented strokes).
Getting back to the topic at hand (medical AI research), importantly, all these patients have a diagnosis of AF. Can you imagine using this label to train either a screening or diagnostic AI algorithm? This trial calls into question whether even screening for short AF runs on smartwatches are worth it (as the historical evidence might not apply to such a different patient population), and also highlights how much significant heterogeneity there is within the same diagnosis.
Medical diagnoses are imprecise (and billing codes are even more imprecise), with many different types of patients falling into the same big buckets. Even the buckets we use evolve over time, as our understanding of the disease and our ability to identify the disease change. And in this blog, we’ve described over and over - using imprecise labels for training AI models lead to a variety subtle challenges.
These Subtleties will actually dictate Real-World Performance
With many AI diagnostic or screening tools, the hidden premise is that the AI model can identify overlooked patients. While it would be natural to assume that the AI models could work by identifying subtle, earlier forms of disease (after all, that is the goal of screening tools and how non-AI screening tools work[3]), in actual deployment, that might not be the case at all.
When using diagnosis codes, or even clinic populations, to train AI models, you might be baking in the biases related to how the cohorts were recruited or which patients get access to care. In a speciality clinic, it might be only patients that have severe enough symptoms that brings them to care. With such a model, it’s not clear at all that it would work well to identify earlier or subtler cases.
And because of this, we very much need AI clinical trials.
[1] In this example, the description I’m using is trying to invoke pericarditis
[2] Given the state of LLM research right now, incantation is the right word. “Prompt Engineering”. It’s more alchemy than science right now.
[3] Screening tools in general find biomarkers in the causal pathway (the prostate generates PSA and polyp might become a large colon cancer), but as described in the previous blog, its not clear at all AI screening tools would work the same way.
"Medical diagnoses are imprecise with many different types of patients falling into the same big buckets."
This is so true. In medicine we lump many different diseases in the same bucket/diagnosis based on their symptoms ( eg. Diabetes, Hypertension, Ca Breast ) until we figure out the underlying cause or mechanism of disease ( Type 1 / 2 DM, HR Positive Breast cancer ).