***Note: This is a summary of some of my comments at SAIL, where I was on a panel on AI and clinical trials with Xiao, Lily, and Arnaub. (btw, fantastic conference! If you’re reading this Substack, chances are you would have a good time if you go.) ***
One common comment and refrain is the consistent lack of clinical trial evidence in medical AI. Sure, there are notable exceptions (and I am only highlighting in cardiology here b/c I’m a cardiologist), but to a first approximation, the products on the market have not been tested in clinical trials, and industry is generally uninterested in developing rigorous clinical trials of their products. As of the writing of this post, no commercially available AI product has undergone a randomized clinical trial. In fact, unlike in therapeutics (drugs and devices), where the big multi-center trials are often industry sponsored and rigorously designed, in fact, in the field of AI and medicine, most of the clinical trials I’ve noted are actually led and funded by academia and the trials themselves are of academically-driven algorithms (not implementations of commercially available products).
Hence the question, why is this the case? Why are there so few AI clinical trials?
My favorite trials
I promise this tangent is relevant, but let’s take a moment and step back from AI. Just think about medical evidence in general. Think about your favorite clinical trial. If you’re a clinician, you’ve probably learned about hundreds if not thousands throughout medical school and residency training. As a cardiologist, I’m fortunate to work in a field where there’s a strong tradition of clinical trials and rigorous science. From PROVE-IT TIMI 22 to the CAST trial, there’s so many practice changing randomized clinical trials that challenged traditional dogma and drastically changed (read: improved) clinical care. Even negative trials (like COURAGE) helped shape to procedures we do and how we think about the benefits and risks of modern medicine.
Now, as we think about all these important clinical trials, how many of your favorite trials are of therapeutics (devices and drugs) and how many of your favorite trials of diagnostics? I think this is a helpful paradigm to think about why there are few AI clinical trials, because to the first approximation, right now AI technologies are diagnostics - whether for screening or diagnosis. In fact, when I posed this question to the audience at SAIL and asked everyone to raise their hand if their favorite clinical trial was a diagnostic test, nearly no one in the 100+ audience raised their hand. In fact, as I further queried, people struggled actually think of even a single clinical trial of a diagnostic test. There are RCTs of diagnostics, of which, I’m particularly a fan of DANCAVAS and NordICC (see, I do like trials even outside of cardiology!), but these are often the exception rather than the rule.
My favorite agency
And here lies the root to my answer. In almost all countries, medical diagnostics and therapeutics are governed by a regulatory body (the FDA for the USA, the EMA for the EU, and NMPA for China). Smaller countries often rely on the expertise of one or more of these regulatory bodies (the thinking goes “if its good enough for the FDA and America, its probably good enough for Palau”), but in almost all countries, the marketing and use of medical devices and therapeutics are regulated. To be FDA approved for therapeutics (drugs and devices), the FDA asks for randomized clinical trials, while for diagnostic tests, clinical trials are neither necessary nor requested. In fact, historically, it used to be two pivotal clinical trials (phase III clinical trials) were required for FDA approval, but after decades of industry lobbying and complaints that the regulatory approval process is too long, that’s no longer the case, and one RCT is now sufficient. (But that also leads to other controversies, but it’s a pendulum and no one is ever happy about the current position.)
The short answer is, diagnostic tests never needed RCTs for regulatory approval, and businesses and actors are thus not incentivized (in fact, potentially disincentivized as there’s always risk for a negative trial) to perform RCTs. This goes beyond just AI, (where given the new-ness of the technologies, theres been an evolving regulatory landscape), but diagnostic tests in general have not undergone a randomized clinical trial. In fact, magnetic resonance imaging has won a Nobel Prize for being innovative and transformative to medicine, but has not undergone a randomized clinical trial. When it comes to AI, which I think is just as transformative, the FDA currently has relatively loose standards (does not require blinding, does not require prospective testing, does not require randomization), and for companies that can get to market faster, the natural incentive is to get to the market and then see where the chips lie.
Even at the FDA, the radiological sciences section is known to have the most experience with evaluating AI products. This is reflected in the sheer number of radiological AI products already FDA cleared, while cardiovascular, dermatological, pathology, GI, and AI products for other medical specialities are still less common. Over time, the FDA has gotten more nuanced in how to approach clearance, but the need for RCTs is not on the horizon (neither asked for nor sufficient to justify not doing some of the multi-center retrospective analyses that they still want to see). In some ways, when an RCT has to be done anyways (for FDA approval), it’s the cherry on top to get an NEJM paper, but purely to drive the business case, an RCT might be too costly or “unnecessary”.
The business downsides of an RCT
With this cost/benefit landscape in mind, for companies, there are clear risks for designing and funding a clinical trial. It encourages more scrutiny of what might be a brittle product, in introduces risk for what might be a negative study, and it encourages cost as there needs to be a clear-eyed evaluation of what sized study is required to be powered for hard endpoints. NordICC enrolled 84,585 participants and DANCAVAS randomized 46,611 participants, which speak to part of the scale needed to power for hard endpoints in a RCT of diagnostic tests. With a diagnostic test, the choice of endpoint is non-trivia, as a hard outcome might be challenging or require a large sample size to power appropriately. To have in impact, not only does there have to be a good diagnostic test, there also needs to be an appropriate downstream therapeutic with good efficacy and minimal risk if one is aiming for a hard outcome.
The commercial strategy for almost all medical AI SaaS companies has instead been the perspective of “lets just try to be first on the market with an OK product, test product-market fit and try monopolize the market, and hope market power allows us to iterate once we’re on the market”. It’s not unreasonable, as even now, payers won’t give a straight answer to whether they will reimburse for AI technologies if there was an RCT (even as they encourage RCTs to kick the can down the road). While a blinded RCT is strong evidence, theres still much to be said that most end-users of AI aren’t necessarily sure how to evaluate AI and strong model performance isn’t the metric payers use to justify buying.