Randomized Controlled Trials in Medical AI A Methodological Critique 1

Various publications claim that medical AI systems perform as well, or better, than clinical experts. However, there have been very few controlled trials and the quality of existing studies has been called into question. There is growing concern that existing studies overestimate the clinical benefits of AI systems. This has led to calls for more, and higher-quality, randomized controlled trials of medical AI systems. While this a welcome development, AI RCTs raise novel methodological challenges that have seen little discussion. We discuss some of the challenges arising in the context of AI RCTs and make some suggestions for how to meet them


Introduction
Recent years have seen increased interest in the application of artificial intelligence (AI) for clinical decision-making. Various high-profile publications claim that medical AI systems perform as well, or better, than clinical experts-especially when diagnosing disease on the basis of medical images (see Topol 2019 for a review). Prominent examples include applications in dermatology (Liu, Jain et al. 2020), ophthalmology (Gulshan et al. 2016) and oncology (Esteva et al. 2017). Despite these developments, there are growing concerns that these studies overestimate the clinical benefits of AI systems in realistic settings. Most existing studies are retrospective and performed outside of clinical environments (Topol 2020). Moreover, outcomes used to evaluate AI performance tend to be only surrogates for meaningful clinical endpoints (Oren, Gersh and Bhatt 2020). Finally, only a handful of 1 Joint first authorship. Analysis randomized clinical trials (RCTs) have been performed. In one of these few RCTs, Haotian Lin et. al (2019) fail to replicate the apparent AI advantage reported by Erping Long et al. (2017); they find that AI systems are less accurate than senior consultants. In their metaanalysis, Xiaoxuan Liu et al. (2019) found that many AI studies are not transparent about their design and methods, creating barriers to the interpretation and replication of their results. Lastly, most AI studies are framed antagonistically: algorithms are made to compete with clinicians in the hopes of demonstrating their superiority. In all likelihood, medical AI will ultimately supplement, and not replace, human judgement. The envisioned role of AI systems is typically that of a second reader, consulted by a clinician who is uncertain about her initial diagnosis (McKinney et al. 2020). That renders antagonistic studies uninformative about the clinical promise of medical AI (Topol 2020).
Critiques of research in clinical AI usually culminate in a call for more RCTs. The emerging standard is for RCTs with clearly reported designs and methods, in realistic clinical environments, with meaningful clinical endpoints and in which clinicians are assisted, but not replaced, by AI systems. In support of this emerging standard, existing reporting guidelines for clinical trials have been extended to provide guidance for clinical trials involving AI systems (Cruz Liu, Cruz Rivera et al. 2020;Mongan, Moy and Kahn 2020). These extensions focus on improving transparency in reporting trial design and methodology, with a view to easy interpretation, review, and replication. Inspired by multi-phase clinical drug trials, Yoonyoung Park et al. (2020) sketch a methodology of staged trials for clinical AI. These are welcome developments. However, these extensions do not aim to be prescriptive about RCT methodology and refrain from discussions of the unique methodological issues that arise for AI RCTs. In this article, we attempt such a discussion.
RCTs are relatively well studied in the philosophy of medicine. In broad terms, the debate has centred around the privileged status of RCTs within the evidential hierarchy of medical research. RCTs are widely accepted as the methodological "gold standard"-they are taken to be the only reliable methods for establishing causal claims about the efficacy of a given intervention. Philosophers have contested this status, by questioning the rationale for randomization (Urbach 1985(Urbach , 1993Worrall 2002Worrall , 2007; by highlighting evidential constraints regarding the transferability of average treatment effects (see Cartwright 2007;Deaton and Cartwright 2018;Worrall 2010); or by arguing that in order to establish causal claims, RCTs need to be supplemented by mechanistic evidence (Russo and Williamson 2007). For the most part, we take for granted the methodological superiority of the RCT: randomization and control are necessary for arriving at a statistically unbiased estimate of the average treatment effect. However, randomization and control are not sufficient for this purpose. In this article, we address unique threats to validity arising in the context of AI RCTs. Many of these threats to validity arise post-randomization. Although Angus Deaton and Nancy Cartwright (2018) provide a useful catalogue of potential post-randomization biases, these problems have been relatively neglected by philosophers of science. A particular focus of this article is how post-randomization threats to validity arise in the context of AI RCTs.
The remainder of this article is structured as follows: Section 2 reviews the existing critiques of research on medical AI; section 3 reviews the few medical AI RCTs that have already been performed; and, finally, section 4 discusses some of the methodological challenges arising in the context of AI RCTs, with some suggestions for how to meet them.

The Need for Medical AI RCTs
Although computers have been assisting in clinical decisions since the 1970s (see Schaffner 1985;Berner and La Lande 2016;Varghese et al. 2018), advances in deep-learning-based computer vision have set off a new wave of research in AI-assisted diagnosis, prognosis, and treatment. Since Varun Gulshan et al. (2016) developed an algorithm to detect diabetic retinopathy from fundus images, many studies have announced AI systems diagnosing diseases at "expert level" (see Topol 2019). This, in turn, has fuelled the interest in using this latest generation of AI systems in clinical settings.
Most of these studies share a common design (Grote and Berens, forthcoming). First, an AI system (typically a deep neural network) is trained to classify different disease entities on the basis of medical images annotated by medical professionals. To validate the AI system, its accuracy is measured in a benchmark dataset, which has been labelled according to an external standard. If the algorithm performs well on the benchmark task, its performance is compared to clinical experts asked to classify the same images. If it performs similarly or better than the clinicians, it is considered validated. Although the performance of these systems is impressive, recent meta-analyses by Liu, Faes et al. (2019) and Myura Nagendran et al. (2020) raise concerns that these retrospective in-silico studies overestimate the clinical benefits of AI systems in realistic settings. We can summarize their main critiques as follows: i. Unfair comparison: The validation task does not reflect the real diagnostic abilities of an expert clinician, as unlike in real clinical settings, she has no access to other diagnostic modalities (for example, patients' testimonies, patients' health records, medical devices). ii.
Irrelevant comparison: Many studies compare surrogate endpoints (diagnostic accuracy) whose bearing on ultimate clinical outcome is unclear. Identifying more diminutive adenomas may prevent cancer, or it may result in unnecessary surgery or chemotherapy. We discuss this in more detail in the fourth section. Moreover, if the algorithms are supposed to assist clinicians, employing an antagonistic study design in which the two compete is not really informative. iii.
Unreliable comparison: The studies compare the performance of AI with only small groups of clinical experts (with a median of only four). iv.
Unrealistic comparison: The studies are retrospective and do not take place in realistic clinical settings. v.
Few RCTs: Most of the studies are not randomized controlled trials and virtually none are double blind. vi.
Bad reporting: Most of the studies do not adhere to reporting standards. For instance, the training data, procedure, or algorithmic details are not made transparent. vii.
Exaggerated claims: Many claims about AI effectiveness in the relevant studies are not backed by their own statistical analyses. These critiques raise concerns about the reproducibility and external validity of existing studies. If apparently validated AI systems turn out to be unreliable in clinical settings, the consequences are potentially disastrous.
The standard emerging from these critiques is for randomized and controlled trials, with clearly reported designs and methods, in realistic clinical environments, with meaningful clinical endpoints and in which clinicians are assisted, but not replaced, by AI systems. In support of this emerging standard, existing reporting guidelines for clinical trials (SPIRIT and CONSORT) have been extended to provide guidance for clinical trials involving AI systems Mongan, Moy and Kahn 2020). 2 The focus of these guidelines is to improve transparency in reporting trial design and methodology, with the aim of facilitating the interpretation, review, and replication of the studies. To give some examples, researchers are encouraged to specify the model architecture of the algorithm and its training process. In addition, they are encouraged to upload their source code to a public repository. Furthermore, the researchers are required to be explicit about the AIintervention, the clinical setting, and the modality of human-AI interaction. They are also directed to detail cases in which the AI failed or misled clinicians. There is little doubt that the authors of the extensions address many of the critiques of the meta-analyses. However, these extensions explicitly do not aim to be prescriptive about methodology and thus abstain from discussions of the unique methodological issues that arise for AI RCTs. In section 3, we attempt such a discussion. In preparation, we review the few RCTs that have been performed in this area.

Randomized Controlled Trials for Clinical AI: The State of the Art
In a recent commentary on the SPIRIT and CONSORT extensions, Eric J. Topol (2020) reports only seven completed RCTs involving AI-assisted treatment in prospective studies in real clinical environments. All but two studies (Wijnberge et al. 2020 andLin et al. 2019) concern AI-assisted endoscopy. All but one (Wijnberge et al. 2020) were performed in four hospitals in China. Most studies had large numbers of patients, but small numbers of clinicians. One study enrolled four clinicians; three studies enrolled six clinicians, and one enrolled eight. In the case of Marije Wijnberge et al. (2020), we were not able to determine how many clinicians were involved. It was often unclear whether each clinician treated patients from only a single arm of the study, or whether a single clinician might treat patients from both arms. It was often unclear how the clinicians were assigned to experimental or control groups, or whether the groups were roughly equal in experience or whether their caseload (outside of the study) was roughly similar. Since the number of clinicians tends to be very small, reviewers should be rigorously aware of small sample size effects. All could improve their reporting in this respect.
Every study, except Lin et al. (2019), had a collaborative design: clinicians in the intervention group were assisted, not replaced, by AI. Although they chose an antagonistic design, the Lin et al. study is interesting because it fails to replicate an AI advantage found by Long et al. (2017) in a retrospective, antagonistic study. In fact, Lin et al. found that the AI system was even less accurate than senior consultants in diagnosing childhood cataracts.
The six collaborative studies can be divided into three that perform technical and three that perform cognitive interventions. 3 In technical studies, the AI system performs a role that would otherwise be performed by a piece of equipment. Dexin Gong et al. (2020), Lianlian Wu et al. (2020) and Wijnberge et al. (2020) perform technical interventions. Both Gong and Jing-Ran Su et al. (2020) develop systems to standardize colonoscopy quality by making the clinician slow down and avoid blind spots. Wijnberge et al. develop an early warning system for hypotension during surgery. These are technical improvements that otherwise might have been achieved by a sophisticated stopwatch or blood pressure monitor. In cognitive interventions, the AI system performs a role normally performed by an expert clinician. The AI system provides clinicians with diagnostic decisions/predictions, which then need to be weighed by the clinicians. Pu Wang et al. (2019Wang et al. ( , 2020 and Su et al. (2020) perform such cognitive interventions. Their algorithms identify polyps and adenomas on the basis of endoscope images, a task normally performed by expert clinicians. These studies are the focus of our discussion because they pose the full range of methodological challenges that arise for AI RCTs. For example, Wang et al. (2020) worry that the physicians receiving algorithmic assistance might develop a "competitive spirit" (344) with respect to the algorithmic system and thereby distort the effects of the intervention. Although it is possible that these kinds of effects arise for technical interventions, it is much less plausible that physicians would develop competitive attitudes towards a stopwatch, blood pressure monitor, or other relatively un-opinionated device.
Only Wang et al. (2020) implemented a double-blind design. In this study, they attempt to replicate the results of their earlier (2019) single-blind study. Prima facie, it is difficult to blind clinicians to the presence of AI assistance: how could the clinician fail to notice if she received algorithmic support? To address this, Wang et al. implement a blinding strategy involving the use of a sham AI. Because of its ambitious design, Wang et al. represents the cutting edge of AI RCTs. It also poses the most interesting methodological problems. For this reason, it is a particular focus of our discussion below.

Revisiting the Methodology of Medical AI RCTs
Although the methodological superiority of RCTs is widely accepted, philosophers of science have questioned whether RCTs deserve the esteem in which they are held by methodologists. 4 Indeed, the methodological advantages of RCTs are rarely explained clearly and are subject to many misconceptions (see Senn 2013;Fuller, 2019). We take for granted that the goal of clinical trials is to arrive at statistically unbiased estimates of the average effect of the treatment under study. 5 Researchers might systematically overestimate the benefit of treatment if patients assigned to treatment are more likely to be young, healthy, or rich than those assigned to control. Random assignment facilitates unbiased estimation of average treatment effects by rendering baseline prognostic factors (such as age, race, disease severity, and so on) statistically independent of assignment to treatment. Although the details differ, all prominent formal frameworks for reasoning about probabilistic causal inference agree that statistical independence of baseline prognostic factors from assignment to treatment is necessary for unbiased inference and that random assignment is-at least sometimes-sufficient for ensuring that the independence holds. 6 Moreover, the usual rationale for randomization transfers fairly straightforwardly to AI RCTs: randomization ensures that the population of patients receiving AI-assisted care are not more likely to have baseline factors more favourable (or unfavourable) for the outcome than patients receiving regular care.
Statistical independence between baseline prognostic factors and assignment is necessary for unbiased inference, but it is not sufficient. A variety of biases can arise postrandomization. If, for example, random assignment to placebo makes it more likely that the attending clinician will prescribe some concomitant therapy as "compensation", then the trial may underestimate the direct benefit of treatment. These are the kinds of problems that blinding is meant to solve. Deaton and Cartwright (2018) provide a useful catalogue of common sources of post-randomization bias. Particularly relevant in our context are the John Henry and physician effects. The former refers to the tendency of the control group to develop a competitive attitude towards the experimental group and thereby invalidate their status as a control. "John Henry" refers to the folkloric railroad worker who, upon learning that his performance would be compared to the steam drill, worked so hard to outperform the machine that he died. Although the consequences would not be as dire, it is easy to imagine physicians similarly motivated by a spirit of competition with diagnostic algorithms. The physician effect refers to the idiosyncratic effect of the individual physician on patient outcome, beyond that of the treatment under study. Neither of these effects are straightforwardly mitigated even by successful blinding. In the best case, blinding will temper a John Henry effect into a Hawthorne effect: subjects will modify their behaviour as a result of an awareness of being observed and compared, but the modification will not be preponderant in one or the other group. Although successful blinding removes the threat the John Henry effect poses to internal validity, a threat to external validity remains: it is difficult to predict the outcome of the treatment once the subjects are no longer aware of being observed. Moreover, blinding alone does not mitigate the physician effect: if, for example, experienced physicians are more likely to treat patients in the experimental group, we would expect the effectiveness of the treatment to be overestimated. We turn now to the ways in which physician effects arise in AI RCT trials.
The results of an AI RCT would be misleading if clinicians receiving AI assistance tended to be more experienced, specialized, or less burdened than those in the control group. We would expect patients treated by more experienced physicians to have better outcomes, regardless of AI assistance. Moreover, experience may interact with the treatment in unexpected ways: more experienced clinicians may react very differently to AI assistance than their less experienced colleagues. For example, Philipp Tschandl et al. (2020) studied dermatologists interacting with an image-based AI for diagnosing skin cancer. They found that less experienced clinicians tended to accept AI-based support that contradicted their initial diagnosis even if they were very confident. More experienced clinicians, by contrast, tended to change their diagnoses to agree with the AI only when they were not confident. Although less experienced clinicians benefited significantly from AI assistance, experienced clinicians benefited only marginally. However, in cases when they were antecedently confident about their diagnosis, experienced clinicians performed worse with AI supportin the rare cases when they changed their diagnoses to agree with the AI, they tended to be led astray. Assuming the results of Tschandl et al. are representative, AI trials would not be probative about the usefulness of AI assistance if senior clinicians were over-represented in the intervention group: treatment decisions would be hardly changed. Conversely, if junior clinicians were over-represented in the intervention group, the trial would overestimate the benefits of AI, since adverse effects on experienced clinicians would rarely be observed.
To adequately control for these potential physician effects, it is important for any study to be clear about the distribution of experience and expertise among clinicians in the experimental and control groups, especially since the number of clinicians involved tends to be very small. Random assignment can mitigate the effect of physician effects, but a lot depends on the details of randomization. For example, researchers could randomize each physician to either always or never receive AI assistance. The trouble with this scheme is that, in the usual case when only a few physicians are participating in the trial, small samplesize effects may predominate: if only three senior and three junior clinicians are participating in the trial, it is not unlikely that all the senior physicians are assigned to always receive AI support. Moreover, in such a scheme, it is not possible to compare an individual physician's performance with and without algorithmic assistance-only the average difference between patient groups can be measured. A more promising scheme first assigns patients to physicians and then randomly assigns each unique physician-patient pair to treatment or control. This scheme ensures independence between clinician experience and treatment assignment and also enables analysts to compare individual physicians to themselves with and without AI assistance. In this way, each physician serves as her own control. Although this is relatively standard practice in drug trials, it is unclear if it is followed in existing AI RCTs. Only Wang et al. (2019) include comparative statistics on clinician expertise, and this is dropped in Wang et al. (2020). Investigators should be encouraged to be clear on this matter.
Although physician effects can be mitigated by appropriate randomization, John Henry and related effects must be dealt with in some other way. Nagendran et al. (2020) and Yuichi Mori, Shin-ei Kudo and Masashi Misawa (2020) call for increasing use of double-blind designs in AI RCTs. Double-blind AI RCTs are indeed rare: so far, only Wang et al. (2020) have performed such a study. The usual methodological justification for blinding the clinician is to ensure that preconceived ideas of the investigator are not important to patient outcomes (Friedman, Furberg and DeMets 2010). Of course, blinding does not remove the influence of preconceived ideas, but it does ensure that their effects are not preponderant in any single group. For example, if clinicians are hostile to AI assistance, they may unconsciously sabotage it. If they are uncritical boosters of AI, they may put more effort in when they receive AI assistance than they do without it. To ensure that these effects are not concentrated in the experimental or control group, the clinicians could be blinded to the use of AI assistance. This is prima facie difficult: how could you not know if you were receiving AI assistance? One way this could be achieved is with a Turing-style design. For example, in AI-assisted colonoscopy, the AI displays alerts for adenoma structures that appear in the visual field. In a Turing-style design a human clinician sitting in a separate room could generate the alerts instead. In such a design, the operating endoscopist would not know whether a human or a machine were generating the alert. Of course, the results of such a trial would only bear on how AI assistance compares with the human assistance and not on whether it is better than no assistance at all. For this reason, it may be desirable to run a three-way trial in which a third group of patients is randomized to receive care without additional assistance. Wang et al. (2020) motivate their double-blind design by appeal to a kind of John Henry effect: One major limitation of the existing non-blinded studies was the introduction of operational bias, because operating endoscopists using the CADe system might be more vigilant because of a competitive spirit or relax and rely on the CADe system. In both cases, the effectiveness of the CADe system might be overestimated or underestimated. (344) Wang et al. should be commended for their attention to these potential biases. However, their attempts to account for it are, in our opinion, counterproductive. To mask the endoscopists, Wang et al. developed a "sham system" that "simulated alert boxes on polyplike non-polyp structures (e.g., bubbles, faeces, undigested debris, and wrinkled mucosa) without tracking actual polyps during the colonoscopy" (2020, 345). Then, the output of both the CADe and the sham system was shown on a second monitor, which was visible only to an observing senior endoscopist and not the operating endoscopist. In both groups, "the observer was responsible for reporting the location of any visible alert box for the endoscopist with a laser pointer on the primary screen." As the authors note, the very fact of being observed might have improved inspection technique and "motivated the competitive spirit" of the operating endoscopists-that may go some way towards equalizing novelty effects across the two groups. However, endoscopists receiving AI assistance were compared to endoscopists distracted with irrelevant and deliberately misleading laser alerts. The tendency of this design is to exaggerate the helpfulness of AI assistance by comparing it, not with the absence of AI assistance, but with the presence of algorithmic sabotage. This sort of design is convoluted and counterproductive and it also raises ethical concerns, as it imposes unnecessary risks on patients. If a Turing-style design is impractical, endoscopists receiving AI assistance could have been compared with endoscopists stimulated by the supervision of a senior colleague. Of course, that would mean that AIassisted treatment would have to pass a rather severe test: it would have to be an improvement over two clinicians working in tandem. In yet another approach, the AI would assist with every patient, but randomize how "helpful" it is going to be-for example, by reducing or increasing the number of interventions. In this way, all clinicians could be made to anticipate AI assistance. Then, endoscopy sessions receiving significant assistance could be compared to those receiving minimal assistance-all clinicians would then be motivated by (perceived) algorithmic competition. The benefits of the Wang et al. design could be had without distracting and misleading practising clinicians with flashing lights.
So far, we have been concerned with potential threats to internal validity. But even studies that provide an excellent, unbiased estimate of the average treatment effect in the trial population may fail to generalize outside of the context of the trial. In what follows, we consider what features of AI RCTs threaten their external validity. In the context of a trial, clinicians are interacting with an untested AI system. They may regard it critically or be moved to greater concentration by a spirit of competition. This may give a temporary and artificial advantage to clinicians in the experimental group. In a short trial, researchers will oversample the period in which clinicians are still adjusting to the new system and before they are able to use it effectively. They will not yet understand the AI system's basic capabilities and limitations, or its medical point of view: how severely it grades disease, or how to interpret its probabilistic output (Cai et al. 2019). Though present during the study, these novelty effects would subside outside of it. Were the AI to be widely adopted, clinicians would be interacting with a "proven" system. They would have become acclimated to the AI. In the long run, they may learn to ignore it, or they may be coaxed into an over-reliance on its assistance (Park et al. 2020). A straightforward approach to dealing with novelty effects is to wait until they wash out: if AI trials run longer than a few weeks, clinicians will have time to acclimate themselves to the new system. Then, comparisons of the experiment and control outcomes could investigate how differences in outcomes evolve over time. This would give the most realistic picture of how the effect of AI assistance would evolve in a clinical setting.
It is important to note that widespread adoption of AI systems may pre-empt the development of certain kinds of expertise. If junior clinicians are over-reliant on AI systems, they may never develop the mastery of their more senior colleagues. Junior clinicians may be inadvertently trained to uncritically imitate AI systems, instead of critically collaborating with them. In this way, AI systems may eventually be used by clinicians who are less confident, experienced, or critical than those on whom they were originally tested. These considerations argue in favour of long-run clinical trials that investigate both how AI assistance interacts with clinician experience and how these effects evolve over time. Overreliance on AI assistance might be mitigated by algorithmic explanations. Tschandl et al. (2020) argue that explanations provided by the AI system (by way of a heatmap, which highlights regions of interest) play an important pedagogical role as clinicians progress from novice to expert (on explainable AI in medicine, see Erasmus, Brunet and Fisher 2020;Sullivan, forthcoming). Through the explanations, the clinicians learn to direct their attention to meaningful signs and symptoms. Of course, this assumes that the AI's judgements-as well as its "reasons" are themselves reliable. In any case, it is better that clinicians understand the reasoning of the AI system, so that they can adjudicate any disputes with standard clinical reasoning. So far, medical AI RCTs have not investigated to what extent explainability improves the diagnostic support of the AI system. We believe that it is important to close this research gap, to get better evidence on when and what sort of explanations are required to improve the interplay of AI systems and clinicians.
Additionally, researchers should take care that the introduction of AI assistance does not induce a survival bias. For example, Google Health deployed the Gulshan et al. (2016) algorithm for detecting diabetic retinopathy in retinal photographs in eleven clinics in Thailand (Beede et al. 2020). To ensure accuracy, the system accepted only high-quality images. Since many images were taken in poor lighting conditions, more than a fifth were rejected. Patients with rejected images were asked to come back another day. Poor internet connections also caused problems with uploading the images. This study was not an RCT, but if it had been, it is easy to imagine how these problems would induce survival bias: patients at well-equipped clinics with stable internet connections would be overrepresented in the experimental group. In all likelihood, richer patients would therefore also be over-represented, even if assignments were randomized. This would probably overestimate the effectiveness of AI assistance. AI trials should be vigilant with regard to these possibilities. If possible, they should include the intention-to-treat analysis as well as a per-protocol analysis. Researchers should justify their decisions to perform one or the other kind of analysis. The failure of Google Health to successfully implement the AI system also highlights some constraints with respect to the transferability of AI systems to different environments. While the relevant systems may work reliably in a state-of-the-art academic hospital, they may be useless for less well-equipped hospitals.
The choice of which clinical endpoint to measure and compare has proven to be intricate. Ideally, what you want to establish in an RCT is that a given treatment had a meaningful effect on patient health (for example, whether there is an increase in the survival rate or recovery time). The problem with current AI systems is that they are neither a treatment in themselves, nor do they determine treatment on their own. What they do instead is to provide a secondary diagnostic opinion to the clinician. The role of the AI system is therefore causally upstream of treatment (see Lalumera and Fanti 2019 for similar concerns regarding medical imaging technologies). The decision of the AI system could be ignored by the clinician, or otherwise be irrelevant for the choice of treatment. This makes it tempting to choose a surrogate endpoint, such as diagnostic accuracy, as in Wang et al. (2020). What speaks in favour of diagnostic accuracy is that by getting the diagnosis correct, it spares patients an odyssey of further diagnostic tests-which in itself can be considered as a quality of life mprovement. However, relying on a surrogate endpoint may backfire. A particular worry is that the involvement of AI systems leads to overtreatment (Oren, Gersh and Bhatt 2020). While an AI system may spot tumours more accurately than even expert clinicians, these previously overlooked tumours may be clinically irrelevant. Wang et al. (2019Wang et al. ( , 2020) find that their system increases the detection of small and diminutive polyps, whose relevance for colorectal cancer prevention is debatable (Vleugels et al. 2017). However, once spotted, further interventions that are harmful to the patient are highly likely to follow, from biopsy to chemotherapy. Hence, merely focussing on surrogate endpoints is insufficient to establish the medical benefit of AI systems. Theoretically, the problem might be mitigated by using a more refined metric of diagnostic accuracy, which includes parsing which disease will impact patients, and which will not. 7 However, making these sorts of prognoses is beyond what current image-based diagnostic AI systems are capable of.
Although RCTs are necessary for testing the mettle of AI systems, they are not sufficient in and of themselves. A fundamental question any AI trial should be able to answer is how, if at all, AI assistance changed decisions about diagnosis or treatment. In drug trials, the answer is relatively simple: patients in one group were assigned to a drug regimen, while the others were assigned to placebo. To answer this question in the case of AI systems may require building models to predict treatment or diagnostic decisions in both arms of the study and comparing them for salient differences. If AI assistance improves patient outcomes, researchers should ensure that these improvements are stable across time, patient characteristics, and across clinicians of different specializations or levels of experience. It is not enough to demonstrate an improvement in patient outcomes: effort should be made to identify the mechanism by which the improvement was made. The latter point is fairly commonplace in the health sciences: Federica Russo and Jon Williamson (2007) argue that probabilistic evidence for causal conclusions in the health sciences must always be buttressed with plausible biomedical mechanisms. However, this point takes on a somewhat different aspect for AI RCTs. In an AI RCT study, we are not looking for a biological pathway, but an institutional and procedural pathway: how did interacting with the algorithmic system change diagnostic and therapeutic practice? Knowledge of biological mechanisms and anatomical microstructures cannot answer this question. Rather, researchers would have to combine statistical and sociological/ethnographic methods to understand the effect AI intervention has on medical practice.
Finally, even if the involvement of AI systems improves the health outcomes of patients, it could still be detrimental to the clinician-patient relationship (cf. Bjerring and Busch 2020;Grote and Berens 2020). For instance, if AI assistance speeds up the diagnostic process, this could come at the expense of the care component, even if speeding up the diagnostic process theoretically frees up resources for care work. Moreover, AI assistance might interfere with the trust relationship between the patient and the clinician. If the clinician tends to (over-)rely on diagnostic support, the patient may suspect that it is not the clinician, but the AI system that is in command. The bottom line is that while AI systems may contribute to the instrumental aim of medicine (curing disease), these algorithms may nevertheless negatively affect the social dynamic of clinical practice. As these aspects are difficult for RCTs to capture, accompanying qualitative studies may be called for.

Conclusion
In this article, we have done three things: we have analysed the rationale for medical AI RCTs and provided an overview over existing medical AI RCTs. On this basis, we considered different methodological challenges for AI RCTs, while pointing out ways to meet these challenges.
The concerns and recommendations that we have made above are by no means decisive or exhaustive. Instead, our article is meant to stimulate methodological reasoning for RCT trials testing AI assistance. An issue we have not discussed, for instance, concerns the validation of AI systems that are not frozen after the training phase (cf. Topol 2020). While this may not be relevant for current vision-based diagnostic systems, such AI systems could become crucial for personalized monitoring and treatment selection. The threat is that the AI system treats different patient demographics fairly during validation (according to some fairness metric), but then develops a novel bias, resulting in unfair treatment. Building fair AI systems is difficult, precisely because there is no value-neutral way to select the training data, the objective function, the model, the benchmark task, the appropriate notion of fairness, and so on (Biddle, forthcoming;Johnson 2020). That difficulty is compounded by the fact that the algorithms continue to evolve after they are implemented. It may be extremely difficult to detect and overcome the bias once the system is launched into a clinical environment. The latest protocols are a welcome and important development. But protocols should also be developed that are tailored to the kinds of methodological issues that arise in these novel kinds of trials. Merely adapting the form of an RCT is no substitute for careful methodological reasoning. We hope that our discussion will stimulate the interest and attention of clinical methodologists and philosophers of medicine.