Randomized Controlled Trials in Medical AI

A Methodological Critique


  • Konstantin Genin Research Group: “Epistemology and Ethics of Machine Learning”; Cluster of Excellence: Machine Learning: New Perspectives for Science; University of Tübingen, Germany
  • Thomas Grote Ethics and Philosophy Lab; Cluster of Excellence: Machine Learning: New Perspectives for Science; University of Tübingen, Germany International Center for Ethics in the Sciences and Humanities (IZEW); University of Tübingen, Germany




Artificial Intelligence, Randomised Controlled Trials, Clinical Methodology, Machine Learning, Medical Diagnosis


Various publications claim that medical AI systems perform as well, or better, than clinical experts. However, there have been very few controlled trials and the quality of existing studies has been called into question. There is growing concern that existing studies overestimate the clinical benefits of AI systems. This has led to calls for more, and higher-quality, randomized controlled trials of medical AI systems. While this a welcome development, AI RCTs raise novel methodological challenges that have seen little discussion. We discuss some of the challenges arising in the context of AI RCTs and make some suggestions for how to meet them.

Author Biography

Konstantin Genin, Research Group: “Epistemology and Ethics of Machine Learning”; Cluster of Excellence: Machine Learning: New Perspectives for Science; University of Tübingen, Germany

Research Group Leader


Beede, Emma, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. “A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy.” In CHI ’20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–12. New York: Association for Computing Machinery. https://doi.org/10.1145/3313831.3376718.

Berner, Eta S. and Tonya J. La Lande. 2016. “Overview of Clinical Decision Support Systems.” In Clinical Decision Support Systems: Theory and Practice, edited by Eta Berner, 1–17. Health Informatics series. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-319-31913-1_1.

Biddle, Justin E. Forthcoming. “On Predicting Recidivism: Epistemic Risk, Tradeoffs, and Values in Machine Learning.” Canadian Journal of Philosophy.

Bjerring, Jens Christian and Jacob Busch. 2020. “Artificial Intelligence and Patient-Centered Decision-Making.” Philosophy & Technology. https://doi.org/10.1007/s13347-019-00391-6.

Cai, Carrie J., Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg et al. 2019. “Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making.” In CHI ’19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–14. New York: Association for Computing Machinery. https://doi.org/10.1145/3290605.3300234.

Cartwright, Nancy. 2007. “Are RCTs the Gold Standard?” BioSocieties 2 (1): 11–20. https://doi.org/10.1017/S1745855207005029.

Cruz Rivera, Samantha, Xiaoxuan Liu, An-Wen Chan, Alastair K. Denniston, Melanie J. Calvert and the SPIRIT-AI and CONSORT-AI Working Group. 2020. “Guidelines for Clinical Trial Protocols for Interventions Involving Artificial Intelligence: The SPIRIT-AI Extension.” The Lancet Digital Health 2, no. 10: e549–e560. https://doi.org/10.1016/S2589-7500(20)30219-3.

Deaton, Angus and Nancy Cartwright. 2018. “Understanding and Misunderstanding Randomized Controlled Trials.” Social Science & Medicine 210: 2–21. https://doi.org/10.1016/j.socscimed.2017.12.005.

Erasmus, Adrian, Tyler Brunet and Eyal Fisher. 2020. “What is Interpretability?” Philosophy & Technology. https://doi.org/10.1007/s13347-020-00435-2.

Esteva, Andre, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau and Sebastian Thrun. 2017. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (7639): 115–118. https://doi.org/10.1038/nature21056.

Friedman, Lawrence, Curt D. Furberg and David L. DeMets. 2010. Fundamentals of Clinical Trials. Fourth edition. New York: Springer.

Fuller, Jonathan. 2019. “The Confounding Question of Confounding Causes in Randomized Trials.” British Journal for the Philosophy of Science 70 (3): 901–926. https://doi.org/10.1093/bjps/axx015.

Gong, Dexin, Lianlian Wu, Jun Zhang, Ganggang Mu, Lei Shen, Jun Liu, Zhengqiang Wang et al. 2020. “Detection of Colorectal Adenomas with a Real-Time Computer-Aided System (ENDOANGEL): A Randomised Controlled Study.” The Lancet Gastroenterology & Hepatology 5, no. 4: 352–361. https://doi.org/10.1016/S2468-1253(19)30413-3.

Grote, Thomas and Philipp Berens. 2020. “On the Ethics of Algorithmic Decision-Making in Healthcare.” Journal of Medical Ethics 46, no. 3: 205–211. http://dx.doi.org/10.1136/medethics-2019-105586.

Grote, Thomas and Philipp Berens. Forthcoming. “Uncertainty, Evidence, and the Integration of Machine Learning into Medical Practice.” The Journal of Medicine and Philosophy.

Gulshan, Varun, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan et al. 2016. “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.” JAMA 316, no. 22: 2402–2410. https://doi.org/10.1001/jama.2016.17216.

Hernán, Miguel Angel. 2004. “A Definition of Causal Effect for Epidemiological Research.” Journal of Epidemiology & Community Health 58, no. 4: 265–271. http://dx.doi.org/10.1136/jech.2002.006361.

Johnson, Gabbrielle M. 2020. “Algorithmic Bias: On the Implicit Biases of Social Technology.” Synthese. https://doi.org/10.1007/s11229-020-02696-y.

Lalumera, Elisabetta and Stefano Fanti. 2019. “Randomized Controlled Trials for Medical Imaging: Conceptual and Practical Problems.” Topoi 38, no. 2: 395–400. https://doi.org/10.1007/s11245-017-9535-z.

Lin, Haotian, Ruiyang Li, Zhenzhen Liu, Jingjing Chen, Yahan Yang, Hui Chen, Zhuoling Lin et al. 2019. “Diagnostic Efficacy and Therapeutic Decision-Making Capacity of an Artificial Intelligence Platform for Childhood Cataracts in Eye Clinics: A Multicentre Randomized Controlled Trial.” EClinicalMedicine 9: 52–59. https://doi.org/10.1016/j.eclinm.2019.03.001.

Liu, Xiaoxuan, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston and and the SPIRIT-AI and CONSORT-AI Working Group. 2020. “Reporting Guidelines for Clinical Trial Reports for Interventions Involving Artificial Intelligence: The CONSORT-AI Extension.” The Lancet Digital Health 2, no. 10: e537–e548. https://doi.org/10.1016/S2589-7500(20)30218-1.

Liu, Xiaoxuan, Livia Faes, Aditya U. Kale, Siegfried K. Wagner, Dun Jack Fu, Alice Bruynseels, Thushika Mahendiran et al. 2019. “A Comparison of Deep Learning Performance Against Health-Care Professionals in Detecting Diseases from Medical Imaging: A Systematic Review and Meta-Analysis.” The Lancet Digital Health 1, no. 6: e271–e297. https://doi.org/10.1016/S2589-7500(19)30123-2.

Liu, Yuan, Ayush Jain, Clara Eng, David H. Way, Kang Lee, Peggy Bui, Kimberly Kanada et al. 2020. “A Deep Learning System for Differential Diagnosis of Skin Diseases.” Nature Medicine 26, no. 6: 900–908. https://doi.org/10.1038/s41591-020-0842-3.

Long, Erping, Haotian Lin, Zhenzhen Liu, Xiaohang Wu, Liming Wang, Jiewei Jiang, Yingying An et al. 2017. “An Artificial Intelligence Platform for the Multihospital Collaborative Management of Congenital Cataracts.” Nature Biomedical Engineering 1, no. 2. https://doi.org/10.1038/s41551-016-0024.

McKinney, Scott M., Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back et al. 2020. “International Evaluation of an AI System for Breast Cancer Screening.” Nature 577: 89–94. https://doi.org/10.1038/s41586-019-1799-6.

Mongan, John, Linda Moy and Charles E. Kahn. 2020. “Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers.” Radiology: Artificial Intelligence 2, no. 2: e200029. https://doi.org/10.1148/ryai.2020200029.

Mori, Yuichi, Shin-ei Kudo and Masashi Misawa. 2020. “Can Artificial Intelligence Standardise Colonoscopy Quality?” The Lancet Gastroenterology & Hepatology 5, no. 4: 331–332. https://doi.org/10.1016/S2468-1253(19)30407-8.

Nagendran, Myura, Yang Chen, Christopher A. Lovejoy, Anthony C. Gordon, Matthieu Komorowski, Hugh Harvey, Eric J. Topol, John P.A. Ioannidis, Gary S. Collins and Mahiben Maruthappu. 2020. “Artificial Intelligence Versus Clinicians: Systematic Review of Design, Reporting Standards, and Claims of Deep Learning Studies.” BMJ 368:m689. https://doi.org/10.1136/bmj.m689.

Oren, Ohad, Bernard J. Gersh and Deepak L. Bhatt. 2020. “Artificial Intelligence in Medical Imaging: Switching from Radiographic Pathological Data to Clinically Meaningful Endpoints.” The Lancet Digital Health 2, no. 9: e486–e488. https://doi.org/10.1016/S2589-7500(20)30160-6.

Park, Yoonyoung, Gretchen Purcell Jackson, Morgan A. Foreman, Daniel Gruen, Jianying Hu and Amar K. Das. 2020. “Evaluating Artificial Intelligence in Medicine: Phases of Clinical Research.” JAMIA Open 3, no. 3: 326–331. https://doi.org/10.1093/jamiaopen/ooaa033.

Russo, Federica and Jon Williamson. 2007. “Interpreting Causality in the Health Sciences.” International Studies in the Philosophy of Science 21, no. 2: 157–170. https://doi.org/10.1080/02698590701498084.

Schaffner, Ken, ed. 1985. Logic of Discovery and Diagnosis in Medicine. Pittsburgh Series in Philosophy and History of Science. Berkeley: University of California Press.

Senn, Stephen. 2013. “Seven Myths of Randomization in Clinical Trials.” Statistics in Medicine 32, no. 9: 1439–1450. https://doi.org/10.1002/sim.5713.

Steel, Daniel. 2011. “Causal Inference and Medical Experiments.” Gifford, Fred (Ed.): Handbook of the Philosophy of Science: Philosophy of Medicine. Vol. 16. North-Holland: 159-185. https://doi.org/10.1016/B978-0-444-51787-6.50006-4.

Su, Jing-Ran, Zhen Li, Xue-Jun Shao, Chao-Ran Ji, Rui Ji, Ru-Chen Zhou, Guang-Chao Li et al. 2020. “Impact of a Real-Time Automatic Quality Control System on Colorectal Polyp and Adenoma Detection: A Prospective Randomized Controlled Study (With Videos).” Gastrointestinal Endoscopy 91, no. 2: 415–424. https://doi.org/10.1016/j.gie.2019.08.026.

Sullivan, Emily. Forthcoming. “Understanding from Machine Learning Models.” British Journal for the Philosophy of Science.

Topol, Eric J. 2019. “High-Performance Medicine: The Convergence of Human and Artificial Intelligence.” Nature Medicine 25, no. 1: 44–56. https://doi.org/10.1038/s41591-018-0300-7.

———. 2020. “Welcoming New Guidelines for AI Clinical Research.” Nature Medicine 26, no. 9: 1318–1320. https://doi.org/10.1038/s41591-020-1042-x.

Tschandl, Philipp, Christoph Rinner, Zoe Apalla, Giuseppe Argenziano, Noel Codella, Allan Halpern, Monika Janda et al. 2020. “Human-Computer Collaboration for Skin Cancer Recognition.” Nature Medicine 26, no. 8: 1229–1234. https://doi.org/10.1038/s41591-020-0942-0.

Urbach, Peter. 1985. “Randomization and the Design of Experiments.” Philosophy of Science 52, no. 2: 256–273. https://doi.org/10.1086/289243.

———. 1993. “The Value of Randomization and Control in Clinical Trials.” Statistics in Medicine 12, no. 15–16: 1421–1431. https://doi.org/10.1002/sim.4780121508.

Varghese, Julian, Maren Kleine, Sophia Isabella Gessner, Sarah Sandmann and Martin Dugas. 2018. “Effects of Computerized Decision Support System Implementations on Patient Outcomes in Inpatient Care: A Systematic Review.” J Am Med Inform Assoc 25, no. 5: 593–602. https://doi.org/10.1093/jamia/ocx100.

Vleugels, Jasper L.A., Yark Hazewinkel, Paul Fockens and Evelien Dekker. 2017. “Natural History of Diminutive and Small Colorectal Polyps: A Systematic Literature Review.” Gastrointestinal Endoscopy 85, no. 6 (June): 1169–1176. https://doi.org/10.1016/j.gie.2016.12.014.

Wang, Pu, Tyler M. Berzin, Jeremy Romek Glissen Brown, Shishira Bharadwaj, Aymeric Becq, Xun Xiao, Peixi Liu et al. 2019. “Real-Time Automatic Detection System Increases Colonoscopic Polyp and Adenoma Detection Rates: A Prospective Randomised Controlled Study.” Gut 68, no. 10: 1813–1819. https://doi.org/10.1136/gutjnl-2018-317500.

Wang, Pu, Xiaogang Liu, Tyler M. Berzin, Jeremy R. Glissen Brown, Peixi Liu, Chao Zhou, M.M. Lei Lei et al. 2020. “Effect of a Deep-Learning Computer-Aided Detection System on Adenoma Detection During Colonoscopy (CADe-DB Trial): A Double-Blind Randomised Study.” The Lancet Gastroenterology & Hepatology 5, no. 4: 343–351. https://doi.org/10.1016/S2468-1253(19)30411-X.

Wijnberge, Marije, Bart F. Geerts, Liselotte Hol, Nikki Lemmers, Marijn P. Mulder, Patrick Berge, Jimmy Schenk et al. 2020. “Effect of a Machine Learning-Derived Early Warning System for Intraoperative Hypotension Vs Standard Care on Depth and Duration of Intraoperative Hypotension During Elective Noncardiac Surgery: The HYPE Randomized Clinical Trial.” JAMA 323, no. 11: 1052–1060.


Worrall, John. 2002 “What Evidence in Evidence‐Based Medicine?” Philosophy of Science 69, no. 3: 316–30. https://doi.org/10.1086/341855.

———. 2007. “Why There’s No Cause to Randomize.” British Journal for the Philosophy of Science 58, no. 3: 451–488. https://doi.org/10.1093/bjps/axm024.

———. 2010. “Evidence: Philosophy of Science Meets Medicine.” Journal of Evaluation in Clinical Practice 16, no. 2: 356–362. https://doi.org/10.1111/j.1365-2753.2010.01400.x.

Wu, Lianlian, Jun Zhang, Wei Zhou, Ping An, Lei Shen, Jun Liu, Xiaoda Jiang et al. 2019. “Randomised Controlled Trial of WISENSE, a Real-Time Quality Improving System for Monitoring Blind Spots During Esophagogastroduodenoscopy.” Gut 68, no. 12: 2161–2169. https://doi.org/10.1136/gutjnl-2018-317366.




How to Cite

Genin, K., & Grote, T. (2021). Randomized Controlled Trials in Medical AI: A Methodological Critique. Philosophy of Medicine, 2(1). https://doi.org/10.5195/pom.2021.27




Funding data

  • Deutsche Forschungsgemeinschaft
    Grant numbers (BE5601/4-1; Cluster of Excellence “Machine Learning—New Perspectives for Science”, EXC 2064, project number 390727645)