Steps to Avoid the Overuse and Abuse of Machine Learning in Clinical Research

Overuse refers to the unnecessary reliance on artificial intelligence or advanced machine learning techniques where alternative, reliable, or superior methodologies already exist. In such cases, the use of AI and machine learning techniques is not necessarily inappropriate or improper, but the justification for such research is unclear or artificial: for example, a new technology may be proposed that does not provide meaningful new answers.

Several clinical studies have used ML techniques to achieve respectable or impressive performance, as indicated by the area under the curve (AUC) values ​​between 0.80 and 0.90, or even >0.90 (box 1). A high AUC value is not necessarily a sign of quality, because the ML model may be too much (Fig. 1). When traditional regression technology is applied and compared with ML algorithms, more complex ML models often present marginal gains in accuracy, presenting a questionable trade-off between model complexity and accuracy.1,2,8And the9And the10And the1112. Even a very high AUC does not guarantee robustness, as an AUC of 0.99 with an overall event rate of less than 1% is possible, and may cause all negative events to be predicted correctly, while few positive events were not.

Figure 1: Fitting the model.
shape 1

Given a data set with data points (green points) and a true effect (black line), the statistical model aims to estimate the true effect. The red line represents a close estimate, while the blue line represents an ML overfit model with an overreliance on outliers. Such a model may appear to provide excellent results for this specific data set, but it fails to perform well on a different (external) data set.

There is an important difference between a statistically significant improvement and a clinically significant improvement in model performance. ML techniques undoubtedly provide powerful methods for dealing with prediction problems involving data with non-linear or complex and high-dimensional relationships (Table 1). In contrast, many simple medical prediction problems are linear in nature, with features that are chosen because they are known to be strong predictors, usually on the basis of previous research or mechanistic considerations. In these cases, money laundering tactics are unlikely to provide substantive improvement in discrimination2. In contrast to the engineering setting, where any improvement in performance may improve the system as a whole, modest improvements in medical prediction accuracy are unlikely to make a difference in clinical procedure.

Table 1 Definitions of several key terms in machine learning

Money laundering techniques must be evaluated against traditional statistical methodologies prior to their dissemination. If the goal of the study is to develop a predictive model, the ML algorithms should be compared to a predefined set of traditional regression techniques for Brier score (a rating scale similar to mean squared error, used to check the quality of the predicted likelihood score), discrimination (or AUC), and calibration. The form must then be externally validated. Analytical methods, and the performance measures being compared, should be specified in a prospective study protocol and should go beyond general performance, discrimination, and calibration to also include measures related to overfitting.

On the contrary, some algorithms can say “I don’t know” when they encounter unfamiliar data13an important but often underappreciated outcome, because knowledge that the prediction is highly uncertain may itself be clinically feasible.

Leave a Comment