Influence of the correlation structure and the proportion of missing data on the predictive performance of models after imputation
The exponential growth of large, complex, multi-source datasets has made handling missing data a challenge in statistical modelling. Although many imputation methods exist, the effect of the underlying correlation structure among features with missing values requires further investigation.
We conducted simulations under realistic conditions to assess the impact of the imputation method used, the proportion and distribution of missing data, and the correlation structure on different performance metrics.
We observed that, when missing data are present, diagnostic performance decreased by 20–30% compared to complete-data models. This decrease is less pronounced under positive correlations and more pronounced when the variables are negatively correlated, with independence falling in between.
The results highlight the impact of the structure of missing data on various diagnostic metrics. This highlights the difficulty of defining optimal clinical cutoffs in incomplete datasets.
Keywords: Missing data Correlations Youden Index