Influence of the correlation structure and the proportion of missing data on the predictive performance of models after imputation
S. Sabroso-Lasa, L. M. Esteban Escaño, J. T. Alcalá Nalvaiz, M. Andueza, N. Malats
The exponential growth of large, complex, multi-source datasets has made handling missing data a challenge in statistical modelling. Although many imputation methods exist, the effect of the underlying correlation structure among features with missing values requires further investigation.
We conducted simulations under realistic conditions to assess the impact of the imputation method used, the proportion and distribution of missing data, and the correlation structure on different performance metrics.
We observed that, when missing data are present, diagnostic performance decreased by 20–30% compared to complete-data models. This decrease is less pronounced under positive correlations and more pronounced when the variables are negatively correlated, with independence falling in between.
The results highlight the impact of the structure of missing data on various diagnostic metrics. This highlights the difficulty of defining optimal clinical cutoffs in incomplete datasets.
Keywords: Missing data, Correlations, Youden Index
Scheduled
GT SW III: Aplicaciones
September 5, 2026 4:00 PM
Aula 21
Other papers in the same session
S. Naya Fernández, J. Tarrío Saavedra, A. Meneses Freire
A. González Romero, Ó. Soto Sánchez, C. Lancho Martín, I. Martín de Diego, J. Garcia-Ochoa, E. López Cano
L. M. Esteban Escaño, S. Álvarez Tena, C. Asensio Chaves