Influence of the correlation structure and the proportion of missing data on the predictive performance of models after imputation

S. Sabroso-Lasa, L. M. Esteban Escaño, J. T. Alcalá Nalvaiz, M. Andueza, N. Malats

The exponential growth of large, complex, multi-source datasets has made handling missing data a challenge in statistical modelling. Although many imputation methods exist, the effect of the underlying correlation structure among features with missing values requires further investigation.

We conducted simulations under realistic conditions to assess the impact of the imputation method used, the proportion and distribution of missing data, and the correlation structure on different performance metrics.

We observed that, when missing data are present, diagnostic performance decreased by 20–30% compared to complete-data models. This decrease is less pronounced under positive correlations and more pronounced when the variables are negatively correlated, with independence falling in between.

The results highlight the impact of the structure of missing data on various diagnostic metrics. This highlights the difficulty of defining optimal clinical cutoffs in incomplete datasets.

Keywords: Missing data Correlations Youden Index

Scheduled

GT SW III: Aplicaciones

September 4, 2026 11:10 AM

Aula 23

Other papers in the same session

Muévete con R: estimación MNAT para modelos de superposición tiempo‑temperatura

S. Naya Fernández, J. Tarrío Saavedra, A. Meneses Freire

Beyond Splash to Stats

A. González Romero, Ó. Soto Sánchez, C. Lancho Martín, I. Martín de Diego, J. Garcia-Ochoa, E. López Cano

Rutas logísticas optimizadas con algoritmos genéticos: implementación en R

L. M. Esteban Escaño, S. Álvarez Tena, C. Asensio Chaves

Influence of the correlation structure and the proportion of missing data on the predictive performance of models after imputation

Other papers in the same session

Cookie policy