Submitted on 28 Nov 2023
Imputation using training labels and classification via label imputation
Thu Nguyen, Tuan L. Vo, Pål Halvorsen, Michael A. Riegler
Missing data is a common problem in practical data science settings. Various
imputation methods have been developed to deal with missing data. However, even
though the labels are available in the training data in many situations, the
common practice of imputation usually only relies on the input and ignores the
label. We propose Classification Based on MissForest Imputation (CBMI), a
classification strategy that initializes the predicted test label with missing
values and stacks the label with the input for imputation, allowing the label
and the input to be imputed simultaneously. In addition, we propose the
imputation using labels (IUL) algorithm, an imputation strategy that stacks the
label into the input and illustrates how it can significantly improve the
imputation quality. Experiments show that CBMI has classification accuracy when
the test set contains missing data, especially for imbalanced data and
categorical data. Moreover, for both the regression and classification, IUL
consistently shows significantly better results than imputation based on only
the input data.
https://arxiv.org/abs/2311.16877