Chapter 4 Handling Missing Data during Development of Predictive Model
4.4 Imputation Methods
The pcaMethods package [117] is employed in imputing missing values in this study and this package contains a collection of principal component analysis (PCA)-based imputation methods and a local least squares imputation (LLSimpute).
This package was proposed by Stacklies et al. and was originally developed based on microarray and metabolite data sets, and its imputation methods were used for handling missing values in medical domain. For example, LLSimpute was used for microarray data sets of breast cancer [118-119], bayesian PCA (BPCA) was used for metabolite data sets of renal cell carcinoma [120], and probabilistic PCA (PPCA) was used for expression of miRNAs of diabetic nephropathy in type 1 diabetes [121]. In this study, imputation methods included in pcaMethods package are used for estimating missing values of laboratory tests such as prothrombin time, total bilirubin, creatinine, and platelet count. In total, six imputation methods in this package are applied to multiple measurement data sets with different sampling time periods. Besides, average values of row data (i.e., mean imputation method) are also used for imputing missing values. The
performance of these imputation methods among different multiple measurement data sets are compared and evaluated.
4.4.1 PPCA
Principal component analysis (PCA) is a popular approach for data analyses and data processing (e.g., dimension reduction). PCA is not based on a probability model and the probabilistic principal component analysis (PPCA) was proposed by Tipping and Bishop [122]. PPCA includes an expectation maximization (EM) approach for PCA with a probabilistic model [117]. In PPCA, a likelihood function is defined and the likelihood of the data far from the training set is much lower. This facilitates the improvement of the estimation accuracy.
4.4.2 BPCA
Bayesian PCA (BPCA) includes an EM method for PCA with a Bayesian model.
[90]. BPCA, similar to PPCA and BPCA, calculates the likelihood of the estimated value based on an EM approach and a Bayesian estimation method [117]. In BPCA, a likelihood function is defined and the likelihood of the data far from the training set is much lower.
4.4.3 Inverse Non-Linear PCA (NLPCA)
Inverse non-linear PCA (NLPCA) is performed based on neural network with non-linear PCA, and NLPCA is regarded as a non-linear generalization of standard linear PCA. A non-linear PCA method was proposed by Scholz et al. for estimating
network which contains a component layer, a hidden non-linear layer, and an output layer [117].
4.4.4 Nipals PCA
Non-linear estimation by iterative partial least squares PCA (Nipals PCA), proposed by Wold et al., is performed based on the root of PLS regression [124]. Nipals PCA is a method at the root of PLS regression which can perform PCA with missing values by simply leaving them out from the appropriate inner products, and this method can handle a small amount of missing values [117].
4.4.5 SVDimpute
Singular value decomposition impute (SVDimpute) was proposed by Troyanskaya et al. for estimating missing data in DNA microarrays [125]. They employed this algorithm in estimating missing values as a linear combination of the k most significant eigengenes. The eigengene with the greatest eigenvalue was the most significant eigengene [117, 126]. In SVDimpute, the singular value decomposition (SVD) is used for obtaining a set of mutually orthogonal expression patterns (e.g., eigengenes in their study). These patterns can be used for approximating the expression of all features in the data set based on the linear combination of these patterns.
4.4.6 LLSimpute
Local least squares imputation (LLSimpute) was proposed by Kim et al. for estimating the missing data in DNA microarrays [89]. They employed this algorithm in estimating the variable with missing values based on a linear combination of k similar
variables. The k variables were decided based on pearson, spearman, or kendall correlation coefficients. The optimal combination was found by local least squares (LLS) regression [117].
4.4.7 Mean
Mean imputation method is used for estimating missing values based on the average of row data. A feature with missing values is estimated to calculate an average of observed (not missing data) feature data. This method is frequently used in studies for handling data with missing values. This method is used in this study, and the comparison between mean imputation and six imputation methods of pcaMethods package is discussed.
4.4.8 Parameter Settings of Imputation Methods
The parameter setting is necessary for seven imputation methods of pcaMethods package. For each imputation method, different parameter settings are evaluated. The parameter settings of SVDImpute, PPCA, BPCA, NLPCA, and Nipals PCA all indicate the number of principal components. The parameter setting of LLSimpute indicates the number of variables selected for regression.
4.4.9 Evaluation of Imputation Methods
The normalized root mean squared error (NRMSE) is frequently used for evaluating the performance of imputation methods [127]. The root mean squared error
values. The RMSE is further normalized by the range of the original true values from the missing entries [88, 109].
)
The “max()” function indicates the maximum value of a collection of numbers, the
“min()” function indicates the minimum value of a collection of numbers, yestimated
indicates the estimated values for missing entries and these estimated values are imputed by different imputation methods in the simulation experiments, ytrueindicates the original true values for missing entries which are randomly masked as missing entries from the complete data set.
For each imputation method and its corresponding parameter setting, the simulation experiment is repeated 10 times based on a complete data set (cases with missing values are removed). The distribution of these 10 NRMSEs from 10 experiments could be analyzed rather than calculate a mean value of these NRMSEs. To evaluate the accuracy and stability of imputation methods, an average of the first quartile, the third quartile, and the median of these 10 NRMSEs is regarded as the imputation method selection criterion. A low NRMSE score indicates few imputation errors and high accuracy. Low imputation method selection criterion score indicates high stability and accuracy.
In a simulation experiment, seven imputation methods and their corresponding parameter settings (total 29 combinations) are employed in estimating features with missing values. When the combinations achieve top 10 leading imputation performances according to the imputation method selection criterion score, these leading combinations (imputation method and its parameter settings) are recorded. For example, there are two imputation methods, named A and B, and both of them have five
combinations (i.e., five parameter settings). When A has four leading combinations and B has five leading combinations, it could be concluded that B has better imputation performance than that of A.