TY - JOUR
T1 - Estimation of a predictor's importance by random forests when there is missing data
T2 - RISK prediction in liver surgery using laboratory data
AU - Hapfelmeier, Alexander
AU - Hothorn, Torsten
AU - Riediger, Carina
AU - Ulm, Kurt
N1 - Publisher Copyright:
© by De Gruyter 2014.
PY - 2014/11/1
Y1 - 2014/11/1
N2 - In the last few decades, new developments in liver surgery have led to an expanded applicability and an improved safety. However, liver surgery is still associated with postoperative morbidity and mortality, especially in extended resections. We analyzed a large liver surgery database to investigate whether laboratory parameters like haemoglobin, leucocytes, bilirubin, haematocrit and lactate might be relevant preoperative predictors. It is not uncommon to observe missing values in such data. This also holds for many other data sources and research fields. For analysis, one can make use of imputation methods or approaches that are able to deal with missing values in the predictor variables. A representative of the latter are Random Forests which also provide variable importance measures to assess a variable's relevance for prediction. Applied to the liver surgery data, we observed divergent results for the laboratory parameters, depending on the method used to cope with missing values. We therefore performed an extensive simulation study to investigate the properties of each approach. Findings and recommendations: Complete case analysis should not be used as it distorts the relevance of completely observed variables in an undesirable way. The estimation of a variable's importance by a self-contained measure that can deal with missing values appropriately reflects the decreased relevance of variables with missing values. It can therefore be used to obtain insight into Random Forests which are commonly fit without preprocessing of missing values in the data. By contrast, multiple imputation allows for the assessment of a variable's relevance one would potentially observe in complete-data situations, if imputation performs well. For the laboratory data, lactate and bilirubin seem to be associated with the risk of liver failure and postoperative complications. These relations should be investigated by future studies in more detail. However, it is important to carefully consider the method used for analysis when there are missing values in the predictor variables.
AB - In the last few decades, new developments in liver surgery have led to an expanded applicability and an improved safety. However, liver surgery is still associated with postoperative morbidity and mortality, especially in extended resections. We analyzed a large liver surgery database to investigate whether laboratory parameters like haemoglobin, leucocytes, bilirubin, haematocrit and lactate might be relevant preoperative predictors. It is not uncommon to observe missing values in such data. This also holds for many other data sources and research fields. For analysis, one can make use of imputation methods or approaches that are able to deal with missing values in the predictor variables. A representative of the latter are Random Forests which also provide variable importance measures to assess a variable's relevance for prediction. Applied to the liver surgery data, we observed divergent results for the laboratory parameters, depending on the method used to cope with missing values. We therefore performed an extensive simulation study to investigate the properties of each approach. Findings and recommendations: Complete case analysis should not be used as it distorts the relevance of completely observed variables in an undesirable way. The estimation of a variable's importance by a self-contained measure that can deal with missing values appropriately reflects the decreased relevance of variables with missing values. It can therefore be used to obtain insight into Random Forests which are commonly fit without preprocessing of missing values in the data. By contrast, multiple imputation allows for the assessment of a variable's relevance one would potentially observe in complete-data situations, if imputation performs well. For the laboratory data, lactate and bilirubin seem to be associated with the risk of liver failure and postoperative complications. These relations should be investigated by future studies in more detail. However, it is important to carefully consider the method used for analysis when there are missing values in the predictor variables.
KW - Random Forests
KW - imputation
KW - liver surgery
KW - missing data
KW - variable importance
UR - http://www.scopus.com/inward/record.url?scp=84904300034&partnerID=8YFLogxK
U2 - 10.1515/ijb-2013-0038
DO - 10.1515/ijb-2013-0038
M3 - Article
C2 - 24914728
AN - SCOPUS:84904300034
SN - 1557-4679
VL - 10
SP - 165
EP - 183
JO - International Journal of Biostatistics
JF - International Journal of Biostatistics
IS - 2
ER -