January, 20 2018
David Steinberg
Yang et al. consider the application of predictive data mining techniques in Information Systems research. Their focus is on the impact of data errors and misclassification on the subsequent data analysis by econometric models. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second stage econometric estimations and threaten the validity of statistical inference. This article examines the nature of this bias, both analytically and empirically, showing that it can be severe even when data mining models exhibit relatively high performance. The bias becomes increasingly difficult to anticipate as the functional form of the measurement error grows more complex, or as the number of covariates in the econometric model increases. The authors review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, they demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real world datasets related to travel, social networking, and crowdfunding campaign websites.
Read the paper:
Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining. Yang, G. Adomavicius, G. Burtch, and Y. Ren. Information Systems Research, forthcoming