April, 9 2021
Rosa Arboretti, Riccardo Ceccato, Luca Pegoraro and Luigi Salmaso
Variable selection plays a fundamental role in the analysis of data containing several variables which are redundant or irrelevant to the problem of interest. The ability to identify and discard these variables would make it possible to improve predictive performances and data interpretation, thus reducing costs and computational time.
Although many methods have been proposed for feature selection, in some fields there is more interest in selecting groups of variables because of the continuous nature and covariance of adjacent data. This is the case for near-infrared (NIR) spectroscopy, where several methods, mainly based on Partial Least Squares (PLS) regression, have been proposed to deal with interval selection. In this paper we consider some of these methods and propose an additional solution based on a variable clustering procedure (CoV/VSURF), Lasso regression and permutation tests, named Sequential Lasso-based Interval Selection (SLIS).
We applied all these interval selection methods to four different public data sets and compared their performances. To this end, we firstly applied two different models on each data set, namely Partial Least Squares and Lasso regression, and recorded the error measures achieved on the test set. Then we repeated this step on the subsets of variables selected by the different methods.
From this analysis, it emerged that the numbers and the positions of intervals selected by the different methods do not always match. Additionally, the performances of the interval selection methods considerably vary across the considered data sets. It does not seem possible to unanimously choose one technique over the others. With regard to the newly proposed procedure, SLIS demonstrated reasonably good performances on almost all the data sets.
Overall, a key result emerged: use of interval selection techniques tends to improve predictive performances, including situations when PLS regression is adopted.
Read the paper:
Rosa Arboretti, Riccardo Ceccato, Luca Pegoraro, and Luigi Salmaso. Interval selection: A case‐study‐based approach. Applied Stochastic Models in Business and Industry, 2021; 1–16 (in press).