Feature selection and its impact on orbital imagery classification accuracy
Abstract
Currently, the large availability of orbital, airborne and UAV-borne multispectral and hyperspectral imagery has fostered an increasingly wide scope of applications. Likewise the sensor systems, image classification approaches have been continuously evolved. In order to produce an accurate discrimination of targets in the process of classification, the employed methods need to be able to deal with a massive quantity of input data. The literature demonstrates that the use of images together with thereof derived attributes increases classification accuracy since they help to differentiate the classes of interest in a more effective way. Handling big data normally implies the use of data mining, aiming to extract amidst a huge volume of information what is actually relevant to the user´s goals. Regardless the application, it is known that relevant information is always followed by superfluous content. In this context, feature selection is an alternative to be considered, so as to filter the most meaningful attributes and to exclude redundant or irrelevant information. In the accomplishment of the present work, a WV-3 multispectral image with 16 spectral bands (VIS, NIR and SWIR) together with the following attributes: Principal Component Analysis (PCA), Minimum Noise Fraction (MNF), Normalized Difference Vegetation Index (NDVI) and Soil-adjusted Vegetation Index (SAVI) were employed in the analyses. The data pre-processing comprised the atmospheric correction, with the parallel conversion of DN values to surface reflectance, followed by feature selection. In this stage, data mining was initially executed in the software WEKA (Waikato Environment for Knowledge Analysis) by the application of three different feature selection algorithms (Wrapper, Correlation-based Features Subset (CFSSubset) and Relief), what resulted in four data sets, one of them complete, and the three remainder ones deriving from the feature selection process. For classifying these four data sets, Random Forest (RF) and Support Vector Machine (SVM) were used. Such methods allow the inclusion of a great number of explanatory variables (attributes) with no a priori concern as to their relevance for classification. The accuracies of the classifications generated with the complete and the three previously filtered data sets were assessed considering the two classifiers. The results attained in this work, supported by findings reported in the peer-review literature, confirm that the data sets subject to feature selection did not provide results with a superior performance in relation to that obtained by the complete dataset. This can be ascribed to the fact that the current classification methods are able to deal with a bulky amount of features, besides handling well with weak explanatory variables.