Multi-Feature Fusion is Important - 行動裝置大規模影像辨識

We first evaluate the performance of different features and show that no single feature performs the best on all categories. The result implies the importance of multiple fea-tures, and we show that multi-feature fusion is more efficient than increasing feature and descriptor dimensions in improving image recognition performance. The overall classifi-cation performance of each feature for the two data sets are in Fig. 4.4. The result shows

Color HistogramColor MomentGabor LBP PHOG HA VL AD

HA LLC Dense LLCDoG LLC SURF LLC

Classication Accuracy (%)

ImageNet137

Figure 4.4: Average classification accuracies of different features on the original image.

c = 64 for VLAD and c = 400 for LLC descriptor. On average, local features significantly outperform global features, and Dense SIFT + LLC achieves the optimal performance on both data sets. All results are reported with statistical standard deviation over 10 rounds of experiments.

that, on average, local features significantly outperform global features. Besides, Hessian affine SIFT performs better with VLAD, while Dense SIFT performs better with LLC and achieves the best performance when we consider only single feature.

Although the performance differences between different features seems significant, the situation is very different when we inspect closer about the classification results on category basis. The classification accuracy of 6 out of 19 categories in ImageNet19 are in Fig. 4.5(A). We can see from the result that no single feature achieves the best

perfor-mance in all categories, and in some categories, global features are comparable with local features. This implies the necessity of using multiple features to achieve robust overall classification performance.

The same observation can be made in ImageNet137 data set, where the result of 12 out of 137 categories are in Fig. 4.5(B). Note that nearly every feature, including global features, achieves the best in certain categories. The result leads to the same conclusion that multi-feature fusion is important; it further indicates that multiple features are getting more important as the number and diversity of the categories to be recognized increases.

Based on the observation, we perform late fusion of multiple features for classification to verify the importance of multiple feature. We perform feature selection on late fusion by iteratively adding one feature at a time, where the feature that achieves the most perfor-mance improvement is selected. The process stops when no further features can improve the performance. The result is in Fig. 4.6, where both absolute and relative improvements of each feature fused are reported. The relative improvement is defined by the absolute improvement divided by the optimal fusion performance. We can see that multi-feature fusion, even with a simple late fusion (i.e., averaging the normalized confidence scores from different modalities) strategy, can significantly increase the classification perfor-mance, and the relative improvement increases as the number of category increases. This result is consistent with the previous observation that the importance of multi-feature fu-sion increases as the diversity of categories increases.

A commonly used strategy to increase classification performance while using single feature, especially local features, is to increase the descriptor dimension. We next com-pare the effectiveness of increasing descriptor dimension and fusing multiple features. In particular, we compare the classification accuracy with respect to the total feature dimen-sion, because the dimension is proportional to the bitrate of the strategy. The dimension of multi-feature fusion is defined as the sum of the dimensions of all the features being fused. The result of ImageNet19 data set is in Fig. 4.7. It is obvious that multi-feature fusion achieves better performance than increasing the feature dimension under the same

1Since we have conducted intensive experiments of different configurations, all the figures are best seen in color.

Japanese MapleDune Canoe CentauryElectric GuitarBikini Safe Fer ris Wheel

Figure 4.5: Results of 6 categories from ImageNet19 data set are in (A), 12 categories from ImageNet137 data set in (B). From ImageNet19, we can see that no single feature achieves the best performance in all categories. This indicates the necessity of multiple features to achieve optimal classification performance. In ImageNet137, which contains more categories and diverse concepts, every feature except Gabor achieves the best per-formances in different categories, and even the same local feature using different pooling methods show different performances. Compared with the results of ImageNet19 data set, we can see the strong needs of multi-modal fusion across different features as the number of category increases¹.

Figure 4.6: Results of multi-feature fusion. For (A), each section in the histogram indi-cates the absolute performance improvement of fusing one additional (best) feature. For (B), each section indicates the relative performance improvement over the optimal perfor-mance. The relative improvement increases in data set with larger number and diversity of categories (ImageNet137), which confirms that multiple feature fusion is very important as the category (or concept) number increases.

bitrate. The result indicates that when extracting multiple features is possible, using multi-ple features is more efficient for improving classification accuracy than using complicated models from a single feature.

在文檔中行動裝置大規模影像辨識 (頁 57-61)