Experimental Setup - 行動裝置大規模影像辨識

In this section, we describe the features used in evaluation and the feature extraction procedures. Then we describe the classifier adopted for classification.

4.3.1 Features

To explore the optimal recognition performance given an image, we evaluate a variety of popular general-purpose features. Many of these features are adopted in the

state-of-the-art visual recognition systems and are evaluated on both public data sets and recogni-tion contests such as PASCAL VOC and ImageNet [22, 4, 55, 68]. The features can be categorized into the following two groups.

Global Features

Global features include all features that are non-local. While there exists a large num-ber of global features ranging from color, texture to edge features, we choose some of the most widely used general-purpose feature as follow:

• Color histogram. We use 24 dimension color histogram in HSV color space where the histogram on each dimension is computed separately and then concated. For quantization, H is divided into 18 levels, S and V are divided into 3 levels.

• Color Moment. We use 225 dimension grid color moment; each image is divided into 5x5 grids and the first to third moments in RGB color space are extracted from each grid respectively.

• Gabor. We use 48 dimension log Gabor coefficients as features. 24 filters with 6 orientations and 4 scales are used to compute the response, and for each filter response, the first and second moment are extracted [46].

• LBP. We use local binary pattern with uniform patterns extension, which results in a 59 dimension histogram [2].

• PHOG. We use 3400 dimensional pyramid histogram of oriented gradient with 4 bin histogram and 3 level pyramid (K = 4, L = 3) [7].

We also examine the popular Gist [54, 19] feature, but the performance is only slightly better than color histogram on ImageNet19 despite of the high dimensionality (960), so we do not include it in further experiments.

Local Features

Local features have become the standard component of state-of-the-art visual recog-nition systems. Among the wide range of detectors as well as descriptors, we choose the

following combinations in our evaluations:

• DoG. Difference of Gaussian detector + SIFT descriptor [49].

• HA. Hessian Affine detector + SIFT descriptor [52].

• Dense. Extract SIFT features with dense sampling, using 20×20 patches and over-lapping windows shifted by 10 pixels. We use the Vlfeat library for Dense SIFT extraction [66].

• SURF. SURF detector + SURF descriptor [3].

We also examine Compressed Histogram of Gradient [10] (CHoG) descriptor which is especially designed for mobile visual search; we do not include it in the following dis-cussions because its performance is similar to SURF in our preliminary test and is not especially representative.

4.3.2 Descriptors

To utilize local features in classification, local descriptors in an image are usually ag-gregated into a compact feature. Many descriptors were proposed in recent years, while Bag of Word (BoW) and Fisher Vector (FV) are the most popular ones among all descrip-tors. We choose the following two descriptors which are the variations of BoW and FV respectively in our evaluations:

• LLC. Locality constraint linear coding (LLC) is a variation of BoW and further combined with Spatial Pyramid (SPM). We choose LLC for its performance, and our preliminary tests also confirm that LLC significantly improves over BoW + SPM. In our experiments, we use codebook size c = 200 and 400 with pyramid level l = 2. Note that the codebook was constructed simply by K-means without optimization for LLC.

• VLAD. Vector of locally aggregated descriptors (VLAD) is a simplification of FV.

The reasons for choosing VLAD over FV is twofold. The first is that in our prelimi-nary tests, VLAD shows comparable performance with FV with smaller dimension

(no covariance vector). The second is that computing VLAD requires less storage and computation resource, which is important on mobile devices. In our experi-ments, we use codebook size c = 16, 64, 256 respectively.

4.3.3 Feature Extraction

To evaluate the recognition performance with respect to varying image sizes, we ex-tract the features from the same image at different scales. For ImageNet19 data set, the images are scaled down to 1/2, 1/4, 1/8, and for ImageNet137 data set, the images are scaled down to 1/2, 1/4, 1/8, 1/16, as shown in Fig. 4.3. On feature extraction, every (scaled-down) image is scaled up to their original size using bilinear interpolation before feature extraction. Image up-scaling is performed for performance reasons; our evaluation shows that scaling up the images to the original size generally yields better performance.

In particular, the image size directly corresponds to feature point number for Dense SIFT feature, which in turns affects performance. Therefore, we perform image up scaling for all features for the consistency of evaluation. No additional information other than the thumbnail image is used during feature extraction.

We also examined the performance of various feature extraction strategies. For ex-ample, we computed the SIFT descriptors on upscaled images using the salient points detected on the original images, which is the strategy adopted in IMShare [13]. The addi-tional detector information in this strategy turns out to be unhelpful for recognition. We also examined the optimal scales up for thumbnail images on feature extraction, and the result is in favor of scaling up image to the original size.

4.3.4 Classifier

In this section, we describe the classifier adopted for classification and the method for multi-feature fusion. For all features, we use SVM with linear kernel for classification [23]

with 1-vs-all framework for multi-class classification. Linear SVM is adopted because of its training and testing efficiency and its success in many state-of-the-art visual recognition system [22, 4, 55, 68]. The parameter of SVM is determined by 5-fold cross validation on

the training set.

For multi-feature fusion, we apply late fusion strategy – averaging the normalized scores from different classifiers over varying features. We use late fusion instead of early fusion for its efficiency, which is important due to the large number of possible combina-tions. To perform late fusion, the decision values of each classifier are first normalized with sigmoid function, and the scores from the same modality over different categories of each image are then L1 normalized. The summation of normalized score over all features is used to determine the class label of the instance. We do not explore sophisticated fu-sion methods, because the focus of this study is on features rather than the sophisticated algorithms, which do not show significant performance gains.

4.3.5 Compression Factor of Images

For bitrate comparison, we have to estimate the image size (in storage). The average image size over the entire data set is used as the estimator, and the images are all in JPEG format with the original image quality of ImageNet data set. For thumbnail images, Image scaling is performed using OpenCV [8], and the thumbnail images are also in JPEG format with the default compression factor 95 in 100-scale of OpenCV.

Although changing the compression configuration may also reduce data size, we focus on image scaling because the efficacy of thumbnail image has been justified [65, 13].

Besides, image scaling is more straightforward for controlling image quality. Therefore, the compression configuration is kept the same throughout the evaluations.

在文檔中行動裝置大規模影像辨識 (頁 53-57)