In mobile visual search, much efforts have been devoted to generating a compact sig-nature from the raw features to reduce the bitrate for retrieval. In this section, we compare
46
Figure 4.10: The recognition bitrate of scaled down image on ImageNet19 data set. Be-cause a scaled down image contains information of multiple features, the performance of both optimal single feature and multi-feature fusion is reported. With multi-feature fusion, scaled down image can significantly outperform the raw feature under given bitrate.
the performance versus bitrate between scaled down image and image signature.
For signature generation, we use the data independent sparse random projection (RP) [1]. To verify the choice, we also compare the performance with two state-of-the-art hash-ing and compression methods such as the unsupervised product quantization (PQ) [59] and the supervised sequential projection learning hash (SPLH) [67], on ImageNet19 data set.
The result is in Fig. 4.12. For PQ, the feature vector is first divided into subvector, with the dimension of subvector being G = 8 for VLAD and G = 10 for LLC. The aver-age number of bits for representing each dimension is set to b = 1. For RP and SPLH, we set the projection matrix as a square matrix so the output bit number is the same as the input dimension; so the compression rate is 64 for all methods. Note the bit num-ber of signatures are large (8k) to ensure recognition performance, because our goal is to build a mobile system with its performance comparable with server-based system. We can see from the result that SPLH does not perform as good as RP in high dimensional (8k) signatures; more importantly, it performs poorly with LLC. The performance of PQ and RP is comparable, and we choose RP for further experiments because of its computation
15
Figure 4.11: The recognition bitrate of scaled down image on ImageNet137 data set. Al-though single feature performance degrades more significantly on ImageNet137 data set, the fusion result of thumbnail image still outperforms raw features in recognition bitrate.
efficiency and data independent property.
The result of signature is in Fig. 4.13. We can see that the bandwidth requirement for signature is lower than that of the scaled-down image. This suggests that there exists redundant information in the scaled image which is not fully utilized by the recognition system yet. We also examine the performance of fusing multiple feature signatures. Under the two aspects of mobile visual recognition, that is, the recognition rate and bitrate, fus-ing multiple signatures turns out to be the best strategy with near optimal performance and roughly the same bitrate as thumbnail images. Note that the performance of fusing multi-ple feature signatures can not be further improve by including more signatures under our multi-modal fusion framework; it might indicate that there exists irreversible information lost in signature generation.
Although multiple (local) feature signatures achieve almost the best performance with low bandwidth, the strategy may turn out to be unfeasible when we consider the con-straints in mobile computing. The most significant problem lies in the computing power on current mobile devices, because it requires extraction of multiple features solely on the device which is computation intensive. Fortunately, it is feasible to compute at least
10
HA VLAD Dense VLAD HA LLC Dense LLC
Classication Accuracy (%)
Feature RP SPLH PQ
Figure 4.12: Performance comparison of different signature generation methods. Feature stands for the performance of the original feature. c = 64 is used for VLAD and c = 400 for LLC. For random projection (RP) and sequential projection learning hash (SPLH), the output bit number is the same as input dimension. For product quantization (PQ), the average bit number for each dimension is set to b = 1. Therefore, the compression rate is 64 for all methods. Note that we use high dimensional signatures (8k) to ensure that the recognition performance does not degrade significantly. Under the compression rate, RP outperforms SPLH, and SPLH performs poorly with LLC. The performance of RP and PQ is comparable; we use RP for its efficient computation in the following experiments.
single feature signature on mobile devices which is the basis of many mobile visual search system. Based on our own implementation, it takes less than a second on average to com-pute the signature of VLAD with SURF feature using codebook size c = 64 on iPhone 5. Therefore, a more realistic solution is to compute single feature signature on the de-vice and send both the thumbnail image and feature signature to the remote server. The result of fusing thumbnail images and single feature signature is in Fig. 4.13. We use Hessian affine local feature and VLAD descriptor with c = 64, with the 8,192 bits sig-nature generated by sparse random projection. Note that we do not choose the sigsig-nature with optimal performance (Dense+LLC+RP) because LLC is computation intensive and is formidable for mobile environments. The strategy balances among different constraints on the device, i.e., storage, network bandwidth and CPU, and it achieves almost the best performance with moderate bitrate. Under current physical constraints on mobile devices,
15
Figure 4.13: Bitrate comparison of all strategies, including feature signatures. Feature signature achieves lowest bitrate under similar accuracy. For the signature used to fuse with thumbnail image, we use HA+VLAD+RP because VLAD is more mobile friendly than LLC. Among all strategies, multiple feature signatures achieves lowest recognition bitrate, but is not feasible on current mobile devices due to the computation overhead of extracting multiple features on the devices. Combining single (local) feature signature and thumbnail image is the solution that best fits current mobile device constraints.
this is probably the best strategy from our evaluations in terms of recognition bandwidth requirements.