Deep Convolution Network - 行動裝置大規模影像辨識

Deep Learning as a learning paradigm has received much attention in the past decade and has been shown effective in various domains including natural language processing [18], speech recognition [53], etc. The main challenge of Deep Neural Network is that it is hard to train, due to the complexity introduced by the depth and the large number of learnable parameters. Take Deep Belief Network (DBN) as an example, which is a gen-erative model composed of multiple layers of Restricted Boltzmann Machines (RBM).

While inference on a single layer RBM is straightforward, inference and training with multiple layers of hidden states is more challenging. To overcome the difficulty, a greedy layerwise pre-training is applied before training on the labeled data [31]. Pre-training is an unsupervised learning process that learns the hidden layers one-by-one, where each layer is learned by maximizing the input likelihood as in RBM. The entire network is then fine-tuned based on the labeled training data. The pre-training is believed to learn a better

representation of the input, which facilitates the following supervised learning and leads to better generalizability [20]. The supervised learning stage can even be omitted, where the network is used as a non-linear feature transform that generates features for other algo-rithms [45]. These unsupervisedly learned networks can even be used for transfer learning [45, 26], which indicates that the networks are capable of learning universal features that are not specific to the training set.

In computer vision, the most popular Deep Learning architecture is the DCN [43]. It can be viewed as a specialized Multi-Layer Perceptron (MLP) with a manually crafted architecture and regularization. MLP can be formulated as a series of affine transform followed by non-linear dimension-wise transform:

Given the activation function g and loss function, the learnable parameters W^(l)are learned by gradient descent. One significant problem of applying MLP directly to general images is that the image signal usually contains tens of thousands of dimensions, which leads to extremely large affine transform matrices W. Without a clever learning process or regularization, the model will be extremely prone to overfitting.

To avoid overfitting, convolution was introduced as a regularization on W based on the prior knowledge on the human visual system and the image signal [43, 44]. Convolution is usually formulated as

where both the input and output are three dimensional tensors, and ˜W^l,kstands for the k-th convolution kernel in the l-th layer. The width of the kernel is denoted by N_W. It can be reformulated as

Figure 2.1: Convolution illustration. Convolution network is essentially a MLP with two additional constraints: (1) Local response and (2) Tied weight. Local response enforces sparse connections between layers using the local receptive field heuristic and the tied weight constraint enforces the weight within the feature map to be the same. The two constraints reduce the number of learnable parameters and avoid overfitting.

where N_H stands for the width of h^l−1 (assume squared input). The convolution kernel W˜^l,k is redefined over the entire input space at each position (i, j) as W^l,k,i,j with the following two constraints:

W^l,k,i,j_k_′_,i_′_,j_′ = 0, if∥i − i^′∥ > N_W

2 or∥j − j^′∥ > N_W

2 (2.6)

and

W^l,k,i,j_k′,i^′,j^′ = Wl,k,i+r,j+r

k^′,i^′+r,j^′+s∀r, s. (2.7) By vectorizing h^l_k,i,j over i, j, k and W^l,k,i,j_k′,i^′,j^′ over i, j, k and i^′, j^′, k^′ respectively, W^l,k,i,j_k′,i^′,j^′

will reduce to the two dimensional tensor W^(l) in Eq. 2.3, as shown in Fig. 2.1, with two

additional constraints: local response as in Eq. 2.6 and tied weight as in Eq. 2.7. The local response constraint enforces that only a small part of W^l,k,i,j can be non-zero, which means that h^l_k,i,j can only depend on a small fraction of h^l⁻¹ while it can have arbitrary dependency in MLP. This reduces the number of learnable parameters with the heuristic that local patterns are important for both visual recognition and human vision system. Tied weight further reduces the number of learnable parameters by enforcing kernels W^l,k,i,j with same l, k to share the same value.

Despite the convolution regularization, DCNs still have a large amount of learnable parameters and are still prone to overfitting. While unsupervised pre-training is popular in DBN to improve the generalizability, DCN as a fully supervised learning algorithm rarely utilizes or even defines similar learning techniques. Therefore, a large training set with high diversity is necessary to learn a DCN, yet such data sets are not easily obtainable in the past. Also, the computation cost of convolution made training on moderate resolu-tion images (200x200, etc.) a formidable task until very recently. The most impressive breakthrough of DCN on visual recognition comes from its superior performance on the ImageNet data set, where A. Krizhevsky et al. [40] and Q.-V. Le et al. [42] independently report a significant performance improvement over traditional image features. While the network architecture used by each group is significantly different from each other, the key factors for the success of both groups are the extremely large training set as well as the parallel acceleration that makes learning possible. This partially explains the resurge of Neural Networks, which had been developed long before they received much attentions:

it is only recently that such large training sets as well as computation power have become available for the network to be learnable.

Chapter 3 Pure Mobile Visual Recognition System

3.1 Goal and system overview

Seeing the issues and requirements in mobile visual recognition, we preliminarily in-vestigate the feasibility for performing visual recognition purely on the mobile. There are two reasons to eliminate the dependency on wireless network; first, the reliability and coverage of wireless network is not satisfactory in many places; second, the network delay will degrade user experience [25]. We summarize the proposed system in Fig. 3.1.

To ensure both recognition accuracy and efficiency, we propose to adopt linear SVM with high dimensional visual feature, following many state-of-the-art large scale visual recognition systems [4, 55]. Nonlinear SVM requires the storage of multiple support vec-tors and the calculations of kernel function over all support vecvec-tors on classification, which is not suitable for mobiles in both storage and computation efficiency. The same concerns hold for nearest neighbor classifiers.

Although being more space efficient, the high dimensional linear SVMs still require huge storage which limits the scalability of semantic space. To overcome the limit, we

“compress” the classifiers by reducing the input space with linear projection before clas-sification, as illustrated in Fig. 3.2. We also impose a sparse constraint on the projection matrix to reduce the storage overhead of projection matrix, otherwise the projection ma-trix may not fit in the memory of mobile devices (e.g., 200k to 512 dimension projection matrix takes 780MB). By reducing the size of both classifiers and projection matrix, we

Figure 3.1: System overview for pure mobile visual recognition. The photo (or video) is recorded by the camera on the mobile device. A high-dimensional feature is computed by the device, then the feature dimension is reduced by a sparse projection matrix learned offline. The classifiers are learned in the reduced dimension space. Multiple classifica-tions are thus performed efficiently on the device using the low-dimensional linear support vector machine (SVM). The key for mobile visual recognition is whether the projection matrix preserves the classification accuracy and is feasible in more compact representation (i.e., sparsity).

improve the scalability of native mobile visual recognition systems. Note that because the size of classifiers and projection matrix is equal to the number of floating point opera-tions when performing classification, reduction of storage also corresponds to reduction of computation. The design also improves updatability of the system by reducing the over-head for updating classification models and projection matrix, which is highly desirable or even necessary for real mobile applications.

There are three requirements for the mobile-classification-compliant dimension re-duction. First, it has to be computation efficient. Second, it has to preserve the classi-fication performance. Finally, the storage consumption should be small. To fulfill these requirements, we design a new linear dimension reduction algorithm – KPP, which will be described in details in the next section.

Figure 3.2: The storage reduction by linear dimension reduction. Given D-dimensional feature and C-categories for classification (e.g., by linear SVM), original classifiers re-quire storage of CD real values as in (a); performing dimension reduction with a projec-tion matrix with d-dimension output as shown in (b) requires (D + C)d storage, if we further consider sparsity of the projection matrix and let r be the ratio of non-zero matrix element (r ≈ 12% in our experiment), the storage reduction becomes CD − (rD + C)d as illustrated in (c). The reduction also corresponds to computation relatively. (Best seen in color)

3.2 Dimension Reduction by Kernel Preserving

在文檔中行動裝置大規模影像辨識 (頁 31-37)