Variance Measurement

Chapter 3. Proposed Method

3.3 Variance Measurement

In the above sections, we always use the sum measurement in each cell of the grid. The goodness is that the computation is very easy and fast. The drawback is that sometimes the sum measurement can’t give us enough information. Or we can say the sum measurement has no discrimination in some situations. For example, in Figure 3-15 (a) the sums of two rectangles are very near. It can’t distinguish the mouse with many details from the smooth cheek. In Figure 3-15 (b), if we can apply a measurement which tells the difference between two green rectangles, we can use them to construct a good feature.

Figure 3-15 (a) The su m of two rectangles are close. (b) The rectangles with new measurement may be a good feature

Because of the above defect, we bring in another measurement, variance. We not only compute the sum of each cell but also compute the variance. There are two main advantages to use the variance as our measurement. First, it’s the second moment which tells us extra information than the first mo ment. Besides, the computation is affordable. After we compute the sum of each cell, we already know the means and it lessens the computation of variance. We use Figure 3-16 to illustrate what the feature looks like after we add the variance measurement.

Figure 3-16 the grid feature with the variance measurement

The right side is a grid feature. It equivalent to

( ) ( ) ( ) ( )

sum J  sum I  variance L  variance J

^{Eq 3-1}

We want to know the influence of the variance measurement so we train two classifiers again. This time we use the same training data and test data in order to compare with the results in the before experiment. We plot the comparison in

following figure.

Figure 3-17 Performances of three classifiers (Harr-lik e features, grid features with sum measurement and grid features with sum & variance measurement) .

After we add the variance measurement, our performance is better than the classifier with Harr- like features. In other words, the variance can help us to classify faces from others.

We show the first three features we find in this experiment in Figure 3-18.

Figure 3-18 The first three features with the variance measurement 0

Chapter 4. E XPERIMENTS

In this chapter, we use the grid feature to train a face classifier with 200 weak learners. We compare its performance with the one using Harr-like features. Before we show the result, we discuss three notable issues in the training process. The first one is how we collect training data. Then second is the symmetric property of front faces. The last one is the overall cascade structure.

4.1 T RAINING D ATA 4.1.1 P OSITIVE TRAINING DATA

Training data takes an important position on the learning type classifier. If the data is abnormal, we may not find real boundary between two classes. O ne of the most famous training databases is built by MIT cbcl. They collect 2429 faces and 4548 non-faces in 19*19 pixels as training data. Another 472 faces and 23573 non- faces are used for testing. However, both Viola and we found the too small size couldn’t give us enough information for detection. So we build a new database for our experiments. We collect 2678 faces in the 30*30 pixels size. A half is male and the other half is female. Most of these faces come from Asians, especially Taiwanese and Japanese. Before we fully construct our database, we notice the different marking area (only face or include the hair) also affects the training result. From our small test, the former is more robust so we adopt it in our experiments. We show both in Figure 4-1

Figure 4-1 Positive training data (a) only face (b) include the hair

4.1.2 N ^EGATIVE D ^ATA

The asymmetric property is the main problem when we collect the negative training data. Because there are too many different classes belonging to negative set, we need a very large negative set to represent them. But when we train our classifier, the amount of two training set size should be close; otherwise the boosting algorithm will focus on the small set at beginning (we give equal weights for two training set).

After a few iterations, the weight of miss classified data would be reweighted too large to find the real boundary. If we randomly use a subspace of the large negative set, we might find a set which is easy to be separated from positive data. The boundary of these training data has no generalization. So we need to choose our negative data near to positive data as close as possible. We use Figure 4-2 and Figure 4-3 to demonstrate this problem.

Figure 4-2 Using wrong negative data can’t find real boundary

Figure 4-3 Choose the negative data as close to positive data as possible

In order to conquer the problem, we use our collected faces and non-faces from MIT cbcl to train a classifier first. The classifier is used as a filter to eliminate those absolute non- faces candidates. We pass many non- faces patches to the filter and only collect the miss-classified ones. These miss-classified ones have more similarities to faces. We show the source of negative data and details of the procedure in Figure 4-4

Figure 4-4 The procedure to find our negative set

We do uniformly sampling in the last step because there are many redundancies created in the second step. And the size of negative data is also 30*30 pixels. In the after experiments, we use these three negative sets to train our cascade classifiers.

Here we show some samples of each set.

Figure 4-5 (a) negative set A (b) negative set B (c) negative set C

4.2 S ^YMMETRIC P ROPERTY OF F ^RONT F ^ACE

We all know that human faces have highly symmetric property and we want to add this knowledge into our training process. So when we train our front face classifiers, we only find features on the left half face. Then we map each feature from this side to right half side to form a new symmetric feature. In ideal case, we should double the threshold of each original feature because we have double cells in our new feature. But considering there are still some asymmetric parts on most human faces, we need to find a new suitable threshold. The threshold is changed with different training dada. If our training negative data is highly asymmetric, we can use a low threshold to achieve a good discriminatio n and vice versa.

In our experience, if we normalize the original threshold to 1, the best new threshold which separate training faces from non-faces mostly usually locates around 1.45 to 1.75 times. Later when we train the cascade classifiers, we adjust the threshold according to different training set.

We show how we map half features to new symmetric features in Figure 4-6.

And in Figure 4-7 we plot different thresholds to its error rates (the original threshold is normalized to 1).

Figure 4-6 Train features on half side then map them to the other side of face

Figure 4-7 Error rates to different thresholds

4.3 C ^ASCADE S TRUCTURE AND B ^OOTSTRAP M ^ETHOD

There are two reasons to use the cascade structure on the face detection. First, it can reject most non-faces at the beginning few layers. It decreases the computation loading and makes real-time applications possible. Another advantage of cascade structure is allowing us use more negative training data. In section 4.1, we mention the asymmetry problem of training set. Although we filter out many redundancies, the negative training set is still larger than the positive one. To cooperate with the large negative training set, we can train several classifiers with the same positive training data but different parts of negative set. When we cascade these classifiers together, our final classifier is equivalent to see the whole negative training set.

Furthermore, we adopt the bootstrap method in the cascade structure. We collect those miss-classified negative training data at this layer and use them as negative data again in the next layer. Through this way, our classifier can focus on those “hard” training data which is not easily classified, and the boundary is also more robust. We use Figure 4-8 to illustrate how to construct the cascade structure with the bootstrap method.

Figure 4-8 The cascade structure with the bootstrap method

At the first layer, we use our collected 2678 faces and 3500 non- faces from MIT weak learners because the negative data here are miss-classified from above layers. It means these are “hard” negative data. We want to find more precise boundary to separate training faces from those hard training non- faces. And from chapter 3 we know the number of weak learners can control the discriminative capability of the classifier. That’s the reason we adopt 100 weak learners at this layer.

4.4 R ^ESULTS

4.4.1 C ^OMPARISON B ^ETWEEN G ^RID F EATURE AND

H ^ARR - ^LIKE F EATURES ON T ^EST P ^ATTERN O ^NE

In order to compare our new feature to the original Harr-like feature, we randomly collect 74 photos with 140 faces from the internet as our test pattern. We use our training database and the cascade structure to train two classifiers with overall 200 weak learners. One classifier uses Harr- like features and the other one uses our grid features. We run 4 iterations in the progressive feature space process and each time we keep the best 40 features. In the beginning forth layers, we all adopt symmetric gird features to training our classifier. Only in the last layer we don’t limit it. The reason is as mentioned above. We want construct a more complex classifier to handle those hard negative data at the last layer. We show the first two features in each layer in Figure 4-9.

Figure 4-9 First two features of each stage

The ROC curves are plotted in Figure 4-10 and detection rates of two classifiers are listed at Table 4-1 and Table 4-2

Figure 4-10 ROC curves of two classifiers (with grid features and Harr-like features)

Table 4-1 The detection rate to false positive rate of the classifier with Harr-lik e features

False positive rate(*10^-4) 0.16 0.66 3.1 8.7 23 Detection rate (Harr-like) 35.71% 58.57% 78.57% 93.57% 97.86%

Table 4-2 The detection rate to false positive rate of the classifier with grid features

False positive rate(*10^-4) 0.22 1 3.8 10

Detection rate (Grid) 55.71% 82.86% 92.14% 97.71%

We show some detected photos in Figure 4-11.

False positive rate Detection rate

Grid features

Harr-like features

Figure 4-11 Some detected photos by the classifier with grid features

4.4.2 C ^OMPARISON B ^ETWEEN G ^RID F EATURE AND

H ARR - LIKE F EATURES ON T EST P ATTERN T WO

We use the classifiers mentioned above on another test pattern. This time we test one of CMU test files (This file is not related to those files whic h we used to create negative data). It contains 60 photos and 172 faces.

The ROC curves are showed below.

Figure 4-12 ROC curves of two classifiers (with grid features and Harr-like features) on CMU test pattern

Grid features Harr-like features

False positive rate Detection rate

The detection rates to false positive rates of both classifiers are list in Table 4-3 and Table 4-4.

Table 4-3 The detection rate to false positive rate of the classifier with Harr-lik e features

False positive rate(*10^-4) 0.16 2.5 6.3 10 12 25 Detection rate (Harr-like) 48.26% 76.16% 81.98% 86.63% 88.37% 92.44%

Table 4-4 The detection rate to false positive rate of the classifie r with grid features

False positive rate(*10^-4) 0.47 1.69 4.74 8.37 13 28 Detection rate (Grid) 51.74% 78.49% 87.79% 92.44% 94.19% 95.93%

We show some detected photos in Figure 4-13.

Figure 4-13 Some detected photos in CMU test file by the classifier with grid features

CMU test pattern is harder than those photos we find on internet, so both two classifiers can’t achieve nearly 100% right by 200 weak learners. We show our worse case in Figure 4-14.

Figure 4-14 The worse case

4.4.3 S ^{UMMERY OF} E ^XPERIMENTS

In the above two experiments, the classifier using our gird features performs better than the one with Harr- like features on both test patterns. Although the test patterns are not big enough to claim our feature is exactly better. At least the results show that our grid features are useful on the face detection problem. There are still many parameters (such as the iteration times…) we don’t optimize them yet. It might be helpful to the performance. If we want to apply grid features in real-time applications, we can use only sum measurements in first few layers. So the integral image is still applicable.

Chapter 5. C ONCLUSION

In this thesis, we develop a new feature named “grid feature” for boosting-kind classifiers. There are three main differences from before works. First, we use the grid representation to construct our features. It reduces the redundancy between features so we have space to do further design. Second, adopt a progressive method to gradually enlarge our feature space toward the good direction. It combines two simple features into a more complex and discriminative feature. In this way, we can find more different type features but don’t need to create a huge initial feature space. Lastly, we add the variance measurement inside the feature. It can compensate the insufficiency of the original sum measurement. but also constructs more robust features.

Our experiment results support our grid features. Both using 200 weak learners in 5 cascade classifiers (layers), our grid features perform better than Harr- like features on two test patterns. To apply other measurements on this structure is a possible way for future works.

R EFERENCES

[1] R.E. Schapire. “The boosting approach to machine learning: An overview. ” Nonlinear Estimation and Classification. Springer, p149-p172, 2001.

[2] R. Meir and G. Rätsch. “An introduction to boosting and leveraging. ” Advanced Lectures on Machine Learning (LNAI2600), p118-p183, 2003.

[3] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee, “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods.” Proc. Fourth Int’l Conf. Machine Learning, p322-330, 1997.

[4] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features.” Proc. IEEE Conf. Computer Vision and Pattern Recognition, p511-p518, 2001.

[5] P. Viola and M. Jones. “Robust Real-Time Face Detection” International Journal of Computer Vision 57(2), p137-p154, 2004.

[6] Li Fei-Fei. “Recognizing and learning object categories” Proc. IEEE Conf.

Computer Vision and Pattern Recognition (CVPR). Short Course, 2007.

[7] A. Torralba, K. P. Murphy and W. T. Freeman. "Sharing features: efficient boosting procedures for multiclass object detection" Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR). p762-p769, 2004.

[8] C. Rudin, R.E. Schapire and I. Daubechies. “Analysis of Boosting Algorithms using the Smooth Margin Function” Annals of Statistics, Vol. 35, No. 6, 2723-2768, Mar. 2007.

[9] M.-H. Yang, D.J. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 1, Jan. 2002.

[10] C. Huang, H.Z. Ai, Y. Li, and S.H. Lao, “Vector Boosting for Rotation Invariant Multi-View Face Detection” IEEE Trans. Pattern Analysis and Machine Intelligence, p671-p686, 2005.

[11] R.E. Shapire. “Foundations of Machine Learning” Lectures on the class, 2006.

在文檔中基於格狀特徵值之推舉式人臉分類器 (頁 31-0)

Chapter 3. Proposed Method

3.3 Variance Measurement

( ) ( ) ( ) ( )

sum J  sum I  variance L  variance J

Chapter 4.

E XPERIMENTS

4.1 T RAINING D ATA 4.1.1 P OSITIVE TRAINING DATA

4.1.2 N EGATIVE D ATA

4.2 S YMMETRIC P ROPERTY OF F RONT F ACE

4.3 C ASCADE S TRUCTURE AND B OOTSTRAP M ETHOD

4.4 R ESULTS

4.4.1 C OMPARISON B ETWEEN G RID F EATURE AND

H ARR - LIKE F EATURES ON T EST P ATTERN O NE