Adaboost

Chapter 2. Backgrounds

2.2 Boosting

2.2.2 Adaboost

Working in Valiant’s PAC learning model, Kearns and Valiant posed the question of whether a “weak” learning algorithm can be boosted into a stronger learning algorithm. The answer is yes. In 1995, Freund and Schapire came up with the first practical boosting algorithm, Adaboost. It combines “weak” learners h which have error rate just better than random guess into a strong classifier.

1 T

t t

f 



t_



h _{Eq. 2-3}

The main idea of Adaboost is to focus on miss-classified training data. At each time, Adaboost picks the best weak learner which generates the smallest weighted error (_t 1/2_t ). Then it increases the weight of miss-classified data and decreases the weight of right ones. In next round, it finds a new weak learner minimizing reweighted error. After T rounds, the strong classifier can achieve a much

smaller error ( _{( )} ₂ _t₍₁ _t₎

error f 



  ). The overall algorithm is as following:

The final hypothesis is the linear combination of weak learners. It means that after we transfer positive and negative training data into the weak learner space, we can linear separate them. We use Figure 2-8 to illustrate this idea.

Adaboost algorithm

Figure 2-8 Training data are linear separated in the weak learner space

In the algorithm, _tis the coefficient to combine weak learners. How do we

choose the _tin Adaboost algorithm? We look Z as a lose function and try to find _t

a suitable _tto minimize it.

The last thing we want to prove is the upper bound of the error. We want to show a few weak learners can theoretically generate a stronger classifier.

The upper bound of the error

Chapter 3. P ROPOSED M ETHOD

Our work focuses on proposing a new feature named “grid feature” for the boosting-kind face classifiers. There are three main differences comparing to before works. First, we adopt the grid representation to construct our features so we can decrease the feature space size. Second, unlike Viola and Jones limited their feature in 3 kinds Harr- like features, we use a progressive method to gradually enlarge our feature space toward “good” direction. Finally, we add the variance measurement to discover more information than the sum measurement used in Harr- like features. We introduce the detail in following sections.

3.1 G RID R EPRESENTATION

Viola and Jones used 3 kind Harr- like features to create nearly 160’000 Harr-like features as their feature space for picking weak learners. There are many fea tures very similar in this set. For example, Figure 3-1 shows two Harr- like features; they are just a little different in their position and size. The highly dependent and redundant properties of those Harr- like features cause the difficulty of training. If we don’t only use 3 kind Harr- like features but want to add more varieties to the feature space. It might increase the features to hundreds of millions. It’s impossible to train a classifier with such a huge feature space.

Figure 3-1 Two similar Harr-lik e features

Besides the training problem, a huge feature space also has risk to make overfitting. From Eq.2-2, we showed that the complexity of hypothesis space influences the upper bound of the test error. The hypothesis space of the boosting-kind

classifier is controlled by two factors. O ne is the number of weak learners and the other one is the feature space for picking weak learners (because the threshold is fixed after we pick the feature). When we enlarge the feature space, we also increase the complexity of hypothesis space. As a result, even if we have the ability to train a classifier with hundreds of millions features, it still not a good idea because of the overfitting risk. We use Figure 3-2 to illustrate the relation between the feature space and the hypothesis space.

Figure 3-2 The hypothesis space of boosting-k ind classifiers

Considering above two problems, we decide to construct our features based on the grid representation. O ur idea is inspired from Li Fei Fei’s tutorial course in CVPR 2007. She showed that the grid representation has good performance on the category classification. In other words, when we use the grid representation, we still have enough information to classify the image.

If we apply the grid representation on original Harr- like features, we can reduce the redundancy. For example, in Figure 3-3, we can use the right side feature to approximate features on the left side.

Figure 3-3 The grid representation reduces many redundancies in Harr-lik e features

Hypothesis space

In order to verify the influence of the grid representation, we train two classifiers with different features. O ne classifier uses original Harr- like features and the other one uses “grid-Harr- like features”. The grid-Harr- like feature means we round off the rectangles to the near grid. Now the rectangles of Harr- like features can’t be placed at arbitrary places and the sizes are also limited. We can see the cell of grid as a unit and each Harr- like feature is composed of integral cells. We use Figure 3-4 to illustrate this idea.

Figure 3-4 grid-Harr-lik e features

We take 1000 faces and non- faces as our training data and another 1000 faces and non- faces are used to test the performances of these two classifiers. The detail of the training and test data is described in chapter 4. The results are plotted in Figure 3-5.

Figure 3-5 The error rates of two classifiers with Harr -lik e features and grid-Harr-lik e features 0

There is a gap between the error rates of two classifiers but the one with grid-Harr- like features is still notable of its classification ability.

In summary, the grid representation can reduce space redundancies and is not grid representation. We compared original Harr- like features to grid-Harr- like features.

Now we want to ask that if we don’t use Harr- like features, how we create features based on the grid form. Considering the 6*6 grid structure (see Figure 3-6, we sum up every pixel inside of each cell. Then we can use {-1,0,+1} coefficient set to arbitrarily combine any cells of the grid to form a new feature. In other words, we have

36 2 36 3 36 36 36

1 2 3 36

2C +2 C 2 C ...2 C possible combinations (it only considers {-1,0,+1} combination coefficients).

Figure 3-6 The 6*6 grid structure

In each feature, we use the sum of every pixe l in b lack ce lls to minus the sum in wh ite cells.

It’s hard to search all possible combinations so we need to have a strategy to find the suitable combination of cells as our feature. This problem is very similar to the classification problem. We want to find a boundary to separate two classes but there are too many choices. If our classifier is simple, it can’t distinguish them very well. If the classifier is very complex, it has the risk to overfitting. When we face this problem, we use boosting to combine weak classifiers together to form a stronger classifier. Using boosting method, we can overcome the dilemma above. This idea inspires us for finding the combination of cells in the same way.

Since it’s hard to directly find the combination of cells to form our features, we start from simple features which contain only few cell combinations. We build a simple feature set and pick out several good ones. “Good” means it has discrimination for training faces and non- faces. Then we combine some of these features to form a more complicated feature. We use Figure 3-7 to demonstrate our idea.

Figure 3-7 Combine two simple features to a more discriminative feature

There is a drawback in above idea. We can’t ensure all more complicated features perform better than simple ones. So we need to preserve those simple features in case of that situation. In other words, every time we keep original feature space and use the picked features to create more features. Add these new features to original space to form a new feature space. We can not only do this process one time to gradually enlarge our feature space. Using this method, we don’t construct a huge feature space at first time (for example, Viola and Jones used ~160’000 features). O ur feature space size is determined by how many iterations we run. We use the following figures to illustrate it.

Figure 3-8 Features at iteration 1 in progressive feature space process

At first iteration, we use only one cell to construct our feature set. There are total 36 features.

(We don’t consider the minus sum of each cell because they have the same discriminative ab ilities)

Figure 3-9 Features at iteration 2 in progressive feature space process

At second iteration, we pic k out all features in the last iterat ion and use combinations of t he m to create more features. New feature space is co mposed of orig inal features and new features. (We only pick out all features this time. In other iterat ions, we pick out several good features instead of all features)

Figure 3-10 Features at iteration 2 in progressive feature space process

At third iteration, we p ick out M good features (the number we need to choose by hands) fro m the last iteration. We use all possible comb inations of these picked features and original 36 features to create new features. This process keeps going until T iterations finish. Yellow means multip ly by 2.

Each time we pick good features to increase our feature space. That ’s why we call the space progressing toward good direction. We show the relation between the progressive feature space and the hypothesis space of boosting-kind classifiers in Figure 3-11.

Figure 3-11 The relation between the progressive feature space and the hypothesis space

In the above process, we start from all possible combinations of features with one cell but it’s not necessary. One can start with more cells. For example, if we start with 2 cells, we have C +2 C =2556₁³⁶  ³⁶₂ features at first iteration. And in the next

iteration, we will have 2556+2 C ²⁵⁵⁶₂ features. It cost more training time. We state this progressive process in math form in next page.

Progressive Feature Space

(2) Use Boosting algorithm to find the best #

r

^t features with thresholds

(3)

4. Use r features with thresholds as weak learners to construct a classifier ^T

First, we transfer pictures from the raw data space ({ ... }x_i x_n ) to the grid

measurement space{ ... ... }y₁⁰ y_i⁰ y_n⁰ . We can see each grid measurement as a feature. At

iteration t = 1, we use {-1, +1} to combine any two of y as a new feature to form _i⁰ the first hypothesis space. Then we apply boosting algorithm on the space to pick out

“good” few features (so we say that it’s toward the good direction). Adding these features and original features (each grid measurements) together, we get y . At ¹_i

iteration t =2, we combine any 2 features of y as a new feature to form the second _i¹

hypothesis space and keep on. y_i^t  y_i^t^¹ numbers of features we picked at iteration t. When t increases, we gradually enlarge the hypothesis space and find more complicated features.

We give some constructed features in the process in Figure 3-12.

Figure 3-12 constructed features in the progressive process

Now we want to compare the features we found to original Harr- like features.

We use 1000 training faces and 1000 training non-faces to train two classifiers. O ne uses our features and the other one use Harr-like features. The grid size we chose is 6 cells*6 cells and we run 3 iterations in the progressive process. The test data are 1000 faces and 2000 non- faces. The result is in Figure 3-13.

Figure 3-13 Performances of two classifiers (with Harr-lik e features and with our features)

We can see there is still a small gap between two performances. However, the gap is less than 3%. We can run more iteration to get even closer results. And in next section, we will add another measurement to our features. It can cross the gap to achieve better performance.

The last issue we want to discuss in this section is the grid size. Different grid sizes have different performances and training times. In our experiments, we’ve tried 10 cells *10 cells, 6 cells*6 cells and 5cells *5cells three sizes in 30*30 windows.

They all use the same progressive strategy to create features. Again, the training data are 1000 faces and 1000 non-faces and the test data are 1000 faces and 2000 non- faces. The performances of former two are very close and much better than the later one. We only plot the performances of former two here.

Figure 3-14 Performances of 10*10 and 6*6 grid sizes

Although their results are very close, we notice there is a little overfitting in 10*10 grid size case. So in later experiments, we always use 6*6 grid size. This also supports our words that the grid representation is less chance to have overfitting (as long as we use too big grid size or too small cell size).

3.3 V ARIANCE M EASUREMENT

In the above sections, we always use the sum measurement in each cell of the grid. The goodness is that the computation is very easy and fast. The drawback is that sometimes the sum measurement can’t give us enough information. Or we can say the sum measurement has no discrimination in some situations. For example, in Figure 3-15 (a) the sums of two rectangles are very near. It can’t distinguish the mouse with many details from the smooth cheek. In Figure 3-15 (b), if we can apply a measurement which tells the difference between two green rectangles, we can use them to construct a good feature.

Figure 3-15 (a) The su m of two rectangles are close. (b) The rectangles with new measurement may be a good feature

Because of the above defect, we bring in another measurement, variance. We not only compute the sum of each cell but also compute the variance. There are two main advantages to use the variance as our measurement. First, it’s the second moment which tells us extra information than the first mo ment. Besides, the computation is affordable. After we compute the sum of each cell, we already know the means and it lessens the computation of variance. We use Figure 3-16 to illustrate what the feature looks like after we add the variance measurement.

Figure 3-16 the grid feature with the variance measurement

The right side is a grid feature. It equivalent to

( ) ( ) ( ) ( )

sum J  sum I  variance L  variance J

^{Eq 3-1}

We want to know the influence of the variance measurement so we train two classifiers again. This time we use the same training data and test data in order to compare with the results in the before experiment. We plot the comparison in

following figure.

Figure 3-17 Performances of three classifiers (Harr-lik e features, grid features with sum measurement and grid features with sum & variance measurement) .

After we add the variance measurement, our performance is better than the classifier with Harr- like features. In other words, the variance can help us to classify faces from others.

We show the first three features we find in this experiment in Figure 3-18.

Figure 3-18 The first three features with the variance measurement 0

Chapter 4. E XPERIMENTS

In this chapter, we use the grid feature to train a face classifier with 200 weak learners. We compare its performance with the one using Harr-like features. Before we show the result, we discuss three notable issues in the training process. The first one is how we collect training data. Then second is the symmetric property of front faces. The last one is the overall cascade structure.

4.1 T RAINING D ATA 4.1.1 P OSITIVE TRAINING DATA

Training data takes an important position on the learning type classifier. If the data is abnormal, we may not find real boundary between two classes. O ne of the most famous training databases is built by MIT cbcl. They collect 2429 faces and 4548 non-faces in 19*19 pixels as training data. Another 472 faces and 23573 non- faces are used for testing. However, both Viola and we found the too small size couldn’t give us enough information for detection. So we build a new database for our experiments. We collect 2678 faces in the 30*30 pixels size. A half is male and the other half is female. Most of these faces come from Asians, especially Taiwanese and Japanese. Before we fully construct our database, we notice the different marking area (only face or include the hair) also affects the training result. From our small test, the former is more robust so we adopt it in our experiments. We show both in Figure 4-1

Figure 4-1 Positive training data (a) only face (b) include the hair

4.1.2 N ^EGATIVE D ^ATA

The asymmetric property is the main problem when we collect the negative training data. Because there are too many different classes belonging to negative set, we need a very large negative set to represent them. But when we train our classifier, the amount of two training set size should be close; otherwise the boosting algorithm will focus on the small set at beginning (we give equal weights for two training set).

After a few iterations, the weight of miss classified data would be reweighted too large to find the real boundary. If we randomly use a subspace of the large negative set, we might find a set which is easy to be separated from positive data. The boundary of these training data has no generalization. So we need to choose our negative data near to positive data as close as possible. We use Figure 4-2 and Figure 4-3 to demonstrate this problem.

Figure 4-2 Using wrong negative data can’t find real boundary

Figure 4-3 Choose the negative data as close to positive data as possible

In order to conquer the problem, we use our collected faces and non-faces from MIT cbcl to train a classifier first. The classifier is used as a filter to eliminate those absolute non- faces candidates. We pass many non- faces patches to the filter and only collect the miss-classified ones. These miss-classified ones have more similarities to faces. We show the source of negative data and details of the procedure in Figure 4-4

Figure 4-4 The procedure to find our negative set

We do uniformly sampling in the last step because there are many redundancies created in the second step. And the size of negative data is also 30*30 pixels. In the after experiments, we use these three negative sets to train our cascade classifiers.

Here we show some samples of each set.

Figure 4-5 (a) negative set A (b) negative set B (c) negative set C

4.2 S ^YMMETRIC P ROPERTY OF F ^RONT F ^ACE

We all know that human faces have highly symmetric property and we want to add this knowledge into our training process. So when we train our front face classifiers, we only find features on the left half face. Then we map each feature from this side to right half side to form a new symmetric feature. In ideal case, we should double the threshold of each original feature because we have double cells in our new feature. But considering there are still some asymmetric parts on most human faces, we need to find a new suitable threshold. The threshold is changed with different training dada. If our training negative data is highly asymmetric, we can use a low threshold to achieve a good discriminatio n and vice versa.

In our experience, if we normalize the original threshold to 1, the best new threshold which separate training faces from non-faces mostly usually locates around 1.45 to 1.75 times. Later when we train the cascade classifiers, we adjust the threshold according to different training set.

We show how we map half features to new symmetric features in Figure 4-6.

And in Figure 4-7 we plot different thresholds to its error rates (the original threshold is normalized to 1).

在文檔中基於格狀特徵值之推舉式人臉分類器 (頁 17-0)

Chapter 2. Backgrounds

2.2 Boosting

2.2.2 Adaboost







Chapter 3.

P ROPOSED M ETHOD

3.1 G RID R EPRESENTATION

r

3.3 V ARIANCE M EASUREMENT

( ) ( ) ( ) ( )

sum J  sum I  variance L  variance J

Chapter 4.

E XPERIMENTS

4.1 T RAINING D ATA 4.1.1 P OSITIVE TRAINING DATA

4.1.2 N EGATIVE D ATA

4.2 S YMMETRIC P ROPERTY OF F RONT F ACE

4.1.2 N ^EGATIVE D ^ATA

4.2 S ^YMMETRIC P ROPERTY OF F ^RONT F ^ACE