Learning Process Of HMM - Hidden Markov Model (HMM)

Chapter 2 Hidden Markov Model (HMM)

2.3 Learning Process Of HMM

The most difficult problem of HMMs is to determine a method to adjust the model parameters(A,B,π) to maximize the probability of the observation sequence given the model. There is no known way to analytically solve for the model which maximizes the probability of the observation sequence. In fact, given any finite observation sequence as training data, there is no optimal way of estimating the model parameters. We can, however, chooseλ =(A,B,π) such that P O( |λ)is locally maximized using an iterative procedure such as the Baum-Welch method.

In the learning phase, each HMM must be trained so that it is most likely to generate the symbol patterns for its category. Training an HMM means optimizing the model parameters (A,B,π) to maximize the probability of the observation sequenceP_r (O |λ). The Baum-Welch algorithm is used for these estimations.

Define:

β is called the backward variable and can also be solved inductively in a manner similar to that used for the forward variableα_t(i), as follows:

(1) Initialization:

The initialization step (1) arbitrary definesβ_T(i)to be 1 for all i. Step (2), which is illustrated in Figure 2-4, shows that in order to have been in stateS at time t, and to _i account for the observation sequence from time t+1 on, you have to consider all possible statesS at time t+1, accounting for the transition from_j S to_i S (the_j a term), _ij as well as the observationO_t₊₁in state j (the b_j(O_t₊₁)term), and then account for the remaining partial observation sequence from state j ( theβ_t+₁(i)term).

Figure 2-4 Illustration of the sequence of operations required for the computation of the backward variableβ_t(i)

We define the variable

) .

i.e., the probability of being in stateS at time t, given the observation sequence O, and _i the modelλ .

In order to describe the procedure for re-estimation (iterative update and improvement) of HMM parameters, we first defineε_t( ji, ), the probability of being in stateS at time _i t, and stateS at time t+1, given the model and the observation sequence, i.e. _j

). The sequence of events leading to the conditions required by (10) is illustrated in Figure 2-5. It should be clear, from the definitions of the forward and backward variables, that we can writeε_t( ji, )in the form

byP(O|λ)gives the desired probability measure.

Figure 2-5 Illustration of the sequence of operations required for the computation of the joint event that the system is in stateS_iat time t and stateS_jat time t+1

We have previously definedγ_t(i)as the probability of being in stateS_iat time t, given the observation sequence and the model; hence we can relateγ_t(i)toε_t( ji, )by summing over j, giving

∑

= expected (over time) number of times that stateS_iis visited, or equivalently, the expected number of transitions made from stateS_i(if we exclude the time slot t = T from the summation). Similarly, summation ofε_t( ji, )over t (from t = 1 to t = T-1) can be interpreted as the expected number of transitions from stateS_ito stateS_j. That is

∑

⁻

Using the above formulas (and the concept of counting event occurrences) we can give a method for re-estimation of the parameters of an HMM. A set of reasonable

re-estimation formulas forπ , A, and B are determined from the left-hand sides of (16)-(18), then it has been proven by Baum and his colleagues that either (1) the initial modelλdefines a critical point of the likelihood function, in which caseλ = ; or (2) modelλ λ is more likely than modelλ in the sense thatP(O|λ)〉 P(O|λ), i.e. we have found a new modelλ from which the observation sequence is more likely to have been produced.

Based on the above procedure, if we iteratively useλ in place ofλand repeat the re-estimation calculation, we then can improve the probability of O being observed from the model until some limiting point is reached. The final result of this re-estimation procedure is called a maximum likelihood estimate of the HMM.

Chapter 3 Proposed Action Recognition Algorithm 3.1 System Overview

The system architecture consists of three parts, including feature extraction, mapping features to symbols and action recognition as shown in Figure 3-1.

For feature extraction, we use background subtraction and threshold the difference between current frame and background image to segment the foreground object. After the foreground segmentation, we extract the posture contour from the human silhouette. As the last phase of feature extraction, a star skeleton technique is applied to describe the posture contour. The extracted star skeletons are denoted as feature vectors for latter action recognition. The process flow of feature extraction is shown in Figure 3-1 (a).

After the feature extraction, Vector Quantization (VQ) is used to map feature vectors to symbol sequence. We build a posture codebook which contains representative feature vectors of each action, and each feature vector in the codebook is assigned to a symbol codeword. An extracted feature vector is mapped to the symbol which is the codeword of the most similar (minimal distance) feature vector in the codebook. The output of mapping features to symbols module is thus a sequence of posture symbols.

The action recognition module involves two phase: training and recognition. We use Hidden Markov Models to model different actions by training which optimizes model parameters for training data. Recognition is achieved by probability computation and selection of maximum probability. The process flows of both training and recognition are shown in Figure 3-1 (b) and (c).

Foreground Segmentation

Border Extraction Star

Skeleton

(a) Process flow of feature extraction

(b) Process flow of training

Figure 3-1 Illustration of the system architecture

Figure 3-2 A walk action is a series of postures over time

3.2 Feature Extraction

Human action is composed of a series of postures over time as shown in Figure 3-2.

A good way to represent a posture is to use its boundary shape. However, using the whole human contour to describe a human posture is inefficient since each border point is very similar to its neighbor points. Though techniques like Principle Component Analysis are used to reduce the redundancy, it is computational expensive due to matrix operations. On the other hand, simple information like human width and height may be rough to represent a posture. Consequently, representative features must be extracted to describe a posture. Human skeleton seems to be a good choice.

There are many standard techniques for skeletonization such as thinning and distance transformation. However, these techniques are computationally expensive and moreover, are highly susceptible to noise in the target boundary. Therefore, a simple, real-time, robust techniques, called star skeleton [26] was used as features of our action recognition scheme.

3.3 Feature Definition

Vectors from the centroid of human body to local maximas are defined as the feature vector, called star vector. The head, two hands, and two legs are usually outstanding parts of extracted human contour, hence they can properly characterize the shape information. As they are usually local maximas of the star skeleton, we define the dimension of the feature vector five. For postures such as the two legs overlap or one hand is covered, the number of protruding portion is below five. Zero vectors are added. In the same way, we can adjust the low-pass filter to reduce the number of local maximas for postures with more than five notable parts.

3.3.1 Star Skeletonization

The concept of star skeleton is to connect from centroid to gross extremities of a human contour. To find the gross extremities of human contour, the distances from the centroid to each border point are processed in a clockwise or counter-clockwise order.

Extremities can be located in representative local maximum of the distance function.

Since noise increases the difficulty of locating gross extremes, the distance signal must be smoothed by using smoothing filter or low pass filter in the frequency domain.

Local maximum are detected by finding zero-crossings of the smoothed difference function. The star skeleton is constructed by connecting these points to the target centroid. The star skeleton process flow of an example human contour is shown in Figure 4 and points A, B, C, D and E are local maximum of the distance function. The details of star skeleton are as follows:

Star skeleton Algorithm(As described in [26]) Input: Human contour

Output: A skeleton in star fashion

1. Determine the centroid of the target image border(x_c,y_c)

d for noise reduction by using linear smoothing filter or low pass filter in the frequency domain.

4. Take local maximum of ()

d as extremal points, and construct the star skeleton by connecting them to the centroid(x_c,y_c). Local maximum are detected by finding zero-crossings of the difference function

( ) ( ) ( 1)

^ − −

= d i d i

δ i (22)

Figure 3-3 Process flow of star skeletonization

3.3.2 Feature Definition

One technique often used to analyze the action or gait of human is the motion of skeletal components. Therefore, we may want to find which part of body (e.g. head, hands, legs, etc) the five local maximum represent. In [26], angles between two legs are used to distinguish walk from run. However, some assumptions such as feet locate on lower extremes of star skeleton are made. These assumptions can not fit other different actions, for example, low extremes of crawl may be hands. Moreover, the number of extremal points of star skeleton varies with human shape and the low pass filter used. Gross extremes are not necessarily certain part of human body. Because of the difficulty in finding which part of body the five local maximum represent, we just use the distribution of star skeleton as features for action recognition.

As a feature, the dimension of the star skeleton must be fixed. The feature vector is then defined as a five dimensional vectors from centroid to shape extremes because head, two hands, two legs are usually local maximum. For postures with more than five contour extremes, we adjust the low pass filter to lower the dimension of star skeleton to five. On the other hand, zero vectors are added for postures with less than five extremes.

Since the used feature is vector, its absolute value varies for people with different size and shape, normalization must be made to get relative distribution of the feature vector. This can be achieved by dividing vectors on x-coordinate by human width, vectors on y-coordinate by human height.

3.4 Mapping features to symbols

To apply HMM to time-sequential video, the extracted feature sequence must be transformed into symbol sequence for latter action recognition. This is accomplished by a well-known technique, called Vector Quantization [27].

3.4.1 Vector Quantization

For vector quantization, codewords _g_j_∈_Rⁿ, which represent the centers of the clusters in the feature Rⁿ space, are needed. Codeword g_j is assigned to symbol

vj. Consequently, the size of the code book equals the number of HMM output symbols. Each feature vector f_i is transformed into the symbol which is assigned to the codeword nearest to the vector in the feature space. This means f_i is transformed into symbol _v_j if j=argmin_jd(f_i,g_j) where d(x,y) is the distance between vectors x and y.

Figure 3-4 The concept of vector quantization in action recognition

For action recognition, we select m feature vectors of representative postures from each action as codewords in the codebook. And an extracted feature would be mapped to a symbol, which is the codeword of the most similar (minimal distance) feature vector in the codebook. The concept of the mapping process is shown in Figure 3-4.

The codebook in the figure contains only some representative star skeletons of walk to explain the mapping concept. In the mapping process, similarity between feature vectors needs to be determined. Therefore we define distance between feature vectors, called star distance, to decide the similarity between feature vectors.

3.4.2 Star Distance

Since the star skeleton is a five-dimensional vector, the star distance between two feature vectors S and T is first defined as the sum of the Euclidean distances of the five sub-vectors. However, consider the star skeletons S and T in Figure 3-5 (a). The two star skeletons are similar, but the distance between them is large due to mismatch. So we modify the distance measurement. Each sub-vector must find their closest mapping as shown in Figure 3-5 (b). The star distance is then defined as the sum of the Euclidean distance of the five sub-vectors under such greedy mapping. For simplicity, the star distance is obtained by minimal sum of the five sub-vectors in all permutation. Better algorithm to accelerate the star distance calculation can be found.

Star Distance =

∑

Figure 3-5 Illustration of star distance (a) Mismatch (b) Greedy Match

3.5 Action Recognition

The idea behind using the HMMs is to construct a model for each of the actions that we want to recognize. HMMs give a state based representation for each action. The number of states was empirieally determined. After training each action model, we calculate the probability P(O|λ_i) , the probability of model λ_i generating the

observation posture sequence O, for each action model. We can then recognize the action as being the one, which is represented by the most probable model.

3.6 Action Series Recognition

What mentioned above are classification of single action. The following is a more complex situation. A man performs a series of actions, and we recognize what action he is performing now. One may want to recognize the action by classification of the posture at current time T. However, there is a problem. By observation we can classify postures into two classes, including key postures and transitional postures. Key postures uniquely belong to one action so that people can recognize the action from a single key posture. Transitional postures are interim between two actions, and even human cannot recognize the action from a single transitional posture. Therefore, human action can not be recognized from posture of a single frame due to transitional postures. So, we refer a period of posture history to find the action human is performing. A sliding-window scheme is applied for real-time action recognition as shown in Figure 7. At time current T, symbol subsequence between T-W and T, which is a period of posture history, is used to recognize the current action by computing the maximal likelihood, where W is the window size. In our implementation, W is set to thirty frames which is the average gait cycle of testing sequences. Here we recognize stand as walk. The unknown is due to not enough history. By the sliding window scheme, what action a man is performing can be realized.

Figure 3-6 Sliding-window scheme for action series recognition

Chapter 4 Experiment Results and Discussion

To test the performance of our approach, we implement a system capable of recognizing ten different actions. The system contains two parts: (1) Single Action Recognition (2) Recognition over a series of actions. In (1), a confusion matrix was used to present the recognition result. In (2), we compare the real-time recognition result to ground truth obtained by human.

4.1 Single action recognition

The proposed action recognition system has been tested on real human action videos.

For simplicity, we assumed a uniform background in order to extract human regions with less difficulty. The categories to be recognized were ten types of human actions:

‘walk’, ‘sidewalk‘, ‘pickup’, ‘sit’, ‘jump 1’, ‘jump 2’, ‘push up’, ‘sit up’, ‘crawl 1’, and ‘crawl 2’. 5 persons performed each type of the 10 actions 3 times. The video content was captured by a TV camera (NTSC, 30 frames / second) and digitized into 352x240 pixel resolution. The duration of each video clip was from 40 to 110 frames.

This number of frames is chosen experimentally: shorter sequences do not allow to characterize the action and, on the other side, longer sequences make the learning phase very hard. Figure 4-1 showed some example video clips of the 10 types of human actions. In order to calculate the recognition rate, we used the leave out method. All data were separated into 3 categories, each category containing 5 persons doing ten different actions one time. One category was used as training data for building a HMM for each action type, and the others were testing data.

(a) walk

(b) sidewalk

(d) pick up

(e) jump 1

(f) jump2

(g) push up

(h) sit up

(i) crawl 1

(j) crawl 2

Figure 4-1 Example clips of each action type

The features of each action type extracted using star skeleton. Feature examples of each action are shown in Figure 4-2. For vector quantization, we manually selected m representative skeleton features for each action as codewords in the codebook for the experiment. In my implementation, for simple actions like sidewalk and jump2, m is set to five. Other eight actions m is set to ten. Thus, the total number of HMM symbols was 90. We build the codebook in one direct first and reverse all the features vectors for recognition of counter actions.

(a) walk

(b) sidewalk

(d) pick up

(e) jump 1

(f) jump 2

(g) push up

(h) sit up

(i) crawl 1

(j) crawl 2

Figure 4-2 Features of each action type using star skeleton

We use a sit action video to explain the recognition process. The sit action is composed of a series or postures. Star skeleton are used for posture description, and map the sit action into feature sequence. The feature sequence is then transformed into symbol sequence O by Vector Quantization. Each trained action model compute the probability generating symbol sequence O, the log scale of probability are shown in Figure 4-3. The sit model has the max probability, so the video are recognized as sit.

Figure 4-3 Complete recognition process of sit action

Table 1. Confusion matrix for recognition of testing data

Finally, Table 1 demonstrated the confusion matrix of recognition of testing data.

The left side is the ground truth action type, the upper side is the recognition action type. The number on the diagonal is the number of each action which are correctly classified. The number which are not on the diagonal are misunderstand, and we can see which kind of action the system misjudge. From this table, we can see that most of the testing data were accurately classified. A great recognition rate of 98% was achieved by the proposed method. Only two confusions occurred only between sit and pick up. We check the two mistaken clips, they contain large portion of bending the body. And the bending does not uniquely belong to sit or pickup so that the two action models confuse. In my opinion, a transitional action, bending, must be added to better distinguish pickup and sit.

4.2 Recognition over a series of actions

In the experiment, human take a series of different actions, and the system will automatic recognize the action type in each frame. 3 different action series video clips are used to test the proposed system. We compare the recognition result to human-made ground truth to evaluate the system performance.

The first test sequence is “Sit up – get up – Jump 2 – turn about – Walk – turn about – Crawl 1”. The second test sequence is “Sidewalk – turn about – Walk – turn about – Pick up”. The third test sequence is “Crawl 2 – get up – turn about – Walk – turn about – Jump2”. Each sequence contains about 3-4 defined action types and 1-2 undefined action types (transitional action). Figure 4-4, 4-5, 4-6 (a) shows the original image sequence (some selected frames) of the four action series respectively.

The proposed system recognized the action type by the sliding window scheme.

Figure 4-4, 4-5, 4-6 (b) shows the recognition result. The x-coordinate of the graph is the frame number, and the y-coordinate indicates the recognized action. The red line is the ground truth defined by human observation, and the blue line is the recognized action types. The unknown period is the time human performs actions that are not defined in the ten categories. The first period of unknown of ground truth is get up, and the second and third period are turn about. The unknown period of recognition result is due to the history of postures is not enough (smaller than the window size).

在文檔中使用星狀骨架作人類動作自動辨識 (頁 19-0)