Human Action Sequence Learning and Recognition

Learning Atomic Human Actions Using Variable-Length Markov

4.2 The Proposed Method for Atomic Action Recognition

4.2.2 Human Action Sequence Learning and Recognition

Atomic Action Learning

Using the codebook of posture templates, an input sequence of postures can be converted into a symbol sequence where

{

b₁,b₂,...,bn

}

{a_q₍₁₎,...,a_q₍_n₎}, method outlined in Section 2.3.1. These VLMMs are actually different order Markov chains. For simplicity, we transform all the high order Markov chains into first-order Markov chains by augmenting the state space. For example, the probability of a d_i-th order Markov chain with state space S is given by

) first-order Markov chain, a new state space is constructed such that both

and are included in the new state

space. As a result, the high order Markov chain can be formulated as the following first-order Markov chain [24]

1 ( , , 1

4.2 The Proposed Method for Atomic Action Recognition

Hereafter, we assume that every VLMM has been transformed into a first-order Markov model.

Atomic Action Recognition

After the VLMMs are trained from the training sequence, the VLMM recognition technique, mentioned in Section 2.3.2, can be applied to atomic action recognition. This VLMM recognition technique works well for natural language processing. However, since natural language processing and human action analysis are inherently different, two problems must be solved before the VLMM technique can be applied to atomic action recognition. First, the VLMM technique cannot handle the dynamic time warping problem; hence VLMMs cannot recognize atomic actions when they are performed at different speeds.

Second, the VLMM technique does not include a model for noise observation, so the system is less tolerant of image preprocessing errors.

First, note that the speed of the action affects the number of repeated symbols in the constructed symbol sequence: a slower action produces more repeat symbols. To eliminate this speed-dependent factor, the input symbol sequence is preprocessed to merge repeated symbols. VLMMs corresponding to different atomic actions are trained with preprocessed symbol sequences similar to the method proposed by Galata et al. [22]. However, this approach is only valid when the observed noise is negligible, which is an impractical assumption. The

recognition rate of the constructed VLMMs is low because image preprocessing errors may identify repeated postures as different symbols. To incorporate a noise observation model, the VLMMs trained with unrepeated sequences must be modified to recognize input sequences with repeated symbols. Let denote the state transition probability from state i to state j. Initially, because the training data contains no repeated symbols. The self-transition probability is updated by original training sequences and δ is a small positive number to prevent the over-fitting problem [49]. Note that if the self-transition probability is zero, then an action sequence that contains repetition will result in a zero probability such that the system will not perform normally when faced with slower action sequences. To overcome this limitation, we add the small positive number δ to the self-transition probability. This parameter can be determined using the cross-validation method. The other transition probability must also be updated as aij^new =aij^old(1−aii^new). For example, if the input training symbol sequence is

“AAABBAAACCAAABB,” the preprocessed training symbol sequence becomes

“ABACAB.” The VLMM constructed with the original input training sequence is shown in Figure 4.2(a); while the original VLMM and modified VLMM constructed with the preprocessed training sequence are shown in Figures 4.2(b) and 4.2(c), respectively.

4.2 The Proposed Method for Atomic Action Recognition

Figure 4.2. (a) The VLMM constructed with the original input training sequence;

(b) the original VLMM constructed with the preprocessed training sequence; (c) the modified VLMM, which includes the possibility of self-transition.

Next, a noise observation model is introduced to convert a VLMM into an HMM. Note that the output of a VLMM determines its state transition and vice versa because the state of a VLMM is observable. In general, the possible output is restricted to several discrete symbols. However, due to the noise caused by image preprocessing, the symbol sequence corresponding to an atomic action includes some randomness. Such randomness will cause the action sequence not recognizable by the VLMMs. Therefore, we propose to modify the symbol observation model as described in the following. Suppose that the output symbol of a VLMM is at time t, and its posture template retrieved from the codebook is . If the VLMM is the right model, the extracted silhouette image will not deviate too much from its corresponding posture template provided that the segmentation result does not contain any major errors. Due to noise observation, the silhouette image is a random variable, and so is the

CSC distance D^csc(o at^, qt). It is possible to learn the distribution of the CSC

distance, D^csc(o at^, qt), using the training data. An example is shown in Figure 4.3. In this example, it is clear that a Gaussian distribution can be applied to model the CSC distance, i.e.

( )

. The standard deviation σ of this distribution is estimated using the maximum-likelihood technique.

Figure 4.3. The distribution of observation error, obtained using the training data.

Note that the VLMM has now been converted into a first-order Markov chain.

If the VLMM’s observation model is detached from the symbol of a state, then the VLMM becomes a standard HMM. The probability of the observed silhouette image sequence, O=o₁o₂...o_T, for a given model Λ can be evaluated by the HMM forward/backward procedure with proper scaling [49]. Finally, category

obtained with the following equation is deemed to be the recognition result:

4.3 Experiments

)]

| ( log[

max

* arg P O

i = Λ_i

i . (4.3)

4.3 Experiments

We conducted a series of experiments to evaluate the effectiveness of the proposed method. A powerful, scalable recognition system would only use the data extracted from one person for training but would still be capable of recognizing data collected from other people. Accordingly, the training data used in our experiments was a real video sequence comprised of approximately 900 frames. The training data contained ten categories of action sequences that were performed by a single person. Some typical image frames are shown in Figure 4.4. Using the posture template selection algorithm, a codebook of 95 posture templates (see Figure 4.5), was constructed from the training data. The data was then used to build ten VLMMs, each of which was associated with one of the atomic actions shown in Figure 4.4.

Figure 4.4. The ten categories of atomic actions used for training

4.3 Experiments

Figure 4.5. Posture templates extracted from the training data

The average log-likelihood of the training error computed with the training data is shown in Table 4.1. The results indicate that the proposed action recognition method can deal with the problem of human action recognition effectively. Next, a test video was used to assess the effectiveness of the proposed method. The test data was obtained from the same human subject.

Each atomic action was repeated four times, yielding a total of 40 test samples (4

positive samples and 36 negative samples) for evaluating the performance of the learnt VLMMs. The proposed method achieved a 100% recognition rate for all the test sequences. To further verify the recognition results, we tested the similarity of any two VLMMs obtained in the experiment. First, we generated 10,000 action sequences for each of the 10 VLMMs, which yielded a total of 100,000 action sequences. Out of the 100,000 action sequences, only 74 sequences were incorrectly recognized and all the errors were on actions 7 and 8 because these two sequences contained many similar postures and thus could be mixed up easily (refer to Figure 4.4). This result is consistent with the data shown in Table 4.1: the log-likelihood of actions 7 and 8 computed using VLMMs 8 and 7 were relatively high. This result confirms that the data shown in Table 4.1 is valid. Furthermore, we have also estimated the p-values [73] for each action model. The posture templates shown in Figure 4.5 were used to generate 10,000 random action sequences using a sample-with-replacement process. The histograms of the log-likelihood of the random sequences and the positive sequences for an action model are shown in Figure 4.6. Since these two histograms do not overlap at all, it is reasonable to infer that the p-value of the action model is very low. To estimate the p-value, we approximate the distributions of the log-likelihood by Gaussian distributions (see Figure 4.6).

Therefore, the p-value can be easily computed. The maximum p-value of the ten models is smaller than 0.0001, which confirms that the results are statistically significant.

4.3 Experiments

Table 4.1. The results of atomic action recognition using the training data

Figure 4.6. The histograms of the log-likelihood of the random sequences and the positive sequences for an action model

In the third experiment, test videos of nine different human subjects (see Figure 4.7) were used to evaluate the performance of the proposed method.

Each person repeated each action five times, so we had five sequences for each action and each human subject, which yielded a total of 450 action sequences.

For comparison, we also tested the performance of the HMM method in this experiment. Since the ten atomic actions used in the experiments were acyclic, only the left-right HMMs were considered in this experiment. Because the initial parameters and the number of HMM states would affect recognition results, the HMM implementation was evaluated using a variety of HMMs, each of which had a different number of hidden states. Furthermore, the HMM were trained ten times and the average results were used to reduce the effect of the initial random parameters. Table 4.2 compares our method’s recognition rate with that of the HMM method, for test data from nine different human subjects. Our method clearly outperforms the HMM method, no matter how many states were selected.

In Table 4.2, the shaded cells denote the best recognition results of the HMM approach for a particular action. It is clear that the selection of the number of states is a critical issue for the HMM method. Note that the number of HMM states that could be set for deriving the best performance was varying in different actions which makes the selection of the number of states even more difficult. In contrast to the difficulty in determining the topology of an HMM, our method is simple and effective because the topology of a VLMM can be determined automatically with a robust algorithm. Note that the recognition rates for action 1 were the worst across all actions. Figure 4.8(a) shows some typical input

4.3 Experiments

postures for a human subject performing action 1. The retrieved, corresponding closest posture templates in the database are shown in Figure 4.8(b). When comparing the corresponding posture templates shown in Figure 4.8(b) with the training posture sequences shown in Figure 4.4, it is clear that the posture templates and the training postures of action 1, in this case, are not well matched.

Due to the segmentation error of the lower arms areas, the input postures were incorrectly related to posture templates of different actions. For example, the retrieved posture templates shown in Figure 4.8(b), from left to right, were extracted from training data of actions 1, 4, 2, 2, 2, 1, 2, 2, 2, 4, and 1, respectively.

Since the proposed method is silhouette-based, when the same postures of two individuals appear to be drastically different (due to dissimilar physical characteristics, motion styles, or improper segmentation), observation errors would bias the recognition result. In particular, if most of the input postures are with high observation error, the context information is not sufficient for accurate performance.

Figure 4.7. Nine test human subjects

4.3 Experiments

Table 4.2. Comparison of our method’s recognition rate with that of the HMM computed with the test data obtained from nine different human subjects

Figure 4.8. Some typical postures of a human subject exercising action 1: (a) the input posture sequence; (b) the corresponding minimum-CSC-distance posture templates.

In order to show that the selection of the parameter τc in the posture template selection process was not a major concern, we calculated the recognition rates for different τc. Figure 4.9 shows the recognition rates with respect to different τc, and it demonstrates that the change of τc only has little influence to the recognition results.

Figure 4.9. Recognition rates with respect to different τc

In the fourth experiment, to evaluate the scalability of the proposed algorithm, we used a new, publicly-available database [3, 63]. This database consists of 90 low-resolution (180 144) action sequences from nine different people, each performing ten natural actions. These actions include: bending (bend), jumping jacks (jack), jumping forward on two legs (jump), jumping in place on two legs

4.3 Experiments

(pjump), running (run), galloping sideways (side), skipping (skip), walking (walk), waving one hand (wave1), and waving two hands (wave2). Sample images of each type of action sequence are shown in Figure 4.10. In [63], a sequence of human silhouettes derived from each action sequence was converted into two representations, namely average motion energy (AME) and mean motion shape (MMS). Subsequently, a nearest neighbor classifier (NN) was used for recognition, and the leave-one-out cross-validation rule was adopted to compute the recognition rate. Recognition results for these two representations, shown in the top two rows of Table 4.3, are compared against our method.

In order to compare our method with the two competing methods in a fairer fashion, we also applied the leave-one-out rule to our method. In this case, eight sets of data grabbed from eight distinct human subjects were used to train the VLMMs, resulting in eight VLMMs for each action. Finally, the category with the maximum likelihood was deemed to be the recognition result. Results using this methodology are shown in the last row of Table 4.3. It is clear that our method outperforms the other two methods for this public database.

Figure 4.10. Sample images in the public action database

Table 4.3. Comparison of our method’s recognition rate with that of the AME plus NN method and the MMS plus NN method for the public database

4.4 Concluding Remarks

4.4 Concluding Remarks

We have proposed a framework for understanding human atomic actions using VLMMs. The framework comprises two modules: a posture labeling module, and a VLMM atomic action learning and recognition module. We have developed a simple and efficient posture template selection algorithm based on the modified shape context matching method. A codebook of posture templates is created to convert the input posture sequences into discrete symbols so that the language modeling approach can be applied. The VLMM technique is then used to learn human action sequences. To handle the dynamic time warping problem and the lack of noise observation model problem of applying the VLMM technique to action analysis, we have also developed a systematic method to convert the learned VLMMs into HMMs. The contribution of our approach is that the topology of the HMMs can be automatically determined and the recognition accuracy is better than the traditional HMM approach. Experiment results demonstrate the efficacy of the proposed method.

Chapter 5

在文檔中利用Isomap學習及VLMM技術來分析人類之動作 (頁 80-99)