Action Series Recognition - Proposed Action Recognition Method

Chapter 3 Proposed Action Recognition Method

3.6 Action Series Recognition

What mentioned above are classification of single action. The following is a more complex situation. A man performs a series of actions, and we recognize what action he is performing now. One may want to recognize the action by classification of the posture at current time T. However, there is a problem. By observation we can classify postures into two classes, including key postures and transitional postures. Key postures uniquely belong to one action so that people can recognize the action from a single key posture. Transitional postures are interim between two actions, and even human cannot recognize the action from a single transitional posture. Therefore, human action can not be recognized from posture of a single frame due to transitional postures. So, we refer a period of posture history to find the action human is performing. A sliding-window scheme is applied for real-time action recognition as shown in Figure 7. At time current T, symbol subsequence between T-W and T, which is a period of posture history, is used to recognize the current action by computing the maximal likelihood, where W is the window size. In our implementation, W is set to thirty frames which is the average gait cycle of testing sequences. Here we recognize stand as walk. The unknown is due to not enough history. By the sliding window scheme, what action a man is performing can be realized.

Figure 3-6 Sliding-window scheme for action series recognition

Chapter 4 Experiment Results and Discussion

To test the performance of our approach, we implement a system capable of recognizing ten different actions. The system contains two parts: (1) Single Action Recognition (2) Recognition over a series of actions. In (1), a confusion matrix was used to present the recognition result. In (2), we compare the real-time recognition result to ground truth obtained by human.

4.1 Single action recognition

The proposed action recognition system has been tested on real human action videos.

For simplicity, we assumed a uniform background in order to extract human regions with less difficulty. The categories to be recognized were ten types of human actions:

‘walk’, ‘sidewalk‘, ‘pickup’, ‘sit’, ‘jump 1’, ‘jump 2’, ‘push up’, ‘sit up’, ‘crawl 1’, and ‘crawl 2’. 5 persons performed each type of the 10 actions 3 times. The video content was captured by a TV camera (NTSC, 30 frames / second) and digitized into 352x240 pixel resolution. The duration of each video clip was from 40 to 110 frames.

This number of frames is chosen experimentally: shorter sequences do not allow to characterize the action and, on the other side, longer sequences make the learning phase very hard. Figure 4-1 showed some example video clips of the 10 types of human actions. In order to calculate the recognition rate, we used the leave out method. All data were separated into 3 categories, each category containing 5 persons doing ten different actions one time. One category was used as training data for building a HMM for each action type, and the others were testing data.

(a) walk

(b) sidewalk

(d) pick up

(e) jump 1

(f) jump2

(g) push up

(h) sit up

(i) crawl 1

(j) crawl 2

Figure 4-1 Example clips of each action type

The features of each action type extracted using star skeleton. Feature examples of each action are shown in Figure 4-2. For vector quantization, we manually selected m representative skeleton features for each action as codewords in the codebook for the experiment. In my implementation, for simple actions like sidewalk and jump2, m is set to five. Other eight actions m is set to ten. Thus, the total number of HMM symbols was 90. We build the codebook in one direct first and reverse all the features vectors for recognition of counter actions.

(a) walk

(b) sidewalk

(d) pick up

(e) jump 1

(f) jump 2

(g) push up

(h) sit up

(i) crawl 1

(j) crawl 2

Figure 4-2 Features of each action type using star skeleton

We use a sit action video to explain the recognition process. The sit action is composed of a series or postures. Star skeleton are used for posture description, and map the sit action into feature sequence. The feature sequence is then transformed into symbol sequence O by Vector Quantization. Each trained action model compute the probability generating symbol sequence O, the log scale of probability are shown in Figure 4-3. The sit model has the max probability, so the video are recognized as sit.

Figure 4-3 Complete recognition process of sit action

Table 1. Confusion matrix for recognition of testing data

Finally, Table 1 demonstrated the confusion matrix of recognition of testing data.

The left side is the ground truth action type, the upper side is the recognition action type. The number on the diagonal is the number of each action which are correctly classified. The number which are not on the diagonal are misunderstand, and we can see which kind of action the system misjudge. From this table, we can see that most of the testing data were accurately classified. A great recognition rate of 98% was achieved by the proposed method. Only two confusions occurred only between sit and pick up. We check the two mistaken clips, they contain large portion of bending the body. And the bending does not uniquely belong to sit or pickup so that the two action models confuse. In my opinion, a transitional action, bending, must be added to better distinguish pickup and sit.

4.2 Recognition over a series of actions

In the experiment, human take a series of different actions, and the system will automatic recognize the action type in each frame. 3 different action series video clips are used to test the proposed system. We compare the recognition result to human-made ground truth to evaluate the system performance.

The first test sequence is “Sit up – get up – Jump 2 – turn about – Walk – turn about – Crawl 1”. The second test sequence is “Sidewalk – turn about – Walk – turn about – Pick up”. The third test sequence is “Crawl 2 – get up – turn about – Walk – turn about – Jump2”. Each sequence contains about 3-4 defined action types and 1-2 undefined action types (transitional action). Figure 4-4, 4-5, 4-6 (a) shows the original image sequence (some selected frames) of the four action series respectively.

The proposed system recognized the action type by the sliding window scheme.

Figure 4-4, 4-5, 4-6 (b) shows the recognition result. The x-coordinate of the graph is the frame number, and the y-coordinate indicates the recognized action. The red line is the ground truth defined by human observation, and the blue line is the recognized action types. The unknown period is the time human performs actions that are not defined in the ten categories. The first period of unknown of ground truth is get up, and the second and third period are turn about. The unknown period of recognition result is due to the history of postures is not enough (smaller than the window size).

By these graphs, we can see that the time human perform the defined actions can be correctly recognized. Some misunderstanding can be corrected by smoothing the recognition signal. A small recognition time delay occurs at the start of crawl due to not enough history for the sliding window scheme. However, the delay is very small that human can hardly feel. The time period human perform undefined action, the system choose the most possible action from ten defined actions. Therefore, more different actions must be added to enhance the system.

(a) Some original image sequences of ‘sit up – jump2 – walk – crawl1’

(b) Recognition result

Figure 4-4 Recognition over a series of actions ‘sit up – jump2 – walk – crawl1’

(a) Some original image sequences of ‘sidewalk – walk – pickup’

(b) Recognition result

Figure 4-5 Recognition over a series of actions ‘sidewalk – walk – pickup’

(a) Some original image sequences of ‘crawl 2 – walk – jump 2’

(b) Recognition result

Figure 4-6 Recognition over a series of actions ‘crawl 2 – walk – jump 2’

Chapter 5 Conclusion and Future Work

We have presented an efficient mechanism for human action recognition based on the shape information of the postures which are represented by star skeleton. We clearly define the extracted skeleton as a five-dimensional vector so that it can be used as recognition feature. A feature distance (star distance) is defined so that feature vectors can be mapped into symbols by Vector Quantization. Action recognition is achieved by HMM. The system is able to recognize ten different actions. For single action recognition, 98% recognition rate was achieved. The recognition accuracy could still be improved with intensive training. For recognition over a series of actions, the time human perform the defined ten actions can be correctly recognized.

Although we have achieved human action recognition with high recognition rate, we also confirm some restrictions of the proposed technique from the experimental results. One limitation is that the recognition is greatly affected by the extracted human silhouette. We used a uniform background to make the foreground segmentation easy in our experiments. To build a robust system, a strong mechanism of extracting correct foreground object contour must be developed. Second, the representative postures in the codebook during Vector Quantization are picked manually, clustering algorithms can be used so that they can be extracted automatically for a more convenient system. Third, the viewing direction is somewhat fixed. In real world, the view direction varied for different locations of the cameras.

The proposed method should be improved because the human shape and extracted skeleton would change from different views.

Bibliography

[1] J.K. Aggarwal and Q. Cai. “Human motion analysis: A review,” Computer Vision Image Understanding, Vol.73, No.3, pp.428–440, March 1999.

[2] D.M. Gavrila. “The visual analysis of human movement: A survey,” Computer Vision Image Understanding, Vol.73, No.1, pp.82–98, Jan. 1999.

[3] D. Hogg. “Model-based vision: A program to see a walking person,” Image Vision Computing, Vol.1, No.1, pp.5–20, Feb. 1983.

[4] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Mach. Intell., Vol.19, No.7, pp.780–785, July 1997.

[5] A. Agarwal. and B. Triggs. "Recovering 3D Human Pose from Monocular Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.

44-58, 2006.

[6] H. Murase and S.K. Nayar. “Visual learning and recognition of 3-D objects from appearance,” International Journal on Computer Vision, Vol.14, No.1, pp.5–24, Jan. 1995.

[7] H. Murase and R. Sakai. “Moving object recognition in eigenspace representation:

Gait analysis and lip reading,” Pattern Recognition Letter, pp.155–162, Feb. 1996.

[8] T. Ogata, J. K. Tan and S. Ishikawa. "High-Speed Human Motion Recognition Based on a Motion History Image and an Eigenspace," IEICE Transactions on Information and Systems, pp. 281-289, 2006.

[9] L. Wang, T. Tan, H. Ning and W. Hu. "Silhouette Analysis-Based Gait Recognition for Human Identification," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1505-1518, 2003.

[10] R. Bodor, B. Jackson, O. Masoud and N. Papanikolopoulos. "Image-Based Reconstruction for View-Independent Human Motion Recognition," Proceedings of International Conference on Intelligent Robots and Systems, Vol.2, pp.

1548-1553, 2003.

[11] R. Cucchiara, C. Grana, A. Prati and R. Vezzani. "Probabilistic Posture Classification for Human-Behavior Analysis." IEEE Transactions on Systems, Man and Cybernetics, Vol.35, pp. 42-54, 2005.

[12] N. Jin and F. Mokhtarian. "Human Motion Recognition Based on Statistical Shape Analysis," Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 4-9, 2005.

[13] M. Blank, L. Gorelick, E. Shechtman, M. Irani and R. Basri. "Actions as Space-Time Shapes," Tenth IEEE International Conference on Computer Vision, Vol. 2, pp. 1395-1402, 2005.

[14] H. Yu, G.M. Sun, W.X. Song and X. Li. "Human Motion Recognition Based on Neural Network," Proceedings of International Conference on Communications, Circuits and Systems, pp. 982-985, 2005.

[15] C. Schuldt, I. Laptev and B. Caputo. "Recognizing Human Actions: A Local SVM Approach," Proceedings of the 17th International Conference on Pattern Recognition, Vol.3, pp. 32-36, 2004.

[16] H. Su and F.G. Huang. "Human Gait Recognition Based on Motion Analysis,"

Proceedings of International Conference on Machine Learning and Cybernetics, Vol. 7, pp. 4464-4468, 2005.

[17] J. Yamato, J. Ohya and K. Ishii. "Recognizing Human Action in Time-Sequential Images using Hidden Markov Model," Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 379-385, 1992.

[18] A. Kale, A. Sundaresan, A. N. Rajagopalan, N. P. Cuntoor, A. K.

Roy-Chowdhury, V. Kruger and R. Chellappa. "Identification of Humans using Gait," IEEE Transactions on Image Processing, pp. 1163-1173, 2004.

[19] R. Zhang, C. Vogler and D. Metaxas. "Human Gait Recognition," Proceedings of International Workshop on Computer Vision and Pattern Recognition, 2004.

[20] L. H. W. Aloysius, G. Dong, Z. Huang and T. Tan. "Human Posture Recognition in Video Sequence using Pseudo 2-D Hidden Markov Models," Proceedings of International Conference on Control, Automation, Robotics and Vision Conference, Vol. 1, pp. 712-716, 2004.

[21] M. Leo, T. D'Orazio, I. Gnoni, P. Spagnolo and A. Distante. "Complex Human Activity Recognition for Monitoring Wide Outdoor Environments," Proceedings of the 17th International Conference on Pattern Recognition, Vol.4, pp. 913-916, 2004.

[22] F, Niu and M. Abdel-Mottaleb. "View-Invariant Human Activity Recognition Based on Shape and Motion Features," Proceedings of IEEE Sixth International Symposium on Multimedia Software Engineering, pp. 546-556, 2004.

[23] T. Mori, Y. Segawa, M. Shimosaka and T. Sato. "Hierarchical Recognition of Daily Human Actions Based on Continuous Hidden Markov Models,"

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 779-784, 2004.

[24] X. Feng and P. Perona. "Human Action Recognition by Sequence of Movelet Codewords," Proceedings of the First International Symposium on 3D Data Processing Visualization and Transmission, pp. 717-721, 2002.

[25] L. R. Rabiner. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, pp. 257-286, 1989.

[26] H. Fujiyoshi and A. J. Lipton. "Real-Time Human Motion Analysis by Image Skeletonization." Proceedings of the Fourth IEEE Workshop on Applications of Computer Vision, pp. 15-21, 1998.

[27] X. D. Huang, Y. Ariki, and M. A. Jack. "Hidden Markov Models for Speech Recognition". Edingurgh Univ. Press, 1990.

在文檔中使用星狀骨架作人類動作自動辨識 (頁 32-0)