• 沒有找到結果。

Fuzzy Rule Construction for Action Recognition

Chapter 4 Experimental Results

4.2 Fuzzy Rule Construction for Action Recognition

For activity recognition, we use the key frame selection technique to automatically select essential templates of the video frames for activity clustering. We chose six essential templates for “walking from right to left,” “walking from left to right” and “climbing down,” respectively; five for “climbing down,” three for

“crouching” and two for “jumping.” There are totally 28 essential templates, and comprising 28 classes. The essential template numbers of each activity depend on how long the activity takes for a complete cycle. Each essential template is a representative cluster center around every five images, which are extracted from five different training persons and have similar postures. Figs. 4.4 and 4.5 are two examples of some templates of two training models.

As shown in Figs. 4.4 and 4.5, if a model bend down or squat down, the bodies in template images are wider than others. For normalization purpose, every segmented image is resized until its height equals to 128 pixels or width equals to 96 pixels.

Images of stand posture usually resize to its height to be the ratio of 96 to 128 pixels.

On the contrary, when the height of body shape is small, the magnifying factor of the image becomes large.

Class 2 Class 6 Class 7

Class 9 Class 15 Class 19

Class 21 Class 25 Class 27

Fig. 4.4 Some “essential templates of posture” of person 1.

Class 2 Class 6 Class 7

Class 9 Class 15 Class 19

Class 21 Class 25 Class 27

Fig. 4.5 Some “essential templates of posture” of person 5.

The template images are transformed to canonical space by the methods described in Chapter 2. Each essential template image of a training model was treated as a center. Hence, there were 112 center vectors because of four subject models and 28 class centers of essential templates of each subject models. Using leave-one-out strategy, there were five test subject models to be tested.

In the testing phase, the training video frames are inputted for activity recognition. The smallest essential template to each image frame is calculated by using Eq. (28) in Section 3.4. We gathered three consecutive 5:1 sub-sampled images

as a group in order to include temporal information. Training is accomplished in off-line manner. Therefore, we gathered three images from different start points to train for constructing fuzzy rules. For examples: the first frame, the 6-th frame and the 11-th frame are gathered together as an input training data; the second frame, the 7-th frame and 12-th frame are gathered together as another input training data; and the third frame, the 8-th frame and the 13-th frame are gathered together as an other input training data, etc. Different start points of image frames, as described above, are used for training fuzzy rules in our experiment, in corresponding to the starting testing video frame may not be the same, either. By utilizing different starting images, the system will be robust to be insignificant to the starting position of the video frames.

The group of the threes images is converted to the posture sequence which has the summation of the three minimal Euclidean distances to the essential templates by Eq. (28). Each posture sequence will support the corresponding rule once. If the corresponding rule is not existent, a new rule is generated in the IF-THEN form as represented in Section 3.4.

A threshold has to be set after all training patterns have been learned. The threshold is used to abandon the IF-THEN rule whose cumulative occurrence times are relative few. The numbers of rules being selected varies with different threshold selection. Table I shows the rule numbers versus different threshold values. Five model excluding one subject are chosen from the training data be utilized for rule construction. Obviously, the higher the threshold we choose, the fewer rules we will obtain. Although higher threshold can reduce rules, fewer rules will lose the tolerance for small difference observed even for the same activity. If some conflicting rules are generated, we choose the rule that is supported by a maximum number of training instances.

TABLE I

THE RULE NUMBERS AT DIFFERENT THRESHOLD

Training models

except Threshold = 3 Threshold = 4 Threshold = 5

Person 1 200 137 92

Person 2 190 138 104

Person 3 184 126 94

Person 4 185 128 91

Person 5 210 150 107

The test images are similarly 5:1 down-sampled of video frames. An activity should appear in a proper order directly perceived through our sense. For example, P1

through P6 are the six linguistic labels of the activity “walking from left to right.” The activity of “walking from left to right” should have the rules with the posture sequence directly perceived through the senses: (P1, P2, P3), (P2, P3, P4), (P3, P4, P5), (P4, P5, P6), (P5, P6, P1), (P6, P1, P2). With the threshold was set at four, a set of fuzzy rules generated from the training data except person 5 are listed in Table II. Two of the learned fuzzy rules of above are represented together with template images are shown in Fig. 4.6. After training all of image sequences, we can compute each the mean and standard deviation of pre-defined activity’s matching degree. In this thesis, we have six pre-defined activities, this we can compute these six activities means and standard deviations and use them to determine whether a input image is belonging to one of pre-defined activity or an unknown activity. In Section 3.2, we discuss the key postures selected by manually and unsupervised clustering algorithm. Table III is the

recognition rates of key postures selected manually. Table IV is the recognition rates of key postures selected by unsupervised clustering algorithm. Table V is the means and standard deviations of person 5.

TABLE II

THE OBTAINED FUZZY RULE BASE GENERATED FROM THE TRAINING DATA EXCEPT

PERSON 5

Number Image 1 Image 2 Image 3 Class

1 P1 P1 P1 WLR

2 P1 P1 P2 WLR

3 P1 P1 P4 WLR

50 P7 P11 P8 WRL

90 P14 P13 P13 CROUCH

110 P17 P17 P17 JUMP

120 P19 P20 P21 CUP

148 P27 P27 P28 CDOWN

149 P27 P28 P27 CDOWN

150 P27 P28 P28 CDOWN

TABLE III

THE RECOGNITION RATES OF KEY POSTURES SELECTED MANUALLY

Recognition rate (%) Testing data

WLR WRL CROUCH JUMP CUP CDOWN

Person 1 100 100 100 100 100 62.26

Person 2 100 94.62 100 100 100 93.33

Person 3 99.11 100 100 100 100 72.61

Person 4 97.06 100 71.70 100 83.54 100

Person 5 100 100 100 100 82.61 100

Average 96.21

TABLE IV

THE RECOGNITION RATES OF KEY POSTURES SELECTED BY UNSUPERVISED CLUSTERING

ALGORITHM

Recognition rate (%) Testing data

WLR WRL CROUCH JUMP CUP CDOWN

Person 1 100 95.74 100 100 100 83.02

Person 2 100 89.25 100 100 100 95.56

Person 3 99.11 100 100 100 87.16 32.88

Person 4 97.06 98.78 85.85 97.17 81.01 91.38

Person 5 100 89.66 100 100 89.86 100

Average 94.79

(a)

(b) Fig. 4.6 Two examples of fuzzy rules. (a) Walking from left to right, (b)

Climbing down.

TABLE V

THE MEANS AND STANDARD DEVIATIONS OF SIX ACTIVITIES’MATCHING DEGREE OF

TRAINING MODEL EXCEPT PERSON 5

Activity Mean Standard deviation (σ )

WLR 7060.04 1819.92

WRL 7043.74 1870.61

CROUCH 5227.48 1332.18

JUMP 3630.96 1168.53

CUP 6282.87 2544.87

CDOWN 7131.24 2804.25

4.3 The Activity Recognition Using Fuzzy Rule Base Approach

The activity recognition system in our experiment is off-line presented and tested;

therefore, the testing video is not done in real time phase. We input the testing video from different starting frames which is similar to the way for the training phase.

Namely, we recognize the video from the first frame, the second frame, the third frame and the fourth frame, etc. with the down sampling intervals of five frames. The testing video was not used for constructing templates and fuzzy rules. Hence, there are five video databases for training and testing.

An example of recognition rate of a testing video start from different frames is shown in Table VI. In this table, WLR is the activity “walking form left to right,” WRL

is the activity “walking from right to left,” JUMP is the activity “jumping,” CROUCH is the activity “crouching,” CUP is the activity “climbing up,” CDOWN is the activity

“climbing down,” CAVORT is the activity “cavorting” and SIT is the activity “sitting.”

The threshold selected for this model is four and we also employ mean plus 3.5 standard deviations as matching degree a boundary to differentiate between the predefined and unknown activities.

TABLE VI

THE RECOGNITION RATE OF PERSON 5 WITH DIFFERENT STARTING FRAME

Recognition rate (%) Starting frame

WLR WRL CROUCH JUMP CUP CDOWN CAVORT SIT

From the 1st, 6th, …

frame 100 94.44 100 100 85.71 100 64.71 89.66

From the 2nd, 7th, …

frame 100 83.33 100 100 78.57 100 52.94 87.93

From the 3rd, 8th, …

frame 100 88.24 100 100 78.57 100 47.06 91.23

From the 4th, 9th, … frame

100 88.24 100 100 64.29 100 68.75 89.47

From the 5th,

10th, … frame 100 94.12 100 100 61.54 100 87.50 85.96

Table VII shows the recognition rate of our system on the five testing subject models. The threshold used to construct fuzzy rules is four. For each activity, the recognition are obtained from different frame and then averaged.

TABLE VII

THE RECOGNITION RATES OF EACH ACTIVITY

Recognition rate (%) Testing data

WLR WRL CROUCH JUMP CUP CDOWN CAVORT SIT

Person 1 100 95.74 74.34 100 100 83.02 17.54 96.65 Person 2 100 89.25 94.03 87.04 95.96 68.89 14.42 94.74 Person 3 78.57 97.25 100 100 83.49 32.88 84.56 100 Person 4 97.06 98.78 59.43 70.75 49.37 91.38 4.85 100 Person 5 100 89.66 100 100 73.91 100 63.86 88.85

Average 84.63

4.4 Extract the New Key Postures

After the recognition scheme is complete, we can generate some unknown image frames from unknown activities of cavorting and sitting. We input these image frames not belonging to the essential templates to unsupervised clustering algorithm.

Moreover, we can then generate the extra key postures for unknown action. After generate the key postures, we will compute the distances between the newly found key postures 112 pre-defined key postures. If this distance is greater than the threshold, Th=8500, then the key posture is not similar enough to the pre-defined key postures, and we will thus identify it to be the extra key posture instead. Figs. 4.7 and 4.8 are two obtained extra key postures of two testing models. Then, imposing at least having Th=8500 deviation from a pre-defined key postures on these key postures.

Figs. 4.9 and 4.10 are obtained the final key postures of these two testing model. The

detection accuracies of real-time and final new key postures belonging to the unknown action video are summarized in Tables VIII and IX.

Fig. 4.7 “The new key postures,” of person 1.

Fig. 4.8 “The new key postures,” of person 5.

Fig. 4.9 “The final new key postures,” of person 1.

Fig. 4.10 “The final new key postures,” of person 5.

TABLE VIII

THE DETECTION ACCURACY OF REAL-TIME NEW KEY POSTURES BELONGING TO THE

UNKNOWN ACTION VIDEO

Person 1 Person 2 Person 3 Person 4 Person 5 Average Key posture

detection accuracy (%)

100.00 88.00 78.38 91.30 90.91 87.90

TABLE IX

THE DETECTION ACCURACY OF FINAL NEW KEY POSTURES BELONGING TO THE

UNKNOWN ACTION VIDEO

Person 1 Person 2 Person 3 Person 4 Person 5 Average Key posture

detection accuracy (%)

100.00 94.12 78.05 88.45 94.75 89.56

Chapter 5 Conclusion

In this thesis, we have presented a fuzzy rule base approach to human activity recognition. In our approach, the effect of illumination variation is decreased by adopting frame ratio method. Moreover, CST and EST are used to reduce data dimensionality and optimize the class separability simultaneously, and each frame of video sequence is then converted to one of 28 key frame postures. At last, fuzzy rule base is reasoned for activity recognition. We also employ unsupervised clustering algorithm to on-line obtain the new key postures of unknown action.

Experiment results have shown that the recognition rate for eight activity classification is 84.63%. Besides, the detection accuracies of finding new key postures, that belonging to the unknown action, are 87.90% on-line, and 89.56% off-line, respectively.

To investigate further, we will further automize the update learning of activity recognition rule base. In addition, recognition from a different viewing direction, extension of test environment, and more complicated activities are our future work.

References

[1] F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 3,

2001.

[2] R. Hamid, Y. Huang, and I. Essa, “ARGMode–Activity recognition using graphical models”, in Proc. Conf. Comput. Vision Pattern Recog., vol. 4, pp.

38–45, Madison, Wisconsin, 2003.

[3] S. Carlsson and J. Sullivan, “Action recognition by shape matching to key frames,"in Proc. IEEE Comput. Soc. Workshop Models versus Exemplars in Comput. Vision, pp. 263–270, Miami, Florida, 2002.

[4] I. Cohen and H. Li, “Inference of human postures by classification of 3D human body shape," in Proc. IEEE Int. Workshop on Anal. Modeling of Faces and Gestures, pp. 74–81, 2003.

[5] M. Piccardi, “Background subtraction techniques: a review,” in Proc. IEEE Int.

Conf. SMC., vol. 4, pp. 3099–3104, 2004.

[6] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: Real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, pp. 809–830, August 2000.

[7] H. Saito, A Watanabe, and S Ozawa, “Face pose estimating system based on eigenspace analysis,” in Proc. Int. Conf. Image Processing, vol. 1, pp. 638–642, 1999.

[8] J. Wang, G. Yuantao, K. N. Plataniotis, and A. N. Venetsanopoulos, “Select eigenfaces for face recognition with one training sample per subject,” 8th Cont., Automat. Robot. Vision Conf., ICARCV 2004, vol. 1, pp. 391–396, 2004.

[9] P. S. Huang, C. J. Harris, and M. S. Nixon, “Canonical space representation for recognizing humans by gait or face,” in Proc. IEEE Southwest Symp. Image Anal.

Interpretation, pp. 180–185, 1998.

[10] M. M. Rahman and S. Ishikawa, “Robust appearance-based human action recognition,” in Proc. the 17th Int. Conf. Pattern Recog., vol. 3, pp. 165–168, 2004.

[11] L. X. Wang and J. M. Mendel, “Generating fuzzy rules by learning from examples,” IEEE Trans. Syst., Man Cybern, vol. 22, no. 6, pp. 1414–1427, 1992.

[12] M. C. Su, “A fuzzy rule-based approach to spatio-temporal hand gesture recognition,” IEEE Trans. Syst., Man Cybern, vol. 30, no. 2, pp. 276–281, 2000.

[13] H. Ushida and A. Imura, “Human-motion recognition by means of fuzzy associative inference,” in Proc. Fuzzy Syst., 1994. IEEE World Congress Comput.

Intell., vol. 2, pp. 813–818, 1994.

[14] K. Etemad and R. Chellappa, “Discriminant analysis for recognition of human face images,” in Proc. ICASSP, pp. 2148–2151, 1997.

[15] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edition, 1300 Boylston Street Chestnut Hill, Massachusetts USA: Academic Press, 1990.

[16] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, ch. 6. John Wiley and Sons, Inc.

[17] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, ch. 5.

Englewood Cliffs, New Jersey:Prentice Hall, 1993.

[18] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval.

McGraw-Hill Book Company, 1983.

[19] A. M. Ferman and A. M. Tekalp, “Multiscale content extraction and representation for video indexing, ” in Multimedia Storage and Archival Systems, (Dallas, TX), Nov. 1997.

[20] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame extraction using unsupervised clustering, ” in Proc. Conf. Image Processing., vol. 1, pp.

866-870, 1998.

相關文件