Feature Extraction - System Flowchart - Online Human Action Recognition System

Chapter 3 Online Human Action Recognition System

3.2 System Flowchart

3.2.3 Feature Extraction

As mentioned above, three types of features, spatial, temporal and structural, are used to distinguish human actions and provide information for the action classification stage. The spatial features are extracted from RGB colour images, as shown in Figure 3.8 (a). Temporal features are extracted from optical flow, as shown in Figure 3.8 (b).

(a) (b)

Figure 3.9 An example of filling the missing joints (a) the missing joints (b) result of joint filling

21

Structural features are extracted from human skeletal joints, as shown in Figure 3.8 (c).

Both spatial and temporal features are extracted by the pre-trained CNN model, InceptionV3, which was proposed by Szegedy et al. [Sze16] in 2016. Table 3.1 (a) outlines the InceptionV3 architecture, including the input size and patch size of every layer. Specifically, the system extracts human action features from the output of the final pool layer, which has dimensions of 1 × 1 × 2048. Table 3.1 (b) shows the evaluation results comparing InceptionV3 with other networks such as PReLU [He15], BN-Inception [Iof15], VGGNet [Sim14], and GoogLeNet [Sze15] proposed by Szegedy et

al. One can observe that InceptionV3 has the lowest error rate for both Top-1 error and

Top-5 error. Further, InceptionV3 is pre-trained on the ImageNet dataset.

Note that the system crops and resizes the input frames before extracting spatial and temporal features. In the cropping step, the system broadens the bounding box by 100 pixels in both left and right and 150 pixels in both top and bottom to increase the spatial information.

Once the system crops the target persons, the cropped human region is resized into 500×450 pixels and sent into InceptionV3 [Sze16] for spatial feature extraction. Note that the cropped human region contains one person if only one person appears in a frame, Table 3.1 InceptionV3 (a) outline of InceptionV3 architecture (b) evaluation results

comparing InceptionV3 with other models [Sze16]

(a) (b) (a) (b) (c)

Figure 3.10 Input information (a) RGB colour images (b) optical flow (c) human skeletal joints

22

but it contains two persons if two persons appear in that frame. The cropped and resized human regions are called CR regions hereinafter. The system calculates the Farneback optical flow using two successive CR regions, and sends it into another InceptionV3 [Sze16] to extract temporal features.

Cropping and resizing the human region can partially fix the camera moving problem because cropping can force the system to focus on the target persons, and resizing can make the human regions consistent in all frames. Moreover, resizing the cropped human regions lets them fit the input shape of InceptionV3 [Sze16].

Figure 3.11 shows an example of the process to obtain optical flow. Figures 3.11 (a) and (b) show two successive input frames and their corresponding CR regions, respectively. Figure 3.11 (c) shows the optical flow obtained by those successive CR regions. The arrows between Figures 3.11 (a), (b) and (c) indicate the processing direction. In summary, each frame can extract a 1 × 1 × 2048 dimension spatial feature vector and two successive frames can extract a temporal feature vector of the same size. Moreover, each input sequence with 𝑁 frames can construct a feature map whose size is 𝑁 × 2048.

Structural features are obtained by calculating the relationship between each pair of skeletal joints. As mentioned above, each person has 13 skeletal joints that can be extracted. Thus, single human actions and interactive human actions by two persons respectively contain 13 and 26 skeletal joints in each frame. However, the system preserves sufficient memory space to record 26 skeletal joints in each frame, whether

(a) (b) (c)

Figure 3.11 An example of optical flow calculation (a) two successive input frames (b) their corresponding CR regions (c) optical flow

23

the frame has one or two persons appearing. The system applies zero-padding to frames containing under 26 skeletal joints for the purpose of preparing information for structural feature extraction.

Next, the system calculates two kinds of distances on pairwise skeletal joints and concatenates them to be the structural features. One is the Manhattan distance (1-norm) and the other is the Euclidean distance (2-norm). In each frame, the system can calculate 2 × ∁₂²⁶(= 650) 1-norm features and 1 × ∁₂²⁶(= 325) 2-norm features for pairwise skeletal joints. Especially, 1-norm features calculate the location difference of pairwise skeletal joints on x-axis and y-axis respectively. Concatenating these features, the system can obtain 3 × ∁₂²⁶(= 650 + 325 = 975) features. Moreover, each input sequence with 𝑁 frames can construct a feature map whose size is 𝑁 × 975.

Figure 3.12 shows two examples of the visualization results of spatial, temporal and structural feature maps. The human action in these examples, as shown in Figure 3.12 (a), is “walk toward to each other”. The two sequences each contain 20 (𝑁 = 20) frames. Figures 3.12 (b), (c) and (d) show their corresponding spatial (green), temporal (purple), and structural (blue) feature maps, respectively. The horizontal axis indicates the dimension of feature vectors and the vertical axis indicates frame numbers. In particular, the structural feature maps have a second horizontal axis on the bottom, which shows 1-norm features (blue) from 0 to 650 and 2-norm features (red) from 650 to 975.

The shade of colours in these feature maps indicate the magnitude of the extracted features. The corresponding ruler is shown on the right side of the feature maps, indicating that smaller values have a lighter colour. In spatial and temporal feature maps (see Figures 3.12 (b) and (c)), if the values are greater than one, they are represented in red.

Figure 3.13 shows another two examples of the visualization results of spatial, temporal and structural feature maps, this time for the drink in stand position, as shown in Figure 3.13 (a). Similarly to above, Figures 3.13 (b), (c) and (d) show the corresponding spatial (green), temporal (purple) and structural (blue) feature maps, respectively.

From these feature maps, one can observe that similar human actions have similar values of features and vice versa. This kind of characteristic can lead the classifiers to more easily obtain successful classification results.

The structural feature maps contain information about the relationship between

24

skeletal joints for both single and interactive actions. For example, in the feature maps of the action “walk toward each other” shown in Figure 3.12 (d), the values of the features are slowly decreasing from time step 0 to 20. This kind of variation means that the skeletal joints are getting closer, which matches the action. By contrast, in the feature maps of the action “drink in stand position” shown in Figure 3.13 (d), the values of the features barely change from time step 0 to 20. This kind of variation means that the skeletal joints only minorly change, which matches the action.

(a) (b)

Figure 3.12 Two examples of feature map visualization (a) human actions (walk toward each other) (b) corresponding spatial feature maps (c) temporal feature maps

(d) structural feature maps

25

在文檔中以深度學習技術為基礎之線上人體動作辨識應用於室內移動型智慧機器人 (頁 29-34)