Action Classification - System Flowchart - Online Human Action Recognition System

Chapter 3 Online Human Action Recognition System

3.2 System Flowchart

3.2.4 Action Classification

This research adopts LSTM networks to classify human action. Each type of feature can be well classified by an appropriate and targeted network. Therefore, twelve kinds of LSTM networks are implemented to find appropriate ones for the three types of features. A new proposed temporal enhancement LSTM (TE-LSTM) is among the

(a) (b)

Figure 3.13 Examples of feature map visualisation (a) human actions (drink in stand position) (b) corresponding spatial feature maps (c) temporal feature maps (d)

structural feature maps

26

implemented twelve networks. This subsection gives a brief overview of the LSTM networks and describes the LSTM models.

(1) Overview of the LSTM Network

The LSTM network improves on the RNN. As mentioned above, the RNN suffers from vanishing gradients in the training stage, and the LSTM network solves this problem.

The LSTM network consists of at least one LSTM layer, and each LSTM layer contains many LSTM cells. An LSTM cell includes a forget gate, an input gate, and an output gate, as shown in Figure 3.14. These three gates regulate, store, and add or remove the information at each cell. The input gate decides what new information should be added to cell state. The forget gate decides which cell states should be retained or removed. The output gate decides the final output vector based on the processed cell state. An LSTM cell is shown in Figure 3.14.

Given an input vector, 𝑥_𝑡, at time step 𝑡 and a hidden state vector, ℎ_𝑡−1, at 𝑡 − 1, the output values of forget gate (𝑓_𝑡), input gate (𝑖_𝑡), output gate (𝑜_𝑡), and memory cell candidate (𝑎_𝑡), as shown in Figure 3.14, can be obtained by the following equations.

𝑓_𝑡 = 𝜎_𝑠(𝑤_𝑓𝑥_𝑡+ 𝑢_𝑓ℎ_𝑡−1 + 𝑏_𝑓). (5)

𝑖_𝑡= 𝜎_𝑠(𝑤_𝑖𝑥_𝑡+ 𝑢_𝑖ℎ_𝑡−1+ 𝑏_𝑖). (6)

𝑜_𝑡= 𝜎_𝑠(𝑤_𝑜𝑥_𝑡+ 𝑢_𝑜ℎ_𝑡−1+ 𝑏_𝑜). (7)

𝑎_𝑡 = 𝜎_ℎ(𝑤_𝑐𝑥_𝑡+ 𝑢_𝑐ℎ_𝑡−1 + 𝑏_𝑐). (8)

where 𝑤_𝑓 , 𝑤_𝑖 , 𝑤_𝑜 , 𝑤_𝑐 , 𝑢_𝑓 , 𝑢_𝑖 , 𝑢_𝑜 , and 𝑢_𝑐 , indicate parameter matrices. The symbols 𝑏_𝑓 , 𝑏_𝑖 , 𝑏_𝑜 , and 𝑏_𝑐 , are bias vectors, and 𝜎_𝑠 and 𝜎_ℎ indicate the sigmoid

Figure 3.14 An LSTM cell

27

function and the tangent function, respectively.

The cell state vector, 𝑐_𝑡−1 , at 𝑡 − 1 can be regarded as the memory of the previous time step and can be used to make the connection with the cell state vector, 𝑐_𝑡, at 𝑡. Then, 𝑐_𝑡 can be calculated by

𝑐_𝑡= 𝑓_𝑡∗ 𝑐_𝑡−1 + 𝑖_𝑡∗ 𝑎_𝑡. (9)

Finally, the hidden state vector, ℎ_𝑡, at time step 𝑡 is defined as

ℎ_𝑡 = 𝑜_𝑡∗ 𝜎_ℎ(𝑐_𝑡). (10)

where the operator ∗ denotes the element-wise product.

(2) LSTM Networks

Twelve kinds of LSTM networks, including LSTM networks with various layers, BiLSTM networks with various layers and the proposed TE-LSTM networks, are implemented to find suitable ones for the three types of features. Table 3.2 shows the structures of three LSTM networks with one, two and three layers, respectively.

Different types of features can be input to train LSTM models. In Table 3.2, 1LSp, 1LTe, and 1LSt indicate the structures of the LSTM networks with one-layer LSTM that are trained by spatial features, temporal features, and structural features, respectively.

Similarly, 2/3LSp, 2/3LTe, and 2/3LSt indicate the structures of the LSTM networks with two/three-layer LSTM trained by spatial features, temporal features, and structural features, respectively.

LSTM_𝑖, 𝑖 = 1,2,3,4, as shown in Table 3.2, indicate the hidden state units of the

ith LSTM layers, and FC

_𝑗 , 𝑗 = 1,2 , indicates the neuron numbers of full-connected layers. The neuron number of FC₂, 16, is equal to the number of human action classes.

Table 3.3 shows the structures of four BiLSTM networks with one, two, three, and four layers. As in Table 3.2, 1/2/3/4BSp, 1/2/3/4BTe and 1/2/3/4BSt indicate the structures of the BiLSTM networks with one/two/three/four-layer BiLSTM which are trained by spatial features, temporal features, and structural features, respectively. Further,

Table 3.2 Structures of LSTM networks

Structure 1-Layer LSTM 2-Layer LSTM 3-Layer LSTM 1LSp 1LTe 1LSt 2LSp 2LTe 2LSt 3LSp 3LTe 3LSt

28

BiLSTM_𝑖, 𝑖 = 1,2,3,4, indicates the hidden state units of the BiLSTM layers.

Additionally, one BiLSTM networks consists of two LSTM layers to process input sequences in two directions. One processes the input sequence from the first frame to the last frame (forward) along the time axis, and the other processes it from the last frame to the first frame (backward). In BiLSTM networks, the relationships of temporal variations can be analysed in both forward and backward directions. In summary, BiLSTM networks may capture better temporal dependencies than LSTM networks in some cases.

This study implemented five types of temporal enhancement (TE)-LSTM networks: TE-LSTM1, TE-LSTM2, TE-LSTM3, TE-LSTM4 and TE-LSTM5. These all have the same structure, a general TE-LSTM structure, but some of the layers use different LSTM models.

Table 3.3 Structures of BiLSTM networks

Structure 1-Layer BiLSTM 2-Layer BiLSTM 3-Layer BiLSTM 4-Layer BiLSTM 1B

1B

2B

3B

4B

BiLSTM1 2048 2048 1024 2048 2048 1024 2048 2048 1024 2048 2048 1024

BiLSTM2 1024 1024 512 2048 2048 1024 2048 2048 1024

BiLSTM3 1024 1024 512 1024 1024 512

BiLSTM4 512 512 256

FC1 128 128 128 128 128 128 128 128 128 128 128 128

FC2 16 16 16 16 16 16 16 16 16 16 16 16

Figure 3.15 Network of a general TE-LSTM

29

A general TE-LSTM network comprises a TE network and a deep LSTM network, as shown in Figure 3.15. The TE network consists of an LSTM module and two fully connected layers. The deep LSTM network consists of two LSTM modules and three fully connected layers. Furthermore, the TE network can analyse the sequences to find their important parts and then pay more attention to those parts. By going through the TE network, the temporal information of sequences can be enhanced. The deep LSTM network then analyses the enhanced temporal sequences and classifies human actions.

Noted that the LSTM modules can be either LSTM network or BiLSTM network.

In the TE network, the feature sequences are first passed through the LSTM module and two fully connected layers to analyse the input sequences. Next, the analysed sequences are normalised using the softmax normalisation method. Finally, the normalisation outputs are multiplied by the feature sequences using element-wise product operation. In the deep LSTM network, the product result is passed through two LSTM modules and three fully connected layers sequentially to classify the human actions.

Tables 3.4, 3.5 and 3.6 show the structures of these five TE-LSTM networks.

Similarly to Table 3.2, T1/2/3/4/5Sp, T1/2/3/4/5Te, and T1/2/3/4/5St indicate the structures of the TE-LSTM1/2/3/4/5 networks trained by spatial features, temporal features and structural features, respectively.

In Tables 3.4, 3.5 and 3.6, LSTM_𝑖^𝑛, 𝑖 = 1,2,3, 𝑛 = 1,2,3,4, and BiLSTM_𝑖^𝑛, 𝑖 = 1,2,3, 𝑛 = 3,4,5 indicate the hidden state units of the ith LSTM/BiLSTM layers of the

Table 3.4 Structure of TE-LSTM networks (a) structure of TE-LSTM1 (b) structure of TE-LSTM2

30 nth TE-LSTM networks, respectively. Here, FC

_𝑗^𝑛 , 𝑗 = 1,2,3,4,5 , 𝑛 = 1,2,3,4,5 , indicates the neuron numbers of fully-connected layers. The neuron number of FC₅^𝑛, 16, is equal to the number of human action classes. Moreover, FVN and ⨀ respectively indicate whether the features vector normalization, softmax normalization, and the element-wise product has been applied. A tick indicates the technique has been applied, and a cross indicates otherwise.

在文檔中以深度學習技術為基礎之線上人體動作辨識應用於室內移動型智慧機器人 (頁 34-39)