• 沒有找到結果。

Chapter 3 Online Human Action Recognition System

3.2 System Flowchart

3.2.4 Action Classification

This research adopts LSTM networks to classify human action. Each type of feature can be well classified by an appropriate and targeted network. Therefore, twelve kinds of LSTM networks are implemented to find appropriate ones for the three types of features. A new proposed temporal enhancement LSTM (TE-LSTM) is among the

(a) (b)

(c) (d)

Figure 3.13 Examples of feature map visualisation (a) human actions (drink in stand position) (b) corresponding spatial feature maps (c) temporal feature maps (d)

structural feature maps

26

implemented twelve networks. This subsection gives a brief overview of the LSTM networks and describes the LSTM models.

(1) Overview of the LSTM Network

The LSTM network improves on the RNN. As mentioned above, the RNN suffers from vanishing gradients in the training stage, and the LSTM network solves this problem.

The LSTM network consists of at least one LSTM layer, and each LSTM layer contains many LSTM cells. An LSTM cell includes a forget gate, an input gate, and an output gate, as shown in Figure 3.14. These three gates regulate, store, and add or remove the information at each cell. The input gate decides what new information should be added to cell state. The forget gate decides which cell states should be retained or removed. The output gate decides the final output vector based on the processed cell state. An LSTM cell is shown in Figure 3.14.

Given an input vector, 𝑥𝑡, at time step 𝑡 and a hidden state vector, ℎ𝑡−1, at 𝑡 − 1, the output values of forget gate (𝑓𝑡), input gate (𝑖𝑡), output gate (𝑜𝑡), and memory cell candidate (𝑎𝑡), as shown in Figure 3.14, can be obtained by the following equations.

𝑓𝑡 = 𝜎𝑠(𝑤𝑓𝑥𝑡+ 𝑢𝑓𝑡−1 + 𝑏𝑓). (5)

𝑖𝑡= 𝜎𝑠(𝑤𝑖𝑥𝑡+ 𝑢𝑖𝑡−1+ 𝑏𝑖). (6)

𝑜𝑡= 𝜎𝑠(𝑤𝑜𝑥𝑡+ 𝑢𝑜𝑡−1+ 𝑏𝑜). (7)

𝑎𝑡 = 𝜎(𝑤𝑐𝑥𝑡+ 𝑢𝑐𝑡−1 + 𝑏𝑐). (8)

where 𝑤𝑓 , 𝑤𝑖 , 𝑤𝑜 , 𝑤𝑐 , 𝑢𝑓 , 𝑢𝑖 , 𝑢𝑜 , and 𝑢𝑐 , indicate parameter matrices. The symbols 𝑏𝑓 , 𝑏𝑖 , 𝑏𝑜 , and 𝑏𝑐 , are bias vectors, and 𝜎𝑠 and 𝜎 indicate the sigmoid

Figure 3.14 An LSTM cell

27

function and the tangent function, respectively.

The cell state vector, 𝑐𝑡−1 , at 𝑡 − 1 can be regarded as the memory of the previous time step and can be used to make the connection with the cell state vector, 𝑐𝑡, at 𝑡. Then, 𝑐𝑡 can be calculated by

𝑐𝑡= 𝑓𝑡∗ 𝑐𝑡−1 + 𝑖𝑡∗ 𝑎𝑡. (9)

Finally, the hidden state vector, ℎ𝑡, at time step 𝑡 is defined as

𝑡 = 𝑜𝑡∗ 𝜎(𝑐𝑡). (10)

where the operator ∗ denotes the element-wise product.

(2) LSTM Networks

Twelve kinds of LSTM networks, including LSTM networks with various layers, BiLSTM networks with various layers and the proposed TE-LSTM networks, are implemented to find suitable ones for the three types of features. Table 3.2 shows the structures of three LSTM networks with one, two and three layers, respectively.

Different types of features can be input to train LSTM models. In Table 3.2, 1LSp, 1LTe, and 1LSt indicate the structures of the LSTM networks with one-layer LSTM that are trained by spatial features, temporal features, and structural features, respectively.

Similarly, 2/3LSp, 2/3LTe, and 2/3LSt indicate the structures of the LSTM networks with two/three-layer LSTM trained by spatial features, temporal features, and structural features, respectively.

LSTM𝑖, 𝑖 = 1,2,3,4, as shown in Table 3.2, indicate the hidden state units of the

ith LSTM layers, and FC

𝑗 , 𝑗 = 1,2 , indicates the neuron numbers of full-connected layers. The neuron number of FC2, 16, is equal to the number of human action classes.

Table 3.3 shows the structures of four BiLSTM networks with one, two, three, and four layers. As in Table 3.2, 1/2/3/4BSp, 1/2/3/4BTe and 1/2/3/4BSt indicate the structures of the BiLSTM networks with one/two/three/four-layer BiLSTM which are trained by spatial features, temporal features, and structural features, respectively. Further,

Table 3.2 Structures of LSTM networks

Structure 1-Layer LSTM 2-Layer LSTM 3-Layer LSTM 1LSp 1LTe 1LSt 2LSp 2LTe 2LSt 3LSp 3LTe 3LSt

28

BiLSTM𝑖, 𝑖 = 1,2,3,4, indicates the hidden state units of the BiLSTM layers.

Additionally, one BiLSTM networks consists of two LSTM layers to process input sequences in two directions. One processes the input sequence from the first frame to the last frame (forward) along the time axis, and the other processes it from the last frame to the first frame (backward). In BiLSTM networks, the relationships of temporal variations can be analysed in both forward and backward directions. In summary, BiLSTM networks may capture better temporal dependencies than LSTM networks in some cases.

This study implemented five types of temporal enhancement (TE)-LSTM networks: TE-LSTM1, TE-LSTM2, TE-LSTM3, TE-LSTM4 and TE-LSTM5. These all have the same structure, a general TE-LSTM structure, but some of the layers use different LSTM models.

Table 3.3 Structures of BiLSTM networks

Structure 1-Layer BiLSTM 2-Layer BiLSTM 3-Layer BiLSTM 4-Layer BiLSTM 1B

Sp

1B

Te

1B

St

2B

Sp

2B

Te

2B

St

3B

Sp

3B

Te

3B

St

4B

Sp

4B

Te

4B

St

BiLSTM1 2048 2048 1024 2048 2048 1024 2048 2048 1024 2048 2048 1024

BiLSTM2 1024 1024 512 2048 2048 1024 2048 2048 1024

BiLSTM3 1024 1024 512 1024 1024 512

BiLSTM4 512 512 256

FC1 128 128 128 128 128 128 128 128 128 128 128 128

FC2 16 16 16 16 16 16 16 16 16 16 16 16

Figure 3.15 Network of a general TE-LSTM

29

A general TE-LSTM network comprises a TE network and a deep LSTM network, as shown in Figure 3.15. The TE network consists of an LSTM module and two fully connected layers. The deep LSTM network consists of two LSTM modules and three fully connected layers. Furthermore, the TE network can analyse the sequences to find their important parts and then pay more attention to those parts. By going through the TE network, the temporal information of sequences can be enhanced. The deep LSTM network then analyses the enhanced temporal sequences and classifies human actions.

Noted that the LSTM modules can be either LSTM network or BiLSTM network.

In the TE network, the feature sequences are first passed through the LSTM module and two fully connected layers to analyse the input sequences. Next, the analysed sequences are normalised using the softmax normalisation method. Finally, the normalisation outputs are multiplied by the feature sequences using element-wise product operation. In the deep LSTM network, the product result is passed through two LSTM modules and three fully connected layers sequentially to classify the human actions.

Tables 3.4, 3.5 and 3.6 show the structures of these five TE-LSTM networks.

Similarly to Table 3.2, T1/2/3/4/5Sp, T1/2/3/4/5Te, and T1/2/3/4/5St indicate the structures of the TE-LSTM1/2/3/4/5 networks trained by spatial features, temporal features and structural features, respectively.

In Tables 3.4, 3.5 and 3.6, LSTM𝑖𝑛, 𝑖 = 1,2,3, 𝑛 = 1,2,3,4, and BiLSTM𝑖𝑛, 𝑖 = 1,2,3, 𝑛 = 3,4,5 indicate the hidden state units of the ith LSTM/BiLSTM layers of the

Table 3.4 Structure of TE-LSTM networks (a) structure of TE-LSTM1 (b) structure of TE-LSTM2

30

nth TE-LSTM networks, respectively. Here, FC

𝑗𝑛 , 𝑗 = 1,2,3,4,5 , 𝑛 = 1,2,3,4,5 , indicates the neuron numbers of fully-connected layers. The neuron number of FC5𝑛, 16, is equal to the number of human action classes. Moreover, FVN and ⨀ respectively indicate whether the features vector normalization, softmax normalization, and the element-wise product has been applied. A tick indicates the technique has been applied, and a cross indicates otherwise.

相關文件