Chapter 3 Online Human Action Recognition System
3.2 System Flowchart
3.2.4 Action Classification
This research adopts LSTM networks to classify human action. Each type of feature can be well classified by an appropriate and targeted network. Therefore, twelve kinds of LSTM networks are implemented to find appropriate ones for the three types of features. A new proposed temporal enhancement LSTM (TE-LSTM) is among the
(a) (b)
(c) (d)
Figure 3.13 Examples of feature map visualisation (a) human actions (drink in stand position) (b) corresponding spatial feature maps (c) temporal feature maps (d)
structural feature maps
26
implemented twelve networks. This subsection gives a brief overview of the LSTM networks and describes the LSTM models.
(1) Overview of the LSTM Network
The LSTM network improves on the RNN. As mentioned above, the RNN suffers from vanishing gradients in the training stage, and the LSTM network solves this problem.
The LSTM network consists of at least one LSTM layer, and each LSTM layer contains many LSTM cells. An LSTM cell includes a forget gate, an input gate, and an output gate, as shown in Figure 3.14. These three gates regulate, store, and add or remove the information at each cell. The input gate decides what new information should be added to cell state. The forget gate decides which cell states should be retained or removed. The output gate decides the final output vector based on the processed cell state. An LSTM cell is shown in Figure 3.14.
Given an input vector, 𝑥𝑡, at time step 𝑡 and a hidden state vector, ℎ𝑡−1, at 𝑡 − 1, the output values of forget gate (𝑓𝑡), input gate (𝑖𝑡), output gate (𝑜𝑡), and memory cell candidate (𝑎𝑡), as shown in Figure 3.14, can be obtained by the following equations.
𝑓𝑡 = 𝜎𝑠(𝑤𝑓𝑥𝑡+ 𝑢𝑓ℎ𝑡−1 + 𝑏𝑓). (5)
𝑖𝑡= 𝜎𝑠(𝑤𝑖𝑥𝑡+ 𝑢𝑖ℎ𝑡−1+ 𝑏𝑖). (6)
𝑜𝑡= 𝜎𝑠(𝑤𝑜𝑥𝑡+ 𝑢𝑜ℎ𝑡−1+ 𝑏𝑜). (7)
𝑎𝑡 = 𝜎ℎ(𝑤𝑐𝑥𝑡+ 𝑢𝑐ℎ𝑡−1 + 𝑏𝑐). (8)
where 𝑤𝑓 , 𝑤𝑖 , 𝑤𝑜 , 𝑤𝑐 , 𝑢𝑓 , 𝑢𝑖 , 𝑢𝑜 , and 𝑢𝑐 , indicate parameter matrices. The symbols 𝑏𝑓 , 𝑏𝑖 , 𝑏𝑜 , and 𝑏𝑐 , are bias vectors, and 𝜎𝑠 and 𝜎ℎ indicate the sigmoid
Figure 3.14 An LSTM cell
27
function and the tangent function, respectively.
The cell state vector, 𝑐𝑡−1 , at 𝑡 − 1 can be regarded as the memory of the previous time step and can be used to make the connection with the cell state vector, 𝑐𝑡, at 𝑡. Then, 𝑐𝑡 can be calculated by
𝑐𝑡= 𝑓𝑡∗ 𝑐𝑡−1 + 𝑖𝑡∗ 𝑎𝑡. (9)
Finally, the hidden state vector, ℎ𝑡, at time step 𝑡 is defined as
ℎ𝑡 = 𝑜𝑡∗ 𝜎ℎ(𝑐𝑡). (10)
where the operator ∗ denotes the element-wise product.
(2) LSTM Networks
Twelve kinds of LSTM networks, including LSTM networks with various layers, BiLSTM networks with various layers and the proposed TE-LSTM networks, are implemented to find suitable ones for the three types of features. Table 3.2 shows the structures of three LSTM networks with one, two and three layers, respectively.
Different types of features can be input to train LSTM models. In Table 3.2, 1LSp, 1LTe, and 1LSt indicate the structures of the LSTM networks with one-layer LSTM that are trained by spatial features, temporal features, and structural features, respectively.
Similarly, 2/3LSp, 2/3LTe, and 2/3LSt indicate the structures of the LSTM networks with two/three-layer LSTM trained by spatial features, temporal features, and structural features, respectively.
LSTM𝑖, 𝑖 = 1,2,3,4, as shown in Table 3.2, indicate the hidden state units of the
ith LSTM layers, and FC
𝑗 , 𝑗 = 1,2 , indicates the neuron numbers of full-connected layers. The neuron number of FC2, 16, is equal to the number of human action classes.Table 3.3 shows the structures of four BiLSTM networks with one, two, three, and four layers. As in Table 3.2, 1/2/3/4BSp, 1/2/3/4BTe and 1/2/3/4BSt indicate the structures of the BiLSTM networks with one/two/three/four-layer BiLSTM which are trained by spatial features, temporal features, and structural features, respectively. Further,
Table 3.2 Structures of LSTM networks
Structure 1-Layer LSTM 2-Layer LSTM 3-Layer LSTM 1LSp 1LTe 1LSt 2LSp 2LTe 2LSt 3LSp 3LTe 3LSt
28
BiLSTM𝑖, 𝑖 = 1,2,3,4, indicates the hidden state units of the BiLSTM layers.
Additionally, one BiLSTM networks consists of two LSTM layers to process input sequences in two directions. One processes the input sequence from the first frame to the last frame (forward) along the time axis, and the other processes it from the last frame to the first frame (backward). In BiLSTM networks, the relationships of temporal variations can be analysed in both forward and backward directions. In summary, BiLSTM networks may capture better temporal dependencies than LSTM networks in some cases.
This study implemented five types of temporal enhancement (TE)-LSTM networks: TE-LSTM1, TE-LSTM2, TE-LSTM3, TE-LSTM4 and TE-LSTM5. These all have the same structure, a general TE-LSTM structure, but some of the layers use different LSTM models.
Table 3.3 Structures of BiLSTM networks
Structure 1-Layer BiLSTM 2-Layer BiLSTM 3-Layer BiLSTM 4-Layer BiLSTM 1B
Sp1B
Te1B
St2B
Sp2B
Te2B
St3B
Sp3B
Te3B
St4B
Sp4B
Te4B
StBiLSTM1 2048 2048 1024 2048 2048 1024 2048 2048 1024 2048 2048 1024
BiLSTM2 1024 1024 512 2048 2048 1024 2048 2048 1024
BiLSTM3 1024 1024 512 1024 1024 512
BiLSTM4 512 512 256
FC1 128 128 128 128 128 128 128 128 128 128 128 128
FC2 16 16 16 16 16 16 16 16 16 16 16 16
Figure 3.15 Network of a general TE-LSTM
29
A general TE-LSTM network comprises a TE network and a deep LSTM network, as shown in Figure 3.15. The TE network consists of an LSTM module and two fully connected layers. The deep LSTM network consists of two LSTM modules and three fully connected layers. Furthermore, the TE network can analyse the sequences to find their important parts and then pay more attention to those parts. By going through the TE network, the temporal information of sequences can be enhanced. The deep LSTM network then analyses the enhanced temporal sequences and classifies human actions.
Noted that the LSTM modules can be either LSTM network or BiLSTM network.
In the TE network, the feature sequences are first passed through the LSTM module and two fully connected layers to analyse the input sequences. Next, the analysed sequences are normalised using the softmax normalisation method. Finally, the normalisation outputs are multiplied by the feature sequences using element-wise product operation. In the deep LSTM network, the product result is passed through two LSTM modules and three fully connected layers sequentially to classify the human actions.
Tables 3.4, 3.5 and 3.6 show the structures of these five TE-LSTM networks.
Similarly to Table 3.2, T1/2/3/4/5Sp, T1/2/3/4/5Te, and T1/2/3/4/5St indicate the structures of the TE-LSTM1/2/3/4/5 networks trained by spatial features, temporal features and structural features, respectively.
In Tables 3.4, 3.5 and 3.6, LSTM𝑖𝑛, 𝑖 = 1,2,3, 𝑛 = 1,2,3,4, and BiLSTM𝑖𝑛, 𝑖 = 1,2,3, 𝑛 = 3,4,5 indicate the hidden state units of the ith LSTM/BiLSTM layers of the
Table 3.4 Structure of TE-LSTM networks (a) structure of TE-LSTM1 (b) structure of TE-LSTM2