以深度學習技術為基礎之線上人體動作辨識應用於室內移動型智慧機器人

全文

(1)國立臺灣師範大學資訊工程學系碩士論文. 指導教授：方瓊瑤. 博士. 以深度學習技術為基礎之線上人體動作辨識應用於室內移動型智慧機器人 Online Human Action Recognition Using Deep Learning for Indoor Smart Mobile Robots. 研究生：謝日棠. 撰. 中華民國一百零九年七月.

(2) 誌謝當我的論文寫到這裡時，也代表著鳳凰花已經陸續的盛開了。首先，我要感謝一直在背後支持我的父母與家人。他們讓我可以在無經濟壓力的情況下攻讀碩士班的學業以及在我低潮時給予我鼓勵。然而，要完成一本碩士班的論文所需要的不單單是時間，更加重要的是教授與學生之間的合作與討論。在這裡我要特別感謝方瓊瑤教授在我研究生生涯的期間給予我的指導與栽培，也感謝方瓊瑤教授提供了我許多到國外交流與學習的機會。在撰寫論文的期間，方瓊瑤教授總是用心且不厭其煩的指導我在論文上的該注意細節與寫作技巧，使論文可以更加完善。然而方瓊瑤教授不僅僅是在學業上給予我指導，在日常生活上處理事情的方法與態度，方瓊瑤教授也會給予我指導，使我獲益良多。再來，我要感謝陳世旺教授在實驗室開會時，時常都會給予我許多在實驗或論文上的建議。且在我學業或是論文上遇到困難時，都會非常熱情且認真的幫助我解決問題。然後也要感謝黃仲誼博士和羅安鈞博士抽空審查我的論文並蒞臨本論文口試且給予許多建議，使本論文改進完整。另外我也要感謝實驗室的夥伴們后玲、簡佑如、林旭政、蔡妤涓、曾永權、徐秉琛、胡雅雯、江孟霖、柯皓中在研究期間給予我的幫助與陪伴，感謝你們協助拍攝資料庫影片，分擔實驗室的事物，以及陪伴我在實驗室的時光。最後我也要感謝廖俐智、Riki Otaki 和 Fredrik Nilsson 在學業上給我的幫助。謹以此論文獻給所有給予我幫助與鼓勵的人。. 謝日棠謹致國立臺灣師範大學資訊工程學系中華民國 109 年 7 月. I.

(3) 摘要本研究提出一種以深度學習技術為基礎應用於室內移動型智慧機器人之線上人體動作辨識系統。此系統利用輸入的視覺資訊且在攝影機朝向目標人物移動的狀況下進行線上人體動作辨識，主要目的在提供智慧型人機互動除了聲控與螢幕觸控外更多的介面選擇。本系統採用三種視覺輸入資訊，分別為彩色影像資訊、短期動態資訊以及人體骨架資訊。且在進行人體偵測時涵蓋五個階段，分別為人體偵測階段、人體追蹤階段、特徵擷取階段、動作辨識階段以及結果整合階段。本系統首先使用一種二維姿態估測方法用來偵測影像中的人物位置，之後利用 Deep SORT 追蹤方式進行人物追蹤。之後，在已追蹤到的人物身上擷取人體動作特徵以便後續的動作辨識。本系統擷取的人體動作特徵有三種，分別為空間特徵、短期動態特徵以及骨架特徵。在動作辨識階段，本系統將三種人體動作特徵分別輸入三種訓練好的神經網路(LSTM networks)進行人體動作分類。最後，將上述三個不同神經網路的輸出結果整合後作為系統的分類結果輸出以期達到最佳成效。另外，本研究建立一個移動式攝影機下的人體動作資料庫(CVIU Moving Camera Human Action dataset)。此資料庫共計3646個人體動作影片，其中包含三個不同攝影角度的11種單人動作和5種雙人互動動作。單人動作包括站著喝水、坐著喝水、站著吃食物、坐著吃食物、滑手機、坐下、起立、使用筆記型電腦、直走、橫走和閱讀。雙人互動動作包括踢腿、擁抱、搬東西、走向對方和走離對方。此資料庫的影片也使用來訓練與評估本系統。實驗結果顯示，空間特徵之分類器的辨識率達96.64%，短期動態特徵之分類器的辨識率達81.87%，而骨架特徵之分類器的辨識率則為68.10%。最後，三種特徵之整合辨識率可達96.84%。. 關鍵字: 線上人體動作辨識、室內移動行智慧機器人、移動式攝影機、深度學習、長短期記憶、雙向長短期記憶、強化時序長短期記憶、空間特徵、時序特徵、結構特徵。. II.

(4) Abstract This research proposes a vision-based online human action recognition system. This system uses deep learning methods to recognise human action under moving camera circumstances. The proposed system consists of five stages: human detection, human tracking, feature extraction, action classification and fusion. The system uses three kinds of input information: colour intensity, short-term dynamic information and skeletal joints. In the human detection stage, a two-dimensional (2D) pose estimator method is used to detect a human. In the human tracking stage, a deep SORT tracking method is used to track the human. In the feature extraction stage, three kinds of features, spatial, temporal and structural, are extracted to analyse human actions. In the action classification stage, three kinds of features of human actions are respectively classified by three kinds of long short-term memory (LSTM) classifiers. In the fusion stage, a fusion method is used to leverage the three output results from the LSTM classifiers. This study constructs a computer vision and image understanding (CVIU) Moving Camera Human Action dataset (CVIU dataset), containing 3,646 human action sequences, including 11 types of single human actions and 5 types of interactive human actions. Single human actions include drink in sit and stand positions, eat in sit and stand positions, play with a phone, sit down, stand up, use a laptop, walk straight, walk horizontal, and read. Interactive human actions include kick, hug, carry object, walk toward each other, and walk away from each other. This dataset was used to train and evaluate the proposed system. Experimental results showed that the recognition rates of spatial features, temporal features and structural features were 96.64%, 81.87% and 68.10%, respectively. Finally, the fusion result of human action recognition for indoor smart mobile robots in this study was 96.84%. Keywords: Online human action recognition, indoor smart mobile robot, deep learning, long short-term memory, bi-directional long short-term memory, temporal enhancement long short-term memory, spatial feature, temporal feature, structural feature.. III.

(5) Table of Contents 誌謝 ............................................................................................................................ I 摘要 ........................................................................................................................... II Abstract .................................................................................................................... III Table of Contents .................................................................................................... IV List of Tables ........................................................................................................... VI List of Figures ........................................................................................................ VII Chapter 1. Introduction ........................................................................................1. 1.1 Research Motivation ......................................................................................1 1.2 Background and Difficulty .............................................................................6 1.3 Research Contribution ....................................................................................7 1.4 Thesis Framework ..........................................................................................8 Chapter 2. Related Work ......................................................................................9. 2.1 Features of Human Action Recognition ..........................................................9 2.2 Models of Human Action Recognition.......................................................... 13 Chapter 3. Online Human Action Recognition System .....................................15. 3.1 Research Purpose ......................................................................................... 15 3.2 System Flowchart ......................................................................................... 15 3.2.1 Human Detection ............................................................................... 16 3.2.2 Human Tracking ................................................................................ 17 3.2.3 Feature Extraction ............................................................................. 20 3.2.4 Action Classification.......................................................................... 25 3.2.5 Fusion ............................................................................................... 30 Chapter 4. Experimental Results .......................................................................32. 4.1 Research Environment and Equipment Setup ............................................... 32 IV.

(6) 4.2 CVIU Moving Camera Human Action Dataset ............................................. 32 4.3 Action Classification Results of Three Types of Features .............................. 33 4.4 Fusion Results .............................................................................................. 40 4.5 Multi-Human Action Classification Results .................................................. 43 Chapter 5 Conclusions and Future Works ............................................................. 45 5.1 Conclusions ................................................................................................. 45 5.2 Future Works................................................................................................ 46 References ................................................................................................................ 47. V.

(7) List of Tables Table 3.1 InceptionV3 (a) outline of InceptionV3 architecture (b) evaluation results comparing InceptionV3 with other models [Sze16] .................................................... 21 Table 3.2 Structures of LSTM networks.....................................................................27 Table 3.3 Structures of BiLSTM networks ................................................................. 28 Table 3.4 Structure of TE-LSTM networks (a) structure of TE-LSTM1 (b) structure of TE-LSTM2 ................................................................................................................ 29 Table 3.5 Structure of TE-LSTM networks (a) structure of TE-LSTM3 (b) structure of TE-LSTM4 ................................................................................................................ 30 Table 3.6 Structure of TE-LSTM5 ............................................................................. 30 Table 4.1 Decision results of frame sampling number selection..................................34 Table 4.2 Results of preprocessing using 1/2-layer LSTM networks .......................... 35 Table 4.3 The total amounts of human action sequences used for action classification36 Table 4.4 Action classification results of spatial features ............................................ 37 Table 4.5 Action classification results of temporal features ........................................38 Table 4.6 Action classification results of structural features .......................................39 Table 4.7 The training time of the twelve LSTM networks with the corresponding types of features (a) spatial feature, (b) temporal feature, (c) structural feature ........... 39 Table 4.8 Classification results of fusion methods and three types of features ............ 40. VI.

(8) List of Figures Figure 1.1 Indoor smart mobile robots (a) Troika (b) Aibo (c) Zenbo (d) Pepper ..........1 Figure 1.2 Smart robot market from 2017 to 2026 as reported by Maximize Market Research [9] ................................................................................................................3 Figure 1.3 Global indoor robot market from 2018 to 2026 as reported by Maximize Market Research [10] ..................................................................................................3 Figure 1.4 Global robotic market (a) global mobile robotics market from 2018 to 2023 as reported by Markets And Markets [11] (b) global service robotics market from 2020 to 2025 as reported by Mordor Intelligence [12] ..........................................................4 Figure 1.5 Robotics market summary from 2020 to 2025 reported by Mordor Intelligence [13] ..........................................................................................................4 Figure 1.6 Robotics market growth rates by regions [13] .............................................5 Figure 3.1 Flowchart of online human action recognition system ............................... 15 Figure 3.2 Comparison of human pose estimators [Cao19] ........................................16 Figure 3.3 Human skeletal joints (a) location of joints (b) result of joints extraction ..17 Figure 3.4 Results of human detection (a) completed joint extraction (b) incomplete joint extraction........................................................................................................... 17 Figure 3.5 Results of human tracking .........................................................................18 Figure 3.6 Reduced skeletal joints (a) skeletal joints without head joints (b) detection result of skeletal joints without head joints................................................................. 18 Figure 3.7 Schematic diagram of body height ............................................................ 19 Figure 3.8 An example of joint filling (a) a missing joint (b) result of joint filling ......19 Figure 3.9 An example of filling the missing joints (a) the missing joints (b) result of joint filling................................................................................................................. 20 Figure 3.10 Input information (a) RGB colour images (b) optical flow (c) human VII.

(9) skeletal joints ............................................................................................................. 21 Figure 3.11 An example of optical flow calculation (a) two successive input frames (b) their corresponding CR regions (c) optical flow ......................................................... 22 Figure 3.12 Two examples of feature map visualization (a) human actions (walk toward each other) (b) corresponding spatial feature maps (c) temporal feature maps (d) structural feature maps ......................................................................................... 24 Figure 3.13 Examples of feature map visualisation (a) human actions (drink in stand position) (b) corresponding spatial feature maps (c) temporal feature maps (d) structural feature maps ............................................................................................... 25 Figure 3.14 An LSTM cell ......................................................................................... 26 Figure 3.15 Network of a general TE-LSTM.............................................................. 28 Figure 4.1 Schematic diagram of recording dataset .................................................... 33 Figure 4.2 Three recording perspectives for “carry object” (a) D1, (b) D2, (c) D3......33 Figure 4.3 Confusion matrix for (a) the first fusion method (b) the second fusion method ...................................................................................................................... 41 Figure 4.4 Classification results of the online system (a) action “sit” (b) action “walk toward each other” (c) action “stand” .........................................................................42 Figure 4.5 Multi-action classification results of the online system (a) actions “walk toward each other” and “carry object” (b) actions “walk horizontal”, “sit”, and “drink in sit position” (c) actions “walk toward each other”, “kick”, and “walk away from each other” ................................................................................................................ 44. VIII.

(10) Chapter 1. Introduction. 1.1 Research Motivation Indoor smart mobile robots have rapidly been adopted for human society and are widely used in public or private indoor spaces for guidance, entertainment, home service, security and so on. For example, a guidance robot such as Troika [1], shown in Figure 1.1 (a), moves around in the airport and provides directions and guidance for tourists. Entertainment robots such as Aibo [2], which is a dog-shaped entertainment robot as shown in Figure 1.1 (b), can be used to play with children or pets in the house. Home service robots such as Zenbo [3], shown in Figure 1.1 (c), are used to provide company to family members. Multifunctional smart robots such as Pepper [4], shown in Figure 1.1 (d), can be used as receptionists at offices and banks, home companions at home, and educational robots at schools, universities, and colleges. These kinds of robots have a level of interaction and self-determination abilities, which are due to the “intelligence” of the robots. This intelligence is created through artificial intelligence techniques. Robots with intelligence are called smart robots.. (a) Troika [5]. (b) Aibo [6]. (c) Zenbo [7]. (d) Pepper [8]. Figure 1.1 Indoor smart mobile robots (a) Troika (b) Aibo (c) Zenbo (d) Pepper The aforementioned indoor smart mobile robots, such as Troika [1], Aibo [2], Zenbo [3], and Pepper [4], have already been released and used in houses, airports, stores, and other indoor spaces. These robots are respectively produced by LuckyGoldstar (LG), a South Korean multinational electronics company; Sony, a Japanese 1.

(11) multinational conglomerate corporation; Asus, a Taiwan-based multinational computer and phone hardware and electronics company; and SoftBank Robotics, a holding company in the SoftBank Group. All of these robots mainly interact via voice commands. Zenbo can also interact via a touch screen. In summary, indoor smart mobile robots are mainly interactive through the application of voice recognition systems and touch screen systems. Indeed, verbal commands and screen touching commands are direct and smart human-robot interactive techniques. However, voice recognition systems typically have limitations with respect to different languages, various accents and even speaking tone. A touch screen system limits the possible distance between the user and the robot. That is, a user must be close enough to touch the screen or to see the content of icons shown on the screen. Vision-based recognition systems provide an alternative type of direct and smart human-robot interaction. The users interact with the robot through a vision-based human action recognition system. With this system, users are only required to perform a daily life action in front of the robot, and the robot is expected to see and recognise the action and then perform the corresponding reflection. For example, if a robot sees the user sits on a chair, then the robot can move to the user and provide the user some water and food. With this approach, users who speak different languages can smoothly interact with the robot. Further, because of the vision-based setting, the robot is capable of interacting with a human remotely. Thus, the barriers and limitations associated with voice recognition and touch screen systems can be solved by using a vision-based online human action recognition system. Such systems can therefore diversify human-robot interaction approaches for future robot products. Moreover, many global market companies have a positive outlook on robot markets and have forecasted increases in the coming years in smart robots, indoor robots, mobile robots, service robots, and other robots. Therefore, robot markets, no doubt, will become a bull market of the world. The smart robot market is a promising prospect according to research from Maximize Market Research, as shown in Figure 1.2 [9], where the number below the bar indicates the years. The number above the bar indicates the market value to the corresponding years, and the unit is billion USD. The research from Maximize Market Research has reported and forecasted the value of the smart robot market from 2017 to 2026. In 2017, the smart robot market was valued at USD 4.54 billion and the market is 2.

(12) expected to grow to USD 29.46 billion by 2026 at a Compound Annual Growth Rate (CAGR) of 23.1% over the forecast period from 2017 to 2026.. Figure 1.2 Smart robot market from 2017 to 2026 as reported by Maximize Market Research [9] This research also reported and forecasted the value of the global indoor robot market from 2018 to 2026, as shown in Figure 1.3 [10], where the number below the bar indicates the year and colours indicate particular regions. The global indoor robot market is predicted to have a CAGR of 28.9% over the forecast period from 2018 to 2026.. Figure 1.3 Global indoor robot market from 2018 to 2026 as reported by Maximize Market Research [10] Markets And Markets reported and forecasted the value of global mobile robot market from 2018 to 2023, as shown in Figure 1.4 (a) [11], where the number below the green bar indicates years. The number in the green bar indicates the market value for the 3.

(13) corresponding years and the unit is billion USD. In 2018, the mobile robot market was valued at USD 18.7 billion and the market is expected to grow to USD 54.1 billion by 2023 at a CAGR of 23.71% over the forecast period from 2018 to 2023. Mordor Intelligence [12] reported and forecasted the value of the global service robotics market from 2020 to 2025, as shown in Figure 1.4 (b), where the number below the orange bar indicates the years and the arrow indicates the CAGR during 2020 to 2025. The value of global service robotics market in 2019 was USD 14.39 billion and it is expected to grow to USD 63.80 billion with a CAGR of 25.34% over the forecast period.. (a) (b) Figure 1.4 Global robotic market (a) global mobile robotics market from 2018 to 2023 as reported by Markets And Markets [11] (b) global service robotics market from 2020 to 2025 as reported by Mordor Intelligence [12]. Figure 1.5 Robotics market summary from 2020 to 2025 reported by Mordor Intelligence [13] The overall robotic market is shown in Figure 1.5 [13], where the number below the blue bar indicates the years. The arrow indicates the CAGR during 2020 to 2025. Mordor Intelligence [13] reported the value of the robotic market was USD 39.72 billion in 2019 and predicted it to have a CAGR of 25% over the forecast period from 2020 to 2025. 4.

(14) Furthermore, Mordor Intelligence also shows the overall robotics market growth rate during 2019 to 2024 by region, as shown in Figure 1.6 [13], where different colours indicate different growth rates. Specifically, green regions indicate high growth rates, yellow regions indicate medium growth rates and red regions indicate low growth rates. The colours cover over half the world. Undoubtably, robotics markets have a huge economic impact globally.. Figure 1.6 Robotics market growth rates by regions [13] Indoor smart mobile robots seem to have a tremendous economic outlook and a high chance of bringing considerable economic benefit to many countries. With such high growth rates in the indoor smart mobile robot markets, it is clear such robots will be widely used in the foreseeable future. Therefore, a diversity of hardware and software products is necessary to satisfy different kinds of customer requirement. Here, hardware refers to the physical parts of the robots, such as the central processing unit, robot appearance, and monitor. The software refers to the abstract part of the robots, such as control systems, input recognition systems, output systems, and inference systems. Improvements in both hardware and software will increase the economic values of the robots. This research focuses on improving the input recognition system, which is part of the software. Different types of input recognition systems process different kinds of input information and can result in different types of human-robot interactions. For example, a voice recognition system lets robots interact with humans via voice commands, a touch screen system lets robots interact with humans via screen touching, and a human action 5.

(15) recognition system lets robots interact with humans via human action commands. This research develops a vision-based online human action recognition system for indoor smart mobile robots. The system is expected to let a robot recognise human actions while the robot is moving towards the user as well as recognise human actions online.. 1.2 Background and Difficulty Human action recognition has been a challenging computer vision problem in video analysis for decades. Methods of human action recognition can be divided into online and offline approaches. Offline methods classify human actions after obtaining the entire sequence. By contrast, online methods can classify actions from only a partial sequence. Both types of method classify the action of the current frame based on the information from previous frames. Only online methods have the characteristic of early action classification. Online human action recognition can be done using a traditional machine learning approach or a deep learning approach. A machine learning example is Hoai and De la Torre’s [Hoa12] proposed maximum-margin framework based on a structured output support vector machine (SVM) to achieve online action prediction. A deep learning example is De Geest and Tuyelaars’ [De18] proposed two-stream feedback neural network built based on a recurrent neural network (RNN) with long-short term memory networks (LSTM) [Hoc97]. Both approaches are popular for solving the online human action recognition problem. However, this research utilises a deep learning method to build the online human action recognition system and tries to explore the characteristics of recurrent neural networks. Most online action recognition systems [Hoa12] [De18] are designed to process video obtained from stationary camera videos. To design this kind of system, there are two main problems to solve: (1) the transformation of real-world three-dimensional (3D) human action into two-dimensional (2D) video might cause object occlusion, (2) the non-rigid body of the target person might cause difficulty in human tracking and recognition. In addition, it is hard to simulate the vision of a mobile robot from stationary 6.

(16) camera videos. Mobile robots are capable of moving so a moving camera is required to simulate vision. In the rest of the article, human action videos recorded by stationary camera are called S-Videos and those recorded by moving camera are called M-Videos. Developing an online action recognition system with M-Videos is more difficult than with S-Videos. For example, while the camera is moving, (1) the background is changing in every frame; (2) distances between target persons and the camera are changing; therefore, the size of the target person in the video is also changing; (3) illumination of each frame may not remain consistent; and (4) a moving camera may experience camera vibration. Trends in smart mobile indoor robots are emerging, and robots will service elders in their house, help humans in the airport, and be used in other indoor spaces. Further, as mentioned above, interaction through human action commands is a direct and smart human-robot interactive approach. In response to these future trends, this research proposes a vision-based online human action recognition system using a deep learning method to recognise human actions under moving camera circumstances for indoor smart mobile robots.. 1.3 Research Contribution This research has three main contributions: the collection of an M-Video dataset of human actions, a human action recognition system to recognise human actions under moving camera circumstances, and a proposed method that simultaneously utilises multiple types of feature information to recognise human actions. (1) M-Video dataset Many human action datasets have been established and provided by different groups and universities for human action recognition experiments. However, many of these datasets, including NTU RGB + D 120 Dataset (NTU) [14], Berkeley Multimodal Human Action Dataset [15], KTH-Dataset [16], SBU Kinect Interaction Dataset [17], and PKU Multi-Modality Dataset (PKU-MMD) [18], are S-Videos. M-Videos datasets have rarely been established. Therefore, this research collects an M-Video dataset called computer vision and image understanding (CVIU) Moving Camera Human Action dataset. The human actions are recorded while the camera is moving towards the target persons. Chapter 4 7.

(17) provides details about this dataset. (2) Human action recognition system under moving camera circumstances Most research in this field focuses on stationary camera circumstances. However, there has been little development of human action recognition systems under moving camera circumstances, which the current research aims to do. The proposed system applies human detection and human tracking to the target persons, and then extracts three types of features from the target persons to provide respective LSTM classifiers to analyse the human actions. Finally, a fusion method is used to integrate these three output results to determine the final classification for the human action. Experimental results show that the proposed model is robust and stable. (3) Utilise three kinds of feature information simultaneously The developed system recognises human actions using three kinds of feature information simultaneously: features obtained from red-green-blue (RGB) colour images, features obtained from optical flow and features generated from human skeletal joints. Each type of feature has a tailored LSTM model and an output result. We then use fusion methods to integrate these three output results to improve the human action recognition rate. Experimental results show that these three kinds of features can cover each other’s deficiencies.. 1.4 Thesis Framework This thesis comprises 5 chapters. Chapter 1 introduces the research motivation, research background and difficulty. Chapter 2 discusses related works. Chapter 3 illustrates and details the system flowchart. Chapter 4 presents experimental results to show the improvement of the proposed system. Finally, Chapter 5 concludes this research and presents future works.. 8.

(18) Chapter 2. Related Work. This chapter discusses some relevant research on human action recognition. The first part introduces various types of features, including spatial and temporal features. The second part introduces human action classifiers.. 2.1 Features of Human Action Recognition Extracting suitable features to represent different actions is key to achieving human action recognition. Spatial and temporal features are those associated with space and time, respectively. Generally speaking, spatial features can be extracted in one frame whereas temporal features can be extracted from at least two successive frames. (1) Spatial Features Skeletal joints are one type of spatial feature. Many datasets have proposed skeletal joint information for researchers, such as NTU Dataset [14], PKU-MMD Dataset [18], and SBU Kinect Interaction Dataset [17]. NTU [14] and PKU-MMD Dataset [18] provide 25 3D-location joints for each person. SBU Kinect Interaction Dataset [17] provide 15 3D-location joints for each person. Many researchers [Han18] [Jun18] [Sha16] [Son18] [Tu18] [Li17] [Liu18] have used skeletal joints as features to classify human actions, although some have used skeletal joints provided by the established datasets and some have extracted their own data. Skeletal joints can be preprocessed to increase their quality as features. Jun and Choe [Jun18] presented data-augmentation methods, such as tilting, flipping, and scale variation on the skeletal joints, to enlarge their training dataset. Tu et al. [Tu18] proposed an LSTM auto-encoder model (LSTM-AE) to eliminate noise and preserve the whole action representation of the skeletal joints. Song et al. [Son18] proposed a spatial and temporal attention model to detect and recognise human actions. They also preprocessed the skeletal joints to maintain consistency for joint position and different perspectives. They smoothed each skeletal joint position to decrease the impact of noise before human action recognition and implemented an attention-based model to enhance the important skeletal joints. Skeletal joints can be also used to extract higher-level features. Li et al. [Li17] 9.

(19) calculated the Euclidean distance between each pair of skeletal joints, and the area of the triangle region among three neighbouring skeletal joints as higher-level features of human action classification. Liu et al. [Liu18] proposed a tree-structure based traversal method to represent the structure of skeletal joints. This kind of representation links neighbouring skeletal joints to enhance their interdependency. Soomro et al. [Soo19] proposed an online action localization and prediction system. They extracted individual skeletal joints using a Convolutional Pose Machines (CPM) [Wei16] method. Moreover, they proposed a high level structural information method to reduce the influence of noise by smoothing the locations of obtained skeletal joints, and to minimise the displacements of joint locations by scaling the height of the skeletal joints. Colour/intensity information is another type of spatial feature extracted by various methods. Some researchers [Ull18] [De18] [Ouy19] [Hua19] [You19] [Goe18] [Du18] [Liu19] have used neural network methods, and others [Ni11] [Liu10] have used traditional image processing methods. Ni and Xu [Ni11] proposed a statistical model based on sparse representation of space-time features to recognise human actions. This model uses the Harris3D detector to find the point of interest in space-time, and then applies a Histogram of Gradients (HOG) descriptor to extract the spatial features. Liu et al. [Liu10] proposed an action recognition framework based on multiple features. The proposed method uses Cuboids [Dol05] and 2D Scale-Invariant Feature Transform (SIFT) to extract local spatial features. Moreover, a frame differencing method is implemented to focus on the region of interest and 2D Gabor filters are applied to extract global spatial features. Ullah et al. [Ull18] proposed a human action recognition model using a bidirectional LSTM model (BiLSTM) [Sch97]. The proposed model extracts spatial features from the last fully connected layer of a pre-trained convolutional neural network (CNN) model, AlexNet [Kri12]. Ouyang et al. [Ouy19] proposed a network consisting of a 3D CNN model [Tra15] and an LSTM model to recognise human actions. In this architecture, they split an action sequence into 25 clips and randomly select 16 frames in each clip. These selected frames are resized into 112 × 112 pixels to be input into the 3D CNN. The output of the last fully connected layer of the 3D CNN is defined as the spatial features. 10.

(20) Huang et al. [Hua19], You and Jiang [You19], Liu et al. [Liu19], and Goel et al. [Goe18] developed online systems. Huang et al. [Hua19] proposed an online action detection and prediction model based on a convolutional recurrent neural network (RNN). In this model, spatial features are extracted from the output of the last convolutional layer of a pre-trained Visual Geometry Group (VGG)-16 model [Sim14]. The feature dimensions are then reduced by employing a 1 × 1 convolutional layer. You and Jiang [You19] proposed a deep neural network model, Action4DNet, to recognise human actions. This model uses 3D CNNs to extract lower-level spatial features of each person. These extracted features are passed through an attention model [Bah14] and a global max-pooling layer [Lin13] to extract higher-level spatial features. Goel et al. [Goe18] proposed an online human activity detection algorithm using support vector machine (SVM). To extract spatial features, they proposed a PersonCentred CNN (PC-CNN) method. PC-CNN first uses a Single Shot Multibox Object Detector (SSD) [Liu16] to detect persons. Next, the regions of detected persons are cropped and resized into 224 × 224 pixels. Finally, the resized regions are sent into a ResNet-152 [He16] network to extract spatial features from the last flatten layer. (2) Temporal Features LSTM networks are powerful for learning long-term dependencies and modelling sequential data. Moreover, LSTM networks can solve the problem of vanishing gradients associated with the fundamental network structure, RNN, in the training stage. Therefore, many researchers [Li17] [Liu17] [Tu18] [Son18] [Jun18] [Han18] [Liu18] [Liu19] [Ull19] [De18] [You19] [Ull18] [Du18] [Ouy19] have adopted LSTM models to extract temporal features. Song et al. [Son18] proposed a spatial and temporal attention model to exploit the importance of each frame. In this model, they added a temporal attention model, which can define the importance level of each frame, to improve the LSTM model. Liu et al. [Liu19] first passed skeletal joints through convolution operations to extract richer temporal statistics and then input these into the LSTM model to extract temporal features. De Geest and Tuyelaars [De18] proposed a two-stream LSTM feedback network to detect and classify actions. This network used a two-stream LSTM model to extract temporal features. One LSTM stream is used to interpret the input frames, and the other is used to capture the temporal dependencies. Ullah et al. [Ull18] proposed an action 11.

(21) recognition model based on a bidirectional LSTM (BiLSTM) network. They regularly sampled one-sixth of the frames in a sequence and input these into the BiLSTM model to extract the temporal dependencies. The goal of frame sampling is to reduce the computational complexity of the proposed system. Ouyang et al. [Ouy19] proposed a human action recognition network using both a 3D CNN model and an LSTM model. In this architecture, they passed the input sequences through the 3D CNN to enhance the temporal feature representation. They then sent the enhanced temporal features to the LSTM model to extract the final temporal features. Optical flow is a type of temporal feature that is widely used to observe short-term dynamics. Jagadeesh and Patil [Jag16] addressed a vision-based human action detection and recognition method using optical flow. This method calculates optical flow between frames and then converts the calculated optical flow data to binary images. Finally, they applied the HOG descriptor to the optical flow to extract temporal features. Ullah et al. [Ull19] proposed an activity recognition network based on multilayer LSTM models. In this network, they used a pre-trained optical flow detection neural network, FlowNet2 [Ilg17], to obtain optical flow. They extracted the feature maps from the final convolutional layers of FlowNet2 [Ilg17] and used a global average pooling to obtain temporal features. In summary, the above research adopted two kinds of spatial features: human skeletal joints and colour information. Importantly, skeletal joints can be used to roughly describe the structure of human poses whereas colour information contains more details of human poses. As mentioned above, colour information can be extracted by neural network methods and traditional image processing methods. Using traditional image processing methods, researchers should decide feature extraction methods themselves. However, the results of selected methods are expected and may be unsuitable to classify human actions. By contrast, with neural network methods, researchers have a higher probability of finding unexpected and suitable spatial features since the neural networks can learn automatically. Therefore, this research adopts neural network methods to extract spatial features. Moreover, this research adopts two kinds of temporal extraction methods for human action sequences: optical flow methods and the LSTM network. Optical flow 12.

(22) methods can capture short-term dynamics and LSTM networks can capture long-term dynamics. By knowing the temporal dynamics of the sequences, the system can discover the discrimination of each human action in the temporal domain.. 2.2 Models of Human Action Recognition In recent years, deep learning methods have been widely studied and developed for human action recognition. Many researchers [Li17] [Liu17] [Tu18] [Son18] [Jun18] [Han18] [Liu18] [Liu19] [Ull19] [De18] [You19] [Hua19] [Ull18] [Cio18] [Du18] [Ouy19] [Cha19] have used deep learning methods to develop their human action recognition models. Some of these studies use the offline approach [Cha19] [Wan16] [Li17] [Ijj14]. Wang et al. [Wan16] proposed a spatio-temporal features representation method, Joint Trajectory Maps (JTM), to use with the 2D CNN model, AlexNet [Kri12], to recognise human actions. JTM features are generated by three Cartesian planes of human action trajectories and are sent into an AlexNet [Kri12] model to recognise human actions. However, 2D CNN models could not learn temporal information, so the information of human action temporal dynamics may be lost. Chang et al. [Cha19] proposed a 3D VGG-13 model to recognise human actions. The authors replaced the 2D CNNs in the original VGG-13 with 3D CNNs to construct the 3D VGG-13 network. 3D CNNs are used to learn the spatial and temporal features, but they focus on learning local spatial and temporal features of sequences. Such local information may be easily affected by noise and there might be a risk relating to lost global information of whole sequences. On the other hand, some researchers [Liu17] [You19] [Liu19] [De18] have developed their human action recognition system using online approaches. De Geest and Tuyelaars [De18] developed a two-stream feedback network to detect and classify human actions. The two-stream feedback network consists of an upper stream LSTM network, a lower stream LSTM network, and a fully connected layer. The upper stream LSTM is used to interpret the input information. The lower stream is used to capture temporal information. Moreover, the fully connected layer is used to project the features into the action classes. In this study, the intensities of RGB colour model are the input information and are first analysed by a CNN model. The results are then sent into the 13.

(23) two-stream feedback network to detect and classify human actions. However, this study ignored other kinds of features, such as skeletal joints or short-term dynamic features. We think each kind of feature can be uniquely analysed for human actions, which would combine to increase the accuracy of human action recognition. Liu et al. [Liu19] proposed a Multi-Modality Multi-Task RNN to classify and forecast human actions. The human action forecast aimed to find the start and end points of an action. This network is a two-stream system. The first stream processes the skeletal joint information, and the second processes the colour intensity information. These two types of information are first processed by using convolutional layers individually. Then, the features extracted from the convolutional layers are sent into a deep LSTM network with two subtask networks for action classification and forecast, respectively. The deep LSTM network alternately stacks three LSTM layers and three fully connected layers, with a fully connected layer with softmax at the end. The two subtask networks mainly consist of fully connected layers. However, this proposed network ignores the uniqueness of each type of input information. Different kinds of input information have corresponding suitable networks, such as various stacks of LSTM layers and various orders of fully connected layers and LSTM layers. We believe that a tailored network for various types of input information can get more meaningful results. In summary, this research adopts three types of features to analyse various aspects of human actions: colour intensity, short-term dynamic information, and skeletal joints. Our proposed system is based on LSTM networks. Compare with 3D CNNs, which have weaknesses related to analysing global information, LSTM networks is superior for learning global temporal features. LSTM networks treat each frame of the input sequence as one input vector and analyse the relationship of all input vectors directly. This means that the temporal dependencies of sequences can be enhanced. Additionally, this research tries to implement corresponding tailored LSTM networks for different characteristics of features.. 14.

(24) Chapter 3. Online Human Action Recognition. System. This chapter discusses the online human action recognition system flowchart proposed by this study. We briefly introduce the purpose of this research and then illustrate and detail the system flowchart.. 3.1 Research Purpose This research aims to provide diverse human-robot interaction options for indoor smart mobile robots and to overcome the limitations of voice recognition systems and touch screen systems. We aim to solve the camera moving and online recognition problems. This research develops a system using neural networks due to the recent development and robustness of deep learning techniques. That is, this research proposes an online human action recognition system using deep learning techniques. By analysing the human actions through the proposed system, the actions can be successfully recognised and indoor smart mobile robots can give the corresponding reflection to users.. 3.2 System Flowchart. Figure 3.1 Flowchart of online human action recognition system 15.

(25) The system flowchart is shown in Figure 3.1. This flowchart has five stages: human detection, human tracking, feature extraction, action classification, and fusion. Note that feature extraction involves three types of features, spatial, temporal features and structural, and they each have their own classifier. After the RGB videos are input into the system, the persons existing in the video are detected and tracked. Here, the detected persons are called target persons. Next, three kinds of features are extracted from the regions of target persons in each frame. These features are then input into their corresponding action classifiers. Finally, the outputs of the three action classifiers are fused together to determine the final human action.. 3.2.1 Human Detection The system adopts OpenPose, a real-time multi-person 2D human pose estimator proposed by Cao et al. [Cao19], to detect humans because it has high speed and accuracy. Figure 3.2 compares OpenPose with other human pose estimators proposed in the literature, including Alpha-Pose [Fan17], Mask R-CNN [He17], PersonLab [Pap18], and METU [Koc18]. In Figure 3.2, the horizontal axis indicates the frames per second (FPS) of a video where each frame contains three target persons. The vertical axis indicates the accuracy (mean average precision) of the results of the human pose estimators. The OpenPose estimator [Cao19] use in this research is highlighted in red triangles.. Figure 3.2 Comparison of human pose estimators [Cao19] From Figure 3.2, the OpenPose estimator [Cao19] has the highest FPS, which is the most important property of online systems. Although the OpenPose estimator [Cao19] sometimes fails to detect all the skeletal joints, this shortcoming does not affect the human detection results. Moreover, our system will fill these missing skeletal joints 16.

(26) to improve the OpenPose estimator [Cao19] in the human tracking stage.. (a) (b) Figure 3.3 Human skeletal joints (a) location of joints (b) result of joints extraction. (a) (b) Figure 3.4 Results of human detection (a) completed joint extraction (b) incomplete joint extraction The OpenPose estimator [Cao19] can extract 18 human skeletal joints for each person. These skeletal joints are two hips, two knees, two ankles, two shoulders, two elbows, two wrists, two ears, two eyes, a nose, and a neck, as shown in Figure 3.3 (a). An example of the results of joint extraction is shown in Figure 3.3 (b). By using the skeletal joints information, the system can enclose and detect the human successfully. The human detection results are shown in Figures 3.4 (a) and (b), which respectively show examples of complete and incomplete extraction. One can observe that the proposed system can detect the human, whether or not skeletal joints are fully extracted.. 3.2.2 Human Tracking After the human detection stage, this system uses a Deep Simple Online and Realtime Tracking (Deep SORT) method proposed by Wojke et al. [Woj17] to track each person in the input video. Some examples of human tracking results are shown in Figure 3.5, where the symbols shown above the bounding boxes, e.g., P-1, P-2, indicate the 17.

(27) person index of the target person. The green and blue bounding boxes show the results of human detection. One can observe that Deep SORT [Woj17] correctly tracks the humans.. Figure 3.5 Results of human tracking As mentioned above, the OpenPose estimator [Cao19] can extract 18 human skeletal joints for each person. However, the skeletal joints on the head are removed in this study because they are not as important for human action detection, and they are easily detected incorrectly. Therefore, the 18 skeletal joints are reduced to 13 joints in this stage, as shown in Figure 3.6 (a). The detection result is shown in Figure 3.6 (b). After the skeletal joint reduction, some missing skeletal joints are filled in at the human tracking stage.. Figure 3.6 Reduced skeletal joints (a) skeletal joints without head joints (b) detection result of skeletal joints without head joints Two approaches are used to fill the missing joints. Assume a missing joint 𝑥𝑖 has not been detected at frame 𝑖. Then, we have the following cases. Case (1): The neck joint, 𝑥𝑖0 , is found at frame 𝑖 . A missing joint 𝑥𝑖 can be predicted by the relative difference between the neck joint and its corresponding joint at 0 ) frame 𝑖 − 1, (𝑥𝑖−1 − 𝑥𝑖−1 , as shown in Equation (1). 0 ) 𝑥𝑖 = 𝑥𝑖0 + (𝑥𝑖−1 − 𝑥𝑖−1 ×𝑆. (1). where 𝑖 indicates frame number, 𝑥𝑖 indicates a missing joint at frame 𝑖 , and 𝑥𝑖−1 indicates the corresponding detected joints of the missing joint 𝑥𝑖 at frame 𝑖 − 1. Note 0 that 𝑥𝑖−1 is detected and not a missing joint. Symbols 𝑥𝑖0 and 𝑥𝑖−1 indicate the neck. 18.

(28) joints at frames 𝑖 and 𝑖 − 1, respectively.. Figure 3.7 Schematic diagram of body height 𝐻𝑖. In Equation (1), 𝑆 is defined as S = 𝐻. 𝑖−1. , where 𝐻𝑖 and 𝐻𝑖−1 are the body. heights in the frames 𝑖 and 𝑖 − 1 , respectively. The body height is the Euclidean distance between the neck joint and the centre between the hip joints, illustrated by the red point in Figure 3.7. Thus, 𝑆 can maintain the consistency of the human height between frame 𝑖 and 𝑖 − 1. The camera moving problem can be fixed partially by considering the scale change of the same person in two successive frames.. (a) (b) Figure 3.8 An example of joint filling (a) a missing joint (b) result of joint filling Figure 3.8 shows an example of joint filling. In Figure 3.8 (a), the white points are the extracted skeletal joints. However, the right wrist joint has not been successfully extracted, as highlighted by a red circle. Figure 3.8 (b) shows the result of joint filling. In Figure 3.8 (b), the white circles indicate the original extracted skeletal joints, and the blue points indicate the filled joints. Clearly, the missing right wrist joint has been filled, as highlighted by a red circle. 19.

(29) Case (2): The neck joint, 𝑥𝑖0 , is not found at frame 𝑖. A missing joint 𝑥𝑖 can be predicted by the relative difference between its corresponding missing joints at frames 𝑖 − 1 and 𝑖 − 2, (𝑥𝑖−1 − 𝑥𝑖−2 ), as follows: (2). 𝑥𝑖 = 𝑥𝑖−1 + (𝑥𝑖−1 − 𝑥𝑖−2 ). where 𝑖 indicates frame number, 𝑥𝑖 indicates the missing joints at frame 𝑖, and 𝑥𝑖−1 , and 𝑥𝑖−2 indicate the corresponding detected joints of the missing joints 𝑥𝑖 at frames 𝑖 − 1 and 𝑖 − 2 , respectively. Note that 𝑥𝑖−1 and 𝑥𝑖−2 are not missing. The difference between frame 𝑖 − 1 and 𝑖 − 2 , (𝑥𝑖−1 − 𝑥𝑖−2 ) , is used to determine the moving direction of joints to predict the missing joints at frame 𝑖. Similarly to Figure 3.8, Figure 3.9 illustrates an example of joint filling. In this example, only three joints are detected successfully, and the others, including the neck joint, have not been extracted, as shown in Figure 3.9 (a). Figure 3.9 (b) shows the result of joint filling. In this situation, the system may obtain some joint information from the current frame; therefore, the degree of similarity between the filled joints and the real joints is lower. However, the joint filling step is still helpful for the following human action recognition stage.. (a) (b) Figure 3.9 An example of filling the missing joints (a) the missing joints (b) result of joint filling. 3.2.3 Feature Extraction As mentioned above, three types of features, spatial, temporal and structural, are used to distinguish human actions and provide information for the action classification stage. The spatial features are extracted from RGB colour images, as shown in Figure 3.8 (a). Temporal features are extracted from optical flow, as shown in Figure 3.8 (b). 20.

(30) Structural features are extracted from human skeletal joints, as shown in Figure 3.8 (c).. (a) (b) (c) Figure 3.10 Input information (a) RGB colour images (b) optical flow (c) human skeletal joints Both spatial and temporal features are extracted by the pre-trained CNN model, InceptionV3, which was proposed by Szegedy et al. [Sze16] in 2016. Table 3.1 (a) outlines the InceptionV3 architecture, including the input size and patch size of every layer. Specifically, the system extracts human action features from the output of the final pool layer, which has dimensions of 1 × 1 × 2048. Table 3.1 (b) shows the evaluation results comparing InceptionV3 with other networks such as PReLU [He15], BNInception [Iof15], VGGNet [Sim14], and GoogLeNet [Sze15] proposed by Szegedy et al. One can observe that InceptionV3 has the lowest error rate for both Top-1 error and Top-5 error. Further, InceptionV3 is pre-trained on the ImageNet dataset. Note that the system crops and resizes the input frames before extracting spatial and temporal features. In the cropping step, the system broadens the bounding box by 100 pixels in both left and right and 150 pixels in both top and bottom to increase the spatial information. Table 3.1 InceptionV3 (a) outline of InceptionV3 architecture (b) evaluation results comparing InceptionV3 with other models [Sze16] (a) (b). Once the system crops the target persons, the cropped human region is resized into 500×450 pixels and sent into InceptionV3 [Sze16] for spatial feature extraction. Note that the cropped human region contains one person if only one person appears in a frame, 21.

(31) but it contains two persons if two persons appear in that frame. The cropped and resized human regions are called CR regions hereinafter. The system calculates the Farneback optical flow using two successive CR regions, and sends it into another InceptionV3 [Sze16] to extract temporal features. Cropping and resizing the human region can partially fix the camera moving problem because cropping can force the system to focus on the target persons, and resizing can make the human regions consistent in all frames. Moreover, resizing the cropped human regions lets them fit the input shape of InceptionV3 [Sze16].. (a) (b) (c) Figure 3.11 An example of optical flow calculation (a) two successive input frames (b) their corresponding CR regions (c) optical flow Figure 3.11 shows an example of the process to obtain optical flow. Figures 3.11 (a) and (b) show two successive input frames and their corresponding CR regions, respectively. Figure 3.11 (c) shows the optical flow obtained by those successive CR regions. The arrows between Figures 3.11 (a), (b) and (c) indicate the processing direction. In summary, each frame can extract a 1 × 1 × 2048 dimension spatial feature vector and two successive frames can extract a temporal feature vector of the same size. Moreover, each input sequence with 𝑁 frames can construct a feature map whose size is 𝑁 × 2048. Structural features are obtained by calculating the relationship between each pair of skeletal joints. As mentioned above, each person has 13 skeletal joints that can be extracted. Thus, single human actions and interactive human actions by two persons respectively contain 13 and 26 skeletal joints in each frame. However, the system preserves sufficient memory space to record 26 skeletal joints in each frame, whether 22.

(32) the frame has one or two persons appearing. The system applies zero-padding to frames containing under 26 skeletal joints for the purpose of preparing information for structural feature extraction. Next, the system calculates two kinds of distances on pairwise skeletal joints and concatenates them to be the structural features. One is the Manhattan distance (1-norm) and the other is the Euclidean distance (2-norm). In each frame, the system can calculate 26 2 × ∁26 2 (= 650) 1-norm features and 1 × ∁2 (= 325) 2-norm features for pairwise. skeletal joints. Especially, 1-norm features calculate the location difference of pairwise skeletal joints on x-axis and y-axis respectively. Concatenating these features, the system can obtain 3 × ∁26 2 (= 650 + 325 = 975) features. Moreover, each input sequence with 𝑁 frames can construct a feature map whose size is 𝑁 × 975. Figure 3.12 shows two examples of the visualization results of spatial, temporal and structural feature maps. The human action in these examples, as shown in Figure 3.12 (a), is “walk toward to each other”. The two sequences each contain 20 (𝑁 = 20) frames. Figures 3.12 (b), (c) and (d) show their corresponding spatial (green), temporal (purple), and structural (blue) feature maps, respectively. The horizontal axis indicates the dimension of feature vectors and the vertical axis indicates frame numbers. In particular, the structural feature maps have a second horizontal axis on the bottom, which shows 1-norm features (blue) from 0 to 650 and 2-norm features (red) from 650 to 975. The shade of colours in these feature maps indicate the magnitude of the extracted features. The corresponding ruler is shown on the right side of the feature maps, indicating that smaller values have a lighter colour. In spatial and temporal feature maps (see Figures 3.12 (b) and (c)), if the values are greater than one, they are represented in red. Figure 3.13 shows another two examples of the visualization results of spatial, temporal and structural feature maps, this time for the drink in stand position, as shown in Figure 3.13 (a). Similarly to above, Figures 3.13 (b), (c) and (d) show the corresponding spatial (green), temporal (purple) and structural (blue) feature maps, respectively. From these feature maps, one can observe that similar human actions have similar values of features and vice versa. This kind of characteristic can lead the classifiers to more easily obtain successful classification results. The structural feature maps contain information about the relationship between 23.

(33) skeletal joints for both single and interactive actions. For example, in the feature maps of the action “walk toward each other” shown in Figure 3.12 (d), the values of the features are slowly decreasing from time step 0 to 20. This kind of variation means that the skeletal joints are getting closer, which matches the action. By contrast, in the feature maps of the action “drink in stand position” shown in Figure 3.13 (d), the values of the features barely change from time step 0 to 20. This kind of variation means that the skeletal joints only minorly change, which matches the action.. (a). (b). (c) (d) Figure 3.12 Two examples of feature map visualization (a) human actions (walk toward each other) (b) corresponding spatial feature maps (c) temporal feature maps (d) structural feature maps 24.

(34) (a). (b). (c) (d) Figure 3.13 Examples of feature map visualisation (a) human actions (drink in stand position) (b) corresponding spatial feature maps (c) temporal feature maps (d) structural feature maps. 3.2.4 Action Classification This research adopts LSTM networks to classify human action. Each type of feature can be well classified by an appropriate and targeted network. Therefore, twelve kinds of LSTM networks are implemented to find appropriate ones for the three types of features. A new proposed temporal enhancement LSTM (TE-LSTM) is among the 25.

(35) implemented twelve networks. This subsection gives a brief overview of the LSTM networks and describes the LSTM models. (1) Overview of the LSTM Network The LSTM network improves on the RNN. As mentioned above, the RNN suffers from vanishing gradients in the training stage, and the LSTM network solves this problem. The LSTM network consists of at least one LSTM layer, and each LSTM layer contains many LSTM cells. An LSTM cell includes a forget gate, an input gate, and an output gate, as shown in Figure 3.14. These three gates regulate, store, and add or remove the information at each cell. The input gate decides what new information should be added to cell state. The forget gate decides which cell states should be retained or removed. The output gate decides the final output vector based on the processed cell state. An LSTM cell is shown in Figure 3.14.. Figure 3.14 An LSTM cell Given an input vector, 𝑥𝑡 , at time step 𝑡 and a hidden state vector, ℎ𝑡−1 , at 𝑡 − 1, the output values of forget gate (𝑓𝑡 ), input gate (𝑖𝑡 ), output gate (𝑜𝑡 ), and memory cell candidate (𝑎𝑡 ), as shown in Figure 3.14, can be obtained by the following equations. 𝑓𝑡 = 𝜎𝑠 (𝑤𝑓 𝑥𝑡 + 𝑢𝑓 ℎ𝑡−1 + 𝑏𝑓 ).. (5). 𝑖𝑡 = 𝜎𝑠 (𝑤𝑖 𝑥𝑡 + 𝑢𝑖 ℎ𝑡−1 + 𝑏𝑖 ).. (6). 𝑜𝑡 = 𝜎𝑠 (𝑤𝑜 𝑥𝑡 + 𝑢𝑜 ℎ𝑡−1 + 𝑏𝑜 ).. (7). 𝑎𝑡 = 𝜎ℎ (𝑤𝑐 𝑥𝑡 + 𝑢𝑐 ℎ𝑡−1 + 𝑏𝑐 ).. (8). where 𝑤𝑓 , 𝑤𝑖 , 𝑤𝑜 , 𝑤𝑐 , 𝑢𝑓 , 𝑢𝑖 , 𝑢𝑜 , and 𝑢𝑐 , indicate parameter matrices. The symbols 𝑏𝑓 , 𝑏𝑖 , 𝑏𝑜 , and 𝑏𝑐 , are bias vectors, and 𝜎𝑠 and 𝜎ℎ indicate the sigmoid 26.

(36) function and the tangent function, respectively. The cell state vector, 𝑐𝑡−1 , at 𝑡 − 1 can be regarded as the memory of the previous time step and can be used to make the connection with the cell state vector, 𝑐𝑡 , at 𝑡. Then, 𝑐𝑡 can be calculated by 𝑐𝑡 = 𝑓𝑡 ∗ 𝑐𝑡−1 + 𝑖𝑡 ∗ 𝑎𝑡 .. (9). Finally, the hidden state vector, ℎ𝑡 , at time step 𝑡 is defined as (10). ℎ𝑡 = 𝑜𝑡 ∗ 𝜎ℎ (𝑐𝑡 ). where the operator ∗ denotes the element-wise product. (2) LSTM Networks. Twelve kinds of LSTM networks, including LSTM networks with various layers, BiLSTM networks with various layers and the proposed TE-LSTM networks, are implemented to find suitable ones for the three types of features. Table 3.2 shows the structures of three LSTM networks with one, two and three layers, respectively. Different types of features can be input to train LSTM models. In Table 3.2, 1LSp, 1LTe, and 1LSt indicate the structures of the LSTM networks with one-layer LSTM that are trained by spatial features, temporal features, and structural features, respectively. Similarly, 2/3LSp, 2/3LTe, and 2/3LSt indicate the structures of the LSTM networks with two/three-layer LSTM trained by spatial features, temporal features, and structural features, respectively. LSTM𝑖 , 𝑖 = 1,2,3,4, as shown in Table 3.2, indicate the hidden state units of the ith LSTM layers, and FC𝑗 , 𝑗 = 1,2 , indicates the neuron numbers of full-connected layers. The neuron number of FC2 , 16, is equal to the number of human action classes.. Structure LSTM1 LSTM2 LSTM3 FC1 FC2. Table 3.2 Structures of LSTM networks 1-Layer LSTM 2-Layer LSTM 3-Layer LSTM 1LSp 1LTe 1LSt 2LSp 2LTe 2LSt 3LSp 3LTe 3LSt 1024 512 1024 1024 512 1024 1024 512 1024 512 256 512 1024 512 1024 512 256 512 128 128 128 128 128 128 128 128 128 16 16 16 16 16 16 16 16 16. Table 3.3 shows the structures of four BiLSTM networks with one, two, three, and four layers. As in Table 3.2, 1/2/3/4BSp, 1/2/3/4BTe and 1/2/3/4BSt indicate the structures of the BiLSTM networks with one/two/three/four-layer BiLSTM which are trained by spatial features, temporal features, and structural features, respectively. Further, 27.

(37) BiLSTM𝑖 , 𝑖 = 1,2,3,4, indicates the hidden state units of the BiLSTM layers. Table 3.3 Structures of BiLSTM networks Structure BiLSTM1 BiLSTM2 BiLSTM3 BiLSTM4 FC1 FC2. 1-Layer BiLSTM. 2-Layer BiLSTM. 3-Layer BiLSTM. 4-Layer BiLSTM. 1BSp. 1BTe. 1BSt. 2BSp. 2BTe. 2BSt. 3BSp. 3BTe. 3BSt. 4BSp. 4BTe. 4BSt. 2048. 2048. 1024. 2048 1024. 2048 1024. 1024 512. 2048 2048 1024. 2048 2048 1024. 1024 1024 512. 128 16. 128 16. 128 16. 128 16. 128 16. 128 16. 128 16. 128 16. 128 16. 2048 2048 1024 512 128 16. 2048 2048 1024 512 128 16. 1024 1024 512 256 128 16. Additionally, one BiLSTM networks consists of two LSTM layers to process input sequences in two directions. One processes the input sequence from the first frame to the last frame (forward) along the time axis, and the other processes it from the last frame to the first frame (backward). In BiLSTM networks, the relationships of temporal variations can be analysed in both forward and backward directions. In summary, BiLSTM networks may capture better temporal dependencies than LSTM networks in some cases. This study implemented five types of temporal enhancement (TE)-LSTM networks: TE-LSTM1, TE-LSTM2, TE-LSTM3, TE-LSTM4 and TE-LSTM5. These all have the same structure, a general TE-LSTM structure, but some of the layers use different LSTM models.. Figure 3.15 Network of a general TE-LSTM 28.

(38) A general TE-LSTM network comprises a TE network and a deep LSTM network, as shown in Figure 3.15. The TE network consists of an LSTM module and two fully connected layers. The deep LSTM network consists of two LSTM modules and three fully connected layers. Furthermore, the TE network can analyse the sequences to find their important parts and then pay more attention to those parts. By going through the TE network, the temporal information of sequences can be enhanced. The deep LSTM network then analyses the enhanced temporal sequences and classifies human actions. Noted that the LSTM modules can be either LSTM network or BiLSTM network. In the TE network, the feature sequences are first passed through the LSTM module and two fully connected layers to analyse the input sequences. Next, the analysed sequences are normalised using the softmax normalisation method. Finally, the normalisation outputs are multiplied by the feature sequences using element-wise product operation. In the deep LSTM network, the product result is passed through two LSTM modules and three fully connected layers sequentially to classify the human actions. Table 3.4 Structure of TE-LSTM networks (a) structure of TE-LSTM1 (b) structure of TE-LSTM2 (a) (b). Structure LSTM11 FC11 FC21 FVN ⨀ LSTM21 LSTM31 FC31 FC41 FC51. TE-LSTM1 T1Sp T1Te T1St 2048 2048 975 2048 2048 975 2048 2048 975       512 512 512 256 256 256 128 128 128 128 128 128 16 16 16. Structure LSTM12 FC12 FC22 FVN ⨀ LSTM22 LSTM32 FC32 FC42 FC52. TE-LSTM2 T2Sp T2Te T2St 2048 2048 975 2048 2048 975 2048 2048 975       512 512 512 256 256 256 128 128 128 128 128 128 16 16 16. Tables 3.4, 3.5 and 3.6 show the structures of these five TE-LSTM networks. Similarly to Table 3.2, T1/2/3/4/5Sp, T1/2/3/4/5Te, and T1/2/3/4/5St indicate the structures of the TE-LSTM1/2/3/4/5 networks trained by spatial features, temporal features and structural features, respectively. In Tables 3.4, 3.5 and 3.6, LSTM𝑖𝑛 , 𝑖 = 1,2,3, 𝑛 = 1,2,3,4, and BiLSTM𝑖𝑛, 𝑖 = 1,2,3, 𝑛 = 3,4,5 indicate the hidden state units of the ith LSTM/BiLSTM layers of the 29.

(39) nth TE-LSTM networks, respectively. Here, FC𝑗𝑛 , 𝑗 = 1,2,3,4,5 , 𝑛 = 1,2,3,4,5 , indicates the neuron numbers of fully-connected layers. The neuron number of FC5𝑛 , 16, is equal to the number of human action classes. Moreover, FVN and ⨀ respectively indicate whether the features vector normalization, softmax normalization, and the element-wise product has been applied. A tick indicates the technique has been applied, and a cross indicates otherwise. Table 3.5 Structure of TE-LSTM networks (a) structure of TE-LSTM3 (b) structure of TE-LSTM4 (a) (b) Structure BiLSTM13 FC13 FC23 FVN ⨀ LSTM13 LSTM23 FC33 FC43 FC53. T3Sp 4096 2048 2048   512 256 128 128 16. TE-LSTM3 T3Te T3St 4096 1950 2048 975 2048 975     512 512 256 256 128 128 128 128 16 16. Structure LSTM14 FC14 FC24 FVN ⨀ BiLSTM14 BiLSTM24 FC34 FC44 FC54. TE-LSTM4 T4Sp T4Te T4St 2048 2048 975 2048 2048 975 2048 2048 975       1024 1024 1024 512 512 512 128 128 128 128 128 128 16 16 16. Table 3.6 Structure of TE-LSTM5 TE-LSTM5 Structure T5Sp T5Te T5St 5 4096 4096 1950 BiLSTM1 2048 2048 975 FC15 5 2048 2048 975 FC2    FVN    ⨀ 5 1024 1024 1024 BiLSTM2 5 512 512 512 BiLSTM3 5 128 128 128 FC3 5 128 128 128 FC4 5 16 16 16 FC5. 3.2.5 Fusion Let the outputs of the three LSTM classifiers trained by spatial features, temporal 𝑆𝑝. features, and structural features at time 𝑡 be 𝑣𝑡 , 𝑣𝑡𝑇𝑒 , and 𝑣𝑡𝑆𝑡 respectively. A fusion 30.

(40) method should be used to integrate these three outputs to determine the fusion action 𝑓𝑢. 𝑆𝑝. class 𝑐𝑡 . Noted that each of 𝑣𝑡 , 𝑣𝑡𝑇𝑒 , and 𝑣𝑡𝑆𝑡 contains 16 probability values corresponding to the 16 action classes and the probability values are between 0 to 1. 𝑆𝑝. 𝑆𝑝. Here, 𝑜𝑡 , 𝑜𝑡𝑇𝑒 , and 𝑜𝑡𝑆𝑡 indicate the highest probability values of 𝑣𝑡 , 𝑣𝑡𝑇𝑒 , and 𝑣𝑡𝑆𝑡 𝑆𝑝. at time 𝑡, and their corresponding action classes are 𝑐𝑡 , 𝑐𝑡𝑇𝑒 , and 𝑐𝑡𝑆𝑡 , respectively. In this study, two kinds of fusion method are implemented and compared with each other to find the characteristics of fusion. Both methods consider the output action class 𝑓𝑢. from the previous time, 𝑐𝑡−1 , to classify the human action at the current time. 𝑆𝑝. In the first fusion method, the output classes of the three types of features, 𝑐𝑡 , 𝑆𝑝. 𝑐𝑡𝑇𝑒 , and 𝑐𝑡𝑆𝑡 , have their corresponding highest probability values 𝑜𝑡 , 𝑜𝑡𝑇𝑒 , and 𝑜𝑡𝑆𝑡 , 𝑓𝑢. respectively. The fusion action class 𝑐𝑡. is assigned by the class with the maximum. 𝑆𝑝. 𝑆𝑝. probability values among 𝑜𝑡 , 𝑜𝑡𝑇𝑒 , and 𝑜𝑡𝑆𝑡 , if 𝑐𝑡 , 𝑐𝑡𝑇𝑒 , and 𝑐𝑡𝑆𝑡 are all different 𝑓𝑢. 𝑆𝑝. from 𝑐𝑡−1 . For example, consider the case where 𝑐𝑡 , 𝑐𝑡𝑇𝑒 , and 𝑐𝑡𝑆𝑡 are different from 𝑓𝑢. 𝑆𝑝. 𝑆𝑝. 𝑓. 𝑆𝑝. 𝑓𝑢. 𝑐𝑡−1 , and 𝑚𝑎𝑥[𝑜𝑡 , 𝑜𝑡𝑇𝑒 , 𝑜𝑡𝑆𝑡 ] = 𝑜𝑡 . Then, 𝑐𝑡 = 𝑐𝑡 . Otherwise, 𝑐𝑡. is assigned to. 𝑓𝑢. 𝑐𝑡−1 . In the second fusion method, the output class can be determined by the following equation. 𝑆𝑝. 𝑓𝑢 𝑐𝑡. ={. 𝑓𝑢. 𝑐𝑡. 𝑆𝑝. if 𝑐𝑡−1 ≠ 𝑐𝑡. 𝑓𝑢. 𝑓𝑢. and 𝑐𝑡−1 ≠ 𝑐𝑡𝑇𝑒 and 𝑐𝑡−1 ≠ 𝑐𝑡𝑆𝑡. 𝑓𝑢. 𝑐𝑡−1. (11). otherwise. Experimental results show that the LSTM classifier trained by spatial features has higher recognition rates compared with those classifiers trained by temporal features and 𝑆𝑝. structural features, respectively. This suggests that the action class 𝑐𝑡. is sometimes. more trustworthy than 𝑐𝑡𝑇𝑒 and 𝑐𝑡𝑆𝑡 . Therefore, in the second fusion method, the fusion 𝑓𝑢. action class 𝑐𝑡. 𝑓𝑢. Otherwise, 𝑐𝑡. 𝑆𝑝. is assigned to 𝑐𝑡. 𝑆𝑝. 𝑓𝑢. if 𝑐𝑡 , 𝑐𝑡𝑇𝑒 , and 𝑐𝑡𝑆𝑡 are all different from 𝑐𝑡−1 .. 𝑓𝑢. is assigned to 𝑐𝑡−1 .. 31.

(41) Chapter 4. Experimental Results. This chapter describes the research environment and equipment and provides details about the CVIU Moving Camera Human Action Dataset. The action classification results for the three types of features individually and the fusion results of action classification are presented. We also show the online human action recognition results of a multi-action sequence.. 4.1 Research Environment and Equipment Setup The focal point of this research is to recognise human actions under moving camera circumstances. This research simulates the moving camera circumstances by moving a four-wheel movable cart with a Kinect v2 sensor on it in a clean background classroom. The cart is 76.5 cm high and 45 cm wide. The CVIU Moving Camera Human Action dataset was built with this environment and equipment. This research was implemented in Python 3.7 using Keras 2.3, TensorFlow 1.15 and OpenCV4.1 run on NVIDIA GeForce GTX 1080 Ti on Ubuntu 16.04.. 4.2 CVIU Moving Camera Human Action Dataset We established an M-Video dataset called the CVIU Moving Camera Human Action dataset (CVIU dataset). The CVIU dataset contains 3,646 human action sequences (252,048 frames), including 11 types of single and 5 types of interactive human actions. The types of single human actions include drink in sit and stand positions, eat in sit and stand positions, play with a phone, sit down, stand up, use a laptop, walk straight, walk horizontal, and read. The types of interactive human actions include kick, hug, carry object, walk toward each other, and walk away from each other. This dataset was recorded from three perspectives and each human action sequence was recorded while the camera was slowly moving towards the target persons. The first recording perspective, D1, had the Kinect v2 sensor facing the target person. The second recording perspective, D2, had the Kinect v2 sensor on the right side of the target person with a 45° angle. The third recording perspective, D3, has the Kinect v2 32.

(42) sensor on the left side of the target person with a 45° angle. Figure 4.1 provides a schematic diagram of the recording dataset including the three recording perspectives, and the recording environment and equipment. Figure 4.2 shows three recording perspectives for a human action sequence, “carry object”.. Figure 4.1 Schematic diagram of recording dataset. (a) D1. (b) D2. (c) D3 Figure 4.2 Three recording perspectives for “carry object” (a) D1, (b) D2, (c) D3. 4.3 Action Classification Results of Three Types of Features This research adopts the CVIU dataset to train and test the networks. In the training stage, action sequences are subsampled and extracted three types of features. And, these three types of features are individually fed into the LSTM networks for training. In the 33.