DEEP LEARNING MODELS EVALUATION - 社交型協作機器人基於情境的涵意提供適切的服務

5.1.1 K-fold Cross-Validation

K-fold cross-validation is to partition labeled data into K equal size subsamples (folds). Among the K folds, a single fold would be left as the testing set and the retained K-1 folds are utilized as training set. The process of cross-validation is then repeat k times with each k fold is used exactly once as the testing set. The K results can be averaged to produce a single estimation.

5.1.2 Features Comparison

We utilize CNN auto-encoder to extract feature from raw image and employed encoded image as feature. Employing encoded image provides two advantages. One is that it reduces the dimension of raw image which dimension is 57600 (3 × 120 × 160) via encoder the image turns into a more meaningful space which holds only 4800 dimension.

The other is that encoded image has potentiality to reconstruct original image from relative reduced dimension. In validation stage, we compare the proposed encoded images to two other kinds of handcraft features, HOG and Optical Flow, with the same LSTM classifier.

l Handcraft Feature Selection

Regards to the situational context. We consider that human body language would be the key factor in this scenario. Thus, HOG and Optical Flow are employed as our handcraft features for comparison. HOG feature is commonly applied in human detection and raise well results especially in pedestrian detection [34]. And the advantage of Optical Flow feature is that it takes Spatio-Temporal factor into account and could be utilized in transition people motion analysis [35]. As a result of variation of dimension across features. Principal Component Analysis (PCA) is employed for reducing both HOG and Optical Flow feature to the dimension of 4800 to be as same as encoded image.

l Results and Discussion

The LSTM architecture is employed as classifier and we apply some training strategy as follows: epoch 30, batch size 8, initial learning rate 0.001, gradient descent optimization algorithms Adam [36] which contains the concept of adapting learning rate and momentum. The initial parameters are set as he_normal distribution which is normal distribution centered on 0 with standard deviation equals to square root of (2/!pq_Uq) where !pq_Uq is the input units in the weight tensor. Dropout is employed to prevent overfitting. The experimental results are shown in Table 5-1. There are two interesting aspects to be discussed. First, the three proposed features yield significantly high accuracy

Table 5-1 Results from features comparison by 5-fold cross-validation, applying LSTM architecture to learn from different features.

on training set. It represents that our hypothesis is truthful, the human body language plays a key role to our target, perceiving the needing assistance for providing heartwarming services. Second, due to the variation among people, there must exist noise from person to person. It may easily raise poor performance on testing set. However, encoded image, HOG feature, and Optical Flow feature yield 73%, 69% and 63%

accuracy (chance = 50%), respectively. That means our LSTM-based classifier successfully learned variety of body language from sequential data. Next step, we would like to examine that the sequence characteristic is really crucial or not.

5.1.3 Classifier Appropriateness

In this experiment, we would like to figure out whether take the sequential information into consideration would perform better. Thus, we compare our LSTM-based classifier to two prevalent classifiers, SVM and Naive Bayes classifiers. In each observation, the input dimension of LSTM-based classifier would be 50 × 4800 (a sequence of 50 keyframes, 4800 feature dimension) and in SVM and Naive Bayes dimension would be 24000.

Table 5-2 Results from Experiment of classifier appropriateness, applying LSTM-RNN architecture, SVM and Gaussian Naive Bayes to classify needing assistance from aforementioned features.

l Classifier Selection

Gaussian Naive Bayes is chosen, since it’s simple property and has advantage for performing on small training set. And the reason why we chose SVM is that it yields high accuracy, nice theoretical guarantees regarding overfitting and often used in people detection especially detection of pedestrians.

l Results and Discussion

In these experiments, LSTM- based, SVM, and Gaussian Naive Bayes classifiers are employed on three kinds of features as shown in Table 5-2. Based on the results in previous experiment, the discussion here mainly focuses on testing accuracy to evaluate the classifier performance. By using encoded image and HOG as our features, we could see that LSTM-based classifier perform more outstanding than the other two classifiers.

On the encoded image aspect, LSTM-based classifier raises more 23.5% and 7% accuracy compared to SVM and Gaussian Naive Bayes classifiers, respectively. On the HOG aspect, LSTM-based raises more 19.5% and 9% accuracy relative to SVM and Gaussian Naive Bayes classifiers, respectively. However, at the view of Optical Flow feature, LSTM drops a little in terms of accuracy less than SVM with about 3%. From our perspective, Optical Flow feature already takes the transition of two image into consideration. Hence, this feature is just a little suitable on SVM classifier. The results

Table 5-3 Results from multi-feature fusion, applying LSTM-RNN architecture, SVM and Gaussian Naive Bayes to classifier needing assistances from concatenated two kinds of aforementioned features.

shown in this experiment meets our hypothesis that the perception of needing assistance is not an impulse trigger, however a sequence of features will considerably raise accuracy.

5.1.4 Multi-feature Fusion

In Experiment of multi-features fusion, we would like to acknowledge whether concatenated two kinds of aforementioned features may achieve better performance in each classifier. The results are shown in Table 5-3.

l Results and Discussion

Previous experimental results show that encoded image and HOG are presented better performance on LSTM-based classifier and Optical Flow is shown to be a little suitable by utilizing SVM classifier. The results in Experiment 3 show that encoded image + HOG via LSTM-based classifier enhance 2% accuracy, comparison to encoded image only. In terms of Optical flow, the accuracy slightly raises 2.5% and 1.5% via SVM by concatenating with encoded image and HOG, respectively. From our perspective, we take these three kinds of features represent as human body language, thus concatenated these features may have a limit benefit on accuracy. To us mind, next time we would like to take another kind of feature into consideration such as facial expression, maybe it would enhance our performance by fusing human body language with facial expression.

5.1.5 Deep Learning Models Comparison

Table 5-4 Results for perceiving a person’s mentation by 5-fold cross-validation, applying CNNs followed by LSTM architecture to learn from different features.

In previous experiments, we learn spatial and temporal factors separately, which means we extract spatial features then apply LSTMs to learn in sequence. In this experiment, we would like to make deep learning model automatically learn from spatio-temporal feature in one model. The proposed CNNs followed by LSTM architecture as shown in Figure 4-22 may potentially keep the capability to uplift the accuracy. Therefore, the enhanced learning architecture is fed with raw images, HOG images, and Optical Flow images as input. The experimental results are shown in Table 5-4.

l Results and Discussion

In order to overcome computational cost, we reduce largely the numbers of filter in each convolutional layers. To us surprise, the raw images and HOG images input yields poor accuracy, it even learns nothing. from our perspective, there are two reason, one is that these two kinds of input may contains too much noise, we prefer to determine the person needing assistances via analyzing his/her sequential behaviors. Another is that we possess fewer observations for training model, it may have not enough data to learn such complex model. However, if we apply some preprocess step, such as extract Optical Flow feature beforehand, the training accuracy significant raise to 92.13%. It means model truly

learn something. In terms of testing accuracy, we may see that the Optical Flow images can reach 78% accuracy which beats aforementioned experiments. In this experiments, we come up with two conclusions. First, we know that if there are fewer training data, it may be a good idea to exact some simple handcraft features before learning. Second, a deep learning model which is composed CNNs followed by LSTMs contain more potentiality than two separate learning models.

5.2 SITUATIONAL CONTEXT PERCEPTION

在文檔中社交型協作機器人基於情境的涵意提供適切的服務 (頁 67-73)