R ELATED W ORK - 預測使用者行為以輔助身份辨識的融合統計方法

There have been many works using the keystroke dynamics to verify the user’s identity. The features of keystroke dynamics include timing features (duration, hold time, latency, and so on), pressure, and position. Most works analyzed user’s keystroke by using the timing features. This idea was first appeared in 1975 [1]. Later, a number of researchers have begun to propose various analysis methods for keystroke dynamics, and a commercial product suite ware shown up [40]. In this chapter, the keystroke analysis methods which are proposed in related works will be first introduced in Section 2.1. Then Section 2.2 focuses on the behavior change and Section 2.3 gives a summary of related works.

2.1 Methods of Keystroke Analysis

The analysis methods of most works can be classified as two categories: statistical classification methods and machine learning methods. The former includes simple statistical methods and data mining methods, such as k-nearest neighbor decision rule, Bayes classifier, and decision tree. The latter includes Neural Network, Fuzzy Logic, Support Vector Machine, and Hidden Markov Model, etc.

2.1.1 Statistical Classification Method

Early works applied the statistical classification method form statistics or data mining area to analyze the typist’s keystroke characteristics. Gaines et al [2] made an experiment which 7 professional secretaries were asked typing some predetermined text, and used t-test method to examine the typing characteristics. Joyce and Gupta [3]

introduced an intuitive method to analyze the user’s login name, first name, last name,

and password during the login process. They measured the difference between the reference strings and the test strings, and compared the difference with some threshold.

Monrose and Rubin [7] [13] clustered users based on the typing speed with three classifier methods: Euclidean distance measure, non-weighted probability, and weighted probability measure. Their work focused on analyzing the most frequent appeared digraph. Magalhaes et al [22] introduced a lightweight algorithm to analyze keystroke characteristics and considered the concept of keyboard gridding based on Revett and Khan’s work [23]. Guven et al [38] introduced a vector based algorithm which is similar to minimum distance classifier. They calculated norms of the vector dimensions for a given two keystroke vectors to make a decision. Hocquet et al [21] combined three classification methods – classical method (the average and standard deviation), measurement of the disorder, and using a discretization of the time. Villani et al [24]

analyzed the long-text input to verify user’s identity with the nearest neighbor classifier.

They made experiments using two input modes – copy and free-text input, and two keyboard types – desktop and laptop keyboards.

In these works, the typing string length is usually longer than 14 for accuracy, but the error rate is still higher than 5%.

2.1.2 Machine Learning Method

Recently, most works utilize the machine learning methods to model the user’s typing behavior, and the verification accuracy is improved. The Neural Network [5][6][10] was applied in this area from 1997. Then Ru et al [8] and Araújo et al [20]

utilized the Fuzzy Logic to distinguish users based on the keystroke latencies, the distance of the keys on the keyboard, and typing difficulty of the key combinations.

Afterward the Support Vector Machine (SVM) [17][19] and Principle Component Analysis (PCA) [25] ware also introduced. Haidar et al [14] presented a suite of

techniques using Neural Networks, Fuzzy Logic, statistical methods, and several hybrid combinations of these approaches to learn the typing behavior of a user. Dowland et al [15] compared the classification accuracy between some data mining methods (k-NN, COG, C4.5, CN2, OC1, RBF). The results showed that the machine learning (OC1 and C4.5) and statistical (k-NN) based algorithms are suitable for free-text keystroke analysis.

These machine learning methods have some trade-off in the efficiency. In Neural Network, if some new members join to the network, it must retrain the network so that the network may become unsettled. As to SVM, it usually spends much time to training model and needs great resources. So, these classifiers are not appropriate to real time authentication system because of training requirement.

We choose the Hidden Markov Model (HMM) from statistical learning theory to model the user’s typing behavior [27]. There are many reasons revealing that the HMM is useful for keystroke analysis. First, each individual has his/her own HMM for the individual’s keystroke timing characteristics. Even if there is new user entering, the only thing to do is creating a HMM for that user. Second, HMM is easy to implement and does not need large resources. Finally, the operation of HMM is efficient. The complexity of density approximation during training is quadratic time, and the complexity of applying Forward algorithm during classification is linear time [37].

2.2 Behavior Change

In the literature of keystroke dynamics analysis, we observed that the most common methods rely on the sample mean and sample standard deviation of the keystroke latencies or durations which are provided at training phase. However, the keying behavior of the user may change over time. Consequently, some works applied

the adaptation mechanism to update the profiles of users. Bleha et al [4] used minimum distance classifier and Bayes classifier to determine the user’s identity. The reference data for each user was updated weekly using the latest 30 entries to compute the reference patterns. Monrose et al [11] proposed a novel approach to improving the security of passwords by combining the typing patterns and password to generate a secret and using it to encrypt data. They used the last h successful login data to update the history file of user. Araújo et al [20] performed an adaptation mechanism after a successful authentication. If the new sample ware not far away the original sample mean, the new one will be added to user template and the oldest one will be discarded.

Hosseinzadeh et al [26] applied the Gaussian Mixture Models to keystroke identification since user’s model could be updated each time he or she is authenticated.

Above works all considered the idea which used the recent data to verify user’s behavior and dropped the old data. But it will be a problem about how many reference data should be included. If the number of data is large, the model could not image current behavior of user exactly. If the number of data is few, the model would react overly. Moreover, how do these reference data affect the model appropriately? Generally, the later behavior should affect the model the more, and the earlier behavior should affect the model the less.

2.3 Summary

The approaches appeared in the related works determined the valid attempts by checking whether the timing features providing by typist fall within the some threshold as follows [3][12]:

p p

p wD D D wD

D_μ − _σ ≤ ≤ _μ + _σ ,

where D is one of the timing feature in the test profile, D_μ^p and D_σ^p are the

corresponding mean and standard deviation of the feature in the individual’s reference profile, and w is the weighting factor. Usually, the mean and standard deviation are estimated by sample mean and sample standard deviation. They are not always practical.

If the mean and standard deviation can be estimated more realistically, the model will verify the identity correctly and detect the imposter easily. To achieve this goal, we consider the behavior change as the other feature, and a statistical prediction method is utilized to estimate the user’s probable behavior (mean and standard deviation) in this thesis.

在文檔中預測使用者行為以輔助身份辨識的融合統計方法 (頁 13-18)