Maximum Likelihood Estimation of the Parameters

3. M ODELING AND M ETHODOLOGY

3.3. Maximum Likelihood Estimation of the Parameters

With the limitation that we are unable to collect all the typing keystrokes of the individual and calculate the exact parameters of the means and variances for each distinct combination of n-graph durations. We have to deduce

{ (

μˆq,σˆq

) }

q∈Qⁿ

of n-graph durations, give a keystroke sequence , by the method of maximum likelihood estimation of the parameters. Fortunately, the maximum likelihood estimation of the parameters for Gaussian distribution can compute the sample mean and sample variance as follows.

S 3.4. Hidden Markov Model

Hidden Markov Models (HMMs) [12, 14, 22] are proper for modeling sequential data, such as the sequence of keystroke timing information that we take into consideration in this thesis. HMMs have been widely applied in areas such as speech recognition, optical character recognition, machine translation, bioinformatics, and genomics. A Markov process is a stochastic process with the property that the probability of transitioning from previous state to current state depends only on the previous state and was independent of all other previous states. In general, a Markov model is a way of describing a process that goes through a series of states [14]. In a general Markov Model, the state is directly observed by the observer. In a Hidden Markov Model, the state is not directly visible, and some outputs from the state are observed. A Hidden Markov Model

y

can be viewed as a chain of mixtures models with unknown parameters.

The HMM we use to model the timing information of keystroke sequence is shown in Figure 3.1. It is a statistical graphical model, where each circle is a random variable. Unshaded circles represent are unknown (hidden) state

variables we wish to infer, and shaded circles are observed state variables, where is a specific point in time.

q

t A

is a state transition matrix holding the probabilities of transitioning from to , where (or ) means the

i-th (or j-th) state. So we have

qt ^j

qt +1

q

i q^j

( )

i t

j q A

P t₊₁ =1| =1 = .

η

is an state emission

matrix holding the output probability P

(

yt^|qtⁱ ⁼¹

)

of i-th state.

π

_i is the initial state probability of i-th state. A compact notation

λ

A

η

π

) is used to indicate the complete parameter set of the model.

Figure 3.1: The Hidden Markov Model for keystroke analysis.

In our setting, given a keystroke sequence , n-graph , [n+1]-graph , such that

S G

G′

{ }

The state transition matrix A is the probability of the frequency that the [n+1]-graph appeared in the

S

as follows.

For instance in Figure 3.2, given a keystroke sequence “banana” and digraph is of interest, the digraph “na” is following the digraph “an”. We have 5 (6-2+1) digraphs in “banana”, 4 (6-3+1) trigraph in “banana”, and the trigraph “ana”

appears two times. As a result, we have the transition probability of 2 =4 0.5 from “an” to “na”.

Figure 3.2: Graphical mode for digraph with keystroke sequence “banana”

The state emission matrix

η

here is defined as the Gaussian distribution probability of the n-graph

G

g

₁,

g

₂,

g

₃,K,

g

_m₋_n₊₁} with duration

( ) ( ) ( )

, , , ,

( )

} { ₁ ₁ ₂ ₂ ₃ ₃ ₋₊₁ ₋ ₊₁

d g d g d g d

_m _n

g

_m _n

GD

K as follows.

( )

For example, given a sample duration 80ms of digraph “na” with the mean 100ms and the standard deviation 30ms, we can calculate the emission probability of sample digraph duration as follows.

( ) [ ]

⁽ ⁾

0 . 010648267

The initial probability vector

π

is the probability of the frequency that the

n-graph appeared in the S

There are three basic problems to solve with the HMM

λ

A

η

π

). These problems are the following.

♦ Given a model parameters

λ

A

η

π

) and observation output sequence

O

_t, compute the probability

P

(

O

λ

) of the observation output sequence.

♦ Given a model parameters

λ

A

η

π

) and observation output

Q

= 1 2 3K which could have generated the observation output sequence.

♦ Given a observation output sequence

O

_t, generate a HMM

λ

A

η

π

) to maximize the

P

(

O

λ

We make the assumption that each individual has his/her own HMM with )

, , (

η π

λ

A

for individual’s keystroke timing characteristics. The problem to solve is that, given a keystroke sequence

S

and its timing information, we have

to choose one from the number of HMMs which has the highest probability to generate the keystroke sequence . Consequently, first we have to calculate the probability of keystroke sequence for each HMM. This is similar to the first basic problem to solve with HMM as described above, and we will show how to solve the problem with Forward algorithm in the next section.

S S

3.5. Forward Algorithm

The problem of finding the probability of keystroke sequence can be viewed as how well a given HMM

22] to calculate the probability of a m long keystroke sequence

S

with n-graph

G

, and n-graph durations

GD

The state probabilities

α

s

of each state can be computed by first calculating

α

for all states at

t

=1.

( )

₁

( )

₁

( )

₁

g π g η

_g₁

d

α

= ⋅

Then for each time step

t

=2 K, ,

k

, the state probability

α

is calculated recursively for each state.

( )

₁ _,

( )

₁

The Forward algorithm described above has certain difference from the original one in [22]. The emission probabilities take less computation to obtain

since we use the Gaussian distribution to model observed states. Additionally the observed states are only connected to the corresponding unknown states because we know the exact combination of n-graph the individual typed. So the summation of all partial probability of the state at time is ignored and only one probability is calculated.

t

In original version of the Forward algorithm, the computation involved in the calculation of

α

( ) j

, 1≤

t

≤

T

, 1≤

j

≤

N

, where T is the number of observations in the sequence and is the number of states in the model, requires

N

( ) ^N ^T

O

² calculations. In our modified version of the Forward algorithm, we can see that it only requires

^O ( ) ^NT

calculations.

3.6. General Modules for Keystroke Analysis

In general, there are two problems can be solved using our model.

♦ Given a keystroke sequence

S

, and a HMM

λ

describing individual’s keystroke timing information, we wish to determine whether

S

come from

λ

or not. (Authentication)

♦ Given a keystroke sequence and a set of HMMs

S λ

s

describing different individuals’ keystroke timing information, we wish to know which HMM most probably generated

S

. (Identification)

The first problem is that, given a test sample of keystroke sequence and a reference profile, we have to decide whether the sample belongs to the reference profile or not. The second problem is very similar to the solutions provided by physiological biometrics. In this section, we devise three modules: Profile Building Module, Authentication Module and Identification Module underlying the model and algorithm described in the previous sections.

In the Profile Building Module, first we have to build the reference profile for each user. It requires the user to provide the reference samples. The more quantity of reference samples provided, the more exact parameters can be extracted. After collecting sufficient number of reference samples, we use the maximum likelihood estimation for Gaussian Modeling to calculate the parameters of each n-graph duration. We also have to compute the transition probability matrix and initial probability vector with respect to Hidden Markov Model. Then the parameters calculated for Hidden Markov Model are treated as the base element of the reference profile for each user. The flow chart of Profile Building Module is shown in Figure 3.3.

User provide sufficient number of reference samples

Transform the reference samples as combination of n-graph and duration

time for each n-graph

Apply maximum likelihood estimation for Gaussian Modeling to calculate the duration mean and duration standard

deviation of each distinct n-graph

Compute the transition probability matrix and initial probability vector of each n-graph state for Hidden Markov

Model

A : Transition probability matrix of HMM η : Emission probability matrix of HMM π : Initial probability vector of HMM

Store the profile of the particular User with the form of HMM λ={A, η, π} into User

Profile Database

Figure 3.3: Flow chart for Profile Building Module

In the Authentication Module, given a keystroke sequence of target string from a user with claimed identity

S

ID

, we wish to examine the possibility that

S

generated by

ID

. First we transform the keystroke sequence to

n-graph combinations

and calculate the timing information of n-graph duration as usual. At this moment, we have

S n-graph duration vector to evaluate the threshold value of the probability

produced by modified Forward algorithm. With the inputs , , and

g

GDT

GD GDT λ

_ID, we can apply modified version of Forward algorithm to obtain two probability value Pr[

S

G

GD

λ

_ID] and Pr[

S

G

GDT

λ

_ID]. Pr[

S

G

GDT

λ

_ID] can be viewed as the possibility if all the n-graphs durations in

G

are deviating

ε

times of duration

σ

from duration

μ

. Pr[

S

G

GDT

λ

_ID] is the threshold value of probability used to decide that the acceptance of the keystroke sequence

is confirmed if following expression is true.

S

The weighting factor

ε

can be specified with respect to different level of security strength. The flow chart of Authentication Module is shown in Figure

3.4.

Figure 3.4: Flow chart for Authentication Module

In the Identification Module, given a keystroke sequence from the individual and a set of HMMs

{ s s s

_m _m _N choose the best one from

l

s

λ

which most probably generated or there is no such one existed. In the beginning, the keystroke sequence is transformed to

n-graph combinations

and the timing information of

n-graph duration

Forward algorithm. We select user U with the maximum probability over others’, such as

(

j j N j l

)

modified Forward algorithm to calculate the . If the

expression , the keystroke sequence

generated by user U is confirmed. Otherwise, we consider the keystroke sequence is not generated by any user in the User Profile Database. The flow chart is shown in Figure 3.5.

User provide testing sample with claimed identity ID.

Transform the testing sample as combination of n-graph and duration

time for each n-graph

Produce GDT with weighting factor ε

Apply modified Forward algorithm to calculate Pr[S, G, GD | λ] for each HMM in the User Profile Database

Confirmed that testing sample S is generated by

user U

ID : The claimed identity of the user S : Keystroke sequence of target string G : n-graph vector generated from S GD : n-graph duration vector for G from S GDT^U: n-graph duration vector for G from the

Gaussian Modeling parameters of user U λ : HMM parameters of user

J : the number of profile in User Profile Database

Pr[S, G, GD | λ_U] >= Pr[S, G, GDT^U| λ_U] Unknown user in the User Profile Database No

Yes Select

Pr[S, G, GD | λU] = max(Pr[S, G, GD | λj]) for all j in J

Figure 3.5: Flow chart for Identification Module

3.7. Scheme and Measures

Within the literature of fixed-text keystroke analysis, most of the proposed approaches put the emphasis on the application of authentication. There are several aspects to be concerned as follows:

♦ The target string to be analyzed could be username, password, first name, last name, or pass-phrase, which are normally short and usually three to sixteen characters in length.

♦ The samples used to build the reference profiles and the samples used to compare are identical and fixed strings, the difference is the timing information extracted from them.

We devise the scheme for fixed-text keystroke analysis according to the concerns listed above. There are two phases in the scheme for static keystroke analysis, the training phase and the recognition phase. The training phase is to build the user profiles as the database for recognition phase to compare with.

In the training phase, we have to decide the number of reference samples from each target string, and the size of n-graph to segment the target string.

Figure 3.5 depicts the process of training phase.

Figure 3.5: Flow chart of training phase for fixed-text keystroke analysis

In the recognition phase, we divide it into two parts according the function for dedicated requirement: Authentication or Identification. Figure 3.6 depicts the process of recognition phase.

Authentication Module

τ, TS, λτ

Acceptance or Rejection of the claimed identity τ

User Profile Database

τ : Claimed identity of the user

TS : Testing samples of keystroke sequences and timing information

λτ : Profile for τ as HMM parameters λ's : All user profile in User Profile Database τm : Identity matched TS

Recognition Phase

λτ

TS, λ's Identification

Module

λ's

τm found to match TS or report TS is unknown to User Profile Database

Figure 3.6: Flow chart of recognition phase for fixed-text keystroke analysis

3.8. Authentication Strategy

The target strings to be analyzed in traditional login-password authentication mechanism are username and password. We can use two strategies as follows:

♦ O-Strategy

The claimed identity is accepted if both username and password passed the verification phase. This strategy requires users make no mistakes on both target strings.

♦ A-Strategy

The claimed identity is rejected if both username and password denied at the recognition phase. This strategy allows users to make almost most one mistake on one of the target strings.

4. Experiments and Results

4.1. Experiment Setting

The experiment was conducted via web browser. A client-side JavaScript is used to gather the timing information of keystroke. Parts of the volunteers are colleagues and alumni of NCTU. Other parts of volunteers were anonymous from Internet. User provided their login name and password via html form, just like the way commonly employed by the web-based application. The timing accuracy we used is 1 millisecond. In this experiment, we use digraph as the segment size of keystroke sequence.

4.2. Data Collection

For the collection of reference samples, 58 volunteers provided two familiar strings as login name and password for 20 times. As to the collection of testing samples, the above 58 volunteer tried to be authenticated in their own account as legitimate users for 15 times, 870 testing samples were used to evaluate FRR.

Another 257 anonymous volunteers tried to be authenticated in legitimate users’

accounts. Each account was attacked between 44 and 82 times. Total 3528 imposter testing samples were collected.

The lengths of login name and password are between 4 and 14. Figure 4.1 show the distribution of target string length.

30 27

14 7

2 0

10 20 30 40 50 60

Number of Profiles

4 5 6 7 8 9 10 11

Minimum Target String Length

Figure 4.1: Target string length distribution of reference samples

4.3. Evaluation

We evaluate the value of standard deviation weighting factor ε between 0.2 and 3.5 with interval of 0.1 for both strategy. Figure 4.2 to Figure 4.5 shows the FAR and FRR of O-Strategy with minimum target length of 9, reference sample size of 5, 10, 15, 20, and 35 possible standard deviation weighting factor.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.2: O-Strategy - Minimum target string length = 9, reference sample size

= 5

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.3: O-Strategy - Minimum target string length = 9, reference sample size

= 10, EER = 5.71%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.4: O-Strategy - Minimum target string length = 9, reference sample size

= 15, EER = 5.24%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.5: O-Strategy - Minimum target string length = 9, reference sample size

= 20, EER = 4.76%

Figure 4.6 to Figure 4.9 shows the FAR and FRR of A-Strategy with minimum target length of 9, reference sample size of 5, 10, 15, 20, and 35 possible standard deviation weighting factor.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.6: A-Strategy - Minimum target string length = 9, reference sample size

= 5, EER = 6.19%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.7: A-Strategy - Minimum target string length = 9, reference sample size

= 10, EER =3.81%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.8: A-Strategy - Minimum target string length = 9, reference sample size

= 15, EER =2.91%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.2 0.5 0.8 1.1 1.4 1.7 2 2.3 2.6 2.9 3.2 3.5

Threshold of Standard Deviation

FRR vs. FAR FAR

FRR

Figure 4.9: A-Strategy - Minimum target string length = 9, reference sample size

= 20, EER =2.54%

We can see from Figure 4.2 to Figure 4.9 that A-Strategy obtained better ERR than O-Strategy. In Figure 4.10 and Figure 4.11, we show that the relation of EER vs. minimum target string length.

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

4 5 6 7 8 9 10 11

Minimum Target String Length

Equal Error Rate

r_size = 10 r_size = 15 r_size = 20

Figure 4.10: O-Strategy – EER vs. Minimum target string length

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

4 5 6 7 8 9 10 11

Minimum Target String Length

Equal Error Rate

r_size = 5 r_size = 10 r_size = 15 r_size = 20

Figure 4.11: A-Strategy – EER vs. Minimum target string length

We can see that as the EER drops as the minimum target string length increases.

O-Strategy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Reference Samples

False Rejection Rate

ε= 1.0 ε= 2.0 ε= 3.0

Figure 4.12: O-Strategy: FRR with different number of reference samples

O-Strategy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Reference Samples

False Acceptance Rate ε= 1.0

ε= 2.0 ε= 3.0

Figure 4.13: O-Strategy: FAR with different number of reference samples

A-Strategy Number of Reference Samples

False Rejection Rate

ε= 1.0 ε= 2.0 ε= 3.0

Figure 4.14: A-Strategy: FRR with different number of reference samples

A-Strategy Number of Reference Samples

False Acceptance Rate ε= 1.0

ε= 2.0 ε= 3.0

Figure 4.15: A-Strategy: FAR with different number of reference samples

We can see from Figure 4.12 to Figure 4.15 that FRR of both strategies drop as the number of reference samples increase, and FAR of both strategies lift slightly as the number of reference samples increase.

5. Conclusions

Our approach achieved the EER of 2.54%, which is near 2 % - values generally considered to be acceptable for this type of system. The ERR of our scheme can be improved as we conduct more experiment to collect more reference sample with length longer than 10.

As to future work, we can combine the proposed scheme with the analysis of the surfing route to the login page. The proposed model can be extended to devise scheme for free-text keystroke analysis, such as continuously real-time identity verification.

6. References

[1] D. Gunetti and C. Picardi, “Keystroke Analysis of Free Text”, ACM Transactions on Information and System Security (TISSEC), vol. 8, no. 3, pp. 312-347, Aug 2005.

[2] L. C. F. Araujo, L. H. R. Sucupira Jr., M. G. Lizarraga, L. L. Ling, and J. B.

T. Uabu-Tti, “User Authentication Through Typing Biometrics Features”, IEEE Transactions on Signal Processing, vol. 53, no. 2, pp. 851-855, Feb.

2005.

[3] S. T. Magalhaes, K. Revett, and H. M. D. Santos, “Password Secured Sites – Stepping Forward with Keystroke Dynamics”, Proceedings of the International Conference on Next Generation Web Services Practices (NWeSP’05), pp. 293-298, Aug. 2005.

[4] W. G. de Ru and J. H. P. Eloff, “Enhanced Password Authentication through fuzzy logic,” IEEE Expert, vol. 17, no. 6, pp. 38–45, Nov. 1997.

[5] K. Revett and A. Khan, “Enhancing Login Security Using Keystroke hardening and Keyboard Gridding”, Proceedings of the IADIS MCCSIS, 2005.

[6] S. T. Magalhaes, H. M. D. Santos, “An Improved Statistical Keystroke Dynamics Algorithm”, Proceedings of the IADIS MCCSIS, 2005.

[7] A. Peacock, X. Ke, and M. Wilkerson, “Typing Patterns: A Key to User Identification”, IEEE Security & Privacy, vol. 2, no. 5, pp. 40-47, Sep 2004.

[8] J. Leggett and G. Williams, “Verifying Identity via Keystroke Characteristics”, International Journal of Man-Machine Studies, vol. 28, no.

1, pp. 67-76, 1988.

[9] S. Haidar, A. Abbas, and A. K. Zaidi, “A multi-technique approach for user identification through keystroke dynamics,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics, vol. 2, pp. 1336–1341, 2000.

[10] D. Song, P. Venable, and A. Perrig, “User Recognition by Keystroke Latency Pattern Analysis”, Apr. 1997.

[11] F. Monrose and A. Rubin, “Authentication via Keystroke Dynamics”, Proceedings of the 4th ACM conference on Computer and Communication Security, pp. 48-56, Apr. 1997.

[12] M. I. Jordan, “An Introduction to Probabilistic Graphical Models”. In preparation.

[13] D. X. Song, D. Wagner, and X. Tian, “Timing Analysis of Keystrokes and Timing Attacks on SSH”, In 10th USENIX Security Symposium,

pp.337-352, Aug. 2001.

[14] S. Russell and P. Norvig, “Artificial Intelligence, A Modern Approach”, Prentice Hall, 1995.

[15] P. Dowland, H. Singh, and S. M. Furnell, “A Preliminary Investigation of User Authentication using Continuous Keystroke Analysis”, Proceedings of 8th IFIP Annual Working Conference on Information Security Management and Small System Security, Sep. 2001.

[16] P. Dowland, H. Singh, and S. M. Furnell, “Keystroke Analysis as a Method of Advanced User Authentication and Response”, Proceedings of the IFIP TC11 17th International Conference on Information Security: Vision and Perspectives, pp. 215-226, May. 2002.

[17] F. Bergadano, D. Gunetti, and C. Picardi, “User Authentication through Keystroke Dynamics”, ACM Transactions on Information and System Security (TISSEC), vol. 5, no. 4, pp. 367-397, Nov 2002.

[18] R. S. Gaines, W. Lisowski, S.J. Press, and N. Shapiro, “Authentication by Keystroke Timing: Some Preliminary Results”, Rand Report R-256-NSF.

Rand Corporation, 1980.

[19] R. Joyce and G. Gupta, “Identity Authentication Based on Keystroke Latencies”, Communication s of the ACM, vol. 33, no. 2, pp. 168–176, 1990.

[20] D. Umphress and G. Williams, “Identity Verification through Keyboard Characteristics”, International Journal of Man-Machine Studies, vol. 23, no.

3, pp. 263-273, 1985.

[21] A. J. Mansfield and J. L. Wayman, “Best Practices in Testing and Reporting Performance of Biometric Devices”, Biometrics Working Group, Aug.

2002.

[22] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, vol. 77, No.

2, Feb. 1989.

[23] S. Bleha, C. Slivinsky, and B. Hussien, “Computer-Access Security Systems Using Keystroke Dynamics”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 12, Dec. 1990.

在文檔中敲鍵行為統計學習模型應用於網路身份認證 (頁 25-0)