1 National Kaohsiung University of Applied Sciences, Taiwan

Cheng Shiu University, Taiwan [email protected]

Abstract

The attack detection and information security for computer network become popular topics for many researchers in recent years. In this paper, the PCA-ICA method for attack and intrusion detection is proposed. According to the experimental result, the proposed method performs higher correct ratio on recognition than the PCA.

1. Introduction

The development of computer network brings the convenience for information exchanging. After the information passing on the internet smoothly, the attack detection and information security for computer network become popular topics for many researchers.

Hence, the Network Intrusion Detection System (NIDS) is generated.

Papers regarding intrusion usually use KDD-Cup-99 to be the database for simulating various kinds of attack and intrusion. In this paper, the Principal Component Analysis (PCA) is applied to fetch the major characteristics of the database, and then the Singular Value Decomposition (SVD) is applied to reduce the dimension of the major characteristics for raising the performance. Moreover, the Independent Component Analysis (ICA) is applied to create the independent sub-space for attack and intrusion detection. According to the experimental result, the proposed method, PCA-ICA, perform higher correct ratio on recognition than PCA.

2. The KDD-Cup-99

The KDD-Cup-99 is a database, which is developed for simulating the attack modes on the computer network, by Lincoln Lab., MIT in 1998. [1] It is usually applied for simulating the attack modes and for

detecting the intrusion. The simulations are processed by collecting all kinds of connections, package flow, and several abnormal situations under the TCP/IP environment.

The KDD-Cup-99 is composed of a known attack connection record, and an unknown dataset connection record. In general, the abnormal situation is classified into 4 classes, and 22 attack modes. The classification is listed in Table 1. The classes of the abnormal situation can be described as follows:

DOS: Denial-of-service, e.g. Syn flood.

R2L: Unauthorized access from a remote machine, e.g. guessing password.

U2R: Unauthorized access to local super user (root) privileges, e.g., various ``buffer overflow'' attacks.

Probing: Surveillance and other probing, e.g., port scanning.

Table 1. The classification of the abnormal situations

Attack

Type Attack Category

Dos back land neptune pod smurf teardrop

U2R buffer_

overflow

loadmodule perl rootkit R2L ftp_write Guess

passwd imap multihop phf spy warezclient warezmaster Probing ipsweep nmap portsweep satan

In the KDD-Cup-99, every row in the database denotes a complete network connection record. There are 42 attributes in each record. The last column denotes whether it is an attack or a normal connection.

An example of the connection record is displayed in Figure 1. Obviously, dataset like this must be preprocessed before applying to the detection methods.

The detail of how to reformat the data is represented later.

The 3rd Intetnational Conference on Innovative Computing Information and Control (ICICIC'08)

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on November 19, 2008 at 03:29 from IEEE Xplore. Restrictions apply.

Figure 1. Data format of the KDD-Cup-99

3. Principal Component Analysis (PCA)

The PCA is a well-known method for data compression and data analysis [2]. And it is also widely applied for solving engineering problems. The purpose of applying PCA in attack and intrusion detection is to decrease the dimension of the original dataset existing in the high-dimensional space into several low-dimensional sub-spaces. It decrease the dimension of the dataset by projecting the original dataset into the sub-spaces, and simultaneously, makes sure that the variance of the projected data is the largest, i.e. the most important. Thus, the dataset can be decreased into lower-dimensional sub-spaces. The principle of PCA can be described as follows:

Assume that there are M training samples with

h

l u

dimensions. Reformat the dataset by combining l times

1 u h

sub-datasets into M datasets with

1 u n

dimensions, where

n l u h

. Equation (1) is applied to calculate the mean vector over all training samples, where

X

_i denotes the i^th reformatted training sample.

In equation (2), all the reformatted training samples are subtract Ӵ, and then product themselves of their own transposed vector. Thus, the covariance matrix C is generated.

After calculating C, the eigenvalue ӳ and the eigenvector U are able to be evaluated. Later than reorderingӳ and U by descending order, selecting

m

largest eigenvectors to be the principals. Let

U

denotes the principles, where r

> @

1,m . These principals construct a sub-space.

Via equation (3), the original datasets can be projected into the sub-space. Hence, the dimension of the dataset can be reduced from

n

m

j j

y U X

(3)

4. Independent Component Analysis (ICA)

The ICA is proposed for solving the problem of the Blind Source Separation (BSS) [3]. The basic theorem of the ICA is based on a simple assumption. Based on

the assumption, a set of bases is used to express a series of random variables to make every element to be count independence or similar independence. In our work, we would like to exercise these independent bases to transform the samples. Hence, the outcomes should be similar independent, even be independent.

This is helpful for us to analyze, and classify the data.

In equation (4), ICA is applied to an

m

-dimensional data

X

, and we have the outcome

U

. Equation (5) is applied for calculating the separate matrix

W

_I.

W

_I is calculated via iteration method, i.e.

repeating equation (4) to equation (7). The first step is to calculate

W

_Z by equation (6). The second step is to calculate

W

for this generation by equation (7). The final step is to calculate

U

. These steps are repeated until the desired iteration is achieved, and then the final

W

_I is generated.

In the experiment, we combine the scheme of the ICA with the PCA in order to gain higher performance on classification. The details are described as follows.

5. Experiment

Since the KDD-Cup-99 provides the known connection records and the unknown ones, we separate the known records into 5 classes, namely, DOS-attack connection, R2L-attack connection, U2R-attack connection, Probing-attack connection, and the normal connection. These classes are divided again into 2 parts: one of them is used as the training sample; and the other on is used as the test sample.

In the beginning, we apply the PCA to extract the principal components of the training samples. And then we apply the ICA to the principal components, which we got from the PCA, to construct the sub-spaces. The last step is to project both the training samples and the test samples into the constructed sub-spaces. The Euclidean distance is applied for evaluating the distances between the incoming test

The 3rd Intetnational Conference on Innovative Computing Information and Control (ICICIC'08)

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on November 19, 2008 at 03:29 from IEEE Xplore. Restrictions apply.

sample and the training samples. Then the class, who is the closest to the test sample, is presented. Figure 2 represents the flow chart of the experiment.

Figure 2. The flowchart of the experiment During the training, analyzing the known connection records is required. As we mentioned above, the datasets in KDD-Cup-99 is classified into 4 classes, namely, DOS, R2L, U2R, and Probing, we add a class, which includes the normal connection records, to be the 5^th class to identify the regular connection. To apply ICA transform, permuting the input data into a one-dimensional vector is required. By applying equation (1), we can get the mean vector Ӵ . According to equation (2), the covariance matrix C is calculated. However, the outcome from the PCA has the same dimension as the input data; this is not suitable for real time network detection. To solve this problem, we apply the SVD to reduce the dimension [4]. Hence, the equation (2) can be rewritten into equation (8), and then the covariance matrix C, the eigenvalue and the eigenvector can be measured by equation (4).

C A A

T (8) Since we have applied SVD to the outcome of the PCA, we can directly take

n

outcomes of SVD to construct the sub-spaces. This process is represented in equation (9), where

U

_i is the i^th eigenvector of the covariance matrix C, and O_i denotes the corresponding eigenvalue. In equation (10), the

original training data

X

is projected into the sub-spaces. To lead in the ICA scheme, we apply the eigenvalue

W

, which comes from the outcome of the PCA, into equation (4) to equation (7) to create a new sub-space.

Based on the independent sub-space from the ICA, equation (10) can be rewritten to be (10.1), where

P

denotes the independent basis, which is calculated by the ICA. Hence, the training sample can be projected into the new sub-space, which is created by ICA, via equation (10.1). Thus, the training phase is accomplished. To examine the outcome of the proposed method, we project the test samples into the sub-space by equation (10.1), and then apply equation (11) to calculate the Euclidean distances between the test sample and all the training samples.

2 If the shortest distance exists between the test sample and the regular connection on the sub-space, the outcome of this test sample is determined to be a normal connection. On the other hand, if the shortest distance exists between the test sample and any kind of the attack record on the sub-space, the outcome of this test sample is determined to be the corresponding attack, and the alarm is issued.

6. Experimental Results

In the experiment, we divide the connection records from the KDD-Cup-99 into 5 classes; and in each class, we divide it into the training samples and the test samples. In other words, there are 26 training samples and 26 test samples in every class. We compare the correct ratio on recognition and the process time of the proposed method to the PCA.

Figure 3 represents the result of the PCA and the PCA-ICA with different numbers of eigenvectors to the correct ratio on recognition. According to the experimental result, both the PCA and the PCA-ICA present correct ratio at 96.92% when the number of the selected eigenvectors is larger than 5.

In figure 4, we present the result of the PCA and the PCA-ICA with different numbers of training samples

The 3rd Intetnational Conference on Innovative Computing Information and Control (ICICIC'08)

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on November 19, 2008 at 03:29 from IEEE Xplore. Restrictions apply.

to the correct ratio on recognition. For all numbers of the training samples, the number of the test sample is fixed to 26.

Figure 3. The correct ratio on recognition to different numbers of eigenvectors

Figure 4. The correct ratio on recognition to different numbers of training samples Table 2. The average correct ratio and the average computing time of 1 to 26 training

samples

Average correct

rate

Average computing time

PCA 73.75 % 17.58 s

PCA+ICA 89.61 % 29.18 s

According to figure 4 and table 2, when the training sample is less, the training is misplaced into the wrong class results in a huge decrease of the correct ratio for the PCA method. However, in the same situation, the PCA-ICA method still holds higher correct ratio on recognition. Although the computing time of the PCA-ICA method is longer, it is still acceptable for such a usage of attack and intrusion detection of computer network, and the average correct ratio of recognition is improved about 16%.

7. Conclusion

In this paper, we propose a PCA-ICA method for detection the attack and intrusion on the computer network, and the proposed method presents higher correct ratio on recognition. By applying SVD into PCA, the dimension of the original data is decreased largely, and the computation time is reduced. We use KDD-Cup-99 to simulate the attack and the intrusion on the computer network. According to the experimental result, the correct ratio on recognition is improved about 5% to 16%.

8. References

[1] KDD Cup 1999 Dataset

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [2] M. Turk and A. Pentland, “Face recognition using eigenfaces”, Proceedings of IEEE, CVPR , pp. 586-591, Hawaii, June, 1991

[3] Marian Stewart Bartlett, Javier R. Movellan, and Terrence J. Sejnowski, “Face Recognition by Independent Component Analysis”,IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002

[4] Jian Yang, Jing-yu Yang, and Alejandro F. Frangi,

"Combined Fisherfaces framework", Image and Vision Computing 21 (2003) 1037–1044

[5] Huang Jun, Guang-Ping, and Xiao-Lu Lin ,"Intrusion detection based on principal component analysis", Journal oc China Jiliang Unversity, Vol.18 No.3 Sep.2007

The 3rd Intetnational Conference on Innovative Computing Information and Control (ICICIC'08)

Authorized licensed use limited to: National Kaohsiung University of Applied Sciences. Downloaded on November 19, 2008 at 03:29 from IEEE Xplore. Restrictions apply.

在文檔中高容量可逆浮水印之研究 (頁 23-26)