Integrate 3-DES with web server - System Architecture

Chapter 3 System Architecture

3.3. Integrate 3-DES with web server

To analysis the efficiency of 3-DES encryption and decryption, we construct a website that offers 3-DES encryption and decryption service. The website is used for ensuring internal security. All data stored in the website is encrypted. While users want to download the data form website, the website will immediately decode the encryption data and send the result to users. On the other hand, when a user wants to upload a file to web server, the website will encode the file first and then put the cipher text into web server.

In Fig. 3-5, it is conspicuous to realize how this data server works. After the user upload a file to server, the website will refresh the download page and list a new cyber link to the new upload file. That is because we want to test the 3-DES speed up performance through the environment of integrating 3-DES with web service. In traditional 3-DES encryption and decryption, the efficiency is too low to tolerate. After we implement 3-DES in CUDA programming, the 3-DES encryption and decryption become fast enough for general purpose.

Moreover, users would not know that web server do the 3-DES decryption when they push the download button.

Fig. 3-5 New Web service architecture

Unfortunately we run into many troubles in implement the whole environment.

The most important trouble is the security problem. While a user tries to download a file from this server, the CPU in the web server would send a signal to call DMA for moving data form host memory to device memory ideally in our theory. But it is not working actually. Final we find out that there is some security measures for avoiding any unsuitable access to host computer. In other hand, the host computer would not accept the GPU access request that is sent from apache server. We will discuss how we solve this problem in next chapter.

Chapter 4 Experimental Results and Analysis

In this chapter we will discuss our implement methods in detail. Based on the framework we talked in chapter three, we would introduce that how we construct each of these components. Therefore, since that there are too much components in our architecture, we just introduce the important or particular design in our implementation and some exceptional troubles programmers may face.

4.1. 3-DES implementation in GPU

4.1.1. Move data from CPU to GPU

The First step of DES encryption and decryption is moving plaintexts or cipher texts from host to device. Host means CPU and device means graphic card. First, the host will instruct the device to malloc same size of data in GPU memory. Second, the host would send data from host memory to device memory with DMA. Finally GPU receive the data, the host can further call GPU to do computing on shaders.

While GPU receive the data and the instruction from CPU, we need to map the data to threads in shaders. In DES implementation, a plaintext is 64-bits. We use a unsigned char array with 8 elements to store the 64-bits and we store all the 64 * N bits in the unsigned char array named block. First GPU reads all elements in array block and send each element to different thread. We use J = (blockIdx.x*blockDim.x+

threadIdx.x) to indicate each thread. J means absolutely thread ID, blockIdx means how many blocks in a grid, blockDim mean how many threads in a block, and threadIdx means the relative thread ID in a block. Second we divide the plaintext to

two parts named left and right. Each of Left and right are contains 32 bits, so we can combine four unsigned char elements individually to left and right like Fig. 4-1.

Fig. 4-1 3-DES Pseudo code with moving data details

Algorithm __golbal__DES_EN (block, key) ; // Same as 3-DES_DE Input: block (64 * N bits plaintext)

Key(64 * 3 keys)

Output: block (64 * N bits cipher)

// N means how many threads we use in device.

4.1.2. Initial permutation

The permutations employed by the cipher are described using bit numbers. The numbering used in the standards documents is enumerating the bits from left to right, starting at 1. When displayed as a matrix, row major order is used. This is best illustrated by the identity transform shown in Table 4.1.

Table 4-1 Bit number

The Pentium processor reads its memory using the opposite bytes (row) order, giving the bit number matrix shown in Table 4.2. We have here divided the matrix in upper and lower halves. On the Pentium we need one 32-bit register to store each half, and hence swapping the halves amounts to swapping the roles of those two registers.

Table 4-2 Input / output bit ordering on Pentium

The initial permutation of the DES has a very simple structure, and can be

performed as a series of bit block swaps known as Hoey's Initial Permutation Algorithm. This algorithm is shown in Table 4.1. Note that there is also an implicit swap of the upper and lower halves at the beginning. To optimize the algorithm for Pentium, Richard Outerbridge's C code implementing the algorithm was analyzed, providing a set of bit exchanges between the two halves of the input block. Though not the same implementation, the idea for how to code each swap came from Eric Young's libdes. IP.1 is applied by performing the swaps of IP in reverse order.

Table 4-3 Initial permutation

In our CUDA programming, the initial permutation will be computed on GPU device.

4.1.3. Round function

DES needs 16 times of round function to complete encryption and decryption.

Generally we make a DES function and recursive call this function to complete 3-DES encryption and decryption. In fact, this may waste lots of CPU time to move data between CPU memory and GPU memory. The solution to reduce the unnecessary memory access is to do 64 times of round function continuously. Thus program only needs two memory accesses to finish one 3-DES encryption or decryption.

The main important things to rewrite the Round function are to care about the on

chip memory access. We use a constant unsigned long memory to store our spbox.

Constant memory’s speed is usually faster than global memory and would not occupy the space of shared memory. After assign the accuracy values to the spbox, we do the round function for 64 times just like the following Fig. 4-2.

Fig. 4-2 Round function details

4.2. 3-DES Encode implementation

Since we want to design a 3-DES encryption program, we need to consider all kinds of file type. For example in the word files like PowerPoint or WordPad, it uses

‘\0’ to be its end symbol. In this situation that we just need to determine what the input word is. But in streaming data like music or movie files, the ‘\0’ is not the end symbol. This means we need to adjust our program to fit all the file types.

Moreover, we input 64 * n bits plaintext once rather than 64 bits plaintext. This While (I < 64)

{

//phase i

work = ((right >> 4) | (right << 28)) ^ ks[i];

left ^= Spbox[6][work & 0x3f];

left ^= Spbox[4][(work >> 8) & 0x3f];

left ^= Spbox[2][(work >> 16) & 0x3f];

left ^= Spbox[0][(work >> 24) & 0x3f];

work = right ^ ks[i+1];

left ^= Spbox[7][work & 0x3f];

left ^= Spbox[5][(work >> 8) & 0x3f];

left ^= Spbox[3][(work >> 16) & 0x3f];

left ^= Spbox[1][(work >> 24) & 0x3f];

}

may make a mistake for us to determine where the end symbol is. For instance, if we input 1000 words of plaintext once and the end symbol ‘\0’ is the 1250^th word of the plaintext, we must need to do the input loop for two times to input all the plaintext.

However after we do the loop for two times, we read the 2000^th word of the plaintext.

Because the 2000^th word is a empty word that compiler could not recognize, the loop would not be stopped.

For the two reasons we describe above, we bring up follow pseudo codes to explain how we solve these problems. First is the pseudo code of 3-DES encryption in Fig. 4-3.

Fig. 4-3 3-DES Encryption Pseudo code

In the figure above, it is obviously that only the function named __global__3DES_encryption () would be executed in GPU. This is because of parallel factor. Besides the __global__3DES_encryption function, only the key_schedule function can be parallelized. But the key_schedule function has few computation that it is not worth executing in CUDA, its memory access time is larger than its computation time.

Another interesting thing deserves to mention is the input function. Since we Algorithm 3-DES Encryption ((plaintext, ciphertext) ;

Input: plaintext (64 * N bits plaintext).

Output: ciphertext (64 * N bits ciphertext).

// N means how many threads we use in device.

begin

declare integer plaintext[64 * N], temptext[64 * N], ciphertext[64 * N];

declare integer key1[64], key2[64], key3[64], key[192] ; declare integer plaintextsize = 64 * N;

While(not reach the end of file) {

input_EN (plaintext, temptext); //put plaintext tinto temp buffer key = key_schedule (key1, key2, key3); // generate subkeys

copy temptext form host memory to device memory __global__3DES_encryption (temptext, key);

copy temptext form device memory to host memory swap (ciphertext , temptext);

output_EN (ciphertext);

} end

want to design a mature code to support all the file types, we need to devise new end symbol judgment methods. Our method is to design a new end symbol for all file types. If input 1000 words which equals to 8000 bits of plaintext once, the design details are follows:

I. Design a new end symbol

End symbol is a unique word that only recognized by program, so we cannot choose the common word to be the end symbol. We choose the word “/nend/ab”

for our end symbol. This can distinguish the end symbol from other words.

II. Add end symbol to plaintext every 1000 words

The program will check if the plaintext reaches the end every 1000 words. We use fread() to determine if we had already read in all the plaintext. If fread() return the value of 8000, which means input 8000 bits or 1000 words. This means the file may not reach the end. We should add the end symbol to this point and go on search for the end of the plaintext. So we accurately output 1008 words (1000 words + 8 words end symbol) at one time.

III. Find out the end of the plaintext

While the value returned form fread() is less than 8000, this means we have found the end of the plaintext. Since we process 1000 words once, we need to fill it to reach 1000 words. The end symbol “/nend/ab” is only 8 words. After the end symbol, we fill the vacancy with “\0” to reach 1008 words.

IV. The stop condition

After we find out the end of the plaintext, next round the fread() would return the value of 0. 0 means nothing left in the plaintext, so the program will break the loop to stop 3-DES encryption.

Fig. 4-4 Input Details of 3-DES Encryption Pseudo Code

4.3. 3-DES Decode implementation

3-DES decode implementation is similar to 3-DES encode implementation. But 3-DES decryption needs more decision operation, which means that needs more CPU

Algorithm 3-DES input_EN (plaintext, *temptext) ; Input: plaintext (64 * N bits ciphertext).

Output: ciphertext (64 * N bits plaintext).

// N means how many threads we use in device.

begin

int trueinputsize = inputsize -8;

int len = fread(cp , sizeof(unsigned char), trueinputsize, fileinput);

if(len equals to the size of input file) { // add end symbol after the end

if( len is less than the size of input file and len doesn’t equals to 0) { // add end symbol after the end and fill 1008 words with ‘\0’

time to determine when to stop. Unlike 3-DES encode implementation, the input of 3-DES decryption is all cipher texts. We need to decode the cipher texts before we analysis it.

Fig. 4-5 3-DES Decryption Pseudo code

The main implementation difference between encryption and decryption is the decision condition. In 3-DES encryption program, system can use feof() to see if the file reaches the end. But in 3-DES decryption, all inputs are cipher texts. If we just

Algorithm 3-DES Decryption (plaintext, *temptext) ; Input: ciphertext (64 * N bits ciphertext).

Output: plaintext (64 * N bits plaintext).

// N means how many threads we use in device.

begin

declare integer plaintext[64 * N], temptext[64 * N], ciphertext[64 * N];

declare integer key1[64], key2[64], key3[64], key[192] ; declare integer ciphertextsize = 64 * N;

While(not reach the end of file) {

Input_DE (ciphertext, temptext, ciphertextsize); //put ciphertext into temp buffer

key = key_schedule (key3, key2, key1); // generate subkeys

copy temptext form host memory to device memory __global__3DES_encryption (temptext, key);

copy temptext form device memory to host memory swap (plaintext , temptext);

output_DE (plaintext);

} end

decode and output, we would get the plaintext with many end symbols inside. The following figure is our implement details.

I. Input 1008 words and decode

While this program starts, it would input 1008 words form cipher text. After inputting 1008 words, the program will send the cipher text to GPU for decoding.

This phase will continue until there are no words of cipher text to input.

II. Analysis after decoding

After decoding, the program would analysis the last eight words of the plaintext.

If the eight words is “/nend/ab”, it means the input file does not reach the end.

The program will output the result without “/nend/ab” and return to stage one.

Else if the eight words is not “/nend/ab”, it means the input file reaches the end.

The program should trace the 1008 words to find out where the end symbol is.

When the program finds out the position of the end symbol, it would send the result without “/nend/ab\0\0\0……” to plaintext. After that, the program will break from the loop and finish all the jobs.

Fig. 4-6 Output function of 3-DES Decryption Pseudo code

Algorithm 3-DES output_DE ((plaintext, *temptext) ; Input: ciphertext (64 * N bits ciphertext).

Output: plainrtext (64 * N bits plaintext).

// N means how many threads we use in device.

begin

int countertemp = 0;

if (there is nothing inputs form file) return; // reach the file end

if( (temptext[len-8] = ‘/’ ) && ( temptext [len-7] = 'n' ) &&

(temptext [len-6] = 'e’) && ( temptext [len-5] = 'n ) &&;

(temptext [len-4] = 'd’) && ( temptext [len-3] = '/' ) &&

(temptext [len-2] = 'a') && ( temptext [len-1] = 'b') }

Countertemp= inputsize - 8; // set the size for output esle

{

unsigned char *temp = temptext;

for( int i = 0; i < inputsize; i ++ )

countertemp ++; // set the size for output }

}

fwrite(temptext, size of (unsigned char), countertemp, Decodeoutput);

// Decodeoutput is a pointer to the output file.

end

4.4. Configuration

4.4.1. Hardware configuration

To support CUDA computing, we have the following hardware configuration:

CPU Intel® Core™2 Quad Processor Q6600 (2.4GHz, quad-core) Motherboard ASUS P5E-VM-DO-BP, Intel® X38 Chipset

RAM Transcend 2G DDR-800

GPU NVIDIA 9800GT 521MB (ASUS OEM)

HDD WD 320G w/ 8MB buffer

Table 4-4 Hardware Configuration

Since we want to compare the performance of CPU versus GPU, we list the specification of the GPU in detail as follows:

Table 4-5 NVIDIA 9800GT Hardware Specification

4.4.2. Software Configuration

OS OPEN SUSE 11.1 (32bit version)

GPU Driver Version 97.73 CUDA Version 2.0 beta

Table 4-6 Software Configuration

4.5. Evaluation and Analysis

The following is our statistics of experimentation. We got the data by testing the CPU time in OpenSSL and implementing DES and 3-DES on graphic card. Besides DES and 3-DES cryptography, AES cryptography had been implemented on CUDA in 2007 by Svetlin A. Manavski [28]. We will discuss how many differences between our results.

cipher \ compute method OpenSSL on CPU Implement DES on CUDA in our work

DES (64bits)

4M 60ms 32ms

8M 108ms 41ms

109.8M 2s 100ms 712ms

697M 14s 313ms 4s808ms

Table 4-7 Comparison DES in CPU and in GPU

Fig. 4-7 DES performance comparison

There are two parts in table 4-7. One is about the CPU time to implement DES in CPU and the other one is about the CPU time to implement DES in CUDA programming. We test the DES’s original CPU time in OpenSSL. It is obviously that we get better performance in CUDA programming. In plaintext of 4M byte, DES in CUDA programming has about 1/2 CPU time to DES of OpenSSL in CPU. In plaintext of 697M, we even get about three to four times of performance in CUDA programming. This means we would get more benefits in larger plaintext size to fit parallel algorithm.

Table 4-8 Comparison 3-DES in CPU and in GPU

Fig. 4-8 3-DES performance comparison

In 3-DES, we can see that it would get about five times performance in GPU than in CPU while the file size is near to 4M bytes. Moreover, it would get more than six times performance if the file size is larger than 700M bytes. The new 3-DES on

220 244

GPU is faster than AES on CPU in all the situations. This means that we can do the same encryption and decryption in 3-DES to get the similar performance. The following is relationship between AES on CPU and 3-DES on GPU.

In the related work about speeding up AES with CUDA programming by Svetlin A. Manavski, a comparison has been made between GPUs and a CPU implementation based on the OpenSSL [29] library. Both the internal GPU elaboration time is shown and the total time, that includes the time needed for the download and read-back operations (that means copying data from the host memory to the GPU device and back). The CUDA-AES [30] implementation on NVIDIA card is faster than the CPU on every input size with reference both to the internal GPU and to the total time [31].

A peak throughput rate of 8.28 Gbit/s is achieved with an input size of 8MB. In that case the GPU is 19.60 times faster than the CPU [32]. So we have both have about six times in GPU than in CPU when the file size is more than 512M bytes.

Chapter 5 Conclusion and Future works

5.1. Conclusion

Parallel programming nowadays becomes more and more important to programmer. This is because multi-cores is the trend to both CPU and GPU architecture development. However not all the programs are parallel program and this paper provide a method to solve this program. Find out the main points to parallelize a program is the most important thing to programmer.

CUDA code is not easy to optimize if the user do not understand the architecture of CUDA. Sometimes this makes programmer to abuse the graphic memory. Then the new CUDA program may have worse performance than original program on CPU. So the final performance is decided by programmer’s effort.

Although DES and 3-DES is an old cipher, there still many people use them to do encoding and decoding. This is because it needs a long time to verify the security of a cipher. DES and 3-DES however both have the trusty security. 3-DES is a compute-sensitive program, so it usually is used for off-time computing. For example, many finance institutions and banks use 3-DES to protect their system while making an inventory. Each cipher has its advantages and disadvantages, what we have to do is to properly use the advantages and keep of the disadvantages. It is better to ameliorate a defective cipher rather than abandon it rashly.

In the result we can see that CUDA programming is a cheap way to speed up a compute-sensitive program. Rather than implement a system on chip or on FPGA for about more than one hundred thousand, use just need to pay about five to eight

thousand to get CUDA. Maybe a program gets better performance on FPGA than on graphic card. But this is a cheap and easy method for a program to learn how to speed up a program or a system in parallel program.

5.2. Future work

(1) Speed up more cipher in CUDA programming

Except for DES and 3-DES, there are still many ciphers that we can try to speed them up in CUDA programming. But some cipher is not easy to implement on graphic card. For example the RSA cipher is very hard to get better performance in CUDA programming. RSA is an algorithm for public-key encryption. The encryption fun of

RSA is . C means cipher text, m and n are integers that 0 < m

< n. It is obviously that is very hard to parallelize the encryption function of RSA.

Although there is still some methods like squaring algorithm can improve this situation. However the performance is still too bad compared with programming in CPU and CUDA unless those programmers can design a new parallel algorithm to RSA. Entirely more high performance ciphers bring us more convenient life. It is necessary that programmers try to speed up more cipher in CUDA programming.

在文檔中利用圖形處理器加速3-DES運算 (頁 27-0)