Motivation - 利用圖形處理器加速3-DES運算

Chapter 1 Introduction

1.2. Motivation

The main streams of ciphers are DES, 3-DES and AES as before [3]. The lack of 64 bits key length of DES, a special effective way enables to decrypt within 24 hours allegedly, though it hasn’t been proved, did alert the awareness of code safety. By increasing longer code or creating a more sophisticated computing method, we may guarantee the safety of coded data.

The advantages of AES, fast calculating speed, strong defense capability and easily parallelization, the latter one especially convenient for inventers, are the reasons made it popular.

The characteristics of 3-DES, 3 times more computing time and competitiveness of computing, are the inevitable result made its inferiority besides comparatively good quality of security as AES [4]. Most people turn to other ciphers which contain certain quality of speed and security. However, to invent new cipher involves quite long time.

If focusing on improve the disadvantage of consumption of CPU resource in 3-DES, not only could fasten the DES but also provide more commodious user interface.

Chapter 2 Background and Related work

2.1. GPU (Graphics Processor Unit) introduction

Nowadays GPU is not only the T&L (transform & lighting) and render hardware, but also the general purpose computation hardware. This is the basic concept of general purpose computation on graphics hardware, also called GPGPU [5]. It means we can do lots of computing as CPU can do on GPU hardware. In fact, some kinds of computation have better performance on GPU than CPU, such as floating-point operation. The comparison of computation power between GPU and CPU in floating-points computation was depicted as Fig. 2-1. It is obviously that GPU and CPU do the same performance in floating-points computation in 2003. But in 2005, GPU do the twice performance than CPU in floating-points computation. However GPU can do triple or quadruple performance than CPU can do now.

Fig. 2-1 Floating-Point Operations per Second for the CPU and GPU

Besides of floating-points computation, it is more easy and effective to do parallel processing on GPU hardware. The major CPU has two or four cores on it, but the major GPU card has about 128 even more cores on it. If we can find a way to divide a computation-sensitive problem to many parallel threads, it might get better performance to run on the GPU hardware. However the mapping has many constraints and is not straightforward. We may need to design some special data structures and modify the algorithm in the way we do graphics rendering. There are now two most famous general purpose GPU architecture, CUDA by NVIDIA and OpenCL by AMD/ATI.

2.2. CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) is the architecture to unify the general computation model on NVIDIA’s graphics card devices [6]. Based on traditional C or C++ program language, CUDA is a new program language added some extension syntax and auxiliary libraries. The programmers can use the CUDA language to coding programming with ease just like C++ language programming [7].

Since we use CUDA to speed up our 3-DES encryption and decryption, we will discuss the advantage to CUDA architecture and enumerate main different between CUDA and OpenCL [8].

The objective of CUDA programming can be summarized as the following two parts:

 Easy to write program

CUDA programming is almost as easy as C++ programming. The main

difference between CUDA and C++ programming is how to optimize your program. A programmer needs to know lots of hardware details about the graphic card while optimizing a CUDA program.

Fig. 2-2 CUDA platform for parallel processing on Nvidia GPUs

 Fit parallel processing

Unlike C++ programming for sequential processing, CUDA programming is suit for parallel processing. For instance, some loops need to be executed for hundreds or thousands times in sequential processing. But in CUDA programming, we can easily break a loop to hundreds or thousands threads. With parallel processing, sometimes we can get five to ten times performance than sequential processing [9].

Fig. 2-3 CUDA Architecture from CUDA ZONE

 scattered write capability

Scattered write means that CUDA program codes can write to arbitrary addresses in GPU memory. In traditional GPU pipeline, it is impossible to assign any memory address by programmers. With the scattered write capability, it is more efficiently for us to build a parallel algorithm just like build an algorithm on C program [10].

 On chip shared memory

One special difference between NVIDIA and AMD/ATI is that NVIDIA GPU has shared memories on it [10]. With the shared memory, we can access data directly from shared memory without accessing the global memory on graphic card. Since it needs 4 hundred or 6 hundred cycles to access the global memory at a time, memory accessing is tone of key points to optimize parallel processing programs [11].

Fig. 2-4 CUDA Memory Architecture from CUDA ZONE

2.3. CUDA memory introduction

In fact, there are six different types of memory that can be access by programmer when writing a CUDA program.

Register’s calculating speed is the fastest among all the other reservoirs. Most of the local variable of threads including array are manipulated by register spontaneously.

In some conditions: Being “co-occupied” by too many variables that maxed out space of register breaching the limitation of compiler (using content N of --maxrregcount with N=128 presumed); using dynamic variables as indexes to access array (due to the need of order formation of array), it’ll be replaced by local memory of compiler with lower speed [12]. The following we would talk about all kinds of memories on graphic card.

Shared memory’s applied range is block, which only could be operated within the same block. The limitations of it are the scale and the necessity to synchronize threads,

avoiding unexpected error of data filing. Its potency next only to register, is the focal point of optimization. Besides, we can’t initialize it in the beginning process other than core implementation, nor access it in host machine. CUDA only provides API to assign its scale in current stage.

Local memory is the inevitable outcome while compiler automatically replacing data to global memory, similar to page swap in operating system. It has negative effect in efficiency, and the effect is hardly control. Therefore, when optimizing program, it has to be tracked by --ptxas-options to avoid it. Sometimes, to increase the number of blockDim in a block, we need --maxrregcount N to limit the usage amount of the maximum usable registers to each thread [13]. They have to compromise between the number of variables and the scale due to its necessity.

Other than device as a tag proclaim, reservoirs which are located by API directly through cudaMalloc also be seem as global memory. Global memory, which means all operating units could operate on it, and anything could be operate by threads in DRAM, threads in different block included. Its accessing without cache is differing from texture cache, which belongs to different ports in hardware and acquired coalesced read. (Coalesced read, is the constant block of reservoir in reading half the warp of thread, which synchronizes the controller of reservoir.)

Constant memory and texture cache belong to the same administrative level, but it has lower error rate cause of its size limitation. Despite the consuming of loading time in the very first time, its operating speed is as fast as shared memory. It can apply in all range and only can be access in the initial stage of setting a file or through the API in host machine.

Because of buffering of cache, texture memory doesn’t acquire collaboration in accessing. The dozens computing cycles is the reason of slightly lower speed compare

to global memory which access directly, but still much faster than global memory which doesn’t collaborate while reading data. Therefore, texture memory is prior in a very sophisticated condition of collaborating accessing.

Texture memory is unrevised for cache’s locality. CUDA provide not only 1D cache pattern, but also 2D and 3D texture cache which applies inclusively for the need of plotting.

The following table 2-1 and table 2-2 show the summary of GPU memory.

Type Tag Life Period Access Scope Hardware Register (none) block Thread R/W On chip Local memory (none) block Thread R/W DRAM Shared memory __shared__ block Block R/W On chip

Texture memory (none) program Global R/W DRAM + cache Constant memory __constant__ program Global R DRAM + cache Global memory __device__ program Global R/W DRAM

Table 2-1 Memory Types in CUDA

Type Access time (clocks) Performance Factor

Register Immediately none

Local memory 400 ~ 600 Compiler auto

Shared memory 4 Memory bank conflict

Texture memory 4, 400 ~ 600(miss) Cache miss Constant memory 4, 400 ~ 600(miss) Cache miss Global memory 400 ~ 600 Memory bank conflict

Table 2-2 Memory Access time in CUDA

2.4. Related Work

In recently years, more and more people try to speed up the compute-sensitive work in CUDA programming. This is because graphic card is much cheaper than other hardware device. User just needs to buy the NVIDIA graphic card and install their driver. In additional, users may get better performance by offering a good parallel algorithm in CUDA programming. Moreover, there is some commercial software supported to CUDA. For example, TMPGENC’s video encoder software supports CUDA speed up in 2008. That is why some people say multi-cores is the final goal for CPU [14].

2.4.1. Related Work: AES speed up in CUDA

It is fully benefits of the most optimized known AES techniques in the CUDA-AES implementation. The CUDA-AES implementation is designed for 32-bit processors. Given the flexibility of memory model, it is possible to efficiently use the four T-look-up tables each one containing 256 entries of 32 bits each. The CUDA-AES implementation is based on combination of the round stages, which allows a very fast execution on processors with word length of 32 bits. In this paper, the author brings up a formula that stands for his AES architecture [15].

Where a is the round input, e is the round output of bytes, T[ ]is a look-up table,

⊕ means XOR and Kj is one column of the stage key. This solution takes only 4 look-ups and 4 XORs per column per round [16].

It is obviously that the CUDA-AES implementation takes few operation in a

round rather than DES. That is the reason why AES encryption is as fast as DES encryption. Furthermore, AES is more suitable for CUDA programming because of using massive matrix. It can easily parallelize all the AES system by spreading up the linear operations in matrix to corresponding threads in GPU hardware [17]. This CUDA-AES implementation in this paper speeds up the about 5 times on G80. The detail data statistics will be analyzed with our results. We put all the data statistics in chapter four.

Chapter 3 System Architecture

In this chapter, we present the system overview of our framework that includes DES and 3-DES (Triple DES) architecture. Additionally, we not only do the encryption and decryption speed up work but also combine the 3-DES to web service.

We will describe the whole system in the last subsection.

3.1. DES and 3DES Introduction

3.1.1. DES Architecture

Since that DES is the main core of 3-DES operation, so we first analysis the DES architecture to find out how many components in DES can be improved by CUDA [14]. DES is a block cipher system with a secret key. The principle of DES is that divide the plaintext into many 64-bits blocks. First, each of 64-bits blocks is subjected to an initial permutation IP. Then do sixteen times logic operation with sixteen sub-keys. Finally we do IP⁻¹ operation to this text to get the final cipher text, so we can send the 64-bits cipher text to other people with DES encryption [18].

In traditional DES architecture is as the Fig. 3-1, all the operations including input, initial permutation, permuted input, sub-key generation, IP⁻¹ operation, and output are all achieved by CPU hardware [19]. Therefore, all logic operations in CPU waste too many CPU time and CPU recourses. Our job is to assign some operations form CPU to GPU to speed up the whole system and decrease the CPU recourses wastage.

Fig. 3-1 Traditional Enciphering Computation

3.1.2. 3-DES Architecture

The secret key DES only has 64 bits (including 8 bits for error detection), people believe that DES may easily be cracked by modern computer because of the short secret key. However 3-DES is one of the methods to improve the weakness of the short secret key problem. By 3-DES, we can improve the secret key from 64 bits to

192 bits [20]. Anyway we don’t need to worry about the secret problem in 3-DES operation, but the computation time is 3-DES is such a big problem for user.

3-DES is based on DES architecture. In order to create a longer secret key, 3-DES tries to do DES encryption and decryption three times to get triple length as the secret key in DES [21]. As Fig. 3-2, we can see that 3-DES needs triple more CPU time to finish a job.

Fig. 3-2 DES Architecture

In detail, 3-DES’s encryption can divide to three parts. First, we use the first key to do DES encryption to plaintext. Then we use the second key to do DES decryption and finally do the DES encryption with the third key to accomplish the 3-DES encryption. Therefore the 3-DES decryption is inverse operation to 3-DES encryption.

It is obviously that 3-DES is a computation-sensitive work. It may cost too much CPU time and CPU recourse [22]. What we need to do is using CUDA programming to speed up the 3-DES computation in GPU.

3.2. Implement 3-DES in CUDA

3.2.1. Facing problems in CUDA program

A trivial idea to implement 3-DES system in CUDA programming is put the encryption and decryption operation on graphic card with keep the tradition 3-DES architecture. However it may face several problems in CUDA programming.

 Parallel algorithm is not easy to build

When we try to rewrite the DES architecture for CUDA programming, one of the main problems in CUDA programming is how to parallelize 3-DES architecture.

Since DES architecture is a sequential execution program as in Fig. 3-2, it is so difficult to parallelize the internal flow of DES. In fact, AES uses massive matrixes to solve this problem. The restriction of DES is not easy to crack because there is no good matrix solution to DES.

To solve this problem, we modify the traditional DES algorithm to fit CUDA program. Traditional DES deals with 64-bits plaintext once. In our idea, we input about 64 * N bits plaintext once [23]. N means how many threads we use on graphic card. By inputting 64 *N bits plaintext, we can easily create a new DES architecture with parallel algorithm. Nevertheless, there are still many problems when you try to input 64 * N bits plaintext once, we will talk about this later.

 How to optimize your system using GPU memory

Another troublesome problem is how to perfect assign the GPU memory. While we try to input 64 * N bits plaintext once, we find out that the system efficiency goes down quickly. The main reason is that we did not well assign the GPU memory [24].

In our original idea, we try not to rewrite the whole 3-DES architecture and want

Algorithm 3-DES Encryption ((plaintext, ciphertext) ; Input: plaintext (64 * N bits plaintext).

Output: ciphertext (64 * N bits ciphertext).

// N means how many threads we use in device.

begin

declare integer plaintext[64 * N], temptext[64 * N], ciphertext[64 * N];

declare integer key1[64], key2[64], key3[64] ;

input (plaintext, temptext); //put plaintextin to temp buffer key_schedule (key1, key2, key3); // generate subkeys copy temptext form host memory to device memory __global__DES_encryption (temptext, key1);

copy temptext form device memory to host memory

to get benefits by CUDA programming. In fact, this is a terrible mistake that we waste too much time in data moving from host memory (CPU memory) to device memory (GPU memory). Although copying data form host memory to device memory can use Direct Memory Access (DMA) to optimize the device’s high-performance, we still have to reduce the frequency of data moving [25].

Traditional 3-DES encryption contains three components, including two DES encryptions and one DES decryption [26]. While we rewrite the 3-DES encryption and decryption form C code to CUDA code, it is necessary to reduce the frequency of data moving. For example that we first write two function call named __global__DES_encryption() and __global__DES_decryption(), “__global__” means this function call will be executed by GPU core. In Fig. 3-3, it is obviously that it needs too much data moving to complete 3-DES encryption once [27]. However what we need is to combine three operations into one operation, so we can reduce the six data moving operations to two data moving operation.

copy temptext form host memory to device memory __global__DES_decryption (temptext, key2);

copy temptext form device memory to host memory copy temptext form host memory to device memory __global__DES_encryption (temptext, key3);

copy temptext form device memory to host memory swap (ciphertext , temptext);

output (ciphertext);

end

Fig. 3-3 Traditional 3-DES algorithm with too much data moving

3.2.2. 3-DES Architecture for CUDA programming

A new 3-DES architecture is a special design for CUDA programming. In our design, we try many kinds of method to optimise the architercture. The final answer we get is the following flow chart as Fig. 3-4. The implement details of the flow chart will be discuss in next chapter, we just need to know how this architecture work and compare this new 3-DES architecture with traditional 3-DES architecture. In fact, Fig.

3-4 only show how 3-DES encryption work. The 3-DES decryption’s flow chart is almost like 3-DES encryption flow chart but still exist some differents. We will also talk about the differents on next chapter.

Fig. 3-4 New 3-DESencryption Flowchart

3.3. Integrate 3-DES with web server

To analysis the efficiency of 3-DES encryption and decryption, we construct a website that offers 3-DES encryption and decryption service. The website is used for ensuring internal security. All data stored in the website is encrypted. While users want to download the data form website, the website will immediately decode the encryption data and send the result to users. On the other hand, when a user wants to upload a file to web server, the website will encode the file first and then put the cipher text into web server.

In Fig. 3-5, it is conspicuous to realize how this data server works. After the user upload a file to server, the website will refresh the download page and list a new cyber link to the new upload file. That is because we want to test the 3-DES speed up performance through the environment of integrating 3-DES with web service. In traditional 3-DES encryption and decryption, the efficiency is too low to tolerate. After we implement 3-DES in CUDA programming, the 3-DES encryption and decryption become fast enough for general purpose.

Moreover, users would not know that web server do the 3-DES decryption when they push the download button.

Fig. 3-5 New Web service architecture

Unfortunately we run into many troubles in implement the whole environment.

The most important trouble is the security problem. While a user tries to download a file from this server, the CPU in the web server would send a signal to call DMA for moving data form host memory to device memory ideally in our theory. But it is not working actually. Final we find out that there is some security measures for avoiding any unsuitable access to host computer. In other hand, the host computer would not

在文檔中利用圖形處理器加速3-DES運算 (頁 10-0)