結論 - 以多圖形處理器加速近似字串比對

本研究透過實現 bit-parallelism 並執行於多個 GPU 上，相較於傳統執行於 CPU 上的 bit-parallelism，更能有效的加速近似字串比對。此外，

本研究之 device 端為兩個 NVIDIA^® GeForce^® GTX480 的 GPU，由實驗結果可以發現，多個 GPU 在處理龐大資料量時，可以提供較高的效能。

本研究採取 bit-parallelism 為研究對象，乃基於其速度快與節省記憶體空間，而本實驗所需結果之資訊僅判別是否有近似的字串發生。倘若所需結果之資訊為該位置上採取的動作時，並不適合採用 bit-parallelism，

而較適合採取 dynamic programming 來進行比對。因此，目前仍然有許多研究學者積極想要找出速度快、省記憶體空間並擁有豐富資訊量的演 算法，而透過 CPU 或 GPU 進行加速。

另外，有部分學者也透過叢集運算，結合安裝在不同主機上的 CPU 或 GPU 進行運算。例如亞馬遜（Amazon）提供的雲端服務，可以給予使用者依自己需求向伺服端開啟資源或機器，將單一電腦無法於有效時間內處理完的問題，向亞馬遜要求開啟多台機器分工處理。

未來研究建議透過雲端服務建置一個 GPU 叢集運算，透過連結多個不同主機上的 GPU，分別進行運算。

參考文獻

一、外文部分

Baeza-Yates, R., & Navarro, G. (1999). Faster approximate string matching.

Algorithmica, 23(2), 127-158.

Dinu, L. P., & Ionescu, R. (2011, September). A genetic approximation of closest string via rank distance. In L. Ciortuz (Chair), Artificial Intelligence III. Symposium conducted at the Symbolic and Numeric Algorithms for Scientific Computing Symbolic and Numeric Algorithms for Scientific Computing, Timisoara, Romania.

Dong, Y., & Qi, B. (2010, December). The technology of music retrieval by humming and its application in internet music search system. Paper presented at the 2010 IEEE International Conference on Information Theory and Information Security, Beijing, China.

Hyyrö, H. (2008). Improving the bit-parallel NFA of Baeza-Yates and Navarro for approximate string matching. Information Processing Letters, 108(5), 313-319.

Ivanko, E. (2006, July). Fast approximate search in strings with rearrangements. In Y. Zhong (Chair), Neural Networks. Symposium conducted at the 5th IEEE International Conference on Cognitive Informatics, Beijing, China.

Kaplan, K. M., & Kaplan, J. J. (2004, October). Multiple DNA sequence approximate matching. Paper presented at the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, La Jolla, CA.

Kalarot, R., Morris, J., & Gimel'farb, G. (2010, November). Performance analysis of multi-resolution symmetric dynamic programming stereo on GPU. Paper presented at the 25th International Conference of Image

and Vision Computing New Zealand, Queenstown, New Zealand.

Lyras, D. P., Sgarbas, K. N., & Fakotakis, N. D. (2007, October). Using the Levenshtein edit distance for automatic lemmatization: A case study for modern greek and english. Paper presented at the 19th IEEE International Conference on Tools with Artificial Intelligence, Paris, France.

Li, H., Ni, B., Wong, M., & Leung, K. (2011, June). A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching. Paper presented at the IEEE 9th Symposium on Application Specific Processors, San Diego, CA.

Liao, H., Lin, Y., & Medioni, G. (2011, November). Aerial 3D reconstruction with line-constrained dynamic programming. Paper presented at the 13th IEEE International Conference on Computer Vision, Barcelona, Spain.

Liu, W., Schmidt, B., Voss, G., Schroder, A., & Muller-Wittig, W. (2006, June). Bio-sequence database scanning on a GPU. In C. Tseng (Chair), High Performance Computational Biology. Symposium conducted at the 20thIEEE International Parallel & Distributed Processing Symposium, Rhodes Island, Greece.

Liu, T., Huang, X., Yang, L., & Zhang, P. (2009, September). Query by humming: Comparing voices to voices. Paper presented at the IEEE International Conference on Management and Service Science, Beijing, China.

Liu, Y., Huang, W., Johnson, J., & Vaidya, S. (2006, May). GPU accelerated Smith-Waterman. In D. Göddeke (Chair), Database Applications - Computer Graphics and Modelling. Symposium conducted at the International Conference on Computational Science, UK.

Liu, Z., Lin, W., Li, N., & Lee, D. (2005, November). Detecting and filtering instant messaging spam - a global and personalized approach. Paper

presented at the 1st IEEE ICNP Workshop on Secure Network Protocols, Boston, MA.

Myers, G. (1999). A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of ACM, 46(3), 395-415.

Moslah, O., Valles-Such, A., Guitteny, V., Couvet, S., & Philipp-Foliguet, S.

(2009, May). Accelerated multi-view stereo using parallel processing capababilities of the GPUS. Paper presented at the 3DTV Conference:

The True Vision - Capture, Potsdam, Germany.

Munekawa, Y., Ino, F., & Hagihara, K. (2008, October). Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU. Paper presented at the 8th IEEE International Conference on BioInformatics and BioEngineering, Athen, Greece.

Qin, Z., Li, P., Zhu, Q., & Tian, C. (2010, March). SWEE: Approximately searching web service with keywords effectively and efficiently. Paper presented at the 2nd IEEE International Conference on Advanced Computer Control, Shenyang, China.

Si, J., Yang, L., Lu, C., Sun, J., & Mei, S. (2009, June). Approximate dynamic programming for continuous state and control problems. Paper presented at the 17th Mediterranean Conference on Control and Automation, Thessaloniki, Greece.

Soleh, M. Y., & Purwarianti, A. (2011, July). A non word error spell checker for Indonesian using morphologically analyzer and HMM. Paper presented at the 2011 IEEE International Conference on Electrical Engineering and Informatics, Bandung, Indonesia.

Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.

Song, T., Xue, Y., & Wang, D. (2006, October). An algorithm of large-scale

approximate multiple string matching for network security. In H. Chen (Chair), Security Protocols and Watermarks. Symposium conducted at the Communications and Networking in China, Beijing, China.

Shi, W., & Xie, M. (2011, June). Spam filtering cloud platform based on sharing fingerprints. Paper presented at the 2011 IEEE International Conference on Computer Science and Service System, Nanjing, China.

Wu, O., Zuo, H., Hu, W., Zhu, M., & Li, S. (2008, December). Recognizing and filtering web images based on people's existence. Paper presented at the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Sydney, Australia.

Wu, S., & Manber, U. (1992). Fast text searching: allowing errors.

COMMUNICATION OF THE ACM, 35(10), 83-91.

附錄

附錄一、Dynamic programming（CPU 版本）

#include <stdio.h>

#define PATTERN_LEN 99

#define NUM_THREADS 1

#define iMIN(x,y) ((x)>(y))? (y):(x) pthread_t tid[NUM_THREADS];

pthread_attr_t attr;

void *parallel_bitap_NFA(void *param);

int value[NUM_THREADS]={0};

void *fill_dp(void *param);

static int pattern_num = 0;

unsigned long int pattern_code = 0;

int k;

int dp_width;

int dp_depth;

int pattern_len, input_len;

char* input;

char pattern[100];

unsigned long int *dp_result;

unsigned long int Temp1 = 0;

static int dp_count = 0;

int main(int argc, char* argv[]){

FILE* fr_input;

FILE* fr_pattern;

FILE* fw_dp_match;

int i,j,y,policy;

int bit_length = 64;

struct timeval t_start, t_end;

if((fr_input=fopen(argv[1],"rb"))==NULL) {

printf("Input FILE: %s cannot be opened\n",argv[1]);

exit(1);

}

/* obtain file size: */

40 fseek (fr_input , 0, SEEK_END);

input_len = ftell (fr_input);

printf("original input_len is %d\n", input_len);

rewind(fr_input);

printf("exact input_len is %d\n", input_len);

}

fclose(fr_input);

//malloc result array dp_result and dp2m_result

dp_result = (unsigned long int*)malloc(sizeof(unsigned long int)*input_len);

for(j=0;j<input_len+1;j++){

dp_result[j] = 0;

}

//reading pattern file

if((fr_pattern=fopen(argv[2],"rb"))==NULL) {

if (pthread_attr_getschedpolicy(&attr,&policy) != 0) fprintf(stderr, "Unable to get policy.\n");

else {

if (pthread_attr_setschedpolicy(&attr, SCHED_OTHER) != 0) printf("unable to set scheduling policy to SCHED_OTHER \n");

pattern_len=strlen(pattern);

printf("pattern_len is %d\n", pattern_len);

printf("The throughput is %f Mps\n",(float)(input_len*8)/(elapsedTime*1000000) );

fclose(fr_pattern);

// Output matching results

fw_dp_match = fopen("dp_approximate_match_result.txt", "w");

if (fw_dp_match == NULL)

if(dp_result[i] != 0){

for (j = pattern_num; j >= 0; j--) { Temp1 = (dp_result[i] >> j) & 1;

if (Temp1 == 1) {

(void)fprintf(fw_dp_match, "At position %4d, match pattern

%d\n", i+1, j+1);

void *fill_dp(void *param){

int segmentSize=(input_len+NUM_THREADS-1)/NUM_THREADS;

int remainder=pattern_len+k;

int tid=(*(int*)param);

int pos=segmentSize*tid ;

int end=pos+segmentSize+remainder;

end=iMIN(end,input_len);

int *dp;

int distance=end-pos+1;

dp = (int *)malloc(distance*dp_depth*sizeof(int));

initial_dp(dp, distance, dp_depth);

int i;

43 +(j-pos)-1]+1);

} } }

for(i=pos+1; i<end+1;i++){// while the variable end add 1 for the reason of initial entry

if(dp[(dp_depth-1)*distance+(i-pos)]<=k) dp_result[i-1] |= pattern_code;

}

free(dp);

}

void initial_dp(int* dp_ptr, int width, int depth){

int i;

for(i=0;i<width;i++) dp_ptr[i]=0;

for(i=1;i<depth;i++) dp_ptr[i*width]=i;

}

#define PATTERN_LEN 99

#define iMIN(x,y) ((x)>(y))? (y):(x)

#define NUM_THREADS 1

pthread_t tid[NUM_THREADS];

pthread_attr_t attr;

void *parallel_bitap_NFA(void *param);

int value[NUM_THREADS]={0};

char *text;

char pattern[100];

int k, bit_length;

unsigned long int text_size = 0;

static int m;

unsigned long int *match_result;

unsigned long int pattern_code = 0;

static int pattern_num = 0;

struct timeval t_start, t_end;

float elapsedTime;

int main (int argc, const char *argv[]) { FILE *fpin;

if ((fpin = fopen(argv[1],"rb")) == NULL) {

printf("Input FILE: %s cannot be opened\n", argv[1]);

45 rewind(fpin);

/* allocate memory to contain the whole file: */

text = (char*) malloc (sizeof(char) * text_size);

/* allocate memory for output */

match_result = (unsigned long int *) malloc (sizeof(unsigned long int) * text_size);

if (match_result == NULL) {

// copy distance from command line k = atoi(argv[3]);

/**********************Pthread***********************************/

/* get the default attributes */

pthread_attr_init(&attr);

/* get the current scheduling policy */

if (pthread_attr_getschedpolicy(&attr,&policy) != 0) fprintf(stderr, "Unable to get policy.\n");

46 printf("SCHED_FIFO\n");

}

/* set the scheduling policy - FIFO, RT, or OTHER */

if (pthread_attr_setschedpolicy(&attr, SCHED_OTHER) != 0) printf("unable to set scheduling policy to SCHED_OTHER \n");

m = strlen(pattern);

for (y = 0; y < NUM_THREADS; y++){

pthread_create(&tid[y],&attr,parallel_bitap_NFA,(void *)&value[y]);

}

/* Now join on each thread */

for (y = 0; y < NUM_THREADS; y++) pthread_join(tid[y], NULL);

pattern_num++;

}

// stop time

gettimeofday(&t_end, NULL);

/* compute and print the elapsed time in millisec */

elapsedTime = (t_end.tv_sec - t_start.tv_sec) * 1000.0;

elapsedTime += (t_end.tv_usec - t_start.tv_usec) / 1000.0;

printf("The input size is %ld bytes\n", text_size );

printf("The elapsed time is %lf ms\n", elapsedTime);

printf("The throughput is %lf

Mbps\n",((double)(text_size*8))/(elapsedTime*1000000));

// Output results

fpout = fopen("Bitap_Pthread_match_result.txt", "w");

if (fpout == NULL) {

perror("Open output file failed.\n");

exit(7);

}

// Output match result to file for (y = 0; y < text_size; y++) { if (match_result[y] != 0) {

for (r = 63; r >= 0; r--) {

Temp = (match_result[y] >> r) & 1;

if (Temp == 1) {

fprintf(fpout, "At position %4d, match pattern %d\n", y+1, r+1);

void *parallel_bitap_NFA(void *param){

int segmentSize = (text_size + NUM_THREADS-1) / NUM_THREADS ; int remainder = m + k;

int tid=(*(int*)param);

int pos = segmentSize * tid ;

int end = pos + segmentSize + remainder;

end = iMIN(end, text_size) ; int i, j, z;

int *Mtable;

unsigned long D ; unsigned long X = 0;

unsigned long *M;

unsigned long mask1 = 0;

unsigned long mask2 = 0;

unsigned long mask3 = 0;

bit_length = (k+2) * (m-k);

if (bit_length > 64) {

perror("pattern is larger than 64 bits!");

exit(8);

}

if (pattern[0] == '\0') {

printf("No pattern input!");

exit(9);

}

pattern_code = pow(2, pattern_num);

Mtable = (int*)malloc(sizeof(int)* m * m);

/*Construct table in order to set up mask*/

for (i = 0; i < m; i++) { for (j = 0; j < m; j++) {

if (pattern[i] == pattern[j]) {

Mtable[i*m+j] = 0; //equal 0 }else {

Mtable[i*m+j] = 1; //nonequal 1

48 }

} }

/* Initialize the bit array */

M = malloc(sizeof(unsigned long int) * m);

for (i = 0; i < m; i++) M[i] = 0;

/*Set up the mask of each element of the pattern*/

for (i = 0; i < m; i++) {

M[i] += (Mtable[i*m+m-k-1+(j%(k+2))-z] * pow(2, j));

} } }

/*Set up the mask using throughout the algorithm*/

/**mask 1**/

49 while (pos < end) {

j = 0;

while (j < m) {

if (text[pos] == pattern[j]) {

X = ((D >> k) >> 2) | M[j] ;/*Line 1*/

break;

} j++;

}

if (j == m)

X = ((D >> k) >> 2) | mask3 ;

D = ((D << 1) | mask1) & (((D << k) << 3) | mask2) & (((X + mask1) ^ X)

>> 1) & mask3 ;/*Line 2 - Line 5*/

if(((D >> k) & 1) == 0 ) {

match_result[pos] |= pattern_code;

} pos++;

}

free(M);

free(Mtable);

}

void parallel_bitap_NFA(char *match_result, unsigned long int text_size, int pattern_number );

__global__ void kernel( char *text, char *pattern, int segmentSize, int Overlap, unsigned long int *M, char *match_result, int m, int text_size, int k, unsigned long int mask1, unsigned long int mask2, unsigned long int mask3, char pattern_code );

int k, m, bit_length ; char *input ;

char pattern[100] ; char *m_result ; cudaError_t status;

unsigned long int input_len;

unsigned long int seg_text_length = 0;

unsigned long int seg_text_start = 0;

unsigned long int seg_text_end = 0;

double TotalsetupInputTime = 0;

double TotalgetResultTime = 0;

double GPUTime = 0;

int main (int argc, const char *argv[]) { FILE *fr_input;

if ((fr_input = fopen(argv[1],"rb")) == NULL) {

printf("Input FILE: %s cannot be opened\n", argv[1]);

exit(1);

}

// copy distance from command line k = atoi(argv[3]);

// Output results

fpout = fopen("Bitap_LargeText_GPU_match_result.txt", "w");

if (fpout == NULL) {

perror("Open output file failed.\n");

exit(1);

}

/********For GPU device********/

int deviceID = 0;

if ( cudaSetDevice(deviceID) != cudaSuccess ){

fprintf(stderr, "Set CUDA device %d error\n", deviceID);

exit(1);

}

cudaDeviceProp deviceProp;

cudaGetDeviceProperties(&deviceProp, deviceID);

printf("Using Device %d: \"%s\"\n", deviceID, deviceProp.name);

/******************************/

printf("*******************Bitap Method on GPU********************\n");

/* obtain file size: */

fseek(fr_input , 0 , SEEK_END);

input_len = ftell(fr_input);

printf("original input_len is %ld\n", input_len);

rewind(fr_input);

if ( input_len <= MAX_MB) { seg_text_length = input_len;

}else{

seg_text_length = MAX_MB;

}

seg_text_end = seg_text_length - 1 ; while( seg_text_end < input_len) {

pattern_num = 1;

seg_text_length = seg_text_end - seg_text_start + 1;

input = (char *) malloc (sizeof(char)*seg_text_length);

if (input == NULL){

if (seg_text_end != 0){

break;

}else{

printf("Allocate input memory error\n");

52 exit (1);

} }

/* copy the file into the buffer: */

fseek (fr_input , seg_text_start, SEEK_SET);

result = fread (input, 1, seg_text_length, fr_input);

if( result != seg_text_length ){

printf("error occurs when reading input file to memory\n");

exit(1);

}

//malloc result array dp_result and dp2m_result

m_result = (char*) malloc (sizeof(char) * seg_text_length);

for(i=0;i<seg_text_length;i++){

m_result[i] = 0;

}

//call the algorithm

while (fscanf(fr_pattern, "%s", pattern) == 1 ) { puts(pattern);

m = strlen(pattern);

bit_length = (k+2) * (m-k);

parallel_bitap_NFA(m_result, seg_text_length, pattern_num);

pattern_num++;

}

// Output match result to file

for (i = 0; i < seg_text_length; i++) {

if (seg_text_start > 0 && i < (m + k)) continue;

if (m_result[i] > 0) {

fprintf(fpout, "At position %4ld, match pattern %d\n", i+seg_text_start+1, pattern_num-1);

count++;

} }

seg_text_start = seg_text_start + MAX_MB - m - k;

seg_text_end = seg_text_start + MAX_MB - 1;

seg_text_end = iMIN(seg_text_end, input_len-1);

rewind(fr_input);

rewind(fr_pattern);

}

printf("\nThe gpu elapsed time is %.2f ms\n", GPUTime);

printf("The input length is %.2f Mbytes\n", (float)input_len/1024/1024);

printf("The throughput is %.2f

Mbps\n",(float)(input_len*8)/(GPUTime*1000000/8) );

printf("\nsetupInputTime = %.2f ms\n", TotalsetupInputTime);

printf("getResultTime = %.2f ms\n", TotalgetResultTime);

double totalTime = GPUTime + TotalsetupInputTime + TotalgetResultTime ; printf("total time including PCIe = %.2f ms\n", totalTime);

void parallel_bitap_NFA(char *d_match_result, unsigned long int text_size, int pattern_num ){

struct timeval t_start, t_end;

float times;

unsigned long int mask1 = 0;

unsigned long int mask2 = 0;

unsigned long int mask3 = 0;

unsigned long int *M;

if (bit_length > 64) {

perror("pattern is larger than 64 bits!");

exit(1);

} int i, j, z;

int *Mtable;

if (pattern[0] == '\0') {

printf("No pattern input!");

exit(1);

}

/********For GPU device********/

char *dev_text;

char *dev_match_result;

char *dev_pattern;

unsigned long int *dev_M;

//Allocate memory in GPU

status = cudaMalloc((void **) &dev_text, sizeof(char) * text_size);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_text error: %s\n", cudaGetErrorString(status));

exit(1) ;

54 }

status = cudaMalloc((void **) &dev_match_result, sizeof(char) * text_size);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_match_result error: %s\n", cudaGetErrorString(status));

exit(1) ; }

cudaMemset(dev_match_result, 0, text_size * sizeof(char));

status = cudaMalloc((void **) &dev_pattern, sizeof(char) * m);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_pattern error: %s\n", cudaGetErrorString(status));

exit(1) ; }

status = cudaMalloc((void **) &dev_M, sizeof(unsigned long int) * m);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_M error: %s\n", cudaGetErrorString(status));

exit(1) ; }

/******************************/

Mtable = (int*)malloc(sizeof(int)* m * m);

M = (unsigned long int*)malloc(sizeof(unsigned long int) * m);

for (i = 0; i < m; i++) M[i] = 0;

gettimeofday(&t_start, NULL);

/*Construct table in order to set up mask*/

for (i = 0; i < m; i++) { for (j = 0; j < m; j++) {

if (pattern[i] == pattern[j]) {

Mtable[i*m+j] = 0; //equal 0

/*Set up the mask of each element of the pattern*/

for (i = 0; i < m; i++) {

M[i] += (Mtable[i*m+m-k-1+(j%(k+2))-z] * pow(2, j));

} } }

/*Set up the mask using throughout the algorithm*/

/**mask 1**/

/*************For GPU device*************/

//Copt text from Host to Device

status = cudaMemcpy(dev_text, input, sizeof(char) * text_size, cudaMemcpyHostToDevice);

if ( cudaSuccess != status ){

fprintf(stderr, "Error: %s\n", cudaGetErrorString(status));

exit(1) ; }

status = cudaMemcpy(dev_pattern, pattern, sizeof(char) * m, cudaMemcpyHostToDevice);

if ( cudaSuccess != status ){

fprintf(stderr, "Error: %s\n", cudaGetErrorString(status));

exit(1) ; }

status = cudaMemcpy(dev_M, M, sizeof(unsigned long int) * m, cudaMemcpyHostToDevice);

if ( cudaSuccess != status ){

fprintf(stderr, "Error: %s\n", cudaGetErrorString(status));

exit(1) ; }

56 gettimeofday(&t_end, NULL);

// compute and print the elapsed time in millisec

double setupInputTime = (t_end.tv_sec - t_start.tv_sec) * 1000.0;

setupInputTime += (t_end.tv_usec - t_start.tv_usec) / 1000.0;

TotalsetupInputTime += setupInputTime;

/*****************DONE*******************/

int segmentSize = (text_size + dimBlock * dimGrid - 1) / (dimBlock * dimGrid);

int segmentOverlap = segmentSize + m + k ; // record time setting

cudaEvent_t start, stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord(start, 0);

/*****************Call the Kernel function*****************/

kernel <<< dimGrid, dimBlock >>>( dev_text, dev_pattern, segmentSize,

segmentOverlap, dev_M, dev_match_result, m, text_size, k, mask1, mask2, mask3, pattern_num );

// record time setting

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&times, start, stop);

cudaEventDestroy(start);

status = cudaMemcpy(d_match_result, dev_match_result, sizeof(char) * text_size, cudaMemcpyDeviceToHost);

// compute and print the elapsed time in millisec

double getResultTime = (t_end.tv_sec - t_start.tv_sec) * 1000.0;

getResultTime += (t_end.tv_usec - t_start.tv_usec) / 1000.0;

__global__ void kernel(char *text, char *pattern, int segmentSize, int Overlap, unsigned long int *M, char *dev_match_result, int m, int input_size, int k, unsigned long int mask1, unsigned long int mask2, unsigned long int mask3, char pattern_code ){

int j ;

int tid = threadIdx.x + blockIdx.x * blockDim.x ; int pos = tid * segmentSize ;

dev_match_result[pos] |= pattern_code;

} pos++ ; }

}

#define BASIC_UNIT 128 * 1024 * 1024

#define MAX_THREADS 8

#define iMIN(x,y) ((x)>(y))? (y):(x)

void parallel_bitap_NFA(char *d_input, char *match_result, unsigned long int text_size, int pattern_number , int num_GPU);

__global__ void kernel( char *text, char *pattern, int segmentSize, int Overlap, unsigned long int *M, int shiftarray, char *match_result, int m, int text_size, int k, unsigned long int mask1, unsigned long int mask2, unsigned long int mask3, char

pattern_code );

int k, m, bit_length ; char *input ;

char pattern[100] ; char *m_result ; cudaError_t status;

unsigned long int input_len;

unsigned long int seg_text_length = 0;

unsigned long int seg_text_start = 0;

unsigned long int seg_text_end = 0;

double TotalsetupInputTime = 0;

double TotalgetResultTime = 0;

double GPUTime = 0;

int main (int argc, const char *argv[]) { FILE *fr_input;

if ((fr_input = fopen(argv[1],"rb")) == NULL) {

// copy distance from command line k = atoi(argv[3]);

// Output results

fpout = fopen("Bitap_omp_GPU_match_result.txt", "w");

if (fpout == NULL) {

perror("Open output file failed.\n");

exit(1);

}

printf("*******************Bitap Method on GPU using OpenMP********************\n");

/* obtain file size: */

fseek(fr_input , 0 , SEEK_END);

input_len = ftell(fr_input);

printf("original input_len is %ld\n", input_len);

rewind(fr_input);

if ( input_len <= MAX_MB) { seg_text_length = input_len;

}else{

seg_text_length = MAX_MB;

}

seg_text_end = seg_text_length - 1 ; int max_gpus = 0;

int num_gpus = 0;

int basic_unit = BASIC_UNIT;

cudaGetDeviceCount(&max_gpus);

while( seg_text_end < input_len) { pattern_num = 1;

seg_text_length = seg_text_end - seg_text_start + 1;

input = (char *) malloc (sizeof(char)*seg_text_length);

if (input == NULL){

if (seg_text_end != 0){

60 break;

}else{

printf("Allocate input memory error\n");

exit (1);

} }

num_gpus = (seg_text_length + basic_unit - 1) / basic_unit;

num_gpus = iMIN( num_gpus , max_gpus );

omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices

/* copy the file into the buffer: */

fseek (fr_input , seg_text_start, SEEK_SET);

result = fread (input, 1, seg_text_length, fr_input);

if( result != seg_text_length ){

printf("error occurs when reading input file to memory\n");

exit(1);

}

//malloc result array dp_result and dp2m_result

m_result = (char*) malloc (sizeof(char) * seg_text_length);

for(i=0;i<seg_text_length;i++){

m_result[i] = 0;

}

//call the algorithm

while (fscanf(fr_pattern, "%s", pattern) == 1 ) { m = strlen(pattern);

bit_length = (k+2) * (m-k);

parallel_bitap_NFA(input, m_result, seg_text_length, pattern_num, num_gpus);

pattern_num++;

}

// Output match result to file

for (i = 0; i < seg_text_length; i++) {

if (seg_text_start > 0 && i < (m + k)) continue;

if (m_result[i] > 0) {

fprintf(fpout, "At position %4ld, match pattern %d\n", i+seg_text_start+1, pattern_num-1);

count++;

} }

seg_text_start = seg_text_start + MAX_MB - m - k;

seg_text_end = seg_text_start + MAX_MB - 1;

seg_text_end = iMIN(seg_text_end, input_len-1);

rewind(fr_input);

rewind(fr_pattern);

step++;

61 }

printf("The gpu elapsed time is %.2f ms\n", GPUTime);

printf("The input length is %.2f Mbytes\n", (float)input_len/1024/1024);

printf("The throughput is %.2f Gbps\n",(float)(input_len)/(GPUTime*1000000/8) );

printf("setupInputTime = %.2f ms\n", TotalsetupInputTime);

printf("getResultTime = %.2f ms\n", TotalgetResultTime);

double totalTime = GPUTime + TotalsetupInputTime + TotalgetResultTime ; printf("total time including PCIe = %.2f ms\n", totalTime);

void parallel_bitap_NFA(char *d_input, char *d_match_result, unsigned long int text_size, int pattern_num , int num_GPU){

struct timeval t_start, t_end;

float times;

unsigned long int mask1 = 0;

unsigned long int mask2 = 0;

unsigned long int mask3 = 0;

unsigned long int *M;

if (bit_length > 64) {

perror("pattern is larger than 64 bits!");

exit(1);

} int i, j, z;

int *Mtable;

if (pattern[0] == '\0') {

printf("No pattern input!");

exit(1);

}

/********For GPU device********/

Mtable = (int*)malloc(sizeof(int)* m * m);

M = (unsigned long int*)malloc(sizeof(unsigned long int) * m);

for (i = 0; i < m; i++) M[i] = 0;

62 gettimeofday(&t_start, NULL);

/*Construct table in order to set up mask*/

for (i = 0; i < m; i++) { for (j = 0; j < m; j++) {

if (pattern[i] == pattern[j]) {

Mtable[i*m+j] = 0; //equal 0

/*Set up the mask of each element of the pattern*/

for (i = 0; i < m; i++) {

M[i] += (Mtable[i*m+m-k-1+(j%(k+2))-z] * pow(2, j));

} } }

/*Set up the mask using throughout the algorithm*/

/**mask 1**/

/*************For GPU device*************/

63 gettimeofday(&t_end, NULL);

// compute and print the elapsed time in millisec

double setupInputTime = (t_end.tv_sec - t_start.tv_sec) * 1000.0;

setupInputTime += (t_end.tv_usec - t_start.tv_usec) / 1000.0;

TotalsetupInputTime += setupInputTime;

/*****************DONE*******************/

#pragma omp parallel shared(text_size, d_input, num_GPU, mask1, mask2, mask3) {

char *dev_text = NULL;

char *dev_match_result = NULL;

char *dev_pattern;

unsigned long int *dev_M;

double getResultTime;

int deviceID = omp_get_thread_num();

int num_threads = omp_get_num_threads();

for(i = deviceID; i < num_GPU; i++) { if( i == 1 ) changeDevice = 2;

else changeDevice = i;

if ( cudaSetDevice(changeDevice) != cudaSuccess ){

fprintf(stderr, "Set CUDA device %d error\n", changeDevice);

}

status = cudaMalloc((void **) &dev_pattern, sizeof(char) * m);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_pattern error: %s\n", cudaGetErrorString(status));

exit(1) ; }

status = cudaMalloc((void **) &dev_M, sizeof(unsigned long int) * m);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_M error: %s\n", cudaGetErrorString(status));

exit(1) ;

64 }

status = cudaMalloc((void **) &dev_text, sizeof(char) * (BASIC_UNIT + m + k));

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_text error: %s\n", cudaGetErrorString(status));

exit(1) ; }

status = cudaMalloc((void **) &dev_match_result, sizeof(char) * BASIC_UNIT);

if ( cudaSuccess != status ){

fprintf(stderr, "cudaMallocHost dev_match_result error: %s\n", cudaGetErrorString(status));

exit(1) ; }

for (begin = deviceID * BASIC_UNIT ; begin < text_size ; begin +=

(num_threads * BASIC_UNIT)){

if (begin > 0) back = begin - m - k;

else back = begin;

end = begin + BASIC_UNIT - 1;

end = iMIN( end, text_size );

length_For_GPU = end - back + 1;

cudaMemset(dev_match_result, 0, BASIC_UNIT * sizeof(char));

gettimeofday(&t_start, NULL);

status = cudaMemcpy(dev_text, d_input + back, sizeof(char) * length_For_GPU, cudaMemcpyHostToDevice);

if ( cudaSuccess != status ){

fprintf(stderr, "Read input error: %s\n", cudaGetErrorString(status));

exit(1) ; }

//Copt text from Host to Device

status = cudaMemcpy(dev_pattern, pattern, sizeof(char) * m, cudaMemcpyHostToDevice);

if ( cudaSuccess != status ){

fprintf(stderr, "Error: %s\n", cudaGetErrorString(status));

exit(1) ; }

status = cudaMemcpy(dev_M, M, sizeof(unsigned long int) * m, cudaMemcpyHostToDevice);

if ( cudaSuccess != status ){

fprintf(stderr, "Error: %s\n", cudaGetErrorString(status));

exit(1) ; }

gettimeofday(&t_end, NULL);

setupInputTime = (t_end.tv_sec - t_start.tv_sec) * 1000.0;

setupInputTime += (t_end.tv_usec - t_start.tv_usec) / 1000.0;

TotalsetupInputTime += setupInputTime;

// set 256 threads per block, set grid size automatically dimBlock = 128 ;

dimGrid = (length_For_GPU + dimBlock - 1) / dimBlock ; if (dimGrid > 65535) {

dimGrid = 65535 ; }

segmentSize = (length_For_GPU + dimBlock * dimGrid - 1) / (dimBlock * dimGrid);

segmentOverlap = segmentSize + m + k ; // record time setting

cudaEvent_t start, stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord(start, 0);

/*****************Call the kernel function*****************/

kernel <<< dimGrid, dimBlock >>>( dev_text, dev_pattern, segmentSize, segmentOverlap, dev_M, back, dev_match_result, m, length_For_GPU, k, mask1, mask2, mask3, pattern_num );

// record time setting

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&times, start, stop);

cudaEventDestroy(start);

cudaEventDestroy(stop);

GPUTime += times ;

/********Copy result form Device to Host********/

gettimeofday(&t_start, NULL);

status = cudaMemcpy(d_match_result + begin, dev_match_result, sizeof(char) * (end-begin+1), cudaMemcpyDeviceToHost);

if ( cudaSuccess != status ){

fprintf(stderr, "Out put error: %s\n", cudaGetErrorString(status));

exit(1) ; }

gettimeofday(&t_end, NULL);

// compute and print the elapsed time in millisec

getResultTime = (t_end.tv_sec - t_start.tv_sec) * 1000.0;

getResultTime += (t_end.tv_usec - t_start.tv_usec) / 1000.0;

TotalgetResultTime += getResultTime;

}

66 }

__global__ void kernel(char *text, char *pattern, int segmentSize, int Overlap, unsigned long int *M, int shiftarray, char *dev_match_result, int m, int input_size, int k, unsigned long int mask1, unsigned long int mask2, unsigned long int mask3, char pattern_code ){

int j ;

int tid = threadIdx.x + blockIdx.x * blockDim.x ; int pos = tid * segmentSize ;

if (shiftarray > 0){

if ((pos >= (m + k)))

dev_match_result[pos-m-k] |= pattern_code;

}else

dev_match_result[pos] |= pattern_code;

} pos++ ; }

}

在文檔中以多圖形處理器加速近似字串比對 (頁 47-82)

結論

參考文獻

附 錄

附錄