測試環境介紹

第五章實驗數據與結果

5.1 測試環境介紹

我們此次研究將在測試平台 Xilinx ISE 8.1i 下開發，所選用的硬體描述語言是 Verilog，並利用 ModelSim 作為跑程式的模擬器以及 Xilinx XST 作為程式的合成器(synthesizer)，並且選擇 ClamAV 在 2008/2 月所釋放出的原病毒碼(字樣) 中的前 10,000 隻病毒作為測試。

我們在驗證程式的正確性主要分為三個步驟。首先，程式完成後，我們利用人工方式推導出幾個階段的結果並與 ModelSim 的結果比較；第二步驟，若人工方式推導出幾個階段的結果與 ModelSim 的結果相符合，則我們將利用 C++所撰寫的程式之結果與 ModelSim 的結果比較，若結果比對亦為相符合，則我們可以確定我們所撰寫出來的 RTL 程式碼是正確無誤的；最後，我們利用由 Xilinx 公司的一套軟體，ChipScope Pro8.1i，進行將 RTL 程式碼燒入 FPGA 板子後的驗證。

ChipScope 軟體的功能就像一台軟體版的邏輯分析儀，它可以將欲調試 (Debugging)的腳位或訊號抓取出來顯示於螢幕上一一檢視比對其結果是否與模擬時一致。

我們欲撰寫的程式架構如下所示，

System 模組 (System.v)

¾ AC 模組 (AC.v)

¾ Prefilter 模組 (Prefilter.v) 9 InContr 模組 (InContr.v)

第五章實驗數據與結果

9 HashGen 模組 (HashGen.v)

¾ TextRAM 模組 (TextRAM.v)

¾ JailRAM 模組 (JailRAM.v)

一開始將由 Prefilter 模組傳送搜尋視窗的起始(初始)位址給 TextRAM 模組進行抓取區塊大小(Block Size)內的資料(4 個位元組)。TextRAM 模組在接收到位址 (wire [12:0] PF2TextAddr)後會計算出 text_bram 群(text_bram0~text_bram3)內每個 BlockRAM 之位址匯流排的相對應位址，text_bram0~text_bram3 的位址匯流排在接收到各自對應的位址後，在下一個時鐘(Clock)後會各自輸出為 1 個位元組(8 個位元)並作一適當的排列連結(concatenation)成為一個具 32 位元的資料匯流排，其資料匯流排將送至 Prefilter 模組內的 HashGen 模組。HashGen 模組利用一雜湊函數將 32 個位元轉成 14 個位元，並作為 mqm_bram 群 (mqm_bram1~mqm_bram7)的位址匯流排的輸入；因此，在收到其 14 個位元的位址後，mqm_bram 群會在下一個時鐘後得到其 mqm_bram 群的輸出並將其輸出排列連結(R，為一 7 個位元之向量，每一個 BlockRAM 輸出值均為一個位元)。

HashGen 模組內部會自行根據得到的 R 計算出下次主位元組列內的值(MB，為一 7 個位元之向量)。HashGen 模組亦會將得到的 R 送至 Prefilter 模組內的另一個模組，InContr 模組。InContr 模組會根據此接收到的 R 先行判定是否為一可疑字樣的發生，若是，則會計算出現在搜尋視窗的起始位址並將計算結果存入一 16 位元暫存器(reg [15:0] PF2JailData)；若否，則會將 16 個位元均設成 0，存入 PF2JailData 暫存器。InContr 模組也會根據接收到的 R 計算出下次搜尋視窗可(向左)移動的距離(wire [4:0] shift_bytes)，則一 13 位元暫存器，PF2TextAddr，會依 shift_bytes 換算出下次區塊大小的起始位址並送往 TextRAM 模組請求下次應傳送給 Prefilter 模組之區塊大小內的資料。若 TextRAM 模組已無或不足之位元組個數可以再餵入 PreFilter 模組，則 TextRAM 模組將利用一條宣告為 wire 的變數 ending 告知 PreFilter 模組應停止運作(此時 ending=1)。從上述可知，當 TextRAM 模組接收到 13 位元的 PF2TextAddr 訊號，到下一次在接收到此訊號所需之週期 為 3 個時鐘的時間。

第五章實驗數據與結果

5.2 Prefilter 模組的修正版

在這，我們將針對改善 Prefilter 模組所抓到的可疑字樣的個數提出兩個版本，一個將以速度為導向，另一個將以功率為導向。圖-5.1 為一開始根據演算法所設計的版本，圖-5.2 是以速度為導向的 Prefilter 模組修正版，圖-5.3 是以功率為導向的 Prefilter 模組修正版。

圖-5.1：PF1，原始的 Prefilter 模組的版本

第五章實驗數據與結果

圖-5.2：PF2，以速度為導向的 Prefilter 模組修正版

圖-5.3：PF3，以功率為導向的 Prefilter 模組修正版

在圖-5.2，我們多加入了一個 BlockRAM 並命名為 mqm_bram8。mqm_bram8 的宣告型態與 mqm_bram1～mqm_bram7 一樣，均為 RAMB16_S1 的單埠同步記憶體。它與 mqm_bram7 一樣是針對字樣的第 7 個位元組到第 10 個位元組，不同

第五章實驗數據與結果

的是它與 mqm_bram7 使用各自的雜湊函數，其目的為輔助字樣的第 7 個位元組到第 10 個位元組的偵測正確性。這也就是說，在多加入一個 mqm_bram8 後，因字樣的第 7 個位元組到第 10 個位元組獲得更精準的偵測判斷而使誤報率(false positive rate)變小，使其被判斷為可疑字樣的個數將獲得大幅的降低，這對於整體所設計之系統的生產量(Throughput)將有明顯的改善。

在圖-5.3，此設計的 MQM 儲存方式與前述 PF1 與 PF2 不同。PF1 與 PF2 由於設計方式使其每一個 BlockRAM 都處於永遠工作的狀態，直至 Prefilter 模組停止運作為止，這對於功率上的消耗佔有一定程度的比例。因此，我們又設計出另一種修正的版本，為了使功率上得到更佳的解。在圖中，我們亦宣告了 8 個 BlockRAM，其每一個的宣告型態均為 RAMB16_S9_S9，為一個雙埠同步記憶體，這是為了使我們每一個要得到 R 的值只需去讀取一個 BlockRAM 即可。由於我們所設計之預先過濾器所得到 R 值僅需 7 個位元，但我們所宣告之 BlockRAM 型態每次輸出值為 8 個位元，這比我們要的 R 值多 1 個位元；因此，

我們一樣利用此位元作為輔助字樣的第 7 個位元組到第 10 個位元組能獲得更精準的偵測判斷而使誤報率(false positive rate)變小，使其被判斷為可疑字樣的個數將獲得大幅的降低。我們利用 HashRule1 模組的輸出作為 A 埠位址匯流排之輸入，得到高位的 7 個位元( )並利用 HashRule2 模組的輸出作為 B 埠位址匯流排之輸入，得到最低位元( )，則

7 6 5 4 3 2 1

r r r r r r r

r0 R={ , , , , , ,r r r r r r r r₇ ₆ ₅ ₄ ₃ ₂ ₁⋅ ₀^'}，這也就是為什麼 BlockRAM 要宣告成『雙埠』的原因。故，PF3 每次在讀取 mqm_bram 群時最多僅需讀取 2 個 BlockRAM，這比 PF1 與 PF2 要少 4 倍，功率上節省了 75%。

PF3 雖然可使功率獲得大幅的改善，但由於需付出一些控制電路來控制每次需讀取在 mqm_bram 群中的哪個 BlockRAM 以及要選擇哪些 BlockRAM 的輸出來作運算，這使得我們在時鐘頻率無法獲得像 PF1 與 PF2 一樣，最高僅可達到 53%，

這使得預先過濾器部分的生產量降低約 50%，我們在下子節中會提出一些數據看到每一種預先過濾器的表現。

第五章實驗數據與結果

5.3 數據與結果

PF1 (Gbps) PF1_NMB (Gbps)

# of

patterns

Min Max Aver. Min Max Aver.

patterns

Min Max Aver. Min Max Aver.

第五章實驗數據與結果

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

The number of patterns

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

The number of patterns

The number of suspicious patterns

圖-5.5：PF1 與 PF1_NMB 所抓到的可疑字樣個數

第五章實驗數據與結果

表-5.3：PF1 與 PF1_NMB 之搜尋視窗平均移動次數和平均移動的位元組個數 Random files

(PF1/PF1_NMB) Real files (PF1/PF1_NMB)

# of

1,000

1334.2/1358.8 6.130/6.021 1307.7/1326.7 6.256/6.167

2,000

1459.6/1537.6 5.606/5.322 1410.3/1484.7 5.800/5.509

3,000

1562.6/1726 5.236/4.740 1494.3/1616.3 5.474/5.061

4,000

1672.8/1930 4.890/4.238 1583.3/1816.7 5.166/4.504

5,000

1773.2/2147.8 4.612/3.808 1663.0/1988.7 4.920/4.114

6,000

1872.6/2386.4 4.368/3.428 1755.3/2221.7 4.661/3.686

7,000

1953.8/2577.2 4.187/3.174 1836.0/2366.0 4.456/3.460

8,000

2051.2/2807.4 3.988/2.914 1926.7/2509.0 4.246/3.263

9,000

2135.4/3004.4 3.831/2.723 1981.7/2665.7 4.129/3.072

10,000

2209.8/3217.2 3.702/2.543 2060.0/2869.7 3.971/2.857

PF1：random PF1_NMB：random PF1：real PF1_NMB：real

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

The number of patterns

The number of shift bytes

圖-5.6：PF1 與 PF1_NMB 的平均移動的位元組個數表現

第五章實驗數據與結果

PF2 (Gbps) PF2_NMB (Gbps)

# of

patterns

Min Max Aver. Min Max Aver.

patterns

Min Max Aver. Min Max Aver.

第五章實驗數據與結果

PF1 & PF2：random PF1_NMB & PF2_NMB：random PF1 & PF2：real PF1_NMB & PF2_NMB：real

2.780

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

The number of patterns

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 The number of patterns

第五章實驗數據與結果

表-5.6：PF2 與 PF2_NMB 之搜尋視窗平均移動次數和平均移動的位元組個數 Random files

(PF2/PF2_NMB) Real files (PF2/PF2_NMB)

# of

1,000

1334.2/1358.8 6.130/6.021 1307.7/1326.7 6.256/6.167

2,000

1459.6/1537.6 5.606/5.322 1410.3/1484.7 5.800/5.509

3,000

1562.6/1726 5.236/4.740 1494.3/1616.3 5.474/5.061

4,000

1672.8/1930 4.890/4.238 1583.3/1816.7 5.166/4.504

5,000

1773.2/2147.8 4.612/3.808 1663.0/1988.7 4.920/4.114

6,000

1872.6/2386.4 4.368/3.428 1755.3/2221.7 4.661/3.686

7,000

1953.8/2577.2 4.187/3.174 1836.0/2366.0 4.456/3.460

8,000

2051.2/2807.4 3.988/2.914 1926.7/2509.0 4.246/3.263

9,000

2135.4/3004.4 3.831/2.723 1981.7/2665.7 4.129/3.072

10,000

2209.8/3217.2 3.702/2.543 2060.0/2869.7 3.971/2.857

PF1 & PF2：random PF1_NMB & PF2_NMB：random

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

The number of patterns

第六章結論

我們可以看到原先依演算法所設計之預先過濾器可操作於 170 MHz 的頻率之上。在生產量方面，最高可近 2.9 Gbps(當字樣集合內僅有 1000 隻字樣)，最差也有近 1.7 Gbps(當字樣集合內僅有 10,000 隻字樣)，這個結果達到了我們預先設定之目標，這使得我們的每一個單位面積可提供的生產量最差亦有約 0.24 Gbps/BlockRAM；在偵測可疑字樣的個數方面，由於我們僅用一個雜湊函數去代表 MQM7 之結果，這使得誤報率居高不下，造成驗證器過於忙碌，使其整體系統生產量無法上來。因此，我們設計出了 PF2 版本，主要改善誤報率並盡量維持現有的生產量的表現。

PF2 版本可操作於 170 MHz 的頻率之上。在生產量方面，最高可近 2.9 Gbps(當字樣集合內僅有 1000 隻字樣)，最差也有近 1.7 Gbps(當字樣集合內僅有 10,000 隻字樣)，這個結果與 PF1 的生產量表現相同，但在偵測可疑字樣的個數方面，由於我們用了兩個雜湊函數去代表 MQM7 之結果，這使得誤報率大幅下降，最高的級差約 37 倍，這意味著所偵測可疑字樣的個數僅是 PF1 的 1

37 (在字樣集合內有 1000 隻字樣時，PF1 偵測可疑字樣的個數為 74，PF2 偵測可疑字樣的個數為 2)。

PF3 版本可操作於 84.5 MHz 的頻率之上。此版本為基於 PF2 的偵測可疑字樣的個數之表現下，改善其功率之消耗。若以 PF1 為基準，PF3 的功率表現上大幅降低了 75%，但付出了在生產量上的表現，約僅剩 51%。

參考文獻

[1] D. Knuth, J. Morris and V. Pratt, “Fast pattern matching in strings,” TR CS-74-440, Stanford University, Stanford California, 1974

[2] R. S. Boyer and J. S. Moore. “A Fast String Searching Algorithm,” Comm. of the ACM, vol. 20, issue 10, pp.762-772, Oct. 1977.

[3] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching," Tech. Rep.

TR94-17, Dept. Comput. Sci., Univ. Arizona, May 1994.

[4] A. Aho and M. Corasick, “Efficient string matching: An aid to bibliographic search,” Comm. of the ACM, vol. 18, issue 6, pp.333-343, Jun. 1975.

[5] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood, “Deep packet inspection using parallel bloom filters,” Symposium on High-Performance Interconnect (HotI), Stanford, CA, pp. 44-51, Aug. 2003.

[6] L. Tan and T. Sherwood, “A high throughput string matching architecture for intrusion detection and prevention,” 32nd Annual International Symposium on Computer Architecture, pp. 112-122, 2005.

[7] R. Sidhu and V. K. Prasanna, “Fast regular expression matching using FPGAs,”

IEEE Symposium on Field Programmable Custom Computing Machines (FCCM), Rohnert Park, CA, 2001.

[8] Clam anti virus signature database, www.clamav.net.

[9] T.H. Ptacek, T.N. Newsham, “Insertion, Evasion, and Denial of Service:

Eluding Network Intrusion Detection”, Secure Networks Inc. Report, January 1998

[10] L. Tan and T. Sherwood, “Architectures for Bit-Split String Scanning in Intrusion Detection,” IEEE Micro, Vol.26, pp. 110-117, 2006

[11] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “Deterministic memory-efficient string matching algorithms for intrusion detection,” IEEE Infocom 2004, pp. 333-340.

[12] T. H. Lee and J. C. Liang, “A high-performance memory-efficient pattern matching algorithm and its implementation,” IEEE Tencon, Hong-Kong, 2006.

[13] L. Tan and T. Sherwood, “A high throughput string matching architecture for intrusion detection and prevention,” 32nd Annual International Symposium on Computer Architecture, pp. 112-122, 2005

[14] Y. Sugawara, M. Inaba and K. Hiraki, “Over 10Gbps string matching mechanism for multi-stream packet scanning systems,” Field Programmable Logic and Application, Vol. 3203, Sep. 2004, pp. 484-493.

[15] C. R. Clark, D. E. Schimmel. “Scalable Pattern Matching for High Speed Networks.” Proc. of 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'04), 2004

附錄一各合成器合成之結果

附錄一

各合成器合成(Synthesize)之結果

一、使用 Xilinx ISE 8.1i 內建合成器(Synthesizer)，XST

1. PF1 Timing Summary:

---

Speed Grade: -6

Minimum period:

5.886ns

(Maximum Frequency:

169.909MHz

) Minimum input arrival time before clock: 3.072ns

Maximum output required time after clock: 6.503ns Maximum combinational path delay: No path found

Timing Detail:

---

All values displayed in nanoseconds (ns)

===========================================================

Timing constraint: Default period analysis for Clock 'clk_p' Clock period: 5.886ns (frequency: 169.909MHz) Total number of paths / destination ports: 2026 / 181

---

Delay: 5.886ns (Levels of Logic = 16) Source: hashgen/B_5 (FF)

Destination: input_contr/PF2TextAddr_10 (FF) Source Clock: clk_p rising

Destination Clock: clk_p rising

Data Path: hashgen/B_5 to input_contr/PF2TextAddr_10

附錄一各合成器合成之結果 LUT4_L:I2->LO 1 0.313 0.128 input_contr/InContr_02_xo<0>64 (input_contr/InContr_02_xo<0>_map1001) LUT4:I2->O 1 0.313 0.506 input_contr/InContr_02_xo<0>109 (input_contr/addr_text<2>)

Minimum period:

4.968ns

(Maximum Frequency:

201.303MHz

) Minimum input arrival time before clock: 3.063ns

Maximum output required time after clock: 6.049ns Maximum combinational path delay: No path found

Timing Detail:

附錄一各合成器合成之結果

All values displayed in nanoseconds (ns)

===========================================================

Timing constraint: Default period analysis for Clock 'clk_p' Clock period: 4.968ns (frequency: 201.303MHz) Total number of paths / destination ports: 1713 / 171

---

Delay: 4.968ns (Levels of Logic = 14) Source: input_contr/PF2TextAddr_12 (FF) Destination: input_contr/PF2TextAddr_10 (FF) Source Clock: clk_p rising

Destination Clock: clk_p rising

Data Path: input_contr/PF2TextAddr_12 to input_contr/PF2TextAddr_10 Gate Net

附錄一各合成器合成之結果

Minimum input arrival time before clock: 3.072ns Maximum output required time after clock: 6.503ns Maximum combinational path delay: No path found

Timing Detail:

---

All values displayed in nanoseconds (ns)

===========================================================

Timing constraint: Default period analysis for Clock 'clk_p' Clock period: 5.886ns (frequency: 169.909MHz) Total number of paths / destination ports: 1949 / 181

---

Delay: 5.886ns (Levels of Logic = 16) Source: hashgen/B_5 (FF)

Destination: input_contr/PF2TextAddr_10 (FF) Source Clock: clk_p rising

Destination Clock: clk_p rising

Data Path: hashgen/B_5 to input_contr/PF2TextAddr_10 Gate Net LUT4_L:I2->LO 1 0.313 0.128 input_contr/InContr_02_xo<0>64 (input_contr/InContr_02_xo<0>_map1001) LUT4:I2->O 1 0.313 0.506 input_contr/InContr_02_xo<0>109 (input_contr/addr_text<2>)

LUT2_D:I1->LO 2 0.313 0.000 input_contr/InContr__n0002<0>lut (N1460)

MUXCY:S->O 1 0.377 0.000 input_contr/InContr__n0002<0>cy (input_contr/InContr__n0002<0>_cyo) MUXCY:CI->O 1 0.041 0.000 input_contr/InContr__n0002<1>cy (input_contr/InContr__n0002<1>_cyo)

附錄一各合成器合成之結果

在文檔中使用FPGA實現應用於網路安全之可延展的字樣比對架構 (頁 49-0)

第五章 實驗數據與結果

5.1 測試環境介紹

5.2 Prefilter 模組的修正版

5.3 數據與結果

Min Max Aver. Min Max Aver.

Min Max Aver. Min Max Aver.

(PF1/PF1_NMB) Real files (PF1/PF1_NMB)

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Min Max Aver. Min Max Aver.

Min Max Aver. Min Max Aver.

(PF2/PF2_NMB) Real files (PF2/PF2_NMB)

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

第六章 結 論

參 考 文 獻

附 錄 一

各合成器合成(Synthesize)之結果

一、 使用 Xilinx ISE 8.1i 內建合成器(Synthesizer)，XST

---

5.886ns

169.909MHz

---

===========================================================

---

4.968ns

201.303MHz

===========================================================

---

---

===========================================================

---

第五章實驗數據與結果

第六章結論

參考文獻

附錄一

一、使用 Xilinx ISE 8.1i 內建合成器(Synthesizer)，XST