效率比較

Forward discrete cosine transform, DCT 是實數對實數間的轉換，為影像處理(如:JPEG,MPEG)的運算核心。數學式如下所示：

NOP |OR R0 1004 |NOP |NOP ;

FFT:

Fast Fourier Transform, FFT 為複數對複數間的轉換，常用於通信系統基頻信號的處理 ( 如： 802.11a) 。下式為使用 Cooley-Turkey radix-2, DIF 的方法表示長度為 N 的 FFT 的數學式：層 butterfly 的運算後必需把一半的資料存回記憶體中，然後把 DFG 分成兩個 8 點的 FFT 依序運算。最後完成一次 16 點複數 FFT

Execution kernel ADSP-21xx TI C55x DSP-lite Proposed DSP Working frequency 160MHz 200MHz 314MHz 268MHz Instruction length 24bits

(single way)

8~48bits (variable length)

128bits (4-way)

64bits (4-way)

16 points complex FFT 874 cycles 356 cycles 268 cycles 277 cycles 8 points 1-D DCT 154 cycles -- 43 cycles 50 cycles 8x8 points 2-D DCT 2452 cycles 1078 cycles 688 cycles 705 cycles 2^nd-order biquad filter 13 cycle 5 cycle 16 cycles 7 cycle

Numerical method BFP BFP SFP SFP

Power dispassion 120 mW 321 mW 52 mW 37 mW

第5章總結

本論文提出一個以程式控制硬體資料流的可程式化的 DSP 加速器核心。其一，以長度精簡的指令集的程式化模式改善其前身 DSP-lite 架構中微程式碼過大的缺點；其二，以程式記憶體取代微程式碼表格，以同步解碼和分叉指令免除原 DSP-lite 需要更新微程式表格的工作。

在使用 SIU DSG 的排程下，可把微處理器的運算單元規畫成類似 ASIC 的資料流。在設計上，採用 SFP 的輕量型運算和分散式記憶體等節能的手段。在運算速度上，以表 4-2 的結果，在同樣的硬體資源下，SIU 可以更有效的運用運算單元，達到較好的效能。而在功率消耗方面，以目前市面上的硬體規格而言，一個用於 802.11b 中 FFT 運算的 ASIC IP 功率消耗為 16mW[25]。使用 DSP 處理器的加速器，其消耗功率平均在 100mW 以上 [26][27]，而以 SIU DSG 為主的資料路徑能把功率消耗控制在 50mW 以下。

因此在功率的比較下，以 SIU 的方法可以有效減少功率消耗，但是距離要達到模擬 ASIC 資料流的目標還是有改善的空間。在程式碼的產生方面，我們最大的缺撼為必需從 SDFG 的方式來產生程式碼，在一般 DSP 處理器上，大都能提供如 C compiler 之類的高階語言的組譯器，但是此 ISA 在合成高階語言方面的確有執行上的困難。

在未來的工作中，本實驗室目前亦有希望能把程式以自動化產生這方向的研究在進行中。在高階語言的支援上，雖然難以產生組譯器，但是仍可以應用程式界面(API)的方式提供物件連結。在節能方面，本架構中仍可再使用一般節能的方法來改善功率消耗如使用低功率的加法器，使用 gated clock…等等。另外，在多處理器的平台設計中，未來使用更多可平行操作的運算單元(concurrent function unit)來平行處理已是無可避免的趨勢之一，如何劃分(partition)這些運算單元才能讓平行處理更有效率亦為一個研究的課題。以我們的設計而言，擁有 load/store，加法器，乘法器，和位移器四個運算單元，在功能上，己足以獨立完成各種信號處理的運算，但未必是分工最平均，效率最好的安排。在研究上，如果一件工作需要 12 個

運算單元同時運作才能符合計算需求，是用 3 個四運算單元一組的資料流好還是 2 個六運算單元為一組的好？因此，我們還希望在 ISA 上多做一些未來可 scalable 的規化，如定義出只要加多少位元及解碼的電路就可以讓硬體架構順利多一個運算單元而不需要將整個資料流的架構重新設計。這樣可以幫助在硬體劃分時更方便得到效能驗証。

Reference:

[1] Alan V. Oppenheim, Ronald W. Schafer, “Discrete-Time Signal Processing”, 2^nd Edition, Prentice Hall, Upper Saddle River, NJ, 1998

[2] David A. Patterson, John L. Hennessy, “Computer Organization & Design The Hardware/Software Interface”, 2^nd Edition, Morgan Kaufmann, San Francisco, CA, 1997

[3] David A. Patterson, John L. Hennessy, “Computer Architecture A Quantitative Approach”, 2^nd Edition, Morgan Kaufmann, San Francisco, CA, 1995

[4] Keshab K. Parhi, “VLSI Digital Signal Processing Systems: Design and Implementation”, John Wiley&Sons, 1999

[5] Website: DSP village, http://dspvillage.ti.com/

[6] Gatherer, et al, "DSP-based architectures for mobile communications: past, present and future", IEEE Communications, vol. 38, Jan. 2000

[7] OMAP5910 Dual Core Processor – Technical Reference Manual, Texas Instruments, Jan. 2003

[8] TriCore 2-32-bit Unified Processor Core v.2.0 Architecture, Architecture Manual, Infineon Technology, June 2003

[9] R. A. Quinnell, "Logical combination? Convergence products need both RISC and DSP processors, but merging them may not be the answer", EDN, 2003

[10] "A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP", TI Application Report, 2003

[11] S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D.

Owens, "Register organization for media processing", in Proc. HPCA-6, 2000

[12] IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standard 754, 1985

[13] Digital Signal Processing – Using the ADSP-2100 Family, Analog Device Inc., 1990

[14] A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP, Texas Instruments, 2003

[15] Tay-Jyi Lin, Hung-Yueh Lin, Chie-Min Chao, Chih-Wei Liu, Chein-Wei Jen,

"A compact DSP core with static floating-point unit & its microcode generation" Proceedings of the 14th ACM Great Lakes symposium on VLSI, 2004

[16] Serene Banerjee, Hamid R. Sheikh, Lizy K. John, Brian L. Evans, and Alan C. Bovik, “VLIW DSP vs. Superscalar Implementation of a Baseline H.263 Video Encoder” Dept. of Electrical & Computer Engineering , University of Texas, 2000

[17] Jessica H. Tseng, Krste Asanovic, "Banked multiported register files for high-frequency superscalar microprocessors", ISCA-30, 2003

[18] Javier Zalamea, Josep Llosa, Eduard Ayguade', Meteo Valero, "Hierarchial clustered register file organization for VLIW Processors", Universitat Polite`cnica de Catalunya, 2003

[19] TMS320C600 CPU and Instruction set reference guide, TI, 2000

[20] T.J. Lin, Chein-Wei Jen, "Data stream generation for concurrent computation in VLSI signal processors", International Conference on Signal Processing, 2000

[21] Y.M. Chang, “Design and Implementation of DSP Datapath for Baseband Processing”, Master Thesis, National Chiao Tung University, Taiwan, 2003 [22] H.Y. Lin, “Lightweight DSP Arithmetic and its Application on a

programmable DSP core”, Master Thesis, National Chiao Tung University, Taiwan, 2004

[23] C.C. Lee, “An Embedded Digital Signal Processor Design with Hierarchical Register File & Packed Instructions”, National Chiao Tung University, Taiwan, 2004

[24] "TMS320C55x DSP Library Programmer's Reference", TI, 2003 [25] Web site: DSP core, http://www.dspcore.com/cn/Products/SIFT.htm

[26] "TMS320VC5509A Power Consumption Summary", TI C5000 Hardware Application Report, 6.2004

[27] Web site: ANALOG DEVICES' EXTENDS ADSP-218X DSP FAMILY WITH OVER 50 PERCENT POWER CONSUMPTION SAVINGS, http://www.analog.com/en/content/0%2C2886%2C431%255F%255F8954

%2C00.html

http://www.analog.com/IST/SelectionTableProcessors/?selection_table_id=

Appendix: Instruction Encoding

Operands: R 說明：

由 IO BUS 讀取一個 16 位元的值，放到 IO output Io = memory(R)

當使用 R10(R9)當成位址的數值時位址會自動被修正為 R10(R9)+counter register

Io = memory(R10+counter register)

當使用 R0 當成位址時，設為 null，IO bus 不會有動作。

當 Z bit 被設為 1 時則 counter register 將在下一個 cycle 被重置為 0

當 S bit 被設為 1 時則後面的 o 則會被設定為 counter register 的向左位移量。counter register 為 4bit。在每執行一次索引定址後 counter register 會自動指向下一個

word，當位移量設為 1 時，則 counter register 每累加一次就會跳 2 個 word，同理設成 2 時會跳 4 個 word…利用此法可對記憶體做固定間隔的連續存取。

1.2. SW: Store 16-bits word Opcode:

Operands: R1 R2 說明：

把 R2 的值存入 R1 所指定的記憶體位址，同時把 R2 的值放到 output 去。

memory(R1)=R2, Io=R2

當使用 R10 當成位址的數值時位址會自動被修正為 R10(R9)+counter register

memory(R10+counter register)=R2, Io=R2

當使用 R0 為位址時，IO bus 不會有動作。R2 的值只會被放到 output 不會被存到記憶體中

Io=R2

1.3. Branch and I/O registers update Instruction: J

Opcode:

Operands: O

說明：程式無條件 branch 至 O 所選擇的 output 的值 Instruction: JC

Opcode:

Operands: O

說明：當 ALU 上一個 cycle 的運算有產生 carry-out 或 burrow-in 的情形時，程式會 branch 至 O 所選擇的 output 的值

Instruction: JNZ Opcode:

Operands: O

說明：當 ALU 上一個 cycle 的運算的結果不是 0 時，程式會 branch 至 O 所選擇的 output 的值

Instruction: JZ Opcode:

Operands: O

說明：當 ALU 上一個 cycle 的運算的結果為 0 時，程式會 branch 至 O 所選擇的 output 的值

Instruction: Update registers Opcode:

Operands: O R1

說明：會把所選的 output 的值存到指定的暫存器 R1=selected output

2 4

1 1 0 0 O

2 4

1 1 0 1 O

2 4

1 1 1 0 O

2 4

1 1 1 1 O

2 4

2. ALU 單元指令格式：

ALU 單元的指令長度為 22bits，主要提供加法，減法，AND 和 OR 的運算

2.1. ADD 加法(不做暫存器更新) Opcode:

Operands: ARx1, ARx2, S 說明： AR = selected output result.

2.4. SUB 減法(不做暫存器更新) Opcode:

Operands: ARx1, ARx2, S 說明：

2.5. SUI 函有立即數值(immediate value)的減法(不做暫存器更新) AR = selected output result.

2.7. AND Opcode:

Operands: ARx, K 說明：

Operands: AR, ARx1, ARx2 說明：

Ao = ARx1 & ARx2

AR = selected output result

0 1 AR ARx1 ARx2

2.9. OR Opcode:

Operands: ARx, K 說明：

Ao = ARx | (K << e) 其中：

| 為 bit-or 的運算

K 的值會自動 sign extended 至 16bits

e 為 K 的指數項，範圍為 000 ~ 101 2.10. ORL 做 OR 運算，同時更新暫存器

Opcode:

Operands: AR, ARx1, ARx2 說明：

Ao = ARx1 | ARx2

AR = selected output result

1 1 1 AR ARx1 ARx2

3 4 5 5

Don’t care

1 1 0 ARx K

11 5

3. 乘法器單元指令格式：

乘法器指令的長度為 12bits，提供整數和純小數的乘法。

3.1. M 整數乘法 Opcode:

Operands: MRx1, MRx2, s 說明：

MRx 的值可為

10000 ~ 11111: 表示取用 MR0 ~ MR15 的值 00000 使用 0 為運算元

01011 使用 Ao 為運算元 01100 使用 Mo 為運算元 01101 使用 So 為運算元 01110 使用 Io 為運算元

Mo = MRx1 * MRx2

當 s 為 0 時即為整數的乘法，當 s 為 1 時乘法的結果會向右修正 1bit。

3.2. fM 純小數乘法 Opcode:

Operands: MRx1, MRx2, s 說明：

Mo = MRx1 (*,s) MRx2 3.3. UM “乘一”的乘法

Opcode:

Operand: MRx 說明：

直接把 MRx 的值傳至 output 不做乘法運算 Mo = MRx

3.4. LM 只更新暫存器(不做乘法運算) Opcode:

Operand: O, MR

說明： MR = selected output result

1 0 0 0 0 1 MR

6 4

0 0 0 0 0 1 MRx2

6 51

1 MRx1 MRx2

1 5 5

0 MRx1 MRx2

1 5 5

4. 位移器單元指令格式：

位移器指令的長度為 15bits，提供無條件的位移指令。

4.1. S: Shift left(不做暫存器更新) Opcode:

Operands: SRx, K (5 bit signed integer (+15~-15)) 說明：

K 為 5bit 的有號數，範圍為 -16 ~ +15

SRx 的值可為

10000 ~ 11111: 表示取用 SR0 ~ SR15 的值 00000 使用 0 為運算元

01011 使用 Ao 為運算元 01100 使用 Mo 為運算元 01101 使用 So 為運算元 01110 使用 Io 為運算元

So = SRx << K，當 K 為負值時表示向右位移 4.2. SLA,SLM,SLI: Shift 且更新暫存器

Opcode:

Operands: SRx, K (4 bit signed integer (+7~-8)), SR 說明：

K 為 4bit 的有號數，範圍為 -8 ~ +7

So = SRx << K，當 K 為負值時表示向右位移 SR = selected output result

SRx K

1 0

2 5 5

Don’t care

SRx K SR

4 4

作者簡歷

劉建良，1973 年 1 月 14 日出生於台南縣。1996 年取得國立清華大學電機工程系學士學位。2001 年於國立交通大學在職專班攻讀碩士。2005 年在劉志尉教授指導下，取得碩士學位。本篇論文「適用於異質性平台之低功率可程式化資料流設計」為其碩士論文。

在文檔中適用於異質性平台之低功率可程式化資料流設計 (頁 64-78)

第5章 總結

Reference:

Appendix: Instruction Encoding

作者簡歷

第5章總結