P OWER E STIMATION

In this section, we quantify the reduction of register accesses. Then we conduct an experiment to see the degree of power saving.

5.2.1 Register Accesses Per Operation

The total accesses of the composite FUs are reduced because of several operations performed in single instruction. It apparently cuts the read and write accesses.

AVG

t N t N t N t N t N t N t N t N t N N

3.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 2.76 1.00 2.82 1.00 2.78 1.00 3.00 1.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 2.76 1.00 2.82 1.00 2.78 1.00 3.00 1.00 1.00 AMS 3.00 1.00 3.00 1.00 2.30 0.77 2.65 0.88 3.00 1.00 2.76 1.00 2.06 0.73 2.10 0.75 2.58 0.86 0.88 ASM 3.00 1.00 3.00 1.00 2.30 0.77 2.65 0.88 3.00 1.00 2.76 1.00 2.14 0.76 2.34 0.84 2.58 0.86 0.90 MAS 2.03 0.68 2.49 0.83 2.39 0.80 2.41 0.80 2.47 0.82 2.76 1.00 2.28 0.81 2.39 0.86 2.77 0.92 0.83 MSA 2.03 0.68 2.49 0.83 2.39 0.80 2.41 0.80 2.47 0.82 2.29 0.83 2.46 0.87 2.23 0.80 2.77 0.92 0.82 SAM 3.00 1.00 3.00 1.00 2.30 0.77 2.65 0.88 3.00 1.00 2.29 0.83 2.34 0.83 2.39 0.86 2.58 0.86 0.89 SMA 2.03 0.68 2.49 0.83 2.39 0.80 2.41 0.80 2.47 0.82 2.29 0.83 2.54 0.90 2.17 0.78 2.77 0.92 0.82 3.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 3.00 1.00 2.76 1.00 2.82 1.00 2.78 1.00 3.00 1.00 1.00 AAMS 2.55 0.85 2.52 0.84 2.04 0.68 2.18 0.73 2.73 0.91 2.41 0.87 1.90 0.68 2.05 0.74 2.24 0.75 0.78 AASM 2.55 0.85 2.52 0.84 2.04 0.68 2.18 0.73 2.73 0.91 2.41 0.87 1.98 0.70 2.28 0.82 2.24 0.75 0.79 AMAS 2.48 0.83 2.49 0.83 1.70 0.57 2.18 0.73 2.47 0.82 2.41 0.87 1.90 0.68 2.05 0.74 2.24 0.75 0.75 AMSA 2.48 0.83 2.49 0.83 1.70 0.57 2.18 0.73 2.47 0.82 1.95 0.71 1.96 0.70 1.98 0.71 2.24 0.75 0.73 ASAM 2.55 0.85 2.52 0.84 2.04 0.68 2.18 0.73 2.73 0.91 1.95 0.71 2.04 0.72 2.23 0.80 2.24 0.75 0.77 ASMA 2.48 0.83 2.49 0.83 1.70 0.57 2.18 0.73 2.47 0.82 1.95 0.71 2.04 0.72 2.23 0.80 2.24 0.75 0.75 MAAS 2.00 0.67 2.02 0.67 2.39 0.80 2.06 0.69 2.20 0.73 2.41 0.87 2.17 0.77 2.23 0.80 2.47 0.82 0.76 MASA 2.03 0.68 2.02 0.67 2.39 0.80 2.06 0.69 2.20 0.73 1.95 0.71 2.24 0.79 2.21 0.80 2.47 0.82 0.74 MSAA 2.03 0.68 2.02 0.67 2.39 0.80 2.06 0.69 2.20 0.73 1.95 0.71 2.39 0.85 2.16 0.78 2.47 0.82 0.74 SAAM 2.55 0.85 2.52 0.84 2.04 0.68 2.18 0.73 2.73 0.91 1.95 0.71 2.19 0.78 2.37 0.85 2.24 0.75 0.78 SAMA 2.48 0.83 2.49 0.83 1.70 0.57 2.18 0.73 2.47 0.82 1.95 0.71 2.19 0.78 2.37 0.85 2.24 0.75 0.76 SMAA 2.03 0.68 2.02 0.67 2.39 0.80 2.06 0.69 2.20 0.73 1.95 0.71 2.46 0.87 2.26 0.81 2.47 0.82 0.75

Biquad

# register access/operation(actual) FIR CFIR LPFIR IMDCT FFT

Table 5-7 Register accesses per operation

Table 5-7 outlines the register accesses per operation. First, count all accesses for each case. The accesses of adder of scalar and VLIW are 2R/1W. The accesses of multiplier of scalar and VLIW are 2R/1W. The accesses of shifter of scalar and VLIW are 1R/1W. The accesses of composite FUs with 3 FUs (1A1M1S) are 3R/1W. The accesses of composite FUs with 4 FUs (2A1M1S) are 4R/1W. Furthermore, when the composite FUs use sub-functions to execute the applications, the corresponding

accesses are taken into consideration. Then, all the data is normalized to scalar 1.00.

The last column is the geometric mean.

Table 5-8 shows the respective register access per operation of the various FU configurations regarding the used benchmarks. On average, the 3-FU and the 4-FU composite FUs reduce about 18% and 27% of register accesses per operation compared to the scalar and the VLIW.

Benchmark Scalar 3-way

VLIW

Table 5-8 Outline of register accesses per operation

5.2.2 Target Simulated Architecture

Figure 5-5 is an overview when the FU and RF are mapped into a real architecture. Assume that there are instruction memory, data memory, and load/store unit which can help the application really work.

Figure 5-5 Simulated architecture

The shadow area surrounded by the dotted line includes FU and RF. These two parts are what we concern. In our experiment, we replace these two part using the FU and RF pairs of scalar, VLIW, and composite FU (MSA) to keep track of the power consumption. Remember that the L/S is independent of the FU because of the assumption made before for software analysis.

5.2.3 Ping-Pong Register File

Based on the analysis assumption and results before, the load/store handling is decoupled. Now we want to estimate power carefully, so we should reconsider of the load/store effect again because of the power estimation should be closed to real situation.

There is two methods to handle the load/store operation.

(1) Increase additional RF port to let the FUs get the right data transparently in the execution sequence.

(2) Increase load/store cycles.

The latter is simple, but it needs extra cycles because of the I/O pattern of register files must be considered.

Without changing the scenario, we use ping-pong register to keep simplicity without opening new ports (2) and performance loss (1).

Assume the load bandwidth is full capable of getting enough data. For example, when executing a MAC needs to load a coefficient and a data input, the load bandwidth is twice the arithmetic bandwidth. However, there is some control hardware overhead if the register file is capable of two values written at the same time using the same MUX.

We mirror a 16-bit RF with 16 registers and use the ping-pong mode execution.

As Figure 5-6 shows, write ports of ping-pong RFs are interleaved access from load/store unit or FU. The ports of FU and RF for different datapaths are listed below.

FU (Input / Output) Scalar: 2 / 1 VLIW: 5 / 3 MSA: 3 / 1 RF (Read / Write)

Scalar: 2 / 1 VLIW: 5 / 3 MSA: 3 / 1

Figure 5-6 Data flow of FU and RF

Besides, power is strongly related to activity pattern. The power consumption is almost from the transition of the logic network. Figure 5-7 shows the access pattern derived from Figure 5-6. We can see that the hardware can get right data in corresponding cycles. The ping-pong execution covers the problems derived from load/store cooperation. These two threads make up the total application execution.

Figure 5-7 Access pattern of FU and RF

5.2.4 Simulation Results

This experiment measures the power consumption of the scalar, the 3-way VLIW, and the MSA-ordered composite FU. We take Synopsys PrimePower as simulation tool. It simulated the power consumption of gate-level files.

The application is executed in streaming process. And the Remez 16-tap FIR with 1,024 Gaussian-distributed random input patterns is used as the test real application. Table 5-9 illustrates the total cycles for test application of each datapaths.

Number of cycle Scalar MSA 3-VLIW random input 1024 1024 1024

remez filter 16 16 16

FIR(once) 31 16 17

total 31744 16384 17408

Table 5-9 Execution cycles

First of all, let us see the factors which may affect the power P. Three main factors are the operation frequency f, capacitance C, and the voltages V. Their relations to power is P∝ f ⋅C⋅V². The voltage V is constant in TSMC 0.13 cell library. The f is related to the cycle time. The capacitance C is related to the area.

Table 5-10 is derived from the OPC analyzed before and the performance goal. It is used as the synthesis timing constraint. And Table 5-11 shows the synthesis area based on the cycle time in Table 5-10. Because the critical path has limited synthesis timing constraint, the area of MSA-ordered composite FU explodes in 400MOPS.

Cycle time (unit:ns) Scalar MSA 3-VLIW

100MOPS 10.00 19.38 18.23

200MOPS 5.00 9.69 9.12

400MOPS 2.50 4.84 4.56

Table 5-10 Cycle time

Ping Pong Sum Ping Pong Sum Ping Pong Sum

100MOPS 12724 16047 16047 32094 12296 18892 18892 37784 12289 31733 31733 63466 200MOPS 13012 16047 16047 32094 12875 18892 18892 37784 12289 31733 31733 63466 400MOPS 30204 16047 16047 32094 21920 18892 18892 37784 13092 31733 31733 63466 Area (unit: um2)

Table 5-11 Synthesis area

Table 5-12 shows the power consumption of every part we concern. The power of MSA is smaller in most case because of the longer cycle time, lower frequency.

The power of FU part of composite FU grows exaggeratedly in 400MOPS because of the area explosion.

Ping Pong Sum Ping Pong Sum Ping Pong Sum

100MOPS 0.454 0.608 0.604 1.212 1.666 0.264 0.566 0.560 1.126 1.390 0.234 0.701 0.703 1.404 1.638 200MOPS 0.994 1.208 1.202 2.410 3.404 0.579 1.120 1.107 2.227 2.806 0.460 1.405 1.403 2.808 3.268 400MOPS 5.670 2.406 2.397 4.803 10.473 2.699 2.237 2.228 4.465 7.164 1.203 2.812 2.807 5.619 6.822 Power (unit: mW) FU

Table 5-12 Power consumption

Compared with the scalar and the VLIW, the composite FU saves 16.5% ~ 31.6% of power consumption under the 100 ~ 200 MOPS performance requirement as Table 5-13 shows. The power saving comes from the less number of RF’s port required by the composite FUs and less register access per operation of the composite FUs. Figure 5-8 shows the bar charts of power and energy comparison.

FU RF Total FU RF Total

-41.8 -7.1 -16.6 -48.4 15.8 -1.7 -41.7 -7.6 -17.6 -53.7 16.5 -4.0 -52.4 -7.0 -31.6 -78.8 17.0 -34.9

MSA 3-VLIW

Power improve (Normalized to scalar) (%)

Table 5-13 Power improvement

Power

Figure 5-8 Comparison of (a) power (b) energy

6 Summary & Future Works

In this thesis, we propose the composite functional units which cascades all the primitive FUs in a customized order by analyzing the DFG (data-flow graph) of the target applications to improve datapath utilization. The composite FU with 3 primitive functional units achieves an OPC of 1.35 on average and has comparable OPCs to that of the VLIW in several benchmarks.

Besides, the composite FU reduces 10% to 25% of area compared with the VLIW and saves 16.5% to 31.6% of power consumption compared with both the scalar and the VLIW under the performance target ranging from 100 to 300 MOPS.

Although the composite FUs may result in long critical path, pipelining technique can be applied to raise the clock rate feasibly. A flexible pipelining design flow is also proposed to assist in FU pipelining. Additionally, the interleaved multithreading can be applied to hide pipeline latency totally if enough number of threads is supported.

The area comparison in section 5-1 shows that the hardware cost of the thread increase is small for composite FU, so the IMT architecture is good for the composite FUs. Relatively, the hardware cost of VLIW cooperated with IMT is too high. Some other methods to hide instruction latency must be taken.

In chapter 4, we proposed an Application Programmable Processor Synthesis Flow to design a processor based on composite FUs. The composite FU selection flow helps user to find a proper composite FU for a specific application.

Future Work

z Thread Register File Reduction

Although multithreading, in our experience, will incur large context overhead.

Because of the IMT architecture needs a thread register file for each thread, the area and overhead of hardware increase with thread number. There are other approaches such as the register file architecture using master latch sharing described in [] to lessen the side effect. In the future, we will continue studying on how to reduce the multithreading-incurred overhead.

Share master latching

[31] introduces a method of reducing area and power consumption of a synthesizable register tile by using a single master latch shared by a number of slaves.

Simulation results show that, depending on the size of the register tile, reduction of power consumption of more than 50% is achievable.

Data stores are an important power critical part of resource sharing architectures [7] or processing units, like application specific instruction set processors (ASIPs). They are preferably implemented as a synthesizable register file described as part of the design on register transfer level, because of a high effort required for timing verification of RAM.

Master

Figure 6-1 (a) D Flip-filop (b) Word Level Register

Figure 6-1 (a) is a typical D flop-flop. It is composed of master latch and slave latch. Several D flip-flops make up a word level register in Figure 6-1 (b). The real data is stored in the slave latch.

This method also aims at reduction of capacitance connected to the data bus.

This is achieved by splitting up the master-slave flip-flops into the master latches and the slave latches. If clock gating is applied, slave latches of registers in the register tile can share one master latch. Thus the number of master latches connected to the data bus is decreased. Additionally savings in area can he expected.

Figure 6-2 (a) shows conventional register file with flip-flops and Figure 6-2 (b) shows modified register file with shared master latches.

Figure 6-2 (a) Conventional register file with flip-filops (b) Modified register file with shared master latches.

z Fused FU and VLIW Cooperation

We demonstrate that the composite FU effectively increases datapath utilization in this thesis. However, the long critical path will incur other overheads, i.e. pipeline latency or register file complexity. Another research direction will focus on the fused FU such as MAC (multiply and add). We can cascade two or three primitive operations to be a fused FU that is frequently used and then design a VLIW processor with several fused FUs as primitive FUs. Compared to the conventional VLIW processors with the same computing resources, multi-issue fused FUs still demand less number of register file ports. We expect that the multi-issue fused FUs will explore more operation parallelism and thus achieve higher OPC. At the same time, it can avoid long critical path.

z Merge Shifter Into Multiplier

Because the shift operations are small relatively to the total operations, the shifter is usually idle. Shift operations often occur between different application and data transformation for alignment. Maybe we can merge the shifter into the multiplier by by-passing the corresponding bits form the product of multiplier with little hardware effort. By the way, the software analysis may need modification.

Reference

[1]

J. L. Hennessy and D. A. Patterson, Computer Architecture – A Quantitative Approach, 4^th edition, Morgan Kaufmann, 2006

[2]

DECsystem-10/DECSYSTEM-20 Processor Reference Manual, DEC, 1982

[3]

S. Rixner, et al., "Register organization for media processing, " in Proc. HPCA-6, pp.375-386, 2000

[4]

G. T. Bryd and M. A. Holliday, "Multithreaded processor architectures," IEEE SPECTRUM, August 1995

[5]

T Ungerer, B ROBIC, and J SILC, "A survey of processors with explicit multithreading," ACM Computing Surveys, Vol. 35, No. 1, pp. 29-63, March 2003

[6]

J Glossner, "Sandblaster Low Power DSP," IEEE Custom Integrated Circuits Conference, 2004

[7]

K. K. Parhi, VLSI Digital Signal Processing Systems – Design and Implementation, John Wiley & Sons, 1999

[8]

A. V. Oppenheim, R.W. Schafer, and J. R. Buck, Discrete-Time Signal Processing, 2^nd edition, Prentice Hall,1999

[9]

X. Y. Li, M. F. Stallmann, and F Brglez, "Effective bounding techniques for solving unate and binate covering problem," ACM IEEE Design Automation Conference, 2005

[10]

R. Cordone, F. Ferrandi, D. Sciuto, and R. W. Calvo, "An efficient heuristic approach to solve the unate covering problem," Proc. Design Automation and Test in Europe, pp. 364-371, 2000

[11]

O. Coudert, "On solving binate covering problems," In The Proceedings of the Design Automation Conference, pages 197-- 202, June 1996

[12]

R. C. Larson and A. R. Odoni, Orban Operation Resarch, 2^nd edition, Dynamic Ideas, 2007

[13]

S. Liao, S. Devadas, K. Keutzer, and S. Tjiang, "Instruction selection using binate covering for code size optimization," In Proceedings of International Conference on ComputerAided Design, 1995

[14]

Gero Dittmann, "Organizing libraries of DFG patterns," Proceeding of the DATE, 2004

[15]

Gero Dittmann. "Organizing pattern libraries for ASIP design, " IBM Research Report RZ3488, April 2003

[16]

T. J. Lin, P. C. Hsiao, C. W. Liu, and C. W. Jen, "Area-efficient register organization for fully-synthesizable VLIW DSP cores," International Journal of Electrical Engineering, May 2006 (EI)

[17]

P. Chretienne, E.G. Coffman, Jr., J.K. Lenstra, and Z.Liu, Scheduling Theory and its Application, Wiley, June 1995.

[18]

P. G. Paulin and J.P. Knight, "Force-directed scheduling for the behavioral synthesis of ASIC’s," IEEE TRANSACTIONS ON CAD, Vol 8, No. 6, June 1999.

[19]

Y. N. Chang, C. Y. Wang, and K. K. Parhi, "Loop-list scheduling for heterogeneous functional units," In 6th Great. Lakes Symposium on VLSI, pages 2–7, March 1996

[20]

S. Govindarajan and R. Vemuri, "Cone-based clustering heuristic for list-scheduling algorithms," In Proc. of European Design & Test Conference (ED&TC), pages 456–462, March 1997

[21]

D. D. Gajski, N. D. Dutt, A. C. Wu, and S. Y. Lin, High-Level Synthesis – Introduction to Chip and System Design, Kluwer Academic Publishers, 1992

[22]

C. Liem, T. May, and P. Paulin, "Instruction-set matching and selection for DSP and ASIP code generation," IEEE European Design and Test Conference, EDAC ,1994

[23]

J. Shu, T. C. Wilson, and D. K. Banerji, "Instruction-set matching and GA-based selection for embedded-processor code generation," 9^th International Conference on VLSI Design, January 1996

[24]

J. V. Praet, G. Goossens, D. Lanneer, and H. D. Man, "Instruction set definition and instruction selection for ASIPs," Proc. 7^th IEEE/ACM Int. Symp. On High-Level Synthesis, Niagara-on-the-Lake, May 1994

[25]

G. Dittmann and A. Herkersdorf, "Multilayer intermediate representation for ASIP design and critical-path optimization," Technical Report RZ 3484, IBM Research, February 2003

[26]

DSPStone - A DSP oriented Benchmarking Methodology, In ICSPAT, Aachen University of Technology, 1994

[27]

BDTI, http://www.bdti.com

[28]

S. Gordon, "Simplified use of 8x8 transform," Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, doc. JVT-I022, San Diego, USA, September 2003.

[29]

Independent JPEG Group, http://www.ijg.org

[30]

MAD: MPEG Audio Decoder, http://www.underbit.com/products/mad/

[31]

M. Wroblewski, M. Mueller, A. Wortmann, S. Simon, W. Pieper, and J. A.

Nossek, “A power efficient register file architecture using master latch sharing,”

in Proc. ISCAS, May 2003

作者簡歷

卓毅，1983 年4 月 21 日出生於台北市。2005 年取得國立交通大學電子工程

學系學士學位，並繼續在國立交通大學電子工程研究所攻讀碩士。2007 年在劉志尉教授指導下，取得碩士學位。本篇論文「具複雜運算單元之低功率多執行緒資料路徑的研究與設計」為其碩士論文。

在文檔中具複雜運算單元之低功率多執行緒資料路徑的研究與設計 (頁 72-86)