System Operation - Design of ALU Cluster Microarchitecture

Chapter 3 Design of ALU Cluster Microarchitecture

3.4 System Operation

In order to increase computation throughput and decrease operation period, the system operation with pipeline mechanism has been recommended as one of solution ways to achieve these goals. Therefore, as is naturally done in most high performance processors, the ALU Cluster also operates in a pipelined manner to reach higher instruction throughput. The pipeline execution diagram in the ALU Cluster is depicted in Figure 3.4.1. The complete process of pipeline operation to execute one instruction includes from FETCH, DECOED, READ REGISTER, and EXECUTE 1 ~ N, to WRITE BACK.

Figure 3.4.1: Pipeline Execution Diagram Details

During the first pipeline stage in the cycle N (FETCH), the decoder fetches and sequences the VLIW-like instructions from the instruction microcode storage. During the decoding stage (DECODE), the decoder decodes the incoming instructions and then delivers the decoded results to the controller. During the register file read stage (READ REGISTER), the controller would manage the data storage unit to be read out the desired data. The desired data major comes from one of the off-chip data memory, self IRF unit, or SPRF unit, and then sends to the dedicated function unit. During the execution stage (EXECUTE), each function unit begins to execute the computing operation if it has been assigned. The duration of executing clock cycle is depended on the types of function unit, for example, the ALU unit is 2 clock cycles, the MUL unit is 4 clock cycles, and the DIV unit is 16 clock cycles. Finally, during the register file write stage (WRITE BACK), the computing results from the function unit would be written back to the assigned data storage unit managed by the controller. Similarly, the assigned data storage unit also mainly comes from the off-chip data memory, one of all IRF units, or SPRF unit.

In summary, while there are perfectly no any hazards happened among the VLIW-like instructions, the sequence pipeline operation mechanism of ALU Cluster is shown in Figure 3.4.2. Although the VLIW-like instructions are scheduled statically and sequenced to the ALU Cluster, any hazards during execution could cause the succession pipeline operation to stall. Thus, if the hazard is encountered, the instructions issued earlier would continue to be executed, but the instructions issued later should be stalled and then be re-executed after the stall condition is no longer valid.

& '$ '$ ( ) *

Figure 3.4.2: Sequence Pipeline Operation Diagram

CHAPTER 4 IMPLEMENTATION

In this chapter, the design of an ALU Cluster micro-architecture, described in previous chapter, would be implemented with the cell-based design method. The EDA flow for the implementation of this design is introduced, and the circuit implementation results are listed. The verification results of this work are discussed from the benchmark choice, the chip configuration for simulation, functionality test verification, to the performance evaluation. Finally, the performance comparisons to current related architecture design, implementation of power saving techniques, and a brief summary of this work are also discussed.

4.1 Design Flow

In order to accomplish the implementation of proposed ALU Cluster micro-architecture, from the defined specifications to the die chip achievement, the feasible methods should be provided to complete this work. For most traditional digital circuit design, the computer-aided design (CAD) tools could be supported to deal with these designs. With the help of CAD tools, the time of circuit design process could be shrunk greatly. Besides, the verification and the debug are easily to be detected and handled. A complete digital circuit design flow with the provided standard cell library, for example, the cell-based design flow, is shown in Figure 4.1.1.

Three main CAD tools are used to design this work: simulator, synthesizer, and automatic placement and route (APR). In addition, the major steps of design flow include from the architecture design, register transfer level (RTL), gate-level, physical-level, verification, to tape out. The details of these steps are explained in the following:

1. Architecture design: This is the initial step to design an integrated circuit (IC). The detail specifications and components of an ALU Cluster should be determined definitely and feasibly.

2. RTL: The determined architecture is stylized by using the hardware description language (HDL) code, such as Cadence NC-Verilog [24], to describe the behavior function of each module. The verification of this step is used Novas Debussy [25]

to certify the functionality simulation without taking any timing delays into account. Once functionality simulations of RTL do not match to the required specification, the HDL codes should be corrected, or the architecture would be modified until meeting the demands.

3. Gate-level: After the verification of functionality simulation is met with the specification of architecture design, the synthesizable RTL codes are synthesized by utilizing the CAD tool, such as Synopsys Design Compiler [21], to the logic cells. The targeted technology process and the essential synthesis constraints would be selected and set to meet the performance requirements. The functionality simulation with considering the gate delays would be performed for the pre-layout verification.

4. Physical-level: The synthesizable codes with logic cells would be transformed from the gate level model into the transistor level model in this step. The APR, such as Synopsys Astro [21], could be completed the physical implementation.

The basic design flow of APR is included the following sub-steps: the global net connection specification, floor planning setup, timing setup, placement and optimization, clock tree synthesis, global nets connection, routing and optimization, and stream out. The gate delay and wire delay would be taken into consideration when performing the post-layout functionality simulation checks.

5. Verification: Another two post-layout verifications are also necessary. One is the design rule check (DRC), and the other is the layout versus schematic (LVS).

DRC checks the data of physical layout against the design rules of fabrication, because the design rule document is golden for each design to have to be followed.

LVS checks the connectivity of physical layout to its relative schematic circuit netlist. The Mentor Calibre [26] could be used for these verifications.

6. Tape-out: The ultimate physical layout would be produced after having gone through the overall design flow, and then it could be fabricated in the foundry.

Figure 4.1.1: Cell-Based Design Flow

4.2 Circuit Implementation and Results

The summary of circuit characteristics of this work are listed in Table 4.2.1.

UMC 0.18 um CMOS process and Artisan design kit are utilized for the implementation. The post-layout operation frequency of an ALU Cluster is 100MHz.

The chip size, core size, gate count, and power dissipation are about 3 mm², 2.2 mm², 411491, and 968 mW, respectively. There are total fifteen memories included in this work. The four 32 x 128 single port static RAM (SRAM) and one 14 x 128 single port SRAM are the instruction memory, which is stored the instructions for executing operation. The ten 32 x 32 single port SRAM are the data memory, which is stored the required data for program execution. Without these memories contained in this work, the core size, gate count, and power dissipation are near 1.47 mm², 255669, and 312 mW, respectively. The physical layout of ALU Cluster is depicted in Figure 4.2.1. The core utilization is close to 88.8%. The floorplan and pad assignment are shown in Figure 4.2.2. There are total 127 input/output (I/O) pads, where 47 input pads, 32 output pads, and 48 power pads. The definition of the I/O ports is summarized in Table 4.2.2. Besides, the die microphotograph of tape-out chip is shown in Figure 4.2.3. The selected package for this chip is CQFP128, and photograph of prototype with package is shown in Figure 4.2.4.

Table 4.2.1: Circuit Summaries

Technology UMC 0.18um Mixed Signal (1P6M) CMOS Process Library Artisan SAGE-X Standard Cell Library

Clock Rate 100 MHz

Figure 4.2.1: Layout of the ALU Cluster

!"#$%

& #$ %

' # "

Figure 4.2.2: Floorplan and Pad Assignment

Table 4.2.2: The Definition of the I/O Ports

I/O Port Name I/O Signal Description

clk Input The clock signal provides for this chip.

reset Input The reset signal provides for this chip.

sel Input

This is 4-bit width input. To select one of the instruction memories and the data memories to be written, or one of the data memories to be read.

mem_d_wr Input

This input port decides to write or read the data memory. “1” means that data is written from the off-chip ports to the data memory. “0” means that data is read from the data memory to the off-chip ports.

mem_d_ctrl Input

This input port decides which source signal controls the data memory to be activated. “1” means that the off-chip port controls the enable signal of data memory.

“0” means that the on-chip signal controls the enable signal of data memory.

a Input

This is 7-bit width input. User can specify the address of instruction memory and data memory by this input port.

d Input

This is 32-bit width input. User can insert instructions to the instruction memory and data to the data memory by this input port.

q Output

This is 32-bit width output. User can fetch execution results from the data memory by this output port.

The power supply provides for the I/O part of chip. There are total 8 pairs of power supply.

Figure 4.2.3: Die Microphotograph

Figure 4.2.4: Photograph of Prototype with Package

4.3 Circuit Verification and Performance Evaluation

In this section, a selected benchmark is used to show the functionality verification of this work. In order to test with feasibility and ease, three steps of test configuration for this chip would be explained. The functionality simulation and the verification during each step of chip configuration would be also described. Finally, the performance evaluation of this work would be discussed.

4.3.1 Test Bench: FIR Filter

Owing to media processing applications are easily expressed as a series of computation kernels that operate on large data streaming. As long as any media processing application could be organized as the stream processing model that would be suitable for the ALU Cluster to execute, for instance, the FIR filter system has been introduced in previous chapter and is depicted in Figure 2.2.2. The FIR filter system is chosen as the test bench for the ALU Cluster since it is suitable for one dimensional architecture, needs repeat and high percentage of addition and multiplication, and applies for wide DSP applications, such as matched filtering, pulse shaping, equalization, etc. A brief review of FIR filter system is illustrated in the following.

The input-output relationship of linear time invariant (LTI) FIR filter can be described as where M represents the length of FIR filter, bk’s are the filter coefficients, and x[n-k]

denotes the data sample at time instance [n-k].

Before executing simulation, the dimension of input and filter coefficients should be determined. As shown in Figure 4.3.1.1(a), the filter coefficients are the sixteen-tap Kaiser window FIR bandpass filters, and the input is an exponential function with ten sampling points. MathWorks Matlab [27] is used to simulate the FIR filter system described above in advance, and the results of simulation are shown in Figure 4.3.1.1(b). This step is in order to make sure the results of FIR filter execution under calculating in the ALU Cluster that could be compared to the results of Matlab simulation to verify whether the functionality operations of this chip work correctly or not.

0 5 10 15

9016−th order Kaiser window FIR filter

0 5 10

12x 10⁵ FIR output results

(b)

Figure 4.3.1.1: Filter Coefficients, Input Data, and Executed Results of the FIR Filter

4.3.2 Functionality Verification

One of design goals for the ALU Cluster is to process the abundant parallel data, so the total numbers of input pads and output pads are enormous significantly.

However, the SRAM, such as the instruction memory and the data memory, is utilized to replace and reduce the most of input pads and output pads. Therefore, for the testability and feasibility of this chip, the ALU Cluster would be operated in three different modes: WRITE Mode, EXECUTION Mode, and READ Mode. When this chip is ready to execute programs, it would be operated in the order from WRITE Mode, EXECUTION Mode, to READ Mode. The detail actions of three modes would be described in the following:

1. WRITE Mode: The first step is to insert the instructions and the required data into the instruction memory and the data memory, respectively, from the input port

“d.” With combination of the other input ports, such as “sel,” “a,” “mem_d_wr,”

and “mem_d_ctrl,” to be controlled and set, user could determine one of the instruction memory or the data memory is the writing target. The value of control signals for memory in this mode are:

mem_d_wr = high & mem_d_ctrl = high

2. EXECUTION Mode: After inserting the instructions and the required data into the dedicated memory, the second step is that the ALU Cluster could be begun to execution the assigned programs. In this mode, the input ports, such as “sel” and

“a,” are used to control the instruction memory to issue the instructions, and the other input ports, such as “mem_d_wr” and “mem_d_ctrl,” are used to set the data memory to be controlled by the on-chip signals. The value of control signals for memory in this mode are:

mem_d_wr = low & mem_d_ctrl = low

3. READ Mode: In the third step, user could read out the data from the data memory for testing after the assigned program execution has been finished. With combination of the input ports, such as “sel,” “a,” “mem_d_wr,” and

“mem_d_ctrl,” the computed data could be read out from the data memory to the output port “q.” The logic analyzer could be utilized to confirm that whether the computed results are accurate or not. The value of control signals for memory in this mode are:

mem_d_wr = low & mem_d_ctrl = high

In order to verify functionality of this work, there are three modes to complete a program execution has been described in previous paragraph, and the steps of functionality verification would also follow in this order to be discussed. All functionality verifications are under the environment of post-layout simulation, and the maximal operation frequency is 90.9 MHz for executing the FIR filter system. The overall operation modes are shown in Figure 4.3.2.1.

Before executing the assigned programs, the WRITE Mode is executed firstly while having set the input ports, such as “mem_d_wr” is high and “mem_d_ctrl” is high. The ALU Cluster during the WRITE Mode is shown in Figure 4.3.2.2. In this mode, not only instructions are inserted into the instruction memory, but also the filter coefficients and the input data of the FIR filter are inserted into the data memory.

Figure 4.3.2.3 and Figure 4.3.2.4 are shown the insertion of filter coefficients and input data, respectively. In addition, the assembly code of overall instructions for the execution of this test bench is summarized in Appendix B.

After having completed the WRITE Mode, the EXECUTION Mode could be started. Figure 4.3.2.5 is shown the ALU Cluster operated in the EXECUTION Mode after having set the input ports, such as “mem_d_wr” is low and “mem_d_ctrl” is low.

The pre-stored instructions are fetched from the instruction memory to the decoder, and then the controller governs overall ALU Cluster to execute the programs. In the mean time, the required input data is read from the data memory, and only the calculated results are also written back to the data memory.

Finally, the last step is to verify the results of program execution. After the WRITE Mode has been finished, this chip is entered into the READ Mode with setting the input ports, such as “mem_d_wr” is low and “mem_d_ctrl” is high. Figure 4.3.2.6 is shown the ALU Cluster worked in the READ Mode. To compare the results read from Figure 4.3.2.6 and the results shown in Figure 4.3.1.1(b), there is no difference between these two results. Therefore, the functionality of ALU Cluster is worked correctly.

Figure 4.3.2.1: The Overall Operation Flow

Figure 4.3.2.2: The Operation of WRITE Mode

Cursor: 0 Marker:0 Delta:0 x 10ps

reset

clk

sel[3:0]

mem_d__wr

mem_d__ctrl

a[6:0]

d[31:0]

80000 90000 100000 110000 120000

0 200000 400000 600000 800000

101-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*-10 -11 -1* -1* -1*

16 1 2 3 4 5 6 7 8 9

11 0 1_* 10* 1_* 0 1_* 10* 1_* 0 1_* -1* 11* 0 1_* 10* 11* 0 1_* 10* 10* 0 1_* 10* 11* 0 1_* 10* -1* 0 1_* 1_* -1* 11* 10* 1_* -1* -1*

Figure 4.3.2.3: Insertion of Filter Coefficients

( 1 - 1 ) Page 1

Figure 4.3.2.4: Insertion of Input Data

( 1 - 1 ) Page 1

Figure 4.3.2.5: The Operation of EXECUTION Mode

Cursor: 0 Marker:0 Delta:0 x 10ps

0 30000 X -28966 19347 7113 -37380 86248 -158279 781792 1127528 224741

1001

182620

Figure 4.3.2.6: The Operation of READ Mode

4.3.3 Performance Evaluation Results

After having completed the execution of FIR filter system in the ALU Cluster, the results of performance evaluation about the code utilization and the memory utilization could be acquired. The detail performance evaluations are discussed in the following.

Figure 4.3.3.1 is shown the code utilization of each arithmetic unit. It takes total 93 instructions for the ALU Cluster to finish the FIR filter simulation. For each arithmetic unit, it takes 60, 75, 80, 80, and 0 instructions for the ALU_0 unit, ALU_1 unit, MUL_0 unit, MUL_1 unit, and DIV_0 unit, respectively, to complete the program execution. Additionally, the code utilization of the ALU_0 unit, ALU_1 unit, MUL_0 unit, MUL_1 unit, and DIV_0 unit is 64.5%, 80.6%, 86%, 86%, and 0%, respectively. Therefore, the code utilization of ALU Cluster is about 63.4%. Besides, it takes 99 clock cycles to complete this simulation, so the clock cycles per executed result output are 3.96.

Figure 4.3.3.2 is shown the memory utilization about the capacity usage in the ALU Cluster. The entry size of IRF unit and SPRF unit is 32 and 64, respectively. It needs 10 and 12 reused entries for each IRF unit in the ALU_0 unit and ALU_1 unit, respectively, 16 and 10 reused entries for each IRF unit respectively in both the MUL_0 unit and MUL_1 unit, 3 reused entries for the SPRF unit, and 0 used entries for each IRF unit in the DIV_0 unit during executing the FIR filter simulation. These results have revealed that the initial decisions of storage capacity of IRF unit and SPRF unit are well sufficient to be provided and used during the execution of FIR filter system.

Figure 4.3.3.3 is shown the memory utilization about the data reference times in the ALU Cluster. The data reference times mean that the number of times for required data is read or written to the storage units, such as the IRF unit, the SPRF unit, and the off-cluster memory during the program execution. The total number of times for each dedicated off-cluster memory of the ALU_0 unit, ALU_1 unit, MUL_0 unit, MUL_1 unit, and DIV_0 unit to be read/written during executing the FIR filter simulation are 0/21, 0/0, 0/2, 0/0, 16/1, 10/0, 16/1, 10/0, 0/0, and 0/0, respectively. In addition, the total number of times for the SPRF unit and each dedicated IRF unit of the ALU_0 unit, ALU_1 unit, MUL_0 unit, MUL_1 unit, and DIV_0 unit to be read/written during executing the FIR filter simulation are 7/7, 57/57, 56/56, 75/75, 75/75, 80/16, 80/10, 80/16, 08/10, 0/0, and 0/0, respectively.

From the results of performance evaluation described in the previous paragraph, the proportion of data reference rate between the on-cluster memory, such as the IRF units and SPRF unit, and off-cluster memory, such as the SRAM, is 912 : 77.

Furthermore, a proportion is 0 : 885 if there is no the hierarchy memory bandwidth; in other words, without containing any IRF units and SPRF unit in the ALU Cluster. The

在文檔中多媒體串流處理器之運算單元設計 (頁 35-0)