MAC and LDST Function Units - C ONSTRUCT THE B ASIC E LEMENTS

CHAPTER 3 DESIGN AND IMPLEMENTATION

3.3 C ONSTRUCT THE B ASIC E LEMENTS

3.3.6 MAC and LDST Function Units

3.3.6 MAC and LDST function units

Figure 3.9 and Figure 3.10 show the block diagram of MAC and LDST function. There are two ALUs, barrel shifters, signed multipliers, pack unit, unpack unit, and unsigned dividers. The ALU can perform common arithmetic, logic, and SIMD operations. Multiplier can accept two 16 bits inputs, and then, generates a 32 bits long product. It is based on shift and add algorithm and constructed of four ripple-carry-save adders. Each ripple-carry-save adder is made up by n+1 full adders (n: wide of input). The last two full adders work for signed bit, and pass to next stage. In this way, we avoid extending the input to 32 bits long to ensure correct product.

Memory Interface

Barrel Shift

Multiplier/

Divider ALU

LDST function unit Pack/Unpack

Figure 3.9 Block diagram of LDST function unit

MAC Barrel Shift

Address generator Multiplier/

Divider

40-bit Accumulator Register

ALU

MAC function unit Pack/Unpack

Figure 3.10 Block diagram of MAC function unit

Pack and Unpack unit are used for PACK and UNPACK instructions as introduced in section 3.2. In Figure 3.11, Pack unit consists of two multiplexers. According to the first half of funt field of PACK instruction, Rs.H or Rs.L will be packed into Rd.H. On the other hand, the last half of funt field of PACK instruction decides Rt.H or Rt.L to be packed into Rd.L.

Unpack unit consists of four multiplexers (Figure 3.12). It unpacks the 16-bit data in Rd.H into Rs.H or Rs.L (Rd.L into Rt.H or Rt.L). Because our design is based on dual-rail data encoding, one half of Rs (Rt) will be valid, and the other have to get from Rs (Rt). We can perform complete detection with this way. For example, if Rd.H is unpacked into Rs.H, the data in Rs.L still stay in its field.

Figure 3.11 Pack unit

Rd.H

Figure 3.12 Unpack unit

3.4 PIPELINE ARCHITECTURE

There are six stages in our pipeline architecture: PF (Prefetch), DP (Dispatch), ID/OF (Instruction Decode and Operand Fetch), EX1 (Execute 1), EX2 (Execute 2), and WB (Write Back). In this section, they are described in details, and the solution of data hazards and control hazards are going to be introduced. Data hazards are solved in ID/OF stage and control hazards in IF stage. Figure 3.13 shows our two-way VLIW pipeline architecture.

Because our microprocessor is a two-way VLIW design, we name the path which is responsible for data transference with data memory “path A” and another “path B” in order to describe the features and function of datapath conveniently.

Figure 3.13 Pipeline Architecture

3.4.1 PF and DP Stage

The instruction packet packs extra NOP instruction if the instructions in same packet could not execute in parallel. But it may waste too memory space to store these instruction packets. In most VLIW processor, there are some instruction compression mechanisms to solve this problem. In our PF stage, the 64-bit instruction packet is fetched from instruction memory. Then, the next stage (DP stage) decompresses this instruction packet. If the two instructions in same packet could be executed in parallel, they are separated into different execution order.

Furthermore, we solve the control hazard in DP stage. Because the utilization of pipeline is 50%, we could pass one instruction at most. If the BEQ/BNEQ is fetched and executed in

EX1 stage, the stall mechanism works in DP stage. After BEQ/BNEQ finishes its own job in EX1 stage, it sends the correct target address to PC register. Then, the PF stage could fetch the correct instruction packet.

3.4.2 ID Stage

The source operand which is used by instruction is fetched in this stage. It is also responsible to generate control signals for instruction. The control signals are decoded in Instruction Decoder unit. The outputs of Instruction Decoder include the control signals of ID/OF, EX1, EX2, and WB stage. The control signals of EX1, EX2, and WB stage are delivered stage by stage (Figure 3.14). There are two Instruction Decodes and two sets of control path due to two-way VLIW design, and used for MAC function unit and LDST function unit respectively.

The two datapaths share the Register Bank. Source operands in different datapath can be fetched simultaneously. The ID stage consists of two parts (Figure 3.15), Instruction Decoder and Register Bank which is described in the following. Moreover, there are two paths between DeMUX and MERGE, one is bypass line for NOP instruction, and the other is used for common instruction.

Figure 3.14 Control Path

Register Bank

Write back 1 Write back 2

dst dst

Figure 3.15 Block diagram of ID stage

3.4.2.1 Register Bank

Figure 3.16 shows the block diagram of Register Bank. It is consists of Operand Decoders, Lock Queue, and Register file.

Operand Decoder: It is responsible to convert operand register number into a 32-bit representation (1-of-32), which is used for selecting which register can be read. For example, the operand register number, 00010, is decoded to 00000000000000000000000000000100, it means the $g2 is to read.

Register file: It has six read ports and two write ports to serve two datapaths. The operands include operand A, operand B (the second operand of R-type instruction), immediate value (the second operand of I-type instruction from the imm field of instruction).

Register file

Operand Decoder

Operands

Write data _1

Lock Module

Write data _2

Operand Decoder

Operands

Figure 3.16 Register Bank

Lock Module: The pipeline may have data hazard caused by two successive instructions, if the source operand of the second instruction is the result of the first instruction. In this situation, it may occur RAW hazard (read after write) if the result does not be written before the second instruction gets it. Figure 3.17 shows the block diagram of Lock Module. It is similar to Lock FIFO of Asynchronous Microprocessor designed by N.C Paver [2]. We use a queue to store the destination register number and the concept of implement is also similar to Lock FIFO. We modify the design of Lock FIFO simply to suit our two-way VLIW architecture.

Lock Queue

Rd _push

Ack pre_stage

pop

Rs Rt

Rs1_E Rt1_E Write_E

Operand read done Ack from EX_latch

Push done Converter Converter

Converter C

Figure 3.17 Block diagram of Lock Module

The three converter units in Figure 3.17 are dual-rail to single-rail converter. (There are same set of elements and control path at another datapath.) When Rs (Rt, Rd) is valid, the operand register number is converted from dual-rail to single-rail, and control unit send a request to Lock Queue to check whether it is stalled by previous instruction or not. Lock

Queue is used for solving RAW hazard. It stores information of destination register, and deletes the information of destination register after the results from WB stage are written into destination register. The instruction is stalled in ID stage if the RAW hazard occurs. There are two Lock Queues in each datapath. They store their destination register number individually.

When instruction is executed in Lock Queue, not only it have to check its own Lock Queue, but the another Lock Queue in anther datapath. Therefore, one of datapath may be stalled by another. For example, there are two successive instruction packets are executed in parallel, and their operands are list as following:

ADD $g4, $g1, $g0 … (1), ADD $g2, $g1, $g0 … (2) ADD $g5, $g2, $g0 … (3), ADD $g6, $g4, $g0 … (4)

Instruction (3) and (4) are executed follow (1) and (2). After Instruction (1) and (2)

finished the work in ID/OF stage, the contents of Lock Queue for each datapath is

LQ1 LQ2

$g4 $g2

Instruction (1) is executed in parallel with (2), and (3) is executed in parallel with (4).

Instruction (3) is stalled by (2) because one source operand ($g2) comes from (2), and (2) has not finished. Instruction (3) and (4) cannot store their destination register number to Lock Queue because they are stalled by Instruction (2) and (1), respectively. Instruction (3) and (4)

cannot store their destination register number until they complete the read operand operation.

After (1) and (2) finish, the Lock Queues are updated:

LQ1 LQ2

$g5 $g6

At this time, instruction (3) and (4) are not stalled anymore. The information of destination register cannot be pushed into Lock Queue until the operands are fetched from register file.

This policy can make sure deadlock never occur. For example, the executing instruction is

“ADD $g3, $g3, $g1”, one of source operand ($g3) and the destination register ($g3) are the

same. The deadlock may occur in ID stage if the information of the destination register is pushed into Lock Queue before the two operands are fetched from register file.

3.4.3 EX Stage

The EX stage is responsible for computations and returning the result to register file, and it is separated into three stages, EX1, EX2, and WB stage. Each datapath has individual function unit as shown in Figure 3.18 (a) and Figure 3.18 (b) due to two-way VLIW design, and works individually. They have to wait for each other. Then, an acknowledgement signal will be sent to previous stage. The one of datapath is responsible for MAC and branch instruction, and the another is responsible for Load and Store instruction. The basic arithmetic operations can be executed in both datapaths. The three stages are described in following. The

DeMUX and MERGE pairs described in section 3.3.4 are used for selecting data flow in each stage. If there is not work in execution stage, they can be bypassed. Figure 3.18 (a) shows the block diagram of Path A, and Figure 3.18 (b) shows the block diagram of Path B.

ALU

44 latches. It is delivered to correct path by DeMUX. In path A, there are three portions. It can perform multiplication, division, shift, and arithmetic operations. The ALU is used for general arithmetic operations and calculating memory address of load and store instructions.

In path B, there are four portions. It can also perform multiplication, division, and shift.

In addition, there are sign-extended unit and address generator. The branch operation can calculate the target address in this stage, and then, the target address is passed to DP stage to solve control hazard caused by branch instruction.

3.4.3.2 EX2 Stage

In path A, it is separated into two parts. The first is data transfer, and the other is data pass. Load and store instructions can fetch and store data with memory via memory interface.

On the other hand, the general instructions which do not need to read or write memory will do nothing in this stage.

In path B, there are two portions, MAC and ALU. The MAC operation reads the contents of 40-bit accumulator register in this stage. Then, the outcome is written into accumulator at next stage. So we can ensure correctness of accumulator. The valid token is bypassed in this stage if the executing instruction is branch instruction which completes its job at previous stage.

3.4.3.3 WB Stage

WB stage is the final stage of our pipeline. It is responsible for saving the result back to register file according to the destination register number. If instructions have finished in previous stage, for example, “SW Rd, Rs, imm”, it has nothing to do in this stage. For multiply-accumulates instruction, the output of MAC unit is written into 40-bit accumulator

in this stage. The value in 40-bit accumulator could be moved into register with ACCLDH and ACCLDL in order to support other application. Because our pipeline architecture is based on

4-phase dual-rail handshaking protocol, the accumulator is read or written at different time due to half of utilization of pipeline stage. Finally, the datapath which finish its own job early has to wait for another only in this stage.

CHAPTER 4 SIMULATION

4.1 TESTING ENVIRONMENT

We use ModelSim 6.0 to verify the correctness of the functionally. In addition, we also synthesized our design with Design Compiler. They are synthesized by TSMC .13μ m process library. The result of area and timing report are described in the following sections.

Figure 4.1 shows the waveform of the function simulation.

Figure 4.1 The waveform of function simulation

4.2 AREA SIMULATION

With TSMC .13μ m processes, the area report of each stage of our two-way VLIW processor is shown in Table 4.1. Table 4.1 shows the area of each pipeline stage except PF and DP stage. Table 4.2 shows the area of register bank.

(μ ㎡) ID/OF EX1 EX2 WB Total Table 4.1 The area report of each stage

Lock Module Register File 11116.3(μ ㎡) 57450.2 (μ ㎡) Table 4.2 The area report of register bank

4.3 TIMING SIMULATION

With TSMC .13μ m processes, the timing report of each stage of our two-way VLIW processor is shown in Table 4.2. The EX1 stages have longer latency than other stages

because the multiplier and divider are executed in this stage. In EX2 stage, we ignore the

memory latency because the memory is based on synchronous circuit design.

(ns) ID/OF EX1 EX2 WB

LDST

22.62

65.18 10.48 2.8

MAC 65.75 61.23 2.11

Table 4.3 The timing report of each stage

CHAPTER 5 CONCLUSION AND FUTURE WOROKS

In this thesis, we have implemented a two-way VLIW processor based on asynchronous circuit design with four-phase dual-rail handshaking protocol. It is a six-stage pipeline architecture. Each stage can execute in variable length of time due to asynchronous circuit nature. It can reduce the instruction memory space via instruction compression. In addition, it also supports SIMD application and multiplier-accumulate operation. There are nine instructions for SIMD application. Moreover, the DeMUX and MERGE can be used to improve the performance. The datapath can be separated into several parts. If the function units between DeMUX and MERGE are not used, the DeMUX will bypass these function units.

In our datapath design, it has two read ports and one write ports for each datapath. We could try to increase the read and write ports in register file in order to improve the performance of SIMD application because the Unpack instruction has to perform twice to unpack a 32-bit value to the destination register. More important, we wish this light-weight asynchronous core could be used to construct a multi-core processor via interconnection network in the future.

Reference

[1] Jens Sparso and Steve Furber, Principles of Asynchronous Circuit Design, Kluwer Academic Publisher, 2001.

[2] N.C. Paver, “The Design and Implementation of an Asynchronous Microprocessor,” Ph.D thesis, Department of Computer Science, The University of Manchester, 1994

[3] S.B. Fuber, P. Day, J.D. Garside, S. Temple, J. Lin, and N.C. Paver, “AMULET2e: An Asynchronous Embedded Controller,” in the third International Symposium on Advanced Research in Asynchronous Circuits and Systems, ASYNC97, pp.243-256, 1997

[4] S.B. Furber, J.D. Garside, D.A. Gilbert, “AMULET3: A High-Performance Self-Timed ARM Microprocessor,” in International Conference on Computer Design: VLSI in Computers and processors, ICCD ’98, pp.247-252, 1998

[5] C.J Chen, W.M. Cheng, H.Y. Tsai, J.C. Wu, “A Quasi-Delay-Insensitive Mircoprocessor Core Implementation for Microcontrollers,” Journal of Information Science and Engineering, Vol. 25, No.2, Mar. 2009, pp543-557

[6] Hung-Yue Tsai, “A Self-timed Dual-rail Pipelined Microprocessor Implementation,”

National Chiao Tung University, 2007

[7] T. Nanya, et.al, “TITAC: Design of a Quasi-Delay-Insensitive Microprocessor,” IEEE Design & Test of Computer, Summer 1994, pp. 50-63

[8] A. Takamura, et.al, “TITAC-2: A 32-bit Asynchronous Microprocessor based on Scalable-Delay-Insensitive Model,” in Proceedings of the International Conference on Computer Design, Oct. 1997, pp.288-294

[9] T. Kumura, M. Ikekawa, M. Yoshida, and Ichiro Kuroda, “VLIW DSP for Mobile Applications,” Signal Processing Magazine, IEEE, Vol. 19, Issue 4, pp. 10-21, 2002 [10] TMS320C55x Technical Overview, Texas Instruments Inc., Literature Number:

SPRU393, 2000, http://www.ti.com

[11] TMS320C64x DSP Library Programmer’s Reference, Texas Instruments Inc., Literature Number: SPRU565B, 2003, http://www.ti.com

[12] T. Kumura, D. Ishii, M. Ikekawa, I. Kuroda, and M. Yoshida, “A low-power

programmable DSP core architecture for 3G mobile terminals,” in Proc. IEEE int. Conf, Acoustics, Speech, and Signal Processing, Vol. 2, pp.1017-1020, 2001

[13] T. Chen, R. Raghavan, J.N. Dale, E. Iwata, “Cell Broadband Engine Architecture and its first implementation- A performance view, ” IBM Journal of Research and Development, Vol.

51, Issue: 5, pp.559-572, Sept. 2007

[14] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, P. Roussel, “The Mircoarchitecture of the Pentium^ 4 Processor”, Intel Technology Journal, Vol. 5, Issue: 1, 2001, http://www.intel.com

在文檔中非同步雙道超大指令字組處理器之資料路徑設計 (頁 43-0)