Media Streaming Processor Architecture - 多媒體串流處理器之運算單元設計

Chapter 2 Background

2.5 Media Streaming Processor Architecture

By reason of the streaming architecture with software defined radio mechanism has been suggested as an effective architecture for the media processing applications, our proposed improved media streaming processor architecture is depicted in Figure 2.5.1. This architecture primarily includes five components: ALU Cluster, stream

register file, distributed memory management unit, power management unit, and controller. In the mean time, there is also an architecture simulator that has been built a simulation environment for this architecture.

Figure 2.5.1: Media Streaming Processor Architecture

In general, the architecture simulator is used to determine the number and the size of ALU Cluster, stream register file, and off-chip memory to be used efficiently, after simulating a specific media processing application. Therefore, the simulation results from the architecture simulator would then be used with the software defined radio mechanism to reconfigure the media streaming processor. This step would let the system to achieve the best performance that trade between the operating time and the power consumption. The ALU Cluster is used to shorten operating time for the kernel computation. There are also high speed storage units that are embedded in the ALU Cluster to reduce the use of the scarce global communication bandwidth. A controller decodes instructions and then controls the overall operation of all ALU Clusters. The stream register file is a memory organized to handle streaming and could hold any number of streaming of any length. The steam register file supports data streaming transfers between the ALU Cluster and off-chip memory, such as double data rate (DDR) random access memory (RAM), so the recirculation of streaming through the stream register file minimizes the use of scarce off-chip data bandwidth in favor of global register bandwidth. Distributed memory management unit is to solve rare data bandwidth, reduce memory copy, and provide dynamic scheduling.

Along with the power dissipation has become an important issue concern in modern VLSI circuitry for the mobile systems, a power management unit with low power techniques for the control and reduction of power dissipation has been integrated in the next generation media streaming processor architecture. Furthermore, the power management unit could scale the performance and power by trading for turn on or turn off the function blocks that depend on the simulation results from the architecture simulator. In addition, the high speed storage units inside the ALU Cluster, stream register file, and off-chip memory DDR RAM are formed the memory bandwidth hierarchy architecture. This is in contrast to conventional architectures which use less efficient global register bandwidth when local bandwidth would suffice, in turn forcing the use of more off-chip bandwidth.

These main components on the improved media streaming processor and a customized architecture simulator have been under developing at SoC Laboratory by students and faculty. In this thesis, I am chiefly responsible for the design of an ALU Cluster micro-architecture, because ALU Cluster is the nexus computation part of the processors and one key factor of increasing high kernel performance. Besides, the major challenges to complete this work are the architecture decision of each component in the ALU Cluster, met trade-off between timing constraint and area constraint, and complicated to micro-architecture implementation. The detail micro-architecture design and the design flow would be described in the following chapters.

CHAPTER 3 DESIGN OF ALU CLUSTER MICROARCHITECTURE

In this chapter, the details of an ALU Cluster micro-architecture are discussed from the decision of each component in the ALU Cluster to the integration of these components in the ALU Cluster. In addition, the dedicated instruction set format for the ALU Cluster is also explained. Finally, the pipeline steps and the overall system operation with pipeline mechanism on the ALU Cluster are described.

3.1 ALU Cluster Block Diagram

In order to improve the conventional processor architecture that poorly handle with the media processing applications, two solution methods primarily to solve these performance bottlenecks appeared on the conventional processor architecture, such as the required high computation throughput and the processor-memory performance gap, are concurrency and locality, respectively. Therefore, the proposed 32-bit ALU Cluster micro-architecture design is mainly based on the Stanford Imagine Stream Processor [19][20] with the consideration of the implementation feasibility.

Concurrency is to provide abundant data-level parallelism which refers to the computation on different data elements occurring in parallel as well as the moderate multiple function units in one ALU Cluster. Locality is temporal and refers to reuse of coefficients or data during the execution of computation kernels, or is also a form of temporal locality that exists between different stages of a computation pipeline or kernels. So the temporary high speed storage unit is embedded inside the ALU Cluster that could form the memory bandwidth hierarchy architectures to reduce the

unnecessary use of global memory bandwidth to access the high latency off-chip memory frequently. The block diagram of the ALU Cluster micro-architecture is shown in Figure 3.1.1.

Figure 3.1.1: ALU Cluster Architecture Block Diagram

From Figure 3.1.1, the ALU Cluster architecture are primarily included several arithmetic units, high speed storage units, such as intra-register-file (IRF) and scratch pad register file (SPRF), a decoder, and a controller. The arithmetic units are contained two ALUs, two multipliers, and one divider, supporting to process the large available parallel data simultaneously. Most media processing applications are well suitable for this mixture of arithmetic units. In addition, the back-end simulation results based on the contemporary process technology and the standard cell library could decide the number of parallel arithmetic units. Every arithmetic unit in the ALU cluster is embedded a high speed 32-entry IRF unit for each input. These IRF units mainly are kept to store the temporal intermediate results of computation during executing on streams of data and greatly reduce the usage of the required off-chip memory bandwidth. This allows memory bandwidth to be used efficiently in the sense that expensive and communication limited global memory bandwidth is not wasted on the arithmetic units where inexpensive local memory bandwidth is easy to provide and use. The 64-entry SPRF unit is also an extra high speed storage unit to offer the spills of recirculation of temporary computing data. A decoder fetches instructions and sends the decoded results to the controller. A controller major controls the overall operations of the ALU Cluster. The details of the micro-architecture of these components in the ALU Cluster architecture would be described in the following sections.

3.2 Instruction Set Format

The architecture of instruction set format for the ALU Cluster is similar to a VLIW-like instruction format, as shown in Figure 3.2.1. The instruction set format is major composed of fields for the total arithmetic units (ALU units, MUL units, and DIV unit) used in the ALU Cluster. In addition, each arithmetic unit field is further subdivided into sub-field, and each sub-field is contained by the two input sources and their read addresses, the one output destination and its write address, and the executed operation.

Figure 3.2.1: Instruction Set Format

The input source reads from three: off-chip data memory, self IRF unit, and SPRF unit. In the same way, the output destination writes back to three: off-chip data memory, one of all IRF units, and SPRF unit. The length of address for input source or output destination is determined by the size of maximum storage unit, for example, one of the off-chip data memory, IRF unit, or SPRF unit. The total executable operation types are depended on different type of arithmetic units, so the length of operation code is also depended on the type of arithmetic units. For example, the length of operation code for the ALU unit, the MUL unit, and the DIV unit is 4-bit, 1-bit, and 2-bit, respectively. Details of operation types of this part will be explained in the next section.

The whole length of instruction set would be determined by the total numbers of different type of arithmetic units. The length of the sub-instruction set of each arithmetic unit would be primarily determined by the length of executable operation code. Therefore, for example, the length of the ALU unit, the MUL unit, and the DIV unit is 30-bit, 27-bit, and 28-bit, respectively, and the whole length of instruction set is

142-bit. Details summary of the defined microcode in instruction set for the input source, the output destination, and the executable operation of each arithmetic unit are summarized in Appendix A.

3.3 ALU Cluster Function Units

In this section, the micro-architecture design of each function unit in the ALU Cluster would be discussed in the following subsections. A more detailed view of function unit, which is contained an arithmetic unit and its associated register files, is shown in Figure 3.3.1. Most arithmetic units have two data inputs, and output bus.

Data in the function unit is temporary stored in the IRF units. These function units are developed with a number of design goals in mind, including the trade-off between physical area, data throughput, and operation latency. As a consequence, in order to reduce the design complexity and enhance the implementation feasibility, Synopsys DesignWare [21] with UMC 0.18 um 1P6M CMOS process and Artisan SAGE-X standard cell library [22] would be used with the intention of shrinking the time of design process.

Figure 3.3.1: Function Unit Details

3.3.1 ALU Unit

An ALU Cluster contains two ALU units that could execute the operations, such as the addition, absolute, logical operation, shift, and comparison instructions, listed in Table 3.3.1.1. Many of these operations support for 32-bit signed integer instructions. These operations could be implemented by using Synopsys DesignWare building block IP, for example, DW01_add, DW01_absval, DW01_ash, and DW01_cmp6. The results of synthesis implementation of available IPs are listed in Table 3.3.1.2.

Table 3.3.1.1: The Operations Correspond to the ALU Unit Operation Description and the data arrival time. Thus, the larger slack value means that more timing margin to complete the execution within one clock cycle. Additionally, the synthesis implementation of available IP only takes the gate delay into consideration without the wire delay, which the wire delay has been grown a chief critical timing issue in the continued scaling of modern VLSI technique [23]. So the larger slack value is better consideration to suitable for the place and route process. The initial clock period is set to 8 ns as well as 125MHz. Besides, the definition of gate count is that the synthesis area of design is divided by the two-input NAND gate area provided by UMC 0.18 um CMOS process with Artisan standard cell library. From Table 3.3.1.2, in order to meet the best optimization between execution time and physical area, therefore, the fast carry-look-ahead architecture is chosen for DW01_add, the carry-look-ahead

architecture is chosen for DW01_absval, the 2:1 inverting multiplexers and 2:1 multiplexers architecture is chosen for DW01_ash, and the carry-look-ahead architecture is chosen for DW01_cmp6. Finally, the ALU unit is designed as 2-stage pipeline architecture. The first stage is to fetch the data of two inputs from the controller and then decide which operation to be executed from operation code. The second stage is to complete the assigned execution.

Table 3.3.1.2: Synthesis Results Correspond to Different Architecture IP Implementation Slack Gate Count

rpl¹ 0.56 244

An ALU Cluster contains two MUL units that could execute the multiplication operation, and the executable operation is listed in Table 3.3.2.1. Like the ALU unit,

8. mx2i = 2:1 inverting multiplexers and 2:1 multiplexers 9. mx4 = 4:1 and 2:1 multiplexers

the MUL unit also executes operation to support for 32-bit signed integer instructions.

The operation could be implemented by using Synopsys DesignWare building block IP, such as DW02_mult and DW_mult_pipe, too. The results of synthesis implementation of available IPs depended on different pipeline stages are listed in Table 3.3.2.2.

Table 3.3.2.1: The Operation Corresponds to the MUL Unit Operation Description

MUL Multiply

From Table 3.3.2.2, the more pipeline stages would make the more slack value, but the gate count from the part of pipeline registers increases more significantly.

Therefore, in order to trade between the execution time and the physical area, the 3-stage pipeline with Booth encoding Wallace tree architecture is chosen for DW_mult_pipe contained DW02_mult. Finally, the MUL unit is designed as 4-stage pipeline. The first three stages are to fetch the data of two inputs and complete the multiplication execution. The forth stage is to truncate the outcome of multiplication to maximum or minimum expressible value if overflow or underflow is occurred, respectively.

Table 3.3.2.2: Synthesis Results Correspond to Different Pipeline Stages IP Pipeline Stage Slack Gate Count

3 3.95 9873

DW_mult_pipe

4 4.54 12084

3.3.3 DIV Unit

An ALU Cluster contains one DIV unit that could execute the operations, such as the division and square root instructions, and these executable operations are listed in Table 3.3.3.1. Like both the ALU unit and the MUL unit, these operations also support for 32-bit signed integer instructions. On the other hand, the DIV unit is not the key kernel performance concerned, therefore, the DIV unit would be suggested not to be pipelined and by increasing the latencies of the execution to trade for shrinking area.

These operations could be implemented by using DW_div and DW_sqrt of Synopsys DesignWare building block IP. The results of synthesis implementation of available IPs are listed in Table 3.3.3.2.

Table 3.3.3.1: The Operations Correspond to the DIV Unit means of increasing execution latencies, therefore, the ripple-carry architecture is chosen both for DW_div and DW_sqrt. Finally, the DIV unit is designed as no pipeline architecture but with a latency of 16 clock cycles for the execution of each time.

Table 3.3.3.2: Synthesis Results Correspond to Different Architecture IP Implementation Data Arrival Time Gate Count

rpl 194.34 6628 arithmetic operations in the ALU Cluster, an important non-arithmetic operation is supported by the IRF unit. The IRF unit is a one read port and one write port high speed register file, and the flip-flops are used as the basic storage element for the IRF units, as shown in Figure 3.3.4.1. The storage capacity of IRF unit is 32 words. The multi-level multiplexer trees are taken the place of the single-level multiplexer trees to speed up the combinational circuit’s part of multiplexer. The IRF unit could be written one data and read another data within the same clock cycle, and the flip-flops before the read selects of IRF unit enable the register file holding the input values within the IRF units so that data written on one clock cycle could be read correctly by the arithmetic unit in the subsequent clock cycle.

The key function of IRF unit is major kept to store the temporal intermediate results of calculation during executing on streams of data and significantly decrease the usage of the necessary off-chip memory bandwidth. This allows memory bandwidth to be used efficiently in the sense that the high-cost and communication

limited global memory bandwidth is not wasted on the function units where the low-priced local memory bandwidth is simple to utilize and offer. In conclusion, all IRF units in the ALU Cluster have a total of 320 words, and provide 8 GB/s of peak memory bandwidth for one ALU Cluster.

!"#$%

Figure 3.3.4.1: IRF Architecture

3.3.5 SPRF Unit

Another non-arithmetic operation is supported by the SPRF unit. The architecture of SPRF unit is analogous to the architecture of IRF unit except for the size of storage capacity. The SPRF unit is a one read port and one write port high speed register file, and the flip-flops are used as the basic storage element for the SPRF units. The storage capacity of SPRF unit is 64 words, and the SPRF unit could provide 0.8 GB/s of peak memory bandwidth for one ALU Cluster. The SPRF unit could be written one data and read another data within the same clock cycle, and the data written on one clock cycle could be read correctly by the arithmetic unit in the subsequent clock cycle. The primary functions of SPRF unit are to hold some spills from IRF units and store common coefficient parameters.

3.3.6 Decoder Unit

The decoder unit provides to fetch the VLIW-like 142-bit instructions from the off-chip instruction memory, and then decodes these instructions for the controller.

First, the fetched instruction is divided into several segments depended on the number of arithmetic units. Second, none operation instruction segments are discarded and then the leftover instruction segments are transformed to the requested binary code type for the controller. Finally, the decoded results from the decoder are sequenced to the controller.

3.3.7 Controller Unit

The controller provides temporary storage to hold the decoded instructions, and then sequences and issues these decoded instructions to the function units during execution. The controller is divided into two parts: the read control and the write control. The part of read control receives the decoded instructions from the decoder, and then acknowledges the storage unit, such as off-chip memory, IRF unit, or SPRF unit, to read out the desired data to the assigned function unit. On the other hand, the part of write control would hold the decoded instructions till the function unit that has finished the execution, and then acknowledges the destined storage unit to be written back the result of computation. The precise timing mechanism and the exact computation data flow are two essentially tasks for the controller to manage the overall operation of the ALU Cluster.

3.4 System Operation

In order to increase computation throughput and decrease operation period, the system operation with pipeline mechanism has been recommended as one of solution ways to achieve these goals. Therefore, as is naturally done in most high performance processors, the ALU Cluster also operates in a pipelined manner to reach higher instruction throughput. The pipeline execution diagram in the ALU Cluster is depicted in Figure 3.4.1. The complete process of pipeline operation to execute one instruction includes from FETCH, DECOED, READ REGISTER, and EXECUTE 1 ~ N, to WRITE BACK.

Figure 3.4.1: Pipeline Execution Diagram Details

During the first pipeline stage in the cycle N (FETCH), the decoder fetches and sequences the VLIW-like instructions from the instruction microcode storage. During the decoding stage (DECODE), the decoder decodes the incoming instructions and then delivers the decoded results to the controller. During the register file read stage (READ REGISTER), the controller would manage the data storage unit to be read out the desired data. The desired data major comes from one of the off-chip data memory, self IRF unit, or SPRF unit, and then sends to the dedicated function unit. During the execution stage (EXECUTE), each function unit begins to execute the computing

在文檔中多媒體串流處理器之運算單元設計 (頁 23-0)