• 沒有找到結果。

Mechanism of the micro-architecture simulator

Chapter 3 Micro-architecture simulator

3.3 Mechanism of the micro-architecture simulator

The performance simulation flow of media application in micro-architecture simulator should be delineated in this section. In this paper, FFT is selected as the simulation benchmark.

Simulation Flow is shown in Figure 3.1, where the design for the kernel of stream programming mode, i. e., simulator, has been introduced in Section 3.2. The application of Simulator can be roughly classified into two stages:

1.) Determination of the simulation micro-architecture of the application. This step contains a number of parameters setups and shall be explicitly answered in Section 3.3.1.

2.) Plugging cluster instruction (stream) into simulator for simulating, and performance estimation shall be clarified in Section 3.3.2.

3.3.1 Parameter definition

Before the operation of the application to be simulated, the “micro-architecture decision” step in Figure 2.3 has to be accomplished, a couple of parameters establishing micro-architecture must be decided.

1.) The number of ALU, MUL, and DIV in a cluster: This step is to define the micro-architecture inside a cluster. If defining a cluster contains three ALU Units, two MUL Units, and one DIV Unit, then the cluster executing instruction would be like Figure 3.10. These parameters will determine the instruction length of cluster instruction, and the number of bit required for the ”DEST” field in the instruction format. If the number of the functional unit in a cluster lies between one to five, the DEST field only needs 3-bit to express the destination of result write back; If the number of the functional unit ranges from 6 to 13, the DEST field shall need 4-bit to represent.

Figure 3.10 Block diagram of a cluster

2.) Determine the number of cluster needed for micro-architecture.

3.) Determine sufficient number of SRF, SP, and LRF, and the number of bit of the ”ADDR_BIT” which expresses memory location in the instruction format.

4.) According to several parameters defined above, we then can calculate the length of ALU Unit, MUL Unit, and DIV Unit for dealing with instructions, and the length of cluster instruction.

ALU_BIT = 8 + 3*ADDR_BIT + DEST_BIT MUL_BIT = 5 + 3*ADDR_BIT + DEST_BIT DIV_BIT = 6 + 3*ADDR_BIT + DEST_BIT

VLIW_length = ALU_BIT*ALU_Number + MUL_BIT*MUL_Number + DIV_BIT*DIV_Number

5.) Determine the consumed clock cycles executed by every functional unit,

“ALU_Time”, “MUL_Time”, and “DIV_Time”. The information is intended to calculate CPU Time.

3.3.2 Map application into dedicated binary stream programming codes

Based on the “micro-architecture decision” step in 3.3.1, ISA of the micro-architecture and the instruction format of different functional unit can be explicitly known. As can be seen in Figure 3.11, the function of traditional compiler is replaced by hand-coding method, which scheduling the instructions of the application according to the ISA into cluster instructions, then the cluster instructions is mapping into stream programming codes that can be executed on the stream processor according to the instruction format of the functional units.

Figure 3.11 Map the application into stream programming code

3.3.3 Simulation

As can be seen in Figure 2.3, after accomplishing two steps of the micro-architecture decision and mapping application into stream programming codes, which may be plugged into simulator to simulate the performance executed on the stream processor.

1.) Load the cluster instructions saved in the files into instruction memory using File I/O.

2.) According to the cluster number defined in Section 3.3.1, it would yield the same number of cluster objects as ”cluster_number”. Every cluster object represents one cluster in real micro-architecture. According to the numbers defined by ALU, MUL, and DIV in Section 3.3.1, every cluster object will declare the same numbers of ALU object, MUL object, and DIV object. These functional unit objects declared by every cluster stand for the functional units processing the instruction in the cluster of practical micro-architecture.

3.) Controller would fetch the cluster instructions in the instruction memory, and then equally distribute these instructions to clusters for dealing with instruction.

4.) The function of Cluster is to divide the cluster instruction into instructions of the same number as the number of functional unit in the cluster, and then submit these instructions into ALU/MUL/DIV for executing.

For example, if a cluster has two ALU, two MUL, and one DIV, we would divide a 137-bit cluster instruction into two 29-bit ALU instructions, two 26-bit MUL instructions, and one 27-bit DIV instruction, and then respectively submit them to five Functional Units for processing. Figure 3.4 clearly illustrates the functions of Cluster.

5.) Estimate Performance:

CPU Time Every time when the function, calculate(), is used in ALU, MUL, and DIV, we accumulate the number of these function units being executed, i.

e., ALU, MUL, and DIV, inside different cluster. According to ALU_Time, MUL_Time, and DIV_Time defined in Section 3.3.1, how many clock cycles every cluster will cost when it implement application could be estimated

further. Comparing to the time consumed by the cluster, the desired CPU Time for implementing application is just the maximum value being drawn.

Memory accessing times Every time when ALU, MUL, and DIV do data fetch or data write back, memory hierarchy will be used. Simulator will accumulate the number of accessing for every memory hierarchy level (could be SRF, SP, or LRF). Application will show the employment information of memory hierarchy after simulator done its work.

The amount of each Memory hierarchy level being used The information about the required amount of memory in every level for simulating application executed by stream processor will also show in the simulation results. This information can greatly help determining how much memory hierarchy capacity is required for the application.

相關文件