An ALU Cluster Intellectual Property - Design and Implementation of an ALU Cluster

Design and Implementation of an ALU Cluster

3.3 An ALU Cluster Intellectual Property

In this section, we combine the improved ALU cluster and designed wrapper together. The improvement of the ALU cluster is for control and internal storages. As for the operational units, they are the same with old ones. It will be showed in the first part in this section. The second part is about the functional simulation. The testbench is 16-tap FIR program that is one common media application. Then, the last part is the summary of the improvements.

3.3.1 Architecture of an ALU Cluster Intellectual Property

The architecture of IP consists of the wrapper, ALU cluster, instruction memory and data memory, as shown in Figure 3.3.1. The wrapper is introduced in previous section. The decoder in the ALU cluster is the main improvement. One of the improvements is to improving the ability of reading source and writing destination. It makes all memory, including data and instruction memory, expose to the AMBA bus.

They can be accessed directly from AMBA bus. Besides, it betters the performance in shortening access cycles. The reading takes two cycle’s latencies in bursting reading.

After the latencies, the reading data comes out every cycle. As for the orginal one, it must have four cycles to access one burst read.

32 32 32 32 32 32 32 32 32 32 32 32 32 14

32 142

7+3

32 32 14

alu_work

142

5+4 32 32*10 32*5

Figure 3.3.1: Architecture of an ALU Cluster IP

In order to let the ALU cluster is able to execute while AMBA bus is granted by other master, the ALU cluster needs a module that feeds instruction memory the address automatically. The new component, Pc_counter, is added to handle this job. It will increase the program counter by one every clock cycle. The decoder will have the end value of Pc_counter and compare it every cycle to check if the ALU cluster finishes the job. If the job is done, it actives the alu_work signal to let wrapper know the situation. If the alu_work is inactive, the IP can not be accessed simply and will return RETRY response back to AMBA bus. There is one special input combination can erase end value of Pc_counter in the decoder and thus coerce the IP to stop execution. This coercive mechanism is designed for avoiding possibility of deadlock occurring.

3.3.2 Functional Verification

The development of the IP including finite state machine in the wrapper and total architecture is shown as above. In this part, the simulation of the testbench of 16-tap FIR filter and its results will be demonstrated to verify the executing ability of the IP for media applications.

3.3.2.1 Testbench : 16-tap FIR filter System

The IP is simulated with the 16-tap FIR as benchmark. The FIR filter system is chosen as the testbench for the functional verification since it is suitable for one dimensional architecture, needs repeat and high percentage of addition and multiplication, and applies for wide DSP applications, such as matched filtering, pulse shaping, equalization, etc. A brief review of FIR filter system is illustrated in the following. The equation 3.1 is a description for FIR filter system. The M represents the length of FIR filter, bk are the coefficients of the it, x[n-k] is the data sample at time instance n-k, and y[n] response the output to the instance time n.

∑

⁻

As shown in the left Figure 3., they are the coefficients of bk and it is a 16-tap Kaiser window FIR bandpass filter. The right figure is the input function, which is exponential equation with ten sampling points. The FIR filter is simulated in advance by the Mathworks Matlab to get the correct results, as shown in Figure 3.3.2. The usage of the Matlab is to obtain the correct results with the input function is fed into the FIR filter. After comparing the results between the results from the Matlab and the IP, we thus can sure that the functionality is correct with the equality after the comparison.

16−th order Kaiser window FIR filter

0 5 10

Figure 3.3.2: Coefficients of the FIR Filter System and Its Input Function

0 5 10 15 20 25

−2 0 2 4 6 8 10

12x 10⁵ FIR output results

Figure 3.3.3: Expected Output Results of the FIR Filter System

3.3.2.2 Simulation Results

The simulation environment is based on UMC 0.18 libraries and is simulated after logic synthesizing. The clock cycle is set up with 6.5 ns. In other words it is simulated in 153.87 MHz. The simulation has four steps, as shown in Figure 3.3.4. It is the full view of the simulation, including all steps. The first one, as depicted in the red rectangular, is to write necessary coefficients into IP. It is because media applications usually have lots of tables and reusable coefficients throughout the total executing journey. Writing the common used coefficients helps to accelerate the execution by abating memory reference. Total numbers of coefficients written in the first step are fifty-two. They will be fed in to the following operations.

The second step is to configure IP, as depicted in the pink rectangular. The instructions are written into the instruction memory through AMBA bus. The needed configuring time depends on the different applications. As introducing in the previous chapter, the instruction format is a 142 bits VLIW instruction set. However, the input bandwidth from AMBA bus to the wrapper is only 32 bits. One instruction thus needs five times to be configured. The additional 18 bits, 160 bits for five clock cycles

configuring, will be truncated automatically. Our FIR is a 94 instructions VLIW program. Therefore, it takes totally 470 clock cycles to finish configuring this program.

Write coefficients

Read results

Execute FIR Configure IP

Figure 3.3.4: Full Waveform View of Simulation

The IP will execute the pre-configured programs at the third stage. It is the yellow rectangular in Figure 3.3.4. The detailed simulation waveform is shown in Figure 3.3.5 and Figure 3.3.6. Two parts in Figure 3.3.5 are high light. The first one is the wathet blue rectangular. It is the key point to invoke ALU cluster to work. It writes the last instruction number to an individual register that is specialized to store the end value of instruction. As shown in the figure, it is ‘5e’ in hexadecimal representation that means the FIR program will halt at the 94th instruction by internal check.

The second high light is the green rectangular. We insert a read request at the middle of the execution. The intention is to check if it wound bother the correctness of the ALU cluster. It will return RETRY response back to MABA bus. The correct results are shown in the forth stage. There is one more thing needed to be pay attention to. The HSELx that is introduced in the chapter 2 is low at all execution stage. The logical low for HSELx in execution stage is necessary. It means that the AMBA bus can be used by other IP while the ALU cluster is executing media applications at the same time. As for Figure 3.3.6, it is transaction after Figure 3.3.5.

Cursor: 3667310 Marker:0 Delta:-3667310 x 1ps

0 1000000 2000000 3000000 4000000

0 7 0

0 2 3

0 2a50 3200 3204 3240

0 30000 0 -289* 19347 7113 -37380 86248 -1582*781792 11275* 224741 182620 27127 9997 33580 -27672 19818 7250

IP return RETRY response

在文檔中應用於多媒體串流處理之可重組式運算單元硬體加速矽智產設計 (頁 52-57)