Result of performance evaluation - Performance Evaluation

Chapter 4 Performance Evaluation

4.3 Result of performance evaluation

In this section, performance evaluation for the stream micro-architecture of different cluster is discussed in detail.

4.3.1 1-cluster stream micro-architecture simulation

First of all, FFT instructions are scheduled as 1-cluster cluster instructions. There are 135 cluster instructions in total. Then, the instruction format definition shown in Figure 3.5 is referred to translate the cluster instructions into binary code, i.e., representation with 0’s and 1’s.

1-cluster micro-architecture (Figure 4.2): 1-cluster micro-architecture contains two ALU Units, two MUL Units, one DIV Unit, and five data exchange intermediate medium between functional units, SP, which are sixty-four 32bit registers. There are one LRF, sixty-four 32-bit registers, and one SRF, sixty-four 32-bit registers, at the input of every functional unit. Therefore, we could know that SRF is sixty-four 32-bit registers, SP is 64*1 = sixty-four 32-bit registers since there is only one cluster in the entire system, and LRF is 64*2*5 = six hundred and forty 32-bit registers in total since there are five FUs in a cluster and one cluster in the system.

Figure 4.2(a) Block diagram of 1-cluster micro-architecture simulator

Figure 4.2(b) Stream programming model of 1-cluster architecture

The cluster instructions initially saved in file ”fft.dat” will be read into Instruction memory through File I/O. The controller shown in Figure 4.3 will fetch cluster instructions from instruction memory, and then submit all cluster instructions to the only cluster for executing. The initial data x[0]~x[31] of FFT will be loaded into SRF through memory. The initial input of Functional unit would be derived from SRF. There are many intermediate computation results during the process from x[0] to X[0], where X[0] is the output of FFT. If these data are used in the same functional unit, these temporary results will be saved in LRF0 of the functional unit.

four functional units. Only the final calculation results, X[0]~X[31], will utilize SRF to write them back to Memory. The intermediate calculation results could take advantage of less time being consumed by LRF instead of memory to expedite whole computation speed.

Figure 4.3 shows the simulation results. It can be seen that executing one hundred and thirty-five cluster instructions require ALU Unit being used for 257 times, MUL Unit for ninety-six times, and DIV Unit for zero time since the operation of FFT does not need division. The whole operation of FFT requires four hundred and fort-nine clock cycles. In the meantime, SRF, SP, and LRF are accessed one hundred and thirty-seven times, one hundred and sixty-three times, and seven hundred fifty-nine times, respectively.

Since this micro-architecture has only one cluster, the time needed to execute FFT is the same as the clock cycles spent by Cluster 1. The ratio of this micro-architecture using memory hierarchy is SRF: SP : LRF = 59 : 11 : 74.

Figure 4.3 Simulation result of 1-cluster stream micro-architecture

4.3.2 2-cluster stream micro-architecture simulation

First of all, FFT instructions are scheduled as 2-cluster cluster instructions.

2-cluster micro-architecture is like dual CPU system. The system can deal with two cluster instructions in one clock. After scheduling, there are sixty-five cluster instructions for cluster 1, and 65 cluster instructions for cluster 2. After two clusters execute one hundred and thirty cluster instructions in total, the final results could be obtained. After scheduling instructions, like 1-cluster simulator, these one hundred and thirty cluster instructions have to be translated into binary codes that are implementable on simulator according to the instruction format shown in Figure 3.5.

2-cluster micro-architecture (Figure 4.4): There are two clusters in micro-architecture. The functional unit in each cluster contains two ALU Units, two MUL Units, one DIV Unit, and five data exchange intermediate medium between functional units, SP, which are sixty-four 32bit registers. The definition is like that of 1-cluster micro-architecture, which has been illustrated in Section 4.2.2. In addition, there are also one LRF, sixty-four 32-bit registers, at the input of every functional unit and one SRF, sixty-four 32-bit registers, in memory hierarchy.

Therefore, it could be known that SRF is 64 32-bit registers, SP is 64*2= 128 32-bit registers since there are two clusters in the entire system, and LRF is 64*2*5*2 = 1280 32-bit registers in total since there are five FUs in a cluster and two clusters in the system.

ALU ALU MUL MUL DIV Scratch

Pad LRF MEM_D

SRF

Controller Instruction

FILE

File I/O

Instruction memory ALU ALU MUL MUL DIV Scratch

Pad LRF

Figure 4.4(a) Block diagram of 2-cluster micro-architecture simulator

Figure 4.4(b) Stream programming model of 2-cluster architecture

The cluster instructions initially saved in file ”fft.dat” will be read into instruction memory through File I/O. The controller shown in Figure 4.4 will fetch cluster instructions from instruction memory, and then equally distribute them to cluster 1 and cluster 2 for executing. The initial data x[0]~x[31] of FFT will be loaded into SRF register through memory. The initial data source of functional unit would be derived from SRF. However, there are many intermediate computation results during the process from x[n] to X[n], where X[n] is the output of FFT. If this result is the input of the functional unit of the same cluster, it will be written to registers LRF or SP. If the calculation result is taken as input of functional unit of the other cluster, then data would be submitted to the function unit of the other cluster through SRF register. Data exchange between every instruction would be processed in the same cluster as possible through faster LRF or SP. Only when data are exchanged between clusters will we use the SRF Register with slower access speed.

The final computation results X[0]~X[31], will be temporarily saved in SRF, and then be written back to memory through memory store command. With this idea to manipulate instructions could greatly save computation time.

Figure 4.5 shows simulation results of 2-cluster stream micro-architecture.

Compared to 1-cluster micro-architecture in Section 4.3.1, it can be readily seen that the performance has been greatly improved when we utilize two clusters for processing the cluster instructions of FFT. As shown in figures, it could be observed that cluster 1 deals with sixty-five cluster instructions, uses ALU Unit for one hundred and twenty-seven times, MUL for twenty-two times, and never use DIV Unit since the operation of FFT does not need DIV Unit to implement instruction.

The whole operation of FFT require one hundred and seventy-one clock cycles for cluster 1 processing sixty-five cluster instructions. In the meantime, SRF, SP, and LRF are accessed one eighty-one times, forty-eight times, and three hundred and eighteen times, respectively.

As shown in the figure, it could be observed that cluster 2 deals with sixty-five cluster instructions, uses ALU Unit for one hundred and twenty-nine times, MUL for seventy-four times, and never uses DIV Unit since the operation of FFT does not need DIV Unit to do instruction. The whole operation of FFT require two hundred and seventy-seven clock cycles for cluster 2 processing sixty-five cluster instructions. In the meantime, SRF, SP, and LRF have been accessed for one hundred and three times, one hundred and twenty-three times, and three hundred and eighty-three times, respectively.

Figure 4.5 Simulation result of 2-cluster stream micro-architecture

The time spent by 2-cluster stream micro-architecture for executing FFT instructions is just the maximal time spent by cluster1 and cluster 2, where the CPU_Time of cluster-1 = 171 clock cycles, and CPU_Time of cluster-2 = 277 clock cycles. Therefore, the CPU_Time spent in this system is 277 clock cycles (the maximal value). In addition, it could be observed that the ratio of the whole micro-architecture using memory hierarchy is SRF: SP : LRF = 62 : 6 : 116.

4.3.3 4-cluster stream micro-architecture simulation

First of all, schedule the FFT instructions as 4-cluster cluster instructions.

4-cluster micro-architecture is just like four-CPU system. The system can deal with four cluster instructions in one clock. After scheduling, every cluster needs to deal with forty-two cluster instructions. After four clusters execute one hundred and

sixty-eight cluster instructions for total, it could be obtained the final results. After scheduling instructions, just like 1-cluster simulator, these one hundred and thirty cluster instructions have to be translated into binary codes that are implementable on simulator according to the instruction format shown in Figure 3.5.

4-cluster micro-architecture (Figure 4.6): There are four clusters in micro-architecture. The functional unit in each cluster contains two ALU Units, two MUL Units, one DIV Unit, and five data exchange intermediate medium between functional units, SP, which is sixty-four 32-bit registers. The definition is just like that of 1-cluster micro-architecture, which has been illustrated in Section 4.2.2. In addition, there are also one LRF, sixty-four 32-bit registers, in the input of every functional unit and one SRF, sixty-four 32-bit registers, in memory hierarchy.

Therefore, it could be known that SRF is sixty-four 32-bit registers, SP is 64*4= 256 32-bit registers since there are two clusters in whole system, and LRF is 64*2*5*4 = 2560 32-bit registers for total since there are five FUs in a cluster and four clusters in whole system.

ALU ALU MUL MUL DIV Scratch

ALU ALU MUL MUL DIV Scratch Pad LRF

ALU ALU MUL MUL DIV Scratch Pad LRF ALU ALU MUL MUL DIV Scratch

Pad LRF

Figure 4.6(a) Block diagram of 4-cluster micro-architecture simulator

Figure 4.6(b) Stream programming model of 4-cluster architecture

The cluster instructions initially saved in file ”fft.dat” will be read into instruction memory through File I/O. The controller shown in Figure 4.6 will fetch cluster instructions from instruction memory, and then equally distribute them to cluster 1, cluster 2, cluster 3, and cluster 4 for executing. The initial data x[0]~x[31]

of FFT will be loaded into SRF register through memory. The initial data source of functional unit would be derived from SRF. However, there are many intermediate computation results during the process from x[n] to X[n], where X[n] is the output of FFT. If this result is the input of the functional unit of the same cluster, it will be written to registers LRF or SP. If the calculation result is taken as input of functional unit of the other cluster, then the data would be submitted to the function unit of the other cluster through SRF register. Data exchange between every instruction would be processed in the same cluster as possible through faster LRF or SP. Only when data are exchanged between clusters, the SRF register with slower access speed will be used. The final computation results X[0]~X[31], will be temporarily saved in SRF, and then be written back to memory through memory store command. With this idea to manipulate instructions could greatly save computation time.

Figure 4.7 shows the simulation results of 4-cluster stream micro-architecture.

Compared to 1-cluster micro-architecture in Section 4.3.1, it can be readily seen that the performance has been greatly improved when four clusters are utilized for processing the cluster instructions of FFT.

As shown in figures, it could be observes that cluster 1 deals with forty-two cluster instructions, uses ALU Unit for sixty-nine times, MUL for three times, and never use DIV Unit since the operation of FFT does not need DIV Unit to implement

instruction. The whole operation of FFT require seventy-five clock cycles for cluster-1 processing forty-two cluster instructions. In the meantime, SRF, SP, and LRF are accessed for fifty-five times, fifteen times, and one hundred and forty-six times, respectively.

As shown in figures, it could be observed that cluster 2 deals with forty-two cluster instructions, uses ALU Unit for sixty-four times, MUL for nineteen times, and never uses DIV Unit since the operation of FFT does not need DIV Unit to implement instruction. The whole operation of FFT require one hundred and forty-two clock cycles for cluster 2 processing forty-two cluster instructions. In the meantime, SRF, SP, and LRF are accessed for sixty-three times, thirty-nine times, and one hundred and forty-seven times, respectively.

As shown in figures, it could be observed that cluster 3 deals with forty-two cluster instructions, uses ALU Unit for sixty-eight times, MUL for thirty-three times, and never uses DIV Unit since the operation of FFT does not need DIV Unit to implement instruction. The whole operation of FFT require one hundred and thirty-four clock cycles for cluster 3 processing forty-two cluster instructions. In the meantime, SRF, SP, and LRF are accessed for seventy times, fifty-seven times, and one hundred and seventy-six times, respectively.

As shown in figures, it could be observed that cluster 4 deals with forty-two cluster instructions, uses ALU Unit for fifty-six times, MUL for forty-one times, and never uses DIV Unit since the operation of FFT does not need DIV Unit to implement instruction. The whole operation of FFT requires one hundred and thirty-eight clock cycles for cluster 4 processing forty-two cluster instructions. In the meantime, SRF, SP, and LRF are accessed for sixty-one times, fifty-nine times, and one hundred and seventy-one times, respectively.

Figure 4.7 Simulation result of 4-cluster stream micro-architecture

The time spent by 4-cluster stream micro-architecture for executing FFT instructions is just the maximal time spent by cluster 1, cluster 2, cluster 3, and cluster 4 where the CPU_Time of cluster 1 = 75 clock cycles, CPU_Time of cluster 2 = 102 clock cycles, CPU_Time of cluster 3 = 134 clock cycles, and CPU_Time of cluster 4

= 138 clock cycles. Therefore, the CPU_Time spent in this system is one hundred and thirty-eight clock cycles (the maximal value). In addition, it could be observed that the ratio of the whole micro-architecture using memory hierarchy is SRF: SP : LRF = 64 : 20 : 153.

4.3.4 8-cluster stream micro-architecture simulation

First of all, schedule the FFT instructions as 8-cluster cluster instructions.

8-cluster micro-architecture is just like eight-CPU system. The system can deal with eight cluster instructions in one clock. After scheduling, every cluster needs to deal with thirty-four cluster instructions. After eight clusters execute two hundred and seventy-two cluster instructions for total, the final results could be obtained.

After scheduling instructions, just like 1-cluster simulator, these two hundred and seventy-two cluster instructions have to be translated into binary codes that are implementable on simulator according to the instruction format shown in Figure 3.5.

8-cluster micro-architecture (Figure 4.8): There are eight clusters in micro-architecture. The functional unit in each cluster contains two ALU Units, two MUL Units, one DIV Unit, and five data exchange intermediate medium between functional units, SP, which is sixty-four 32bit registers. The definition is just like that of 1-cluster micro-architecture, which has been illustrated in Section 4.2.2. In addition, there are also one LRF, sixty-four 32-bit registers, in the input of every functional unit and one SRF, sixty-four 32-bit registers, in memory hierarchy.

Therefore, it could be known that SRF is sixty-four 32-bit registers, SP is 64*8 = 512 32-bit registers since there are two clusters in whole system, and LRF is 64*2*5*8 = 5120 32-bit registers for total since there are five FUs in a cluster and eight clusters in whole system.

ALU ALU MUL MUL DIV Scratch

ALU ALU MUL MUL DIV Scratch Pad LRF

ALU ALU MUL MUL DIV Scratch Pad LRF ALU ALU MUL MUL DIV Scratch

Pad LRF

ALU ALU MUL MUL DIV Scratch Pad LRF

ALU ALU MUL MUL DIV Scratch Pad LRF ALU ALU MUL MUL DIV Scratch

Pad LRF

Figure 4.8(a) Block diagram of 8-cluster micro-architecture simulator

controller

Figure 4.8(b) Stream programming model of 8-cluster architecture

The cluster instructions initially saved in file ”fft.dat” will be read into instruction memory through File I/O. The controller shown in Figure 4.8 will fetch cluster instructions from instruction memory, and then equally distribute them to cluster 1 to cluster 8 for executing. The initial data x[0]~x[31] of FFT will be loaded into SRF register through memory. The initial data source of functional unit would be derived from SRF. However, there are many intermediate computation results

during the process from x[n] to X[n], where X[n] is the output of FFT. If this result is the input of the functional unit of the same cluster, it will be written to registers LRF or SP. If the calculation result is taken as input of functional unit of the other cluster, then the data would be submitted to the function unit of the other cluster through SRF register. Data exchange between every instruction would be processed in the same cluster as possible through faster LRF or SP. Only when data are exchanged between clusters, the SRF register with slower access speed will be used.

Figure 4.9 shows the simulation results of 8-cluster stream micro-architecture.

Compared to the 4-cluster micro-architecture in Section 4.3.3, it can be seen that the performance has been hardly improved when eight clusters are utilized for processing the cluster instructions of FFT.

As shown in figures, it could be observed that cluster1 deals with thirty-four cluster instructions, uses ALU Unit for thirty-five times, MUL for once, and never use DIV Unit since the operation of FFT does not need DIV Unit to implement instruction.

The whole operation of FFT require thirty-seven clock cycles for cluster 1 processing thirty-four cluster instructions. In the meantime, SRF, SP, and LRF are accessed for forty times, ten times, and fifty-eight times, respectively.

As shown in figures, it could be observed that cluster 2 deals with thirty-four cluster instructions, uses ALU Unit for thirty-eight times, MUL for four times, and never use DIV Unit since the operation of FFT does not need DIV Unit to implement

instruction. The whole operation of FFT require forty-six clock cycles for cluster 2 processing thirty-four cluster instructions. In the meantime, SRF, SP, and LRF are accessed for thirty-eight times, twenty-five times, and sixty-three times, respectively.

As shown in figures, it could be observed that cluster 3 deals with thirty-four cluster instructions, uses ALU Unit for twenty-four times, MUL for eight times, and never use DIV Unit since the operation of FFT does not need DIV Unit to implement instruction. The whole operation of FFT require forty clock cycles for cluster 3 processing thirty-four cluster instructions. In the meantime, SRF, SP, and LRF are accessed for forty times, five times, and fifty-one times, respectively.

在文檔中多媒體串列處理器之微架構模擬器 (頁 49-69)