Different Warp-Thread Configuration - Experiment of Different Configuration

Chapter 5 Methodology and Experiment

5.4 Experiment of Different Configuration

5.4.1 Different Warp-Thread Configuration

Tables below show the effect of different configurations on performance and indicators.

Notation of AwBc in Config. rows denote the configuration of A warp(s) and B core(s), which can serves A×B contents in maximum. This content also means the memory re-source required, both in register file and local ray memory. We group up these configu-ration by number of contents.

Table 5.3: Results of single CU with 16 contents

Results of conference scene (16 contents, single CU)

Config. 1w16c 2w8c 4w4c 8w2c 16w1c

Performance

(Mrays/s) 18.69 19.14 18.69 15.87 10.58

Time Util. (%) 99.52 99.51 100 99.59 99.46

Res. Util. (%) 99.5 99.49 99.35 99.44 78.96

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 5.82 11.9 23.08 39.35 52.51

FltALU Util. 4.15 8.48 16.45 28.05 37.43

MemALU Util. 18.44 19.46 20.01 18.95 15.13

RayALU Util. 3.9 7.98 15.47 26.38 35.19

IntALU Wait 0 0 0.06 0.71 2.62

FltALU Wait 0 0 0.06 0.37 0.80

MemALU Wait 0 0.47 1.24 2.23 1.60

RayALU Wait 0 0.06 0.26 0.59 0.80

Utilization of IntALU, FltALU and RayALU increase as numbers of cores de-crease. Fewer resource of computational cores causes higher utilization ratio. MemALU does not follow this rule since the size and bandwidth of data cache keeps the same (8KB, 128Gb/s) throughout different configurations. For RayALU, though the size of local ray memory keeps the same in RayALU, the bandwidth will grow as the configu-ration of more cores.

Table 5.4: Results of single CU with 32 contents

Result of conference scene (32 contents, single CU)

Config. 1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Performance

(Mrays/s) 31.77 32.85 31.46 29.14 17.78 13.83

Time Util. (%) 99.19 99.16 99.19 99.25 100 99.65

Res. Util. (%) 100.00 100.00 100.00 100.00 65.57 81.05 Indicator of different ALUs

IntALU Util. 4.94 10.22 19.57 36.23 44.2 68.47

Table 5.5: Results of single CU with 64 contents

Results of conference scene (64 contents, single CU)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 43.76 42.64 40.95 37.31 26.87

Time Util. (%) 86.97 87.85 88.39 99.04 99.31

Res. Util. (%) 94.75 94.50 94.51 87.04 87.80

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 8.11 15.32 28.9 46.48 67.24

Figure 5.5 below compares the relationship between performance and different

configurations.

Figure 5.5 Performance of different configurations (single CU)

Figure 5.6 Normalized Performance of different configurations (single CU)

4.0 8.0 16.0 32.0 64.0 128.0

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Different Configuration (single CU)

64contents 32contents 16contents

0.5 1.0 2.0 4.0 8.0 16.0 32.0

32 16 8 4 2 1

Perf. per core log2 (Mrays/s)

Number of cores

Norm. Performance (single CU)

64contents

32contents

16contents

In the same group of contents, as the x-axis moves right, number of cores drops causing the trend of decrement in performance. Fortunately, though number of cores drops, it pushes more warps to fill the pipeline in ALUs. As a result, performance won't decrease linearly with number of cores. Furthermore, with the benefits of hiding

memory latency of warps, it gives a slightly growing trend as more warps. That is why in the cases with 2 warps but fewer cores out-performs the single warp configuration.

Another view point, as jumps from fewer contents to more contents vertically, it doubles the number of warps and according resource. It give rise to a performance gain.

But this gain does not double the performance since the number of ALUs keeps the same. This trend will saturate as too many warps suffering from a traffic jam, which can be observed by the indicator of waiting warps.

The point of 2 cores in 32 contents slightly drop from the trend discussed above. It is caused by the non-linear effect of kernel choice policy. Kernel choice policy changes it favor with higher timing utilization to higher resource utilization. It can be observed from the indicator of Time Util. and Res. Util.

Figure 5.7 ALU utilization rate in different configuration (32 contents)

Figure 5.7 supports our assumption of performance trend by comparing the ALU utilization between different configurations. Initially, the utilization rate is very low for single warp with full cores. Because each instruction needs to wait for the finish of pre-vious one in our warp scheduling policy without out-of-order instruction launch. It cause the pipeline being full of bubbles. When number of warps grows, there are more task the fill up pipelines.

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00

TidUtil BrhUtil MemUtil IntUtil FltUtil SlwUtil RayUtil

ALU Util. Rate (%) (32 contents)

1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Figure 5.8 ALU utilization rate for different contents (4 cores)

Figure 5.8 shows the ALU utilization rate variance with number of contents. The growing trend is a little bit smoother than above in Figure 5.7.

Figure 5.9 Average waiting traffic of ALUs (32 contents)

We can conclude that the performance drop with fewer cores is result from the non-linear grow of waiting traffic toward ALUs. And the composition of traffic varies with different configuration. Initially the bottle neck is the stall at data cache. As the re-duction of computational ALU, stalls from IntALU dominate and grow fast.

0.00 10.00 20.00 30.00 40.00 50.00 60.00

MemUtil IntUtil FltUtil SlwUtil RayUtil

ALU Util. Rate (%) (4 cores)

16contents 32contents 64contents

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

N. warps

Avg. Waiting Traffic of ALUs (32 contents)

TidWait BrhWait MemWait IntWait FltWait SlwWait RayWait

5.4.2 Multiple Compute Units

Since the maximum performance of single compute unit of ~40Mrays/s is still far from our expectation of ~100Mrays, we also try to duplicate the number of compute units above. Observes the performance growth as number of compute unit doubles, and dou-bles.

Table 5.6: Results of 2 CUs with 32 contents

Result of conference scene (32 contents, 2 CUs)

Config. 1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Performance

(Mrays/s) 32.37 36.66 37.41 35.64 30.06 22.56

Time Util. (%) 90.74 89.96 89.71 90.06 98.46 98.85 Res. Util. (%) 57.11 57.60 58.08 59.83 55.70 58.28

Indicator of different ALUs

IntALU Util. 5.01 6.80 13.76 25.80 39.88 59.22

FltALU Util. 3.44 4.68 9.48 17.80 27.47 40.92

MemALU Util. 29.65 20.57 21.49 21.42 18.38 16.33

RayALU Util. 3.67 4.98 10.08 18.89 29.21 43.41

Table 5.7: Results of 4 CUs with 32 contents

Result of conference scene (32 contents, 4 CUs)

Config. 1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Performance

(Mrays/s) 36.78 46.09 45.97 45.84 41.26 34.65

Time Util. (%) 29.76 92.56 92.59 93.05 97.73 98.16 Res. Util. (%) 45.59 33.17 33.31 35.18 35.21 40.02

Indicator of different ALUs

IntALU Util. 5.57 6.75 8.26 16.26 28.13 46.35

FltALU Util. 3.80 4.70 5.71 11.23 19.40 32.03

MemALU Util. 24.74 12.84 12.93 13.43 12.30 12.38

RayALU Util. 3.69 5.21 6.04 11.90 20.60 33.97

Table 5.8: Results of 2 CUs with 64 contents

Results of conference scene (64 contents, 2 CUs)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 62.16 63.65 61.38 53.78 42.41

Time Util. (%) 91.73 91.53 91.68 98.61 98.91

Res. Util. (%) 55.38 56.52 56.94 54.07 56.82

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 5.84 11.65 22.15 35.87 55.94

FltALU Util. 4.02 8.03 15.28 24.72 38.66

MemALU Util. 32.07 32.60 32.34 27.91 24.34

RayALU Util. 4.28 8.55 16.23 26.28 41.02

Table 5.9: Results of 4 CUs with 64 contents

Results of conference scene (64 contents, 4 CUs)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 80.82 80.64 82.08 75.62 64.31

Time Util. (%) 93.72 93.75 93.65 97.77 98.27

Res. Util. (%) 31.69 32.43 33.30 34.15 38.23

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 6.12 7.41 14.88 26.53 43.71

FltALU Util. 4.26 5.12 10.28 18.35 30.20

MemALU Util. 21.22 20.63 21.47 19.72 18.48

RayALU Util. 4.73 5.44 10.91 19.48 32.06

We can see that performance grows as the duplication of compute units. But the performance does not grow linearly since the bandwidth of Ray Pool and Intersection Pool are kept as constant. Additionally, there is another limits at kernel choice policy by host CPU. Different CUs do not work separately. They are controlled by the same ker-nel launch signal by host and runs the same kerker-nel code. This synchronization of opera-tion causes the collision at pools, making addiopera-tional stalls for each other. That is why many efficient indicator decrease and performance growth saturate.

Figures below show the variation of performance by multiple compute units.

Figure 5.10 Performance of multiple CUs (32 contents)

Figure 5.11 Normalized Performance of multiple CUs (32 contents)

4 8 16 32 64 128

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Multiple CUs (32 contents)

1CU 2CUs 4CUs

0.5 1 2 4 8 16 32 64

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Norm. Performance (32 contents)

1CU

2CUs

4CUs

Figure 5.12 Performance of multiple CUs (32 contents)

Figure 5.13 Normalized Performance of multiple CUs (64 contents)

Normalized performance here is calculated by dividing the performance by the number of cores inside a single CU, not the total number of cores in all CUs. It gives the increasing appearance as number of CU grows. It is easier to compare with the cases

4 8 16 32 64 128

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Multiple CUs (64 contents)

1CU 2CUs 4CUs

1 2 4 8 16 32 64

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Norm. Performance (32 contents)

1CU

2CUs

4CUs

that grows in contents.

Figure 5.14 Comparison of ALU utilization rate in multiple CUs

Figure 5.15 Multiple CUs versus multiple contents in performance

Figure 5.15 show the comparison of different strategies as duplicating contents or the whole compute unit. Duplication in contents have early advantages of more warps to fill the pipeline. But it saturates as the number of ALUs do not grow respectively. On the other hands, though multiple CU has sharper potential, it results to a much larger

0.00 10.00 20.00 30.00 40.00 50.00 60.00

MemUtil IntUtil FltUtil SlwUtil RayUtil

ALU Util. Rate (%) (8 warps 8 cores)

1 CU 2 CUs 4 CUs

1 2 4 8 16 32 64

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Multiple CUs v.s. Multiple Contents

16cotent

64content

4CUs of 16ctt

cost in hardware resource of both warps and cores.

Specifically, there is not a golden answer to the choice of configuration parameters.

Different configuration have different hardware cost in both area and timing. Though seemingly duplication in contents is better, too many warps might cause the timing vio-lation in dispatcher in 1 nanosecond.

In our situation, we choose the 8 warps 8 cores configuration with 4 CUs. It keeps the performance in 64 contents when number of cores drop. Also, we are asking more in performance, so the duplication in CUs is a must for us.

在文檔中適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計 (頁 42-53)