Optimization for Ray Tracing - Configurable Multiprocessor Simulation

Chapter 4 Configurable Multiprocessor Simulation

4.7 Optimization for Ray Tracing

To communicate with two pools, keep intersection data and generate ray data it needs some special function units handling such works. We setup the RayALU for doing so.

The local data of intersections and rays is very similar to the role of shared

memory or software defined cache in CUDA [17]. But the system of our ray-pool based ray tracing is data-driven by these intersections and rays. It means that the flow of these driving data must be kept fluently for high efficiency. Here we calculate the efficiency bound by bandwidth to pools:

Bandwidth bound =bandwidth of ray pool, 128 bits per ns

size of ray, 48 bytes = 333𝑀𝑅𝑎𝑦𝑠/𝑠𝑒𝑐 It is very tight for our targeted 100MRays/sec performance. The ray flow must be kept ongoing between pools.

So we design a 3-set local ray memory in our RayALU. One set work as the input buffer for intersection prefetching. Another is used for current computation. While the other act as the output buffer. The roles of these sets swap accordingly by instructions.

Each set of memory contains the full content of rays for total amount of threads in that compute unit. In so doing, we can keep the communication bandwidth with pools near continuing.

Figure 4.10 Local ray memory in RayALU

The local ray memory size is 5.25KB for each set for typical configuration (8 cores with 8 warps). Sum up as about 16KB for each compute units.

However, there are still some limitations to get full utilization of pool bandwidth.

Since the whole multiprocessor is fired up by kernel launching at the same time, rays and intersections are produced and consumed in batches. What’s more, local ray

Set

local R/W export package

Import package

single port

BW: 128 bits (4x4B) nWiB

memory is designed coalesced-wise in order to accelerate local memory read/write by warps. It leads to the batches of packet must be synchronized before and after kernels.

Consequently, though 3-set architecture is seemingly efficient enough, it still requires carefully control both at host about kernel launch algorithm and at kernel codes pro-gramming to get practically high efficiency with pools.

Apart from this RayALU, our multiprocessor is still a general-purposed processor like GPGPUs. And it can be viewed as an example of user-defined special ALUs in our configurable multiprocessor.

Chapter 5 Methodology and Experiment

In this chapter we will discuss about experiments on our simulator. Starts form testing benchmarks, configuration of our simulator and how we measure the performance of proposed system to results and discussion.

5.1 Testing Benchmarks

We test our proposed system on “conference” scene. Such scene has 331K triangles with 36 kind of material and is one of the most popular scenes as ray tracing bench-mark. The origin model does not have reflective material. We directly modify all materi-als with reflective luminous being half of intensity of diffuse luminous. Our testing res-olution is 512x512 pixels with one sample ray per pixel. And there is a single point light as light source.

There are two modes for testing: primary ray only and limited depth of bounce.

The primary ray only case is used for measuring the maximum efficiency of our ray tracer. Each pixel requires an eye ray and a shadow ray for drawing. This case has the highest coherence for memory. As for the other mode, it is closer to real rendering case for ray tracing. Reflective effects can be observe after turning on the bounces. Here we limit the bouncing depth as 5. Both of them are based on the ray tracing algorithm of Whitted-style ray tracing.

Figure 5.1 Target scene: conference

5.2 Simulation Environment Setup

The followings are configuration and simulation parameters of our simulator.

5.2.1 Two Pools and Traverser

Two pools are configured as the maximum depth of 512 packages. Each ray package is 48 bytes while instruction package is 84 bytes. That is, the ray pool is 24KB in size and intersection pool is 84KB in size. The transfer latency of two pools are both 128 bits per

512x512 px, Whitted-style 36 materials

330K triangles Depth = 1 (Primary)

Depth = 5

conference scene

nanosecond plus two nanosecond overhead for each request.

The efficiency of Traverser is modeled statically. We use a pure software traverser running the algorithm of AABB BVH tree. We do not varies the delay according to cache or traverse depth since we do not have complete model of hardware traverser. We only assume that it has the equally efficiency with our Shader, transforming 64 rays to intersections as the end of Shader kernel or 1000 nanoseconds when Shader stalls.

The ability of Traverser varies for the experiment of different number of compute units. While doubling the number of compute units in Shader, we also doubles the throughput of Traverser for balancing.

5.2.2 Related Kernel and Host Control Policy

We have three pieces of kernel code: Casting Eye Ray, Casting Secondary Ray and Shading. Casting Eye Ray is the step to generate eye ray by camera and pixel coordi-nates. Casting Secondary Ray is the step to generate reflection ray and shadow ray based on previous intersection. And Shading kernel is the step to draw out light infor-mation by non-blocked shadow ray. Being blocked or not is known by null intersection of shadow ray from Traverser.

Figure 5.2 Three kinds of kernel

1 Eye ray

Reflection ray Shadow ray

2 3 3

3 kinds of kernel:

① Cast eye ray

② Cast 2

^nd

ray: refl.+shad.

③ Shading by non-blocked shad.

Kernel codes for two modes (primary ray only and limited depth of bounce) are slightly different. For the mode primary ray only, calculation about reflection in Casting Secondary Ray kernel is skipped. Table 5.1 show the instruction composition of these kernels.

Kernel codes would be slightly different for multiple compute units, too. Because it requires additional codes for control of block dimension especially for Casting Eye Ray kernel. Table 5.1 also show the case of multiple compute units by the right hand side of

“/”. (Left hand side is information of single compute unit.)

Number of instructions (single/multiple compute unit(s)) Kernel Cast Eye R Cast 2^nd R

with refl. R

Cast 2^nd R

without refl. R Shading

N of Instr. 77 / 85 191 110 54 / 55 *

Table 5.1: Number of instructions in different kernel

*: Shading kernel requires additional synchronization to flush local ray memory for multiple compute units.

Kernel Cast Eye R Cast 2^nd R with refl. R

Cast 2^nd R

without refl. R Shading

Consume -- C nIS C nIS C sIS

Produce C nR C nR+CsR C sR --

Notes: C means the number of contents in such kernel; nR means normal ray; sR means shadow ray;

nIS means intersection of normal ray; sIS means intersection of shadow ray.

There are still other mechanism to destroy rays: rays exceeding bouncing depth limit and null inter-sections are discarded by pools.

Table 5.2: Effect to pools of different kernels

The kernel launch policy controlled by hosting CPU (or local controller) is showed in Figure 5.3. Be aware that this policy is not the optimal solution, it is just a naive pol-icy. The choice of kernel launch policy involves lots of factor such as intersection pre-load, early launch or wait for high utilization and the latency of control policy to real kernel launch. It can be a new research topic, so is not discussed in our thesis.

Symbols in Figure 5.3 are illustrated here. RP means the number of rays in ray pool and ISP means the number of intersections in intersection pool. As for “sIS” and “nIS”

are different kind of intersections in intersection pool: sIS is intersection of shadow ray while nIS in intersection of normal ray. It is because the Shading kernel and Casting Secondary Ray kernel operate on different kinds of intersections.

Figure 5.3 Kernel choice policy by controller

It is also important to avoid window effect caused by latency of Shader and Tra-verser. Such situation is showed in Figure 5.4. It can be solved by casting eye rays of another tile on the cost of divergence. Overall, it has an efficiency enhancement about 64% (26Mrays -> 41Mrays) under the blockage of 512 contents (not permanently al-lows new tiles for divergence issue) with the configuration of 8 warps, 8 cores and sin-gle compute unit. So the later experiments are all based on the policy without window effect.

RP+ISP>0

RP > 48

sIS > 0

Draw out halt

ISP > 8

sIS < nIS

Cast R

_2nd ^{Cast R}^Eye

or halt

Done

Traversal Traversal Traversal Traversal

Figure 5.4 Window effect avoidance in kernel launch policy

5.3 Performance Indicator

The following performance indicators are used to judge the quality of different configu-ration of multiprocessor in section 5.4 .

Firstly, the most important indicator is the overall performance for ray tracing.

General performance is measured by dividing the number of rays traced by simulation time. For the case of primary ray only, the number of rays traced is a constant of 512 (width) × 512 (height) × 2 (eye ray and shadow ray) = 512Krays.

As for the measurement of utilization, we have: time utilization rate, resource utili-zation rate and ALU utiliutili-zation rate.

Time Util. = 100% × (𝑚𝑢𝑙𝑡𝑖𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑤𝑜𝑟𝑘 𝑡𝑖𝑚𝑒) (𝑡𝑜𝑡𝑎𝑙 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒)⁄ Resource Util. = weighted avg. (kernel time, 100% ∙ 𝑡ℎ𝑟𝑒𝑎𝑑𝑠 𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑

𝑚𝑎𝑥. 𝑡ℎ𝑟𝑒𝑎𝑑𝑠 𝑙𝑎𝑢𝑛𝑐ℎ𝑎𝑏𝑙𝑒) ALU Util. = 100% − 𝐴𝐿𝑈 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 𝑓𝑜𝑟 𝑛𝑒𝑤 𝑡𝑎𝑠𝑘 𝑏𝑢𝑡 𝑛𝑜𝑡 𝑢𝑠𝑒𝑑

𝑚𝑢𝑙𝑡𝑖𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑤𝑜𝑟𝑘 𝑡𝑖𝑚𝑒

ALU utilization rate is set as subtraction form because ALUs are not consist for tasks.

Some are pipelined, some are fixed latency, and some are mixed while others are com-plex. ALU utilization rate is measured for each type of ALU separately. It can be a hint about which type of ALU might be the bottleneck of performance.

There is another kind of indicator, the congestion of ALU. We record the average number of waiting task for a busy ALU.

Cast R

_Eye

ALU Congesion Traffic = 𝐴𝑐𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑒 𝑞𝑢𝑒𝑢𝑖𝑛𝑔 𝑡𝑎𝑠𝑘𝑠 𝑏𝑜𝑢𝑛𝑑 𝑓𝑜𝑟 𝑡ℎ𝑎𝑡 𝐴𝐿𝑈 𝑚𝑢𝑙𝑡𝑖𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑤𝑜𝑟𝑘 𝑡𝑖𝑚𝑒

If the congestion traffic is abnormally high, it indicates that it requires more specific re-source or architecture improvement for higher performance.

5.4 Experiment of Different Configuration

We measure the performance and related indicators on our proposed system with differ-ent configuration. From these experimdiffer-ents, it helps us both in the choice of configura-tion and the trends when system grows.

5.4.1 Different Warp-Thread Configuration

Tables below show the effect of different configurations on performance and indicators.

Notation of AwBc in Config. rows denote the configuration of A warp(s) and B core(s), which can serves A×B contents in maximum. This content also means the memory re-source required, both in register file and local ray memory. We group up these configu-ration by number of contents.

Table 5.3: Results of single CU with 16 contents

Results of conference scene (16 contents, single CU)

Config. 1w16c 2w8c 4w4c 8w2c 16w1c

Performance

(Mrays/s) 18.69 19.14 18.69 15.87 10.58

Time Util. (%) 99.52 99.51 100 99.59 99.46

Res. Util. (%) 99.5 99.49 99.35 99.44 78.96

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 5.82 11.9 23.08 39.35 52.51

FltALU Util. 4.15 8.48 16.45 28.05 37.43

MemALU Util. 18.44 19.46 20.01 18.95 15.13

RayALU Util. 3.9 7.98 15.47 26.38 35.19

IntALU Wait 0 0 0.06 0.71 2.62

FltALU Wait 0 0 0.06 0.37 0.80

MemALU Wait 0 0.47 1.24 2.23 1.60

RayALU Wait 0 0.06 0.26 0.59 0.80

Utilization of IntALU, FltALU and RayALU increase as numbers of cores de-crease. Fewer resource of computational cores causes higher utilization ratio. MemALU does not follow this rule since the size and bandwidth of data cache keeps the same (8KB, 128Gb/s) throughout different configurations. For RayALU, though the size of local ray memory keeps the same in RayALU, the bandwidth will grow as the configu-ration of more cores.

Table 5.4: Results of single CU with 32 contents

Result of conference scene (32 contents, single CU)

Config. 1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Performance

(Mrays/s) 31.77 32.85 31.46 29.14 17.78 13.83

Time Util. (%) 99.19 99.16 99.19 99.25 100 99.65

Res. Util. (%) 100.00 100.00 100.00 100.00 65.57 81.05 Indicator of different ALUs

IntALU Util. 4.94 10.22 19.57 36.23 44.2 68.47

Table 5.5: Results of single CU with 64 contents

Results of conference scene (64 contents, single CU)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 43.76 42.64 40.95 37.31 26.87

Time Util. (%) 86.97 87.85 88.39 99.04 99.31

Res. Util. (%) 94.75 94.50 94.51 87.04 87.80

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 8.11 15.32 28.9 46.48 67.24

Figure 5.5 below compares the relationship between performance and different

configurations.

Figure 5.5 Performance of different configurations (single CU)

Figure 5.6 Normalized Performance of different configurations (single CU)

4.0 8.0 16.0 32.0 64.0 128.0

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Different Configuration (single CU)

64contents 32contents 16contents

0.5 1.0 2.0 4.0 8.0 16.0 32.0

32 16 8 4 2 1

Perf. per core log2 (Mrays/s)

Number of cores

Norm. Performance (single CU)

64contents

32contents

16contents

In the same group of contents, as the x-axis moves right, number of cores drops causing the trend of decrement in performance. Fortunately, though number of cores drops, it pushes more warps to fill the pipeline in ALUs. As a result, performance won't decrease linearly with number of cores. Furthermore, with the benefits of hiding

memory latency of warps, it gives a slightly growing trend as more warps. That is why in the cases with 2 warps but fewer cores out-performs the single warp configuration.

Another view point, as jumps from fewer contents to more contents vertically, it doubles the number of warps and according resource. It give rise to a performance gain.

But this gain does not double the performance since the number of ALUs keeps the same. This trend will saturate as too many warps suffering from a traffic jam, which can be observed by the indicator of waiting warps.

The point of 2 cores in 32 contents slightly drop from the trend discussed above. It is caused by the non-linear effect of kernel choice policy. Kernel choice policy changes it favor with higher timing utilization to higher resource utilization. It can be observed from the indicator of Time Util. and Res. Util.

Figure 5.7 ALU utilization rate in different configuration (32 contents)

Figure 5.7 supports our assumption of performance trend by comparing the ALU utilization between different configurations. Initially, the utilization rate is very low for single warp with full cores. Because each instruction needs to wait for the finish of pre-vious one in our warp scheduling policy without out-of-order instruction launch. It cause the pipeline being full of bubbles. When number of warps grows, there are more task the fill up pipelines.

0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00

TidUtil BrhUtil MemUtil IntUtil FltUtil SlwUtil RayUtil

ALU Util. Rate (%) (32 contents)

1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Figure 5.8 ALU utilization rate for different contents (4 cores)

Figure 5.8 shows the ALU utilization rate variance with number of contents. The growing trend is a little bit smoother than above in Figure 5.7.

Figure 5.9 Average waiting traffic of ALUs (32 contents)

We can conclude that the performance drop with fewer cores is result from the non-linear grow of waiting traffic toward ALUs. And the composition of traffic varies with different configuration. Initially the bottle neck is the stall at data cache. As the re-duction of computational ALU, stalls from IntALU dominate and grow fast.

0.00 10.00 20.00 30.00 40.00 50.00 60.00

MemUtil IntUtil FltUtil SlwUtil RayUtil

ALU Util. Rate (%) (4 cores)

16contents 32contents 64contents

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

N. warps

Avg. Waiting Traffic of ALUs (32 contents)

TidWait BrhWait MemWait IntWait FltWait SlwWait RayWait

5.4.2 Multiple Compute Units

Since the maximum performance of single compute unit of ~40Mrays/s is still far from our expectation of ~100Mrays, we also try to duplicate the number of compute units above. Observes the performance growth as number of compute unit doubles, and dou-bles.

Table 5.6: Results of 2 CUs with 32 contents

Result of conference scene (32 contents, 2 CUs)

Config. 1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Performance

(Mrays/s) 32.37 36.66 37.41 35.64 30.06 22.56

Time Util. (%) 90.74 89.96 89.71 90.06 98.46 98.85 Res. Util. (%) 57.11 57.60 58.08 59.83 55.70 58.28

Indicator of different ALUs

IntALU Util. 5.01 6.80 13.76 25.80 39.88 59.22

FltALU Util. 3.44 4.68 9.48 17.80 27.47 40.92

MemALU Util. 29.65 20.57 21.49 21.42 18.38 16.33

RayALU Util. 3.67 4.98 10.08 18.89 29.21 43.41

Table 5.7: Results of 4 CUs with 32 contents

Result of conference scene (32 contents, 4 CUs)

Config. 1w32c 2w16c 4w8c 8w4c 16w2c 32w1c

Performance

(Mrays/s) 36.78 46.09 45.97 45.84 41.26 34.65

Time Util. (%) 29.76 92.56 92.59 93.05 97.73 98.16 Res. Util. (%) 45.59 33.17 33.31 35.18 35.21 40.02

Indicator of different ALUs

IntALU Util. 5.57 6.75 8.26 16.26 28.13 46.35

FltALU Util. 3.80 4.70 5.71 11.23 19.40 32.03

MemALU Util. 24.74 12.84 12.93 13.43 12.30 12.38

RayALU Util. 3.69 5.21 6.04 11.90 20.60 33.97

Table 5.8: Results of 2 CUs with 64 contents

Results of conference scene (64 contents, 2 CUs)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 62.16 63.65 61.38 53.78 42.41

Time Util. (%) 91.73 91.53 91.68 98.61 98.91

Res. Util. (%) 55.38 56.52 56.94 54.07 56.82

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 5.84 11.65 22.15 35.87 55.94

FltALU Util. 4.02 8.03 15.28 24.72 38.66

MemALU Util. 32.07 32.60 32.34 27.91 24.34

RayALU Util. 4.28 8.55 16.23 26.28 41.02

Table 5.9: Results of 4 CUs with 64 contents

Results of conference scene (64 contents, 4 CUs)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 80.82 80.64 82.08 75.62 64.31

Time Util. (%) 93.72 93.75 93.65 97.77 98.27

Res. Util. (%) 31.69 32.43 33.30 34.15 38.23

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 6.12 7.41 14.88 26.53 43.71

FltALU Util. 4.26 5.12 10.28 18.35 30.20

MemALU Util. 21.22 20.63 21.47 19.72 18.48

RayALU Util. 4.73 5.44 10.91 19.48 32.06

We can see that performance grows as the duplication of compute units. But the performance does not grow linearly since the bandwidth of Ray Pool and Intersection Pool are kept as constant. Additionally, there is another limits at kernel choice policy by host CPU. Different CUs do not work separately. They are controlled by the same ker-nel launch signal by host and runs the same kerker-nel code. This synchronization of opera-tion causes the collision at pools, making addiopera-tional stalls for each other. That is why many efficient indicator decrease and performance growth saturate.

Figures below show the variation of performance by multiple compute units.

Figure 5.10 Performance of multiple CUs (32 contents)

Figure 5.11 Normalized Performance of multiple CUs (32 contents)

4 8 16 32 64 128

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Multiple CUs (32 contents)

1CU 2CUs 4CUs

0.5 1 2 4 8 16 32 64

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Norm. Performance (32 contents)

1CU

2CUs

4CUs

Figure 5.12 Performance of multiple CUs (32 contents)

Figure 5.13 Normalized Performance of multiple CUs (64 contents)

Normalized performance here is calculated by dividing the performance by the number of cores inside a single CU, not the total number of cores in all CUs. It gives the increasing appearance as number of CU grows. It is easier to compare with the cases

4 8 16 32 64 128

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Multiple CUs (64 contents)

1CU 2CUs 4CUs

1 2 4 8 16 32 64

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Norm. Performance (32 contents)

1CU

2CUs

4CUs

that grows in contents.

Figure 5.14 Comparison of ALU utilization rate in multiple CUs

Figure 5.15 Multiple CUs versus multiple contents in performance

Figure 5.15 show the comparison of different strategies as duplicating contents or the whole compute unit. Duplication in contents have early advantages of more warps to fill the pipeline. But it saturates as the number of ALUs do not grow respectively. On the other hands, though multiple CU has sharper potential, it results to a much larger

0.00 10.00 20.00 30.00 40.00 50.00 60.00

MemUtil IntUtil FltUtil SlwUtil RayUtil

ALU Util. Rate (%) (8 warps 8 cores)

1 CU 2 CUs 4 CUs

1 2 4 8 16 32 64

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Multiple CUs v.s. Multiple Contents

16cotent

64content

4CUs of 16ctt

cost in hardware resource of both warps and cores.

Specifically, there is not a golden answer to the choice of configuration parameters.

Different configuration have different hardware cost in both area and timing. Though seemingly duplication in contents is better, too many warps might cause the timing vio-lation in dispatcher in 1 nanosecond.

In our situation, we choose the 8 warps 8 cores configuration with 4 CUs. It keeps the performance in 64 contents when number of cores drop. Also, we are asking more in performance, so the duplication in CUs is a must for us.

5.4.3 Changing Kernel Choice Policy

There is another domain for higher performance: the kernel choice policy in host CPU.

We know that the resource utilization rate is low in cases of multiple CUs discussed in section 5.4.2 . Here is another reason why the resource utilization rate is low: the kernel choice policy is designed for fewer cores.

The kernel choice policy adopted in section 5.4.2 is the policy shown in 5.2.2 for every configurations. But the parameters in such policy in Figure 5.3 are small. They are chosen in the cases of 16 contents. Now we perform another experiment of chang-ing the kernel choice policy with larger parameters shown in Figure 5.16.

Figure 5.16 New kernel choice policy for large scale configuration

The performance grows dramatically. It improves about 20~30% in performance, reaching 98Mrays/s. We can see that this policy favors resource utilization than timing utilization which is much different with the original case shown in Table 5.9. It stalls

RP+ISP>0

RP > 48

sIS > 0

Draw out halt

ISP > 8

sIS < nIS

Cast R

_2nd ^{Cast R}_{or halt}^Eye

Done

Traversal Traversal Traversal Traversal

→ 192 → 96

more, waiting for larger amount of contents pre-fetched from intersection pool for next kernel. Utilization rate of ALUs also grows.

Table 5.10: Results of 4 CUs with 64 contents by new policy

Results of conference scene (64 contents, 4 CUs)

Config. 2w32c 4w16c 8w8c 16w4c 32w2c

Performance

(Mrays/s) 94.48 98.27 96.83 98.52 81.88

Improvement (%) +17.5 +21.8 +18.0 +30.3 +27.3

Time Util. (%) 51.41 55.76 56.74 67.33 75.88

Res. Util. (%) 95.39 95.55 95.80 97.31 98.23

Indicator of different ALUs (Util. : %, Wait : N. of warps)

IntALU Util. 8.26 15.58 29.74 49.79 70.84

FltALU Util. 5.66 10.72 20.50 34.36 48.92

MemALU Util. 45.00 43.76 43.57 38.55 30.64

RayALU Util. 6.06 11.41 21.78 36.45 51.86

Figure 5.17 Comparison of different kernel choice policy on 4 CUs 64 contents

16.0 32.0 64.0 128.0

32 16 8 4 2 1

Performance log2 (Mrays/s)

Number of cores

Comparison (4 CUs 64 contents)

new policy

old policy

Figure 5.18 Comparison of different policy with normalized performance on 4 CUs 64 contents

But this policy is not that suitable for smaller scale like 16 contents. This experi-ment shows that the kernel choice policy is another important topic and different config-urations have their own best policy. We only test a simple kernel choice policy here, there is a lot of space for delicate policy. The timing to pre-fetch, the balance of waiting time and resource utilization, and the dynamic varying trends in pools are factors that can be further researched.

1.0 2.0 4.0 8.0 16.0 32.0 64.0

32 16 8 4 2 1

Perf. / c in 1CU log2 (Mrays/s)

Number of cores

Comparison (4 CUs 64 contents)

new policy

old policy

Chapter 6 Hardware Implementation

We implement the proposed configurable multiprocessor in Verilog HDL and verified by gate-level synthesis. In Verilog design, we take advantage of basic modules from Synopsys DesignWare Library [19], such as floating point computation units, divisor, FIFO, arbiter, etc. After the programming in Verilog, we adopt Cadence NC-Verilog [20] as our simulation CAD tools. After Verilog simulations, we synthesize our design using Synopsys Design Compiler [21] with TSMC N45GS 40nm standard cell library.

在文檔中適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計 (頁 34-0)