Hardware Implementation - System Overview

Chapter 3 System Overview

3.6 Hardware Implementation

We implement our configurable multiprocessor by Verilog HDL and verified by logic synthesis using the tool of Synopsis Design Compiler with TSMC 40nm process. The designed working frequency is 1 GHz. And the cell area reported by synthesis tool is 0.537 mm² for single compute unit under the configuration of 4 cores and 4 warps.

More information is showed in Chapter 6 .

Chapter 4 Configurable Multiprocessor Simula-tion

In this chapter we propose a simple but complete SIMT multiprocessor, which is the main contribution in this work.

The followings are some specifications of our processor. It runs at the clock rate of 1 GHz with 4 stages pipeline. Using 64-bit self-defined instruction and 32-bit width data (both integer and single precision floating point data). The number of cores and warps are configurable ranging from 1 to 32 as the exponent of 2 in each compute unit (streaming multiprocessor in CUDA). Each thread has 16 general registers. The number of compute unit is also configurable ranging from 1 to 4. Each compute unit possess an 8KB L1 data cache and a 4KB L1 instruction cache. Both two caches are 2 way set as-sociative.

The originally framework of this processor is established by Yu-Sheng Lin and de-signed as a GPGPU (general-purpose GPU). He draws up the basic architecture by re-ferring to a paper [15] about GPGPU-SIM [16]. Our processor further extend Lin’s work making it more configurable, accurate and concrete. It’s near cycle accurate now, and can be verified by hardware implementation.

As a result, our processor is a CUDA-like SIMT multiprocessor. We have very similar idea of “threads, warps, blocks, kernels, SMs (we use the word ‘compute units’

here)” as CUDA programming model [17].

4.1 Operating Flow

The operations of our processor are spilt into two part (instruction routine and data rou-tine) with a few step. We group a fixed amount of operation threads (typically: 8) into a warp as the basic unit in our operation flow. To make things easy, we only discuss the operating flow of a single warp in section 4.1. And later discuss the relationship be-tween warps in section 4.2.

The instruction routine is functions that request a correct instruction to run. First, instruction buffer updates its state and starts instruction request. Second, it finds the top entry of PC (program counter) stack and increase the PC to next instruction address.

Then, send a request to instruction fetch unit. The chosen request by arbiter is able to

consult the instruction cache. Last, brings the instruction back to instruction buffer.

Figure 4.1 Instruction routine

On the other hand, the data routine is functions that really execute the instruction.

Also starting by instruction buffer, it consults the register file and predicate register as instruction indicated (decode). Then goes to specific type of ALU according to OP-code. But maybe other warps also want to go to the same ALU, so there is a dispatcher to handle the traffic. And an arranger to re-arrange tasks by warps back to warp order using warp ID.

Figure 4.2 Data routine

By means of real hardware implementation, taking logic and flip-flop delays into concern, we have the following pipeline stages: IBuf, PCStack, IArb, ICache, RegFile, Dispatch, ALUEx, WriBack.

Instruction Buffer

PC Stack PC++

maybe pop by RPC

I-Fetch Arbiter

I-Cache

Instruction Buffer

Predicate &

ALUs (many types) Dispatcher

Arranger

read consult ALU to

go by OP-code

restore place to back by warp ID write

finish

Figure 4.3 Pipeline stages

The WriBack stage and next IBuf stage can be done in the same cycle by tracing the execution state in instruction buffer properly. In case that both two routines have not stalls (chosen by arbiter, cache hit and single cycle ALU), they can switch tasks with no delay every 4 cycles.

In the baseline implementation, there is no out-of-order instruction issuance. But there is still pipeline between different warps. In that way, it is still possible to fill up the ALU pipeline. We leave the out-of-order issuance as future work, such related algorithm like score-board can be added at instruction buffer.

Furthermore, under the circumstances of simulation efficiency, the operating flow in our simulator is much more “flat”. We only simulate the data routine pipeline “ap-pearance” by stalling the tasks in arrays (see Figure 4.4 ). Because of in-order and little overlap of instructions issuance, there is no dependency problem inside each warp. The simulator does all of instruction fetch, decode, register read/write and its function in the same cycle but still able to be near cycle accurate.

I-Buf Reg-File Dispatch ALU-Ex Wri-Back

I-Buf Reg-File I-Buf PCStack I-Arb I-Cache

I-Buf PCStack

in the same cycle

Instruction

Instruction routine Data routine

time

Figure 4.4 Simulator model

4.2 Warp Scheduling Policy

To simplify control unit, our warp scheduling policy is very similar to fine-grain multi-threading. The dispatcher delivers as many warps to ALUs as possible according to their OP-code. For pipelined ALUs, every pipeline stages is filled with different warps.

The arbitration of warps for the same ALU is static priority. By differing the pro-gress, we can avoid warps executing the same instruction and collision at dispatcher.

And it is also simple for implementation.

This policy brings the effect of hiding memory latency. Imagine that the warp with highest priority (w0, for example) initially occupies the computation ALU. Then it might suffer a memory read instruction and still. Meanwhile, the computation ALU is released for low priority warps and so on switching warps. Until the origin warp (w0) finish the memory access, the computation resource is kept without waste.

SM

Light green: memory unit;

dark green: cycle blocker (pipeline or latency)

Instr Def.

C++ Mem

FakeCache

Figure 4.5 A detailed snapshot of operating flow hiding memory latency

4.3 Types of ALU

In our configurable multiprocessor, ALUs can be easily plug-in/off and customized. It is not necessarily be a computational ALU and a memory access unit. We also support multiple ALUs having the same type. If there are two integer ALUs, dispatcher can is-sue two of the warps requiring that kind of ALU in proper way.

It might be a little bit confusing about how to calculate the amount of “cores”.

Here is an idea that imagine an all-purpose ALU, we can easily conclude the amount of cores is equal to such SIMD (single instruction multiple data) ALU width. As for our machine, the different types of ALU can be thought of as the separation of that all-pur-pose ALU. So the amount of cores in our multiprocessor can be estimated as the maxi-mum of SIMD ALU width across different types of ALU.

Int

PC0=0x18 PC4=0x00 PC1=0x10 PC5=0x00 PC2=0x08 PC6=0x00 PC3=0x08 PC7=0x00

Program Counter Machine Code

4.3.1 Framework and Typical ALUs

To carry out the feature mentioned above, we define a simple framework for these ALUs. It requires an “ALU type” constant signal, “ALU is free for new task next cycle”

and “warp ID to free” register signal. With such signal, dispatcher can count the amount of resource of each type. After complicated multiplexer executing dispatch policy, cus-tomized ALU can receive the decoded instruction with register data (rs, rt), immediate data and OP-code.

In our Ray-Pool based multiprocessor, we define the following ALUs: TidALU (for thread/block indexing and dimension control), BrhALU (for branch control like predicate, jump and function stack), MemALU (for memory access with cache), In-tALU (for integer data computation), FlIn-tALU (for floating point data computation), SlwALU (for special function of computation like division and square root), and Ray-ALU (for special function of ray tracing).

In the simulator, we define a virtual class of such ALU. The default operating model is an OP-code dependent fixed latency + constant pipeline stages. The ALUs used in our case are configured as Table 4.1. The notation of “depends” means it over-writes the C++ virtual function for detailed simulation. Those special ALUs are dis-cussed later. The latency and pipeline configurations are illustrated in Figure 4.6.

Table 4.1: Latency and pipeline configuration of ALUs

Types Tid Brh Mem Int Flt Slw Ray

latency

1 1 depends 1 1 1 depends

pipeline

1 2 2 2 3 11 2

Figure 4.6 Latency and pipeline configuration illustration

4.3.2 Memory ALU

Both data cache and memory can be mapped to our ALU framework as a new kind of ALU. By dynamically handling the ALU latency, managing “is free” signal and “warp ID to free” signal, we can simulate more complex task like non-blocking cache. The process can be broken into two parts.

latency pipe

task#1 done

isFree

task#1 launch

able to receive new task

3-1=2

3-1=2 4

Configure as L4P3

Configure as L1P4 = fully pipeline with 4 stages

pipe l pipe

1 l

4-1=3

Figure 4.7 Simulating non-blocking cache

First part is “is free” signal control. Thought we want to simulate a non-blocking cache, there are still some blocking processes. Cache only has limited bandwidth. How-ever, warps contains multiple threads which may request for different cache line. We add a fake 2 way cache that only records the cache tag without storing data, under the consideration of simulation efficiency. Having the cache tag signature is enough for re-porting cache hit or miss. Each cache line access leads to one cycle stall. For coalesced

To-Finish Table

wID op

addr0 addr1 addr2 addr3

Cache hit/miss check task2

wID op

addr0 addr1 addr2 addr3

t+=120

Part 2: “WarpID toFree” signal control

Part 1: “isFree” signal control

read, the blocking time equals (warp size) / (cache bandwidth) while (warp size) for separate read.

Second part is when to free the warp. It can done by maintaining two timing table.

On one side, we need to take down the time each cache line miss come back. Then, we can calculate the exact time this warp is going to finish. On the other side, we have to keep this finish time in another table since it is still no the time to free that warp. These are implemented by event-based simulation and accelerated by priority queue data structure.

We set a constant cache miss penalty as 120 cycles. The value is chosen from Mi-cron’s DDR3 Verilog model [18]. We try a simple memory controller with close-page policy at the configuration of sg25, JEDEC DDR3-800E, inner clock 400MHz and outer clock 100MHz. It requires about 120 ns to carry 64 bytes data (a cache line).

4.4 Branch Control

Branch control is an import topic as a SIMT processor. We take advantages of predicate registers and PC-RPC stacks, similar to Fung’s work [15].

Simple if-else branches are handled by predicate registers. Each instruction con-tains a predicate index. That index is mapped to a predicate register recording operation mask for threads in that warp. Data in predicate registers is handled and updated by in-structions from BrhALU, only. There are 16 predicate registers for each warp in our im-plementation.

As for function call or loops, we assign the control over PC-RPC stacks to instruc-tions from BrhALU, too. When calling a function, proper instrucinstruc-tions are executed to push a new entry into PC-RPC stacks. The entry is popped at the end of function as PC

>= RPC. Each warp has its own PC-RPC stack.

4.5 Instruction Generating Tool

The instructions of different kernels can be generated by our tool in simulator. Since it is difficult to write and maintain machine codes, this tool is helpful in kernel code writing for our self-defined instruction set. This tool is still insufficient as a compiler. It is more like an assembler, but do more things than an assembler.

We make use of operator overloading in C++. We declare a C++ class AsmReg to denote registers. Each AsmReg must come with its register index and a note of its data type (integer or floating point). When two AsmRegs are written to do operations (like addition), we temporarily generate an AsmInfo class. As this AsmInfo meets an assign-ment operator (“=”) with destination AsmReg, our instruction generator would push a proper instruction (iadd rD rS rT, for example) to kernel code storage.

Figure 4.8 Illustration of instruction generating tool

Meanwhile, these operators used in kernel codes are still primitive C++ operators.

We can easily change the “define” of AsmReg to integer or floating point in primitive C++ while keeping the correct functionality. That is, our generating tool provides two mode: software simulation and hardware simulation.

Pure software simulation is running kernel codes as primitive C++ codes. In this mode, kernel codes are independent of multiprocessor, such as warp scheduler, register, cache and cycles in SystemC simulation. Users can quickly check their algorithm and de-bug in the early phase.

Only the hardware simulation mode make use of SystemC multiprocessor. It is used for verifying simulation details and getting real performance indicator.

Figure 4.9 Two modes of simulation

4.6 Multiple Compute Units

Our multiprocessor simulator support multiple compute units simulation, too. Each compute unit possess a warp scheduler, general and predicate registers, PC-RPC stacks, memory units like instruction and data caches, bus masters to memory and pools and a set of ALUs. Each compute unit is a complete SIMT multiprocessor.

Above from single compute unit, a full system multiprocessor like GPGPUs may has several of these compute units. Due to the hardware constrains, single compute unit cannot grow up unlimitedly. So multiple compute units systems are built for plenty of

AsmReg

//push_instr. asm(op rD rS rT)

dimension setting, prepare memory

kernelPrepare

Both runnable C++ / Asm gen. code

kernelCode

log message, , memory check

kernelEnd

computational requirement. Such system requires a block scheduler, bus and arbiter of bus master in those compute units.

4.7 Optimization for Ray Tracing

To communicate with two pools, keep intersection data and generate ray data it needs some special function units handling such works. We setup the RayALU for doing so.

The local data of intersections and rays is very similar to the role of shared

memory or software defined cache in CUDA [17]. But the system of our ray-pool based ray tracing is data-driven by these intersections and rays. It means that the flow of these driving data must be kept fluently for high efficiency. Here we calculate the efficiency bound by bandwidth to pools:

Bandwidth bound =bandwidth of ray pool, 128 bits per ns

size of ray, 48 bytes = 333𝑀𝑅𝑎𝑦𝑠/𝑠𝑒𝑐 It is very tight for our targeted 100MRays/sec performance. The ray flow must be kept ongoing between pools.

So we design a 3-set local ray memory in our RayALU. One set work as the input buffer for intersection prefetching. Another is used for current computation. While the other act as the output buffer. The roles of these sets swap accordingly by instructions.

Each set of memory contains the full content of rays for total amount of threads in that compute unit. In so doing, we can keep the communication bandwidth with pools near continuing.

Figure 4.10 Local ray memory in RayALU

The local ray memory size is 5.25KB for each set for typical configuration (8 cores with 8 warps). Sum up as about 16KB for each compute units.

However, there are still some limitations to get full utilization of pool bandwidth.

Since the whole multiprocessor is fired up by kernel launching at the same time, rays and intersections are produced and consumed in batches. What’s more, local ray

Set

local R/W export package

Import package

single port

BW: 128 bits (4x4B) nWiB

memory is designed coalesced-wise in order to accelerate local memory read/write by warps. It leads to the batches of packet must be synchronized before and after kernels.

Consequently, though 3-set architecture is seemingly efficient enough, it still requires carefully control both at host about kernel launch algorithm and at kernel codes pro-gramming to get practically high efficiency with pools.

Apart from this RayALU, our multiprocessor is still a general-purposed processor like GPGPUs. And it can be viewed as an example of user-defined special ALUs in our configurable multiprocessor.

Chapter 5 Methodology and Experiment

In this chapter we will discuss about experiments on our simulator. Starts form testing benchmarks, configuration of our simulator and how we measure the performance of proposed system to results and discussion.

5.1 Testing Benchmarks

We test our proposed system on “conference” scene. Such scene has 331K triangles with 36 kind of material and is one of the most popular scenes as ray tracing bench-mark. The origin model does not have reflective material. We directly modify all materi-als with reflective luminous being half of intensity of diffuse luminous. Our testing res-olution is 512x512 pixels with one sample ray per pixel. And there is a single point light as light source.

There are two modes for testing: primary ray only and limited depth of bounce.

The primary ray only case is used for measuring the maximum efficiency of our ray tracer. Each pixel requires an eye ray and a shadow ray for drawing. This case has the highest coherence for memory. As for the other mode, it is closer to real rendering case for ray tracing. Reflective effects can be observe after turning on the bounces. Here we limit the bouncing depth as 5. Both of them are based on the ray tracing algorithm of Whitted-style ray tracing.

Figure 5.1 Target scene: conference

5.2 Simulation Environment Setup

The followings are configuration and simulation parameters of our simulator.

5.2.1 Two Pools and Traverser

Two pools are configured as the maximum depth of 512 packages. Each ray package is 48 bytes while instruction package is 84 bytes. That is, the ray pool is 24KB in size and intersection pool is 84KB in size. The transfer latency of two pools are both 128 bits per

512x512 px, Whitted-style 36 materials

330K triangles Depth = 1 (Primary)

Depth = 5

conference scene

nanosecond plus two nanosecond overhead for each request.

The efficiency of Traverser is modeled statically. We use a pure software traverser running the algorithm of AABB BVH tree. We do not varies the delay according to cache or traverse depth since we do not have complete model of hardware traverser. We only assume that it has the equally efficiency with our Shader, transforming 64 rays to intersections as the end of Shader kernel or 1000 nanoseconds when Shader stalls.

The ability of Traverser varies for the experiment of different number of compute units. While doubling the number of compute units in Shader, we also doubles the throughput of Traverser for balancing.

5.2.2 Related Kernel and Host Control Policy

We have three pieces of kernel code: Casting Eye Ray, Casting Secondary Ray and Shading. Casting Eye Ray is the step to generate eye ray by camera and pixel coordi-nates. Casting Secondary Ray is the step to generate reflection ray and shadow ray based on previous intersection. And Shading kernel is the step to draw out light infor-mation by non-blocked shadow ray. Being blocked or not is known by null intersection of shadow ray from Traverser.

Figure 5.2 Three kinds of kernel

1 Eye ray

Reflection ray Shadow ray

2 3 3

3 kinds of kernel:

① Cast eye ray

② Cast 2

^nd

ray: refl.+shad.

③ Shading by non-blocked shad.

Kernel codes for two modes (primary ray only and limited depth of bounce) are slightly different. For the mode primary ray only, calculation about reflection in Casting Secondary Ray kernel is skipped. Table 5.1 show the instruction composition of these kernels.

Kernel codes would be slightly different for multiple compute units, too. Because it requires additional codes for control of block dimension especially for Casting Eye Ray kernel. Table 5.1 also show the case of multiple compute units by the right hand side of

“/”. (Left hand side is information of single compute unit.)

Number of instructions (single/multiple compute unit(s)) Kernel Cast Eye R Cast 2^nd R

with refl. R

Cast 2^nd R

without refl. R Shading

N of Instr. 77 / 85 191 110 54 / 55 *

Table 5.1: Number of instructions in different kernel

*: Shading kernel requires additional synchronization to flush local ray memory for multiple compute units.

Kernel Cast Eye R Cast 2^nd R with refl. R

Cast 2^nd R

without refl. R Shading

Consume -- C nIS C nIS C sIS

Produce C nR C nR+CsR C sR --

Notes: C means the number of contents in such kernel; nR means normal ray; sR means shadow ray;

nIS means intersection of normal ray; sIS means intersection of shadow ray.

There are still other mechanism to destroy rays: rays exceeding bouncing depth limit and null inter-sections are discarded by pools.

Table 5.2: Effect to pools of different kernels

The kernel launch policy controlled by hosting CPU (or local controller) is showed

在文檔中適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計 (頁 23-0)