Short Summary - Previous Work - 適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計

Chapter 2 Previous Work

2.4 Short Summary

A fast comparison of these works is shown in Table 2.1. From these work, it seems that making a special function unit handling ray traversal and intersection is a more efficient way in ray tracing. And there is a trade-off of programmability with efficiency as fixed function about shaders. From SGRT and GR6500, we know that the cooperation of shaders and traversers is an important issue.

Our work is a trial of another architecture. We propose a simple but complete sys-tem prototype with approaching performance with those state-of-the-art works. In com-parison, GTX680 is an extreme example of a pure general-purposed computation ma-chine. Though it has lots of resource of FLOPS and bandwidth, it only has little ad-vantage over others when they scale up. Our ray-pool based system is also a simple but efficient solution for ray tracing. The proposed system only requires fewer computation

resource while keeping the programmability and performance for ray tracing.

Table 2.1: Fast comparison of hardware ray tracer

GTX680 [9]

Optix [10] RayCore SGRT PowerVR

GR6500 Ours

Year

2013 2014 2013 2015 2015

Process

28nm 28nm 65nm 28nm 40nm

Type

GPU ASIC GPU+ASIC GPU+ASIC GPU+ASIC

Clock

1GHz 500MHz ASIC:500M

MIMD:1G 600MHz 1GHz

N Cores

1536 1+4 4 128 8x4+?

GFLOPS

3000 72.5x4 150 8x4

BW(GB/s)

192.2 8

Area

294 18+1.6 25 16

Mrays/s

(Primary)

432 193 184 300 100

Chapter 3 System Overview

In this chapter we give a fast description of proposed “ray-pool based ray tracing sys-tem” (in short as RPRT).

3.1 Main Concept

For real time ray tracing hardware, how to divides algorithm to programs and ASIC while keeping high efficient cooperation of them is an important topic. Having a look at previous work, most of them come with a ray traversal ASIC and many of them pre-serve some parts of programmable processor.

We also adopt this idea to divide the whole process into processors and special ACIS. Since there are many different algorithms for ray tracing, it is better to keep some programmable parts for different application. Also, a programmable processor can share the computation of other applications besides ray tracing if needed.

Meanwhile, through different types of ray tracing algorithm, many of them have the similar process of finding ray-triangle intersection. Showing this intersection cess is an important efficiency topic. So we leave a special ASIC to accelerate this pro-cess while keeping the supportability of different algorithm.

To make these two part cooperate efficiently, we insert two pools between them.

By the arrangement and virtualization of two pools, the two computational unit can work simultaneously and independently (little suffering from stalls by other).

Figure 3.1 Main concept of RPRT

3.2 System Architecture

Here is the architecture graph in Figure 3.2 and illustration of its computational process.

Figure 3.2 System architecture of RPRT

Our system can be view as a new graphics card. Firstly, host CPU move instruc-tions (kernel codes) and related data from main memory to DRAMs of our system. After contents are ready, the host CPU send a starting signal with kernel code address to

Shader Traverser Pool size, PC to launch

Shader Shader Traversor Traverser

launch. Shader then starts running the according kernel. At the end of kernel, the rays generated are pushed into Ray Pool and trigger the task of Traverser.

After the end of kernel, the Controller decides the next kernel to launch based on the size of two pools. That Controller may be the host CPU or an on-board embedded CPU. This kernel choice process is practiced many times to complete a single frame.

As for memory system, since the two computational unit (Shader and Traverser) are working separately, we use two independent memory spaces and ports for them.

Only triangle data is duplicated for both of them.

Accelerating data structure for traversal such as BVH tree needs to be pre-com-puted in our current system. And is moved to Traverser’s memory in advanced. But such process still can be done by cooperation between Shader kernel and host CPU furtherly.

3.3 Shader (Configurable Multiprocessor)

Our Shader here is a self-designed configurable multiprocessor. And this is the main contribution of this paper. There are more details discussed in Chapter 4 .

Since there are little open resources of general purpose multiprocessor that can be easily configured and linked with self-defined special units in both software simulator and hardware implementation currently, we make one by ourselves. Our configurable multiprocessor is a CUDA-like SIMT (single instruction multiple threads) processor which have the benefits of easier to program [11] and able to hide memory latency.

Our configurable multiprocessor can be changed in numbers of threads (cores), warps and compute units. And is written with many parameters in both simulator and hardware Verilog implementation, such as pipeline stages of ALUs and the combination of ALUs. It is also easy to plug-in self-defined function units by following simple inter-faces.

Followings are the configuration of multiprocessor used in proposed typical RPRT.

It has 4 compute units, each contains an 8 cores ALU set with 8 warps, 4KB instruction cache, 8KB data cache and 16KB specialized local memory for ray tracing.

3.4 Ray Traverser

Details of ray traverser is not the main focus of this thesis. There are lots of research studying about hardware traverser. Many of them are compatible with our ray-pool based ray tracing system. We only assume that this traverser having the nearly ability (both throughput and latency) with our Shader at about 100Mrays per second.

Ray Traverser is a hardware ASIC designed for traversal and intersection. Tra-versal is the step to find which triangle may on the way of the target ray. Recent tra-versal step arranges triangles into data structure like BVH (bounding volume hierarchy)

[12] [13] tree according to space locality. The main process is hardware accelerated par-allel tree traversal. As for intersection, it is the step to certainly calculate the intersection point of target ray and candidate triangle suggested by traversal step.

Take Chang’s work [14] for example, it is complete system for ray tracing. It is de-signed in cell-based design flow under UMC 90nm process with about 9 mm² core area, operating at the clock rate of 125MHz. It has the capability of 125 M traversals and 62.5 M intersections per second which is suitable for our requirement. It traversing the BVH tree with packets of 64 rays. It is also balanced with our Shader in the typical configura-tion of 8 cores and 8 warps (a batch of 64 contents per compute unit).

3.5 Software Simulation

We build up the simulation environment on SystemC version 2.3.0. It is mainly de-signed as a component-assembly model with approximated timing accuracy in function-ality. It is near cycle accurate in Shader part while approximated timed in other parts. As for communication between modules, transportation inside our chip (Shader, Traverser and two pools) is approximate-timed calculated by bandwidth. The memory access from Shader is approximate-timed, too. But the other communications are un-timed in our simulator, such as host CPU and host memory. More details can be found in section 5.2 .

Figure 3.3 Block diagram of software simulator

Figure 3.3 illustrates the main blocks of our simulator. The MP, Trav, RPool and ISPool parts are our SystemC modules and functions of configurable multiprocessor (Shader), ray traverser, ray pool and intersection pool respectively. Functions of kernel control for host CPU are written in TB.cpp/h. Benchmark data and self-made instruction

MP

<SIMT>

Trav

RPool

ISPool

Memory <pure C++>

FrmBuf Instr

control

generating tools are handled in Instr.

We do all the measurements and optimize our system based on this simulator and verified the Shader part by Verilog implementation. Experiments details are showed in Chapter 5 .

3.6 Hardware Implementation

We implement our configurable multiprocessor by Verilog HDL and verified by logic synthesis using the tool of Synopsis Design Compiler with TSMC 40nm process. The designed working frequency is 1 GHz. And the cell area reported by synthesis tool is 0.537 mm² for single compute unit under the configuration of 4 cores and 4 warps.

More information is showed in Chapter 6 .

Chapter 4 Configurable Multiprocessor Simula-tion

In this chapter we propose a simple but complete SIMT multiprocessor, which is the main contribution in this work.

The followings are some specifications of our processor. It runs at the clock rate of 1 GHz with 4 stages pipeline. Using 64-bit self-defined instruction and 32-bit width data (both integer and single precision floating point data). The number of cores and warps are configurable ranging from 1 to 32 as the exponent of 2 in each compute unit (streaming multiprocessor in CUDA). Each thread has 16 general registers. The number of compute unit is also configurable ranging from 1 to 4. Each compute unit possess an 8KB L1 data cache and a 4KB L1 instruction cache. Both two caches are 2 way set as-sociative.

The originally framework of this processor is established by Yu-Sheng Lin and de-signed as a GPGPU (general-purpose GPU). He draws up the basic architecture by re-ferring to a paper [15] about GPGPU-SIM [16]. Our processor further extend Lin’s work making it more configurable, accurate and concrete. It’s near cycle accurate now, and can be verified by hardware implementation.

As a result, our processor is a CUDA-like SIMT multiprocessor. We have very similar idea of “threads, warps, blocks, kernels, SMs (we use the word ‘compute units’

here)” as CUDA programming model [17].

4.1 Operating Flow

The operations of our processor are spilt into two part (instruction routine and data rou-tine) with a few step. We group a fixed amount of operation threads (typically: 8) into a warp as the basic unit in our operation flow. To make things easy, we only discuss the operating flow of a single warp in section 4.1. And later discuss the relationship be-tween warps in section 4.2.

The instruction routine is functions that request a correct instruction to run. First, instruction buffer updates its state and starts instruction request. Second, it finds the top entry of PC (program counter) stack and increase the PC to next instruction address.

Then, send a request to instruction fetch unit. The chosen request by arbiter is able to

consult the instruction cache. Last, brings the instruction back to instruction buffer.

Figure 4.1 Instruction routine

On the other hand, the data routine is functions that really execute the instruction.

Also starting by instruction buffer, it consults the register file and predicate register as instruction indicated (decode). Then goes to specific type of ALU according to OP-code. But maybe other warps also want to go to the same ALU, so there is a dispatcher to handle the traffic. And an arranger to re-arrange tasks by warps back to warp order using warp ID.

Figure 4.2 Data routine

By means of real hardware implementation, taking logic and flip-flop delays into concern, we have the following pipeline stages: IBuf, PCStack, IArb, ICache, RegFile, Dispatch, ALUEx, WriBack.

Instruction Buffer

PC Stack PC++

maybe pop by RPC

I-Fetch Arbiter

I-Cache

Instruction Buffer

Predicate &

ALUs (many types) Dispatcher

Arranger

read consult ALU to

go by OP-code

restore place to back by warp ID write

finish

Figure 4.3 Pipeline stages

The WriBack stage and next IBuf stage can be done in the same cycle by tracing the execution state in instruction buffer properly. In case that both two routines have not stalls (chosen by arbiter, cache hit and single cycle ALU), they can switch tasks with no delay every 4 cycles.

In the baseline implementation, there is no out-of-order instruction issuance. But there is still pipeline between different warps. In that way, it is still possible to fill up the ALU pipeline. We leave the out-of-order issuance as future work, such related algorithm like score-board can be added at instruction buffer.

Furthermore, under the circumstances of simulation efficiency, the operating flow in our simulator is much more “flat”. We only simulate the data routine pipeline “ap-pearance” by stalling the tasks in arrays (see Figure 4.4 ). Because of in-order and little overlap of instructions issuance, there is no dependency problem inside each warp. The simulator does all of instruction fetch, decode, register read/write and its function in the same cycle but still able to be near cycle accurate.

I-Buf Reg-File Dispatch ALU-Ex Wri-Back

I-Buf Reg-File I-Buf PCStack I-Arb I-Cache

I-Buf PCStack

in the same cycle

Instruction

Instruction routine Data routine

time

Figure 4.4 Simulator model

4.2 Warp Scheduling Policy

To simplify control unit, our warp scheduling policy is very similar to fine-grain multi-threading. The dispatcher delivers as many warps to ALUs as possible according to their OP-code. For pipelined ALUs, every pipeline stages is filled with different warps.

The arbitration of warps for the same ALU is static priority. By differing the pro-gress, we can avoid warps executing the same instruction and collision at dispatcher.

And it is also simple for implementation.

This policy brings the effect of hiding memory latency. Imagine that the warp with highest priority (w0, for example) initially occupies the computation ALU. Then it might suffer a memory read instruction and still. Meanwhile, the computation ALU is released for low priority warps and so on switching warps. Until the origin warp (w0) finish the memory access, the computation resource is kept without waste.

SM

Light green: memory unit;

dark green: cycle blocker (pipeline or latency)

Instr Def.

C++ Mem

FakeCache

Figure 4.5 A detailed snapshot of operating flow hiding memory latency

4.3 Types of ALU

In our configurable multiprocessor, ALUs can be easily plug-in/off and customized. It is not necessarily be a computational ALU and a memory access unit. We also support multiple ALUs having the same type. If there are two integer ALUs, dispatcher can is-sue two of the warps requiring that kind of ALU in proper way.

It might be a little bit confusing about how to calculate the amount of “cores”.

Here is an idea that imagine an all-purpose ALU, we can easily conclude the amount of cores is equal to such SIMD (single instruction multiple data) ALU width. As for our machine, the different types of ALU can be thought of as the separation of that all-pur-pose ALU. So the amount of cores in our multiprocessor can be estimated as the maxi-mum of SIMD ALU width across different types of ALU.

Int

PC0=0x18 PC4=0x00 PC1=0x10 PC5=0x00 PC2=0x08 PC6=0x00 PC3=0x08 PC7=0x00

Program Counter Machine Code

4.3.1 Framework and Typical ALUs

To carry out the feature mentioned above, we define a simple framework for these ALUs. It requires an “ALU type” constant signal, “ALU is free for new task next cycle”

and “warp ID to free” register signal. With such signal, dispatcher can count the amount of resource of each type. After complicated multiplexer executing dispatch policy, cus-tomized ALU can receive the decoded instruction with register data (rs, rt), immediate data and OP-code.

In our Ray-Pool based multiprocessor, we define the following ALUs: TidALU (for thread/block indexing and dimension control), BrhALU (for branch control like predicate, jump and function stack), MemALU (for memory access with cache), In-tALU (for integer data computation), FlIn-tALU (for floating point data computation), SlwALU (for special function of computation like division and square root), and Ray-ALU (for special function of ray tracing).

In the simulator, we define a virtual class of such ALU. The default operating model is an OP-code dependent fixed latency + constant pipeline stages. The ALUs used in our case are configured as Table 4.1. The notation of “depends” means it over-writes the C++ virtual function for detailed simulation. Those special ALUs are dis-cussed later. The latency and pipeline configurations are illustrated in Figure 4.6.

Table 4.1: Latency and pipeline configuration of ALUs

Types Tid Brh Mem Int Flt Slw Ray

latency

1 1 depends 1 1 1 depends

pipeline

1 2 2 2 3 11 2

Figure 4.6 Latency and pipeline configuration illustration

4.3.2 Memory ALU

Both data cache and memory can be mapped to our ALU framework as a new kind of ALU. By dynamically handling the ALU latency, managing “is free” signal and “warp ID to free” signal, we can simulate more complex task like non-blocking cache. The process can be broken into two parts.

latency pipe

task#1 done

isFree

task#1 launch

able to receive new task

3-1=2

3-1=2 4

Configure as L4P3

Configure as L1P4 = fully pipeline with 4 stages

pipe l pipe

1 l

4-1=3

Figure 4.7 Simulating non-blocking cache

First part is “is free” signal control. Thought we want to simulate a non-blocking cache, there are still some blocking processes. Cache only has limited bandwidth. How-ever, warps contains multiple threads which may request for different cache line. We add a fake 2 way cache that only records the cache tag without storing data, under the consideration of simulation efficiency. Having the cache tag signature is enough for re-porting cache hit or miss. Each cache line access leads to one cycle stall. For coalesced

To-Finish Table

wID op

addr0 addr1 addr2 addr3

Cache hit/miss check task2

wID op

addr0 addr1 addr2 addr3

t+=120

Part 2: “WarpID toFree” signal control

Part 1: “isFree” signal control

read, the blocking time equals (warp size) / (cache bandwidth) while (warp size) for separate read.

Second part is when to free the warp. It can done by maintaining two timing table.

On one side, we need to take down the time each cache line miss come back. Then, we can calculate the exact time this warp is going to finish. On the other side, we have to keep this finish time in another table since it is still no the time to free that warp. These are implemented by event-based simulation and accelerated by priority queue data structure.

We set a constant cache miss penalty as 120 cycles. The value is chosen from Mi-cron’s DDR3 Verilog model [18]. We try a simple memory controller with close-page policy at the configuration of sg25, JEDEC DDR3-800E, inner clock 400MHz and outer clock 100MHz. It requires about 120 ns to carry 64 bytes data (a cache line).

4.4 Branch Control

Branch control is an import topic as a SIMT processor. We take advantages of predicate registers and PC-RPC stacks, similar to Fung’s work [15].

Simple if-else branches are handled by predicate registers. Each instruction con-tains a predicate index. That index is mapped to a predicate register recording operation mask for threads in that warp. Data in predicate registers is handled and updated by in-structions from BrhALU, only. There are 16 predicate registers for each warp in our im-plementation.

As for function call or loops, we assign the control over PC-RPC stacks to instruc-tions from BrhALU, too. When calling a function, proper instrucinstruc-tions are executed to push a new entry into PC-RPC stacks. The entry is popped at the end of function as PC

>= RPC. Each warp has its own PC-RPC stack.

4.5 Instruction Generating Tool

The instructions of different kernels can be generated by our tool in simulator. Since it is difficult to write and maintain machine codes, this tool is helpful in kernel code writing for our self-defined instruction set. This tool is still insufficient as a compiler. It is more like an assembler, but do more things than an assembler.

We make use of operator overloading in C++. We declare a C++ class AsmReg to denote registers. Each AsmReg must come with its register index and a note of its data type (integer or floating point). When two AsmRegs are written to do operations (like addition), we temporarily generate an AsmInfo class. As this AsmInfo meets an assign-ment operator (“=”) with destination AsmReg, our instruction generator would push a proper instruction (iadd rD rS rT, for example) to kernel code storage.

Figure 4.8 Illustration of instruction generating tool

Meanwhile, these operators used in kernel codes are still primitive C++ operators.

We can easily change the “define” of AsmReg to integer or floating point in primitive C++ while keeping the correct functionality. That is, our generating tool provides two mode: software simulation and hardware simulation.

Pure software simulation is running kernel codes as primitive C++ codes. In this mode, kernel codes are independent of multiprocessor, such as warp scheduler, register, cache and cycles in SystemC simulation. Users can quickly check their algorithm and de-bug in the early phase.

Only the hardware simulation mode make use of SystemC multiprocessor. It is used for verifying simulation details and getting real performance indicator.

在文檔中適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計 (頁 17-0)