Thesis Organization - 適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計

Chapter 1 Introduction

1.4 Thesis Organization

The chapters bellow are arranged as following. Chapter 2 introduces and reviews some previous ray tracing engines with fast comparison. Chapter 3 gives an overview and concepts of proposed ray-pool based ray tracing system. Chapter 4 focuses on the proposed configurable multiprocessor for this system. Chapter 5 reveals the methodol-ogy of simulation and experiments for performance of the system. Chapter 6 describes the hardware implementation for multiprocessor to show the feasibility. Then Chapter 7 discusses about the non-simulated parts and future works. While conclusion of this the-sis is offered in Chapter 8 .

Chapter 2 Previous Work

In this chapter, we introduce several hardware accelerated ray tracing techniques or de-vices. In section 2.4 , we give a fast comparison of these work and make a short sum-mary from them.

2.1 Ray Core

RayCore [5] is a hardware ASIC (application-specific integrated circuit) for ray tracing on mobile devices. Though it says RayCore architecture is based on MIMD (multiple instruction multiple data)-based ray tracing architecture, it also reveals that RayCore is made as fixed-function units instead of programmable units. We think maybe we can classify it as MIMD-inspired ASIC since RayCore consults ideas from MIMD architec-ture to make its function pipeline.

It has the performance of 193Mrays and claims able to render HD 720p at 56 FPS.

RayCore is based on Whitted style ray tracing with SAH kd-tree as the acceleration data structure. It supports dynamic scenes construction, building SAH kd-tree for 64K trian-gles within 20 ms.

RayCore is composed of six ray tracing unit (RTU) and one tree building unit (TBU). Targets for 28 nm process with the area of 18(RTU) + 1.6(TBU) mm² and the clock rate at 500MHz. It is a power efficient work with the power consumption at about 1 W.

RayCore is a complete work supporting OpenGL ES 1.1-like API and has been verified on FPGA board.

2.2 SGRT

SGRT is the short of Samsung reconfigurable GPU based on ray tracing [6]. SGRT is a mobile solution for ray tracing based on two parts: GPU and ASIC for T&I (traversal and intersection). A typical devices has 4 SGRT cores, each single SGRT core consists of a SRP (Samsung Reconfigurable Processor) and a T&I unit.

SRP is a MIMD mobile GPU with theoretical peak performance of 72.5 GFLOPS (giga floating point operations per second) per core at 1 GHz clock frequency and 2.9 mm² at 65 nm process. It comes with a very long instruction word engine and a coarse-grained reconfigurable array of ALUs.

A T&I unit [7] consists of a ray dispatch unit, 4 traversal units, and an intersection unit. T&I unit in SGRT selects the BVH tree as the acceleration structure. SGRT pro-posed the parallel pipelined traversal unit architecture to further accelerate the travers-ing process.

The 4-core SGRT has the performance of 184 Mrays/s on conference scene and claims able to render FHD 1080p at 34 FPS. SGRT does not support dynamic scenes construction on it. Only static scene is tested and BVH trees are built by CPU. Though it seems weaker than RayCore in performance, it remains the programmable GPU part.

Their processes are different, too. SGRT targets at a more previous process of 65 nm with the area of 25 mm² for 4 cores. The clock rate is 500MHz for T&U and 1GHz for SRP. SGRT is also verified on FPGA board.

2.3 PowerVR GR6500

PowerVR GR6500 [8] is a coming device designed by Imagination Technologies. It is a GPU embedded with ray tracing hardware. But there is no complete official datasheet available now. Only little information is revealed.

It claims having the ability of rendering at 300Mrays/s with 100M dynamic trian-gles. GR6500 supports hybrid rendering (rasterization + ray tracing) and dynamic scene construction by a scene hierarchy generator. Their ray tracing unit is also a fixed-func-tion.

Besides ray tracing unit, there is a ray tracing data master managing transmission of ray intersection data to the main scheduler, in preparation for shaders to run. On the other hand, GR6500 is yet a GPU with 128 ALU cores running at 600MHz with 150 GFLOPS.

2.4 Short Summary

A fast comparison of these works is shown in Table 2.1. From these work, it seems that making a special function unit handling ray traversal and intersection is a more efficient way in ray tracing. And there is a trade-off of programmability with efficiency as fixed function about shaders. From SGRT and GR6500, we know that the cooperation of shaders and traversers is an important issue.

Our work is a trial of another architecture. We propose a simple but complete sys-tem prototype with approaching performance with those state-of-the-art works. In com-parison, GTX680 is an extreme example of a pure general-purposed computation ma-chine. Though it has lots of resource of FLOPS and bandwidth, it only has little ad-vantage over others when they scale up. Our ray-pool based system is also a simple but efficient solution for ray tracing. The proposed system only requires fewer computation

resource while keeping the programmability and performance for ray tracing.

Table 2.1: Fast comparison of hardware ray tracer

GTX680 [9]

Optix [10] RayCore SGRT PowerVR

GR6500 Ours

Year

2013 2014 2013 2015 2015

Process

28nm 28nm 65nm 28nm 40nm

Type

GPU ASIC GPU+ASIC GPU+ASIC GPU+ASIC

Clock

1GHz 500MHz ASIC:500M

MIMD:1G 600MHz 1GHz

N Cores

1536 1+4 4 128 8x4+?

GFLOPS

3000 72.5x4 150 8x4

BW(GB/s)

192.2 8

Area

294 18+1.6 25 16

Mrays/s

(Primary)

432 193 184 300 100

Chapter 3 System Overview

In this chapter we give a fast description of proposed “ray-pool based ray tracing sys-tem” (in short as RPRT).

3.1 Main Concept

For real time ray tracing hardware, how to divides algorithm to programs and ASIC while keeping high efficient cooperation of them is an important topic. Having a look at previous work, most of them come with a ray traversal ASIC and many of them pre-serve some parts of programmable processor.

We also adopt this idea to divide the whole process into processors and special ACIS. Since there are many different algorithms for ray tracing, it is better to keep some programmable parts for different application. Also, a programmable processor can share the computation of other applications besides ray tracing if needed.

Meanwhile, through different types of ray tracing algorithm, many of them have the similar process of finding ray-triangle intersection. Showing this intersection cess is an important efficiency topic. So we leave a special ASIC to accelerate this pro-cess while keeping the supportability of different algorithm.

To make these two part cooperate efficiently, we insert two pools between them.

By the arrangement and virtualization of two pools, the two computational unit can work simultaneously and independently (little suffering from stalls by other).

Figure 3.1 Main concept of RPRT

3.2 System Architecture

Here is the architecture graph in Figure 3.2 and illustration of its computational process.

Figure 3.2 System architecture of RPRT

Our system can be view as a new graphics card. Firstly, host CPU move instruc-tions (kernel codes) and related data from main memory to DRAMs of our system. After contents are ready, the host CPU send a starting signal with kernel code address to

Shader Traverser Pool size, PC to launch

Shader Shader Traversor Traverser

launch. Shader then starts running the according kernel. At the end of kernel, the rays generated are pushed into Ray Pool and trigger the task of Traverser.

After the end of kernel, the Controller decides the next kernel to launch based on the size of two pools. That Controller may be the host CPU or an on-board embedded CPU. This kernel choice process is practiced many times to complete a single frame.

As for memory system, since the two computational unit (Shader and Traverser) are working separately, we use two independent memory spaces and ports for them.

Only triangle data is duplicated for both of them.

Accelerating data structure for traversal such as BVH tree needs to be pre-com-puted in our current system. And is moved to Traverser’s memory in advanced. But such process still can be done by cooperation between Shader kernel and host CPU furtherly.

3.3 Shader (Configurable Multiprocessor)

Our Shader here is a self-designed configurable multiprocessor. And this is the main contribution of this paper. There are more details discussed in Chapter 4 .

Since there are little open resources of general purpose multiprocessor that can be easily configured and linked with self-defined special units in both software simulator and hardware implementation currently, we make one by ourselves. Our configurable multiprocessor is a CUDA-like SIMT (single instruction multiple threads) processor which have the benefits of easier to program [11] and able to hide memory latency.

Our configurable multiprocessor can be changed in numbers of threads (cores), warps and compute units. And is written with many parameters in both simulator and hardware Verilog implementation, such as pipeline stages of ALUs and the combination of ALUs. It is also easy to plug-in self-defined function units by following simple inter-faces.

Followings are the configuration of multiprocessor used in proposed typical RPRT.

It has 4 compute units, each contains an 8 cores ALU set with 8 warps, 4KB instruction cache, 8KB data cache and 16KB specialized local memory for ray tracing.

3.4 Ray Traverser

Details of ray traverser is not the main focus of this thesis. There are lots of research studying about hardware traverser. Many of them are compatible with our ray-pool based ray tracing system. We only assume that this traverser having the nearly ability (both throughput and latency) with our Shader at about 100Mrays per second.

Ray Traverser is a hardware ASIC designed for traversal and intersection. Tra-versal is the step to find which triangle may on the way of the target ray. Recent tra-versal step arranges triangles into data structure like BVH (bounding volume hierarchy)

[12] [13] tree according to space locality. The main process is hardware accelerated par-allel tree traversal. As for intersection, it is the step to certainly calculate the intersection point of target ray and candidate triangle suggested by traversal step.

Take Chang’s work [14] for example, it is complete system for ray tracing. It is de-signed in cell-based design flow under UMC 90nm process with about 9 mm² core area, operating at the clock rate of 125MHz. It has the capability of 125 M traversals and 62.5 M intersections per second which is suitable for our requirement. It traversing the BVH tree with packets of 64 rays. It is also balanced with our Shader in the typical configura-tion of 8 cores and 8 warps (a batch of 64 contents per compute unit).

3.5 Software Simulation

We build up the simulation environment on SystemC version 2.3.0. It is mainly de-signed as a component-assembly model with approximated timing accuracy in function-ality. It is near cycle accurate in Shader part while approximated timed in other parts. As for communication between modules, transportation inside our chip (Shader, Traverser and two pools) is approximate-timed calculated by bandwidth. The memory access from Shader is approximate-timed, too. But the other communications are un-timed in our simulator, such as host CPU and host memory. More details can be found in section 5.2 .

Figure 3.3 Block diagram of software simulator

Figure 3.3 illustrates the main blocks of our simulator. The MP, Trav, RPool and ISPool parts are our SystemC modules and functions of configurable multiprocessor (Shader), ray traverser, ray pool and intersection pool respectively. Functions of kernel control for host CPU are written in TB.cpp/h. Benchmark data and self-made instruction

MP

<SIMT>

Trav

RPool

ISPool

Memory <pure C++>

FrmBuf Instr

control

generating tools are handled in Instr.

We do all the measurements and optimize our system based on this simulator and verified the Shader part by Verilog implementation. Experiments details are showed in Chapter 5 .

3.6 Hardware Implementation

We implement our configurable multiprocessor by Verilog HDL and verified by logic synthesis using the tool of Synopsis Design Compiler with TSMC 40nm process. The designed working frequency is 1 GHz. And the cell area reported by synthesis tool is 0.537 mm² for single compute unit under the configuration of 4 cores and 4 warps.

More information is showed in Chapter 6 .

Chapter 4 Configurable Multiprocessor Simula-tion

In this chapter we propose a simple but complete SIMT multiprocessor, which is the main contribution in this work.

The followings are some specifications of our processor. It runs at the clock rate of 1 GHz with 4 stages pipeline. Using 64-bit self-defined instruction and 32-bit width data (both integer and single precision floating point data). The number of cores and warps are configurable ranging from 1 to 32 as the exponent of 2 in each compute unit (streaming multiprocessor in CUDA). Each thread has 16 general registers. The number of compute unit is also configurable ranging from 1 to 4. Each compute unit possess an 8KB L1 data cache and a 4KB L1 instruction cache. Both two caches are 2 way set as-sociative.

The originally framework of this processor is established by Yu-Sheng Lin and de-signed as a GPGPU (general-purpose GPU). He draws up the basic architecture by re-ferring to a paper [15] about GPGPU-SIM [16]. Our processor further extend Lin’s work making it more configurable, accurate and concrete. It’s near cycle accurate now, and can be verified by hardware implementation.

As a result, our processor is a CUDA-like SIMT multiprocessor. We have very similar idea of “threads, warps, blocks, kernels, SMs (we use the word ‘compute units’

here)” as CUDA programming model [17].

4.1 Operating Flow

The operations of our processor are spilt into two part (instruction routine and data rou-tine) with a few step. We group a fixed amount of operation threads (typically: 8) into a warp as the basic unit in our operation flow. To make things easy, we only discuss the operating flow of a single warp in section 4.1. And later discuss the relationship be-tween warps in section 4.2.

The instruction routine is functions that request a correct instruction to run. First, instruction buffer updates its state and starts instruction request. Second, it finds the top entry of PC (program counter) stack and increase the PC to next instruction address.

Then, send a request to instruction fetch unit. The chosen request by arbiter is able to

consult the instruction cache. Last, brings the instruction back to instruction buffer.

Figure 4.1 Instruction routine

On the other hand, the data routine is functions that really execute the instruction.

Also starting by instruction buffer, it consults the register file and predicate register as instruction indicated (decode). Then goes to specific type of ALU according to OP-code. But maybe other warps also want to go to the same ALU, so there is a dispatcher to handle the traffic. And an arranger to re-arrange tasks by warps back to warp order using warp ID.

Figure 4.2 Data routine

By means of real hardware implementation, taking logic and flip-flop delays into concern, we have the following pipeline stages: IBuf, PCStack, IArb, ICache, RegFile, Dispatch, ALUEx, WriBack.

Instruction Buffer

PC Stack PC++

maybe pop by RPC

I-Fetch Arbiter

I-Cache

Instruction Buffer

Predicate &

ALUs (many types) Dispatcher

Arranger

read consult ALU to

go by OP-code

restore place to back by warp ID write

finish

Figure 4.3 Pipeline stages

The WriBack stage and next IBuf stage can be done in the same cycle by tracing the execution state in instruction buffer properly. In case that both two routines have not stalls (chosen by arbiter, cache hit and single cycle ALU), they can switch tasks with no delay every 4 cycles.

In the baseline implementation, there is no out-of-order instruction issuance. But there is still pipeline between different warps. In that way, it is still possible to fill up the ALU pipeline. We leave the out-of-order issuance as future work, such related algorithm like score-board can be added at instruction buffer.

Furthermore, under the circumstances of simulation efficiency, the operating flow in our simulator is much more “flat”. We only simulate the data routine pipeline “ap-pearance” by stalling the tasks in arrays (see Figure 4.4 ). Because of in-order and little overlap of instructions issuance, there is no dependency problem inside each warp. The simulator does all of instruction fetch, decode, register read/write and its function in the same cycle but still able to be near cycle accurate.

I-Buf Reg-File Dispatch ALU-Ex Wri-Back

I-Buf Reg-File I-Buf PCStack I-Arb I-Cache

I-Buf PCStack

in the same cycle

Instruction

Instruction routine Data routine

time

Figure 4.4 Simulator model

4.2 Warp Scheduling Policy

To simplify control unit, our warp scheduling policy is very similar to fine-grain multi-threading. The dispatcher delivers as many warps to ALUs as possible according to their OP-code. For pipelined ALUs, every pipeline stages is filled with different warps.

The arbitration of warps for the same ALU is static priority. By differing the pro-gress, we can avoid warps executing the same instruction and collision at dispatcher.

And it is also simple for implementation.

This policy brings the effect of hiding memory latency. Imagine that the warp with highest priority (w0, for example) initially occupies the computation ALU. Then it might suffer a memory read instruction and still. Meanwhile, the computation ALU is released for low priority warps and so on switching warps. Until the origin warp (w0) finish the memory access, the computation resource is kept without waste.

SM

Light green: memory unit;

dark green: cycle blocker (pipeline or latency)

Instr Def.

C++ Mem

FakeCache

Figure 4.5 A detailed snapshot of operating flow hiding memory latency

4.3 Types of ALU

In our configurable multiprocessor, ALUs can be easily plug-in/off and customized. It is not necessarily be a computational ALU and a memory access unit. We also support multiple ALUs having the same type. If there are two integer ALUs, dispatcher can is-sue two of the warps requiring that kind of ALU in proper way.

It might be a little bit confusing about how to calculate the amount of “cores”.

Here is an idea that imagine an all-purpose ALU, we can easily conclude the amount of cores is equal to such SIMD (single instruction multiple data) ALU width. As for our machine, the different types of ALU can be thought of as the separation of that all-pur-pose ALU. So the amount of cores in our multiprocessor can be estimated as the maxi-mum of SIMD ALU width across different types of ALU.

Int

PC0=0x18 PC4=0x00 PC1=0x10 PC5=0x00 PC2=0x08 PC6=0x00 PC3=0x08 PC7=0x00

Program Counter Machine Code

4.3.1 Framework and Typical ALUs

To carry out the feature mentioned above, we define a simple framework for these ALUs. It requires an “ALU type” constant signal, “ALU is free for new task next cycle”

and “warp ID to free” register signal. With such signal, dispatcher can count the amount of resource of each type. After complicated multiplexer executing dispatch policy, cus-tomized ALU can receive the decoded instruction with register data (rs, rt), immediate data and OP-code.

In our Ray-Pool based multiprocessor, we define the following ALUs: TidALU (for thread/block indexing and dimension control), BrhALU (for branch control like predicate, jump and function stack), MemALU (for memory access with cache), In-tALU (for integer data computation), FlIn-tALU (for floating point data computation), SlwALU (for special function of computation like division and square root), and Ray-ALU (for special function of ray tracing).

In the simulator, we define a virtual class of such ALU. The default operating model is an OP-code dependent fixed latency + constant pipeline stages. The ALUs used in our case are configured as Table 4.1. The notation of “depends” means it over-writes the C++ virtual function for detailed simulation. Those special ALUs are dis-cussed later. The latency and pipeline configurations are illustrated in Figure 4.6.

Table 4.1: Latency and pipeline configuration of ALUs

Types Tid Brh Mem Int Flt Slw Ray

latency

1 1 depends 1 1 1 depends

pipeline

1 2 2 2 3 11 2

Figure 4.6 Latency and pipeline configuration illustration

4.3.2 Memory ALU

Both data cache and memory can be mapped to our ALU framework as a new kind of ALU. By dynamically handling the ALU latency, managing “is free” signal and “warp ID to free” signal, we can simulate more complex task like non-blocking cache. The process can be broken into two parts.

latency pipe

task#1 done

isFree

task#1 launch

able to receive new task

3-1=2

3-1=2 4

Configure as L4P3

Configure as L1P4 = fully pipeline with 4 stages

pipe l pipe

1 l

4-1=3

Figure 4.7 Simulating non-blocking cache

First part is “is free” signal control. Thought we want to simulate a non-blocking

在文檔中適用於光池加速光線追蹤繪圖之多核心處理器硬體架構設計 (頁 15-0)