• 沒有找到結果。

Chapter 2 Background

2.1 Stream processing

Stream processors are fully programmable processors which aimed at media applications. Media applications, including signal processing, image and video processing, and graphics, are well suited to a stream processor like Imagine [5]

because they possess four key attributes [6]:

High Computation Rate: Many media applications require billions to tens of billions of arithmetic operations per second to achieve real-time performance.

High Computation to Memory Ratio: Structuring media applications as stream programs exposes their locality, allowing implementations to minimize global memory usage. At the result stream programs tend to achieve a high computation to memory ratio: most media applications perform tens to hundreds of arithmetic operations for each necessary memory reference.

Produce-Consumer Locality with Little Global Data Reuse: The typical data reference pattern in media applications requires a single read and write per global data element. Little global reuse means that traditional caches are largely

ineffective in these applications. Intermediate results are produced at the end of a computation stage and consumed at the beginning of the next stage.

Parallelism: Media applications exhibit instruction-level, data-level, and task-level parallelism.

Media applications operate on streams of low-precision data, have abundant data-parallelism, rarely reuse global data, and perform tens to hundreds of operations per global data reference. These stream programs map easily and efficiently to the data bandwidth hierarchy of the stream micro-architecture. The data parallelism inherited in media applications allows a single instruction to control multiple arithmetic units and allows intermediate data to be localized to small clusters of units, significantly reducing communication demands. Data from one processing kernel are forwarded to the next kernel, which localized the data communication and rarely reused global data. Furthermore, the computation demands of these applications can be satisfied by keeping intermediate data close to the arithmetic units, rather in memory.

2.1.1 Stream programming model

The stream processor executes applications that have been mapped to the stream programming model [7]. This programming model organizes the computation in an application into a sequence of arithmetic kernels, and organizes the data-flow into a series of data streams. The data streams are ordered, finite-length sequences of data orders of an arbitrary type (although all the records in one stream are of the same type). The inputs and outputs to kernels are data streams. Streams passing among

multiple computation kernels form a stream program. The only non-local data a kernel can reference at any time are the current head elements of its input streams and the current tail elements of its output streams. In the stream programming model, locality and concurrency are exposed both within a kernel and between kernels.

In Figure 2.1, shows the mapping of radix-2 FFT [7] to the stream model. Each oval in the figure corresponds to the execution of a kernel, while each arrow represents a data stream transfer. In the stream implementation, kernel requires two input streams and one output stream. The output of the last kernel is in bit-reversed order, so it must be reordered in the memory. In FFT only data elements passed between kernels need to access the SRF, and only the initial input data and final output data need to access the global memory space in DRAM.

Figure 2.1 Stream and kernel representation

Applications that are more involved than the FFT example map to the stream model in a similar fashion. Examples can be found in other references: Khailany et al. discuss the mapping of stereo depth extractor [5]; Rixner discusses the mapping of an MPEG-2 encoder [8]; and Owens et al. discuss the mapping of a polygon rendering pipeline [9].

The stream model is important because it organizes an application to expose the locality and parallelism information that is inherent in the application.

2.1.2 Stream micro-architecture

The stream processor is a hardware micro-architecture designed to implement the stream programming model. Imagine, designed by computer systems laboratory of Stanford University is a stream processor which block diagram is shown in Figure 2.2 [14]. The core of Imagine is a 128 KB stream register file (SRF). The SRF is connected to 8 SIMD-controlled VLIW-like arithmetic clusters controlled by a microcontroller, a memory system interface to off-chip DRAM, and a network interface to connect to other nodes of a multi-Image system. All modules are controlled by an on-chip stream controller under the direction of an external host processor.

Figure 2.2 Stream processor block diagram

The working set of streams is located in the SRF. Stream loads and stores occur between the memory system and the SRF; network sends and receives occur between the network interface and the SRF. The SRF also provides the stream inputs to kernels and stores their stream outputs.

The kernels are executed in the 8 arithmetic clusters. Each cluster contains several functional units (which can exploit instruction-level parallelism) fed by distributed local register files. The 8 clusters (which can exploit data-level parallelism) are controlled by the microcontroller, which supplies the same instruction stream to each cluster. On Imagine, streams are implemented as contiguous blocks of memory in the SRF or in off-chip memory. Kernels are implemented as programs run on the arithmetic clusters.

The three-level memory bandwidth hierarchy characteristic of media application behavior consists of the memory system, the SRF, and the local register files within the clusters

2.1.3 An example of stream processor: Imagine

Imagine is a programmable stream processor, which is a general purpose processor, and is the hardware implementation of the stream model. The concept of this micro-architecture is based on stream. Imagine is organized to take advantage of the locality and parallelism inherent in media applications. A block diagram of the micro-architecture is shown in Figure 2.2.

Imagine contains 48 ALUs [12], and a unique three level memory hierarchy design to keep the functional units saturated during stream processing. The three-tiered data bandwidth hierarchy consists of a stream memory system (2 GB/s), a global stream register file (32 GB/s), and a set of local distributed register files located

near the arithmetic units (544 GB/s). The 128 KB SRF at the center of the bandwidth hierarchy not only provides intermediate storage for data streams but also enables additional stream clients to be modularly connected to Imagine, such as streaming network interface. A single microcontroller broadcasts cluster instructions in SIMD fashion to all of the arithmetic clusters. Each of Imagine’s 8 arithmetic clusters consists of 6 functional units containing 3 adders, 2 multipliers, and a divide/square root. These units are controlled by statically scheduled cluster instructions issued by the microcontroller.

相關文件