Organization - 具多齊質性處理器核心之多媒體串流處理架構

Chapter 1 Introduction

1.2 Organization

In the beginning of Chapter 2, the background and challenges about this thesis are introduced. Issues of programming models and system architectures for media applications are discussed.

Next come the details of the development roadmap and the proposed design is described. The development roadmap and micro-architectures of an ALU cluster, an ALU cluster Intellectual Property and Floating point units for the ALU cluster IP and the overview of the AMBA AHB protocol are described in Chapter 3.

In Chapter 4, implementation results of proposed designs are described. The verification and testing results are also introduced in this chapter. Then the performance evaluation and comparison are discussed in the last part of the chapter. In the last chapter of this thesis, Chapter 5, the conclusion and future work are summarized.

CHAPTER 2 Background and Challenges

This chapter begins with the discussion of conventional programming model for media application. Therefore two kinds of programming model are used to deal with data while the comparison between programming models will be introduced in next chapter. Meanwhile, different system architectures of implementing the design are presented and discussed. These issues discussed of the programming models and system architectures are the background of current research. This thesis is motivated from these issues of the background.

2.1 Issues of Programming Model

Traditionally, the media applications are processed by conventional programming model implemented in conventional general purpose processor architecture. As shown in Fig 2.1, conventional programming model read data from memory system for computation and write results back into memory system. The memory system of this processing model depends on caches, which is optimized for latency and data reuse. Remind the characteristics of media processing applications.

First, every stream is read exactly once, resulting in poor cache performance. Second, operating one data element is largely independent to others. It results in a large amount of data parallelism and high latency tolerance. Finally it can not support high ratio of computation to memory access. Above-mentioned issues show that large available parallelism, little data reuse and high computation to memory access ratio are cramped by the attributes of caches.

Another clincher is memory-processor communication bandwidth gap. As shown in Fig 2.2, the processor-memory performance gap reveals that the performance growth of memory is much slower than processor [8]. The phenomenon will cause more latency for memory access and communication between processor and memory

is more critical. And traditional memory system utilizes global structures to provide data bandwidth. It means that it cannot scale to multiple arithmetic logic units for high performance rates in media applications.

Fig 2.1 Conventional Programming Model

Fig 2.2 Processor-Memory Performance Gap

2.2 Issues of System Architecture

Generally speaking, there are many different system architectures when implementing a design. Three main system architectures of design methodology, such as application specified integrated circuit(ASIC), platform-based architecture and reconfigurable architecture will be briefly introduced in this section on the basis of time to market demands, programmability, flexibility and physical area, etc.

Following, the pros and cons of these system architectures are discussed. [9 - 11]

The application specified integrated circuit is the most commonly used in these architectures. The ASIC design principle is shown in Fig 2.3. The chip implementation could be finished very quickly as long as the well-defined specification is given. Overall function and performance, such like area, power consumption and operating frequency, are optimized for the specification required.

Thus long design cycles which include circuit design and the manufacture increase the investment risk. And design verifications and corrections also take a large amount of design effort. It raises investment risk also. In addition, the waste of logic resources and power dissipation for non-active hardware is another issue for the design methodology. Besides, let us consider the situation that the specifications are changed.

In this situation, it reveals the lack of flexibility for ASIC design. It also shows the lack of programmability and non-reusable in the architecture.

Fig 2.3 Application Specified Integrated Circuit Design

The platform-based architecture includes a processor, memory, communication bus and multiple functional hardware accelerators. It gains more flexibility than ASIC architecture from reusing existence intellectual property (IP), such as digital signal processor (DSP), baseband codec, audio applications accelerator and other functional blocks. The example of platform-based architecture is shown in Fig 2.4. Different IP blocks are added or removed to meet different application. The platform-based architecture provides a common communication bus for convenience to integrated different IP macros quickly. Different systems will be set up as fast as possible. It reduces the design and re-develops effort significantly. One more attractive thing is that these platforms have been set up with a developing baseboard. Many common IP and peripherals on the baseboard will benefit to fast prototyping. The existent OS of the baseboard can reduce the effort of connecting the real applications to development design. Current research such as [12], [13] and [14] are listed in the reference.

Thus there are some drawbacks in this architecture. The interface communicate each functional macro increases the overhead of whole system. Besides, the memory bandwidth is limited by the communication bus. These factors decrease the efficiency seriously. In the meantime, the power consumption should be increased when more IP blocks are included. The idle IP macros waste unnecessary power dissipation, too. As discussed above, the platform-based architecture is more flexible and programmable than ASIC design. Thus this architecture is still a task-oriented system. It can not be applied to any application using the same framework.

Fig 2.4 An Example of Platform-Based Architecture

The reconfigurable architecture, the third design methodology, is similar to platform-based architecture. As shown in Fig 2.5, there are multiple general processing elements in this architecture. These processing elements, or ALU cluster IP blocks, play the key role of operating data stream. A system of reconfigurable architecture is built up with a micro controller, a bus or a network on chip system and a well-hierarchical memory system. One advantage of this kind of architecture is the usages of hardware accelerator IP are reconfigurable. It provides a significant flexibility and programmability for different applications. Another advantage is the applications can be operated concurrently. It means that it provides the ability for parallel operation. Nevertheless, there exist some potential drawbacks in using the design methodology. First, without power management system the power dissipation of unused process elements can not be saved. Second, the reconfigurable architecture could not match the above-introduced characteristic very well since the bandwidth of communication bus is insufficient and data transfer bottleneck encounters between process elements and memory system. The efficient memory hierarchy system is needed to solve the performance degradation. Current research such as [15], [16] and [17] are referenced in the bibliography.

Fig 2.5 A Diagram of Reconfigurable Architecture

In conclusion, one of these system architectures can be selected to implement the design trading off between pros and cons addressed above. These pros and cons corresponding to characteristics of media applications are summarized in the Table 2.1 listed below. Thus, any one of them adopted alone suffers from some drawbacks and can not meet the application of media processing very well. Consequently, the proposed design will be addressed and discussed in later section. It must resolves these issues.

Table 2.1 System Architecture vs. media application

ASIC Platform-Based

will be limited by bus

CHAPTER 3 Development Roadmap and Proposed Design

Base on the previous two chapters, the background, challenges are addressed.

Then the developmental roadmap and the proposed design of this thesis will discussed in this chapter.

The first section is the developmental roadmap of this thesis. It introduces the motivation to propose these designs after the issues of Chapter 2 are discussed.

Afterwards the streaming programming model and developmental roadmap overcome the issues mentioned above are introduced.

The second section is the demonstration of previous design of ALU cluster.

Review the architecture of the ALU cluster, the key processing element of ALU cluster intellectual property. Through testing the manufactured chip, the results confirm the correctness of functionality and the architecture is not only feasible but also efficient for media applications. The latter part of the description of this paragraph will be introduced clearly in next chapter.

The third part of this chapter is the description of the designed ALU cluster intellectual property. This design is an integration of the improved ALU cluster and the AMBA AHB slave interface. The improved ALU cluster is based on the ALU cluster discussed in previous section. The detail architectures and overview of AMBA AHB slave protocol are introduced in this section.

Eventually the forth section of this chapter is the description of the designed floating point operation units for ALU cluster IP. It will bring up an idea to integrate the floating point unit in the ALU cluster IP. The design consideration of the floating point units are described in the section. The details of the design are summarized.

3.1 Developmental Roadmap 3.1.1 Motivation

Issues of programming models and system architectures encountered in modern media processing system are addressed in Section 2.1 and Section 2.2. These drawbacks make the media applications handled inefficiently. The situation is more and more serious when great deals of media applications are applied in portable systems.

The media streaming architecture with homogeneous processor cores, a turnkey solution for media applications, is proposed in this thesis. This system intends to provide several cons of processing data stream. First it provides highly parallel computing ability so that multiple processing elements are needed. Performance improvement of media applications is achieved because of exploiting the large available parallelism inherence of media process. Subsequently, the communication bandwidth bottleneck discussed above has to solve. An efficient hierarchy of memory system is needed to expose the characteristic of little global data reuse and high computation to memory access ratio in media applications. Therefore, a reconfigurable hardware accelerator is built as a processing element to form the media streaming architecture with homogeneous processor cores in this thesis.

3.1.2 Roadmap

As mentioned above in the previous chapter, the conventional programming model is not suitable for these applications. So the stream programming model is adopted in this thesis. We will discuss the stream programming model below.

Following the developmental roadmap including system architecture is described in this chapter later.

3.1.2.1 Stream Programming Model

In the stream programming model, data is aligned in order as a stream. Streams are arbitrary data type. Operations are applied on entire streams. These operations perform computations, stream transfers, loads and stores etc. in the programming model. Nodes which carry out these operations are called kernels. They perform computation, such as a function, to each element of whole data streams. Kernels input

one or more data streams to operate and output one or more data streams as outputs.

These kernels only operate on local data and may not make arbitrary memory references.

After introducing streams, operations and kernels in the stream programming model, the structure of the model are depicted in Fig 3.1. As shown in the diagram, the stream programming model handles data by chaining operations together and makes data passing through kernels. Two example of dealing with applications using stream programming model are shown in Fig 3.2 and Fig 3.3 respectively. One is 1024-points complex radix-2 Fast Fourier Transform (FFT), a popular operation in multimedia processing [18]. Another is the example of image processing, the Stereo Depth Extraction, commonly used in modern image and medical diagnosing system [3].

Fig 3.1 Stream Programming model

Fig 3.2 Example of 1024-points radix-2 Fast Fourier Transform

Fig 3.3 Example of Stereo Depth Extraction

Remind the characteristic of the conventional programming model discussed in Section 2.1. Comparison between the stream programming model and the conventional programming model, some obvious pros are revealed when expressing media applications in the stream programming model. Corresponding three features of media process such as little data reuse, high large available parallelism and computation to memory access ratio, Pros are described briefly below. First, multiple kernels exploit the inherent parallelism feature. Second, data streams produced at the end of one kernel will consume at the next kernel makes the programming model fit

the feature of little data reuse. Finally the high computation to memory access ratio results from the minimization global memory usage in the stream programming model.

Features compare between these two programming models are listed in Table 3.1.

Table 3.1 Comparison between programming models

Conventional Programming

Model Stream Programming Model Large available

parallelism One central processor unit Multiple kernels Little data reuse Traditional caches are

ineffective

Data produced at the end of one kernel will consume at the

next kernel High

computation to memory access

ratio

Each element reference the off-chip memory

Minimize global memory usage

The programming model in this thesis was adopted the stream programming model. Because of the features mentioned above, it is suitable for processing system aimed at multimedia processing applications.

3.1.2.2 Developmental Roadmap

The developmental roadmap proposed in this section provides a suitable solution to process applications to conform the requirement expected. A sketch map of proposed development roadmap is illustrated in Fig 3.4. In this roadmap, five steps are segmented to build the media streaming architecture with homogeneous processor cores. As illustrated in the Fig 3.4, the proposed system gains the advantages from platform-based architecture and reconfigurable architecture. The mixture of system architectures are utilized as the structure of proposed system to overcome the challenge of issues deriving from using these system architectures singly. And the whole processing system provides an efficient processing system to get over issues of programming models and system architectures.

ALU cluster IP

Descriptions of these five steps are introduced briefly in this section. First, the leftmost chip photo, implemented in UMC 0.18um CMOS technology, in Fig 3.4 is the prototype 1 of ALU cluster [19] [20] . This is called the first step of proposed developmental roadmap. The chip had been measured and verified. As a processing element, the measurement results demonstrate that the architecture and functionality is suitable and correct for the applications.

By the right side of prototype 1, the ALU cluster IP with AMBA AHB interface, the prototype 2, was designed and taped out using TSMC 0.15um CMOS technology.

The layout and die photo with package are depicted in the development roadmap respectively. This step verifies the designed ALU cluster IP to be suitable as a reconfigurable hardware accelerator in media applications. In addition to this purpose, the interface obeyed the AHB slave bus provides a common communication bridge between these ALU cluster IPs and micro-controller used to manage whole media streaming system. These features make the proposed prototype 2 have ability as a processing element to be applied in the media streaming system. It is a significant feature of the second step in developmental roadmap.

In modern multimedia applications, the floating point operations occupy a large percentage of the computation amount. These floating point operations stand a key role of performance and power consumption in whole applications. A floating point unit is essential in performance improvement if the budget of power consumption and logic resources are agreed. Consequently, an ALU cluster IP supported floating point operation is a requirement. As shown in the middle of Fig 3.4, an ALU cluster IP with floating point supported are introduced. The hard macro of floating point unit is designed and implemented using TSMC 0.18um CMS technology. The combinations of ALU cluster IP and the hard macro provides efficient computing ability to handle floating point operations. It is the third generation in the described developmental roadmap.

Following the forth step in Fig 3.4 is discussed in this paragraph. In this step, system integration is preliminary started. An ALU cluster IP with compatible board for logic tile connector will be stacked into the RealView versatile platform baseboard for ARM926EJ-S [21] [22]. In this baseboard, the ALU cluster IP is verified whether the AHB bus interface fits the bus protocol. One ALU cluster IP stacked in the board are also made sure that the IP has the ability as a hardware accelerator integrated with the platform.

Finally multiple processing elements, ALU cluster IPs, will be combined with the versatile baseboard to form a media streaming architecture with homogeneous processor cores. The proposed system matches the features of media applications such as large available parallelism, little data reuse and high computation to memory access ratio. Suitable benchmarks are ported into this processing system and compare with each other. These applications include some popular operations in multimedia applications and MIMO-OFDM system. The finite impulse response (FIR) filter system and Fast Fourier Transform (FFT) are selected as benchmarks in media applications. And the key operations of MIMO-OFDM system, such as matrix inversion and Gram-Schmitt process, are selected as benchmarks, too [23]. Adoption of stream programming model and mixture of system architectures make the media streaming system become a turkey solution for modern multimedia processing requirement.

Five paragraphs discussed above are introduced the proposed developmental roadmap compendiously. The details of these steps, such as micro architectures, implementation results and etc., are discussed and described in the following sections and chapters respectively.

3.2 An ALU cluster

The briefly description of the previous design, an ALU cluster, included the micro architecture is described. The ALU cluster is called prototype 1 in the above-presented developmental roadmap. It is convincible that it can handle media applications expectedly.

3.2.1 Micro-Architecture of an ALU cluster

As the major part for handling the media processing, an ALU cluster includes five arithmetic units, supporting to process the parallel data concurrently. As shown in Fig 3.5, they are two ALUs, two multipliers and one divider. Large amount of digital signal processing application are suitable for porting in the architecture with mixture of arithmetic units. There is also one scratch pad register file (SPRF), ten banks of intra register file (IRF), a controller and a decoder.

Fig 3.5 Micro-Architecture of ALU cluster

There are thirteen instructions can be executed by the ALU, such as ADD, SUB, ABS, AND, OR, XOR, NOT, SLL, SRL, SRA, LT, GT, EQ. The adder and comparator adopted for ALUs are the carry-lookahead architecture with two stage pipeline. Booth encoding architecture was adopted for our four stage pipeline multiplier. This multiplier can carry out multiplication. The last arithmetic unit is the divider. It performs the division operation that gets the quotient and remainder and calculates the square root. The designed divider in an ALU cluster is not the key kernel about performance concerned so that this unit is not pipelined and considered to shrink the logic resource by increasing latencies of operation.

As mentioned above, in the ALU cluster there are IRFs embedded for each operation units. These intra register files are local to themselves arithmetic units. The purpose of IRFs is to provide an efficient memory bandwidth for arithmetic units. In other words, these arithmetic units would not waste the precious global memory

在文檔中具多齊質性處理器核心之多媒體串流處理架構 (頁 16-0)