• 沒有找到結果。

Issues of System Architecture

Chapter 2 Background and Challenges

2.2 Issues of System Architecture

Generally speaking, there are many different system architectures when implementing a design. Three main system architectures of design methodology, such as application specified integrated circuit(ASIC), platform-based architecture and reconfigurable architecture will be briefly introduced in this section on the basis of time to market demands, programmability, flexibility and physical area, etc.

Following, the pros and cons of these system architectures are discussed. [9 - 11]

The application specified integrated circuit is the most commonly used in these architectures. The ASIC design principle is shown in Fig 2.3. The chip implementation could be finished very quickly as long as the well-defined specification is given. Overall function and performance, such like area, power consumption and operating frequency, are optimized for the specification required.

Thus long design cycles which include circuit design and the manufacture increase the investment risk. And design verifications and corrections also take a large amount of design effort. It raises investment risk also. In addition, the waste of logic resources and power dissipation for non-active hardware is another issue for the design methodology. Besides, let us consider the situation that the specifications are changed.

In this situation, it reveals the lack of flexibility for ASIC design. It also shows the lack of programmability and non-reusable in the architecture.

Fig 2.3 Application Specified Integrated Circuit Design

The platform-based architecture includes a processor, memory, communication bus and multiple functional hardware accelerators. It gains more flexibility than ASIC architecture from reusing existence intellectual property (IP), such as digital signal processor (DSP), baseband codec, audio applications accelerator and other functional blocks. The example of platform-based architecture is shown in Fig 2.4. Different IP blocks are added or removed to meet different application. The platform-based architecture provides a common communication bus for convenience to integrated different IP macros quickly. Different systems will be set up as fast as possible. It reduces the design and re-develops effort significantly. One more attractive thing is that these platforms have been set up with a developing baseboard. Many common IP and peripherals on the baseboard will benefit to fast prototyping. The existent OS of the baseboard can reduce the effort of connecting the real applications to development design. Current research such as [12], [13] and [14] are listed in the reference.

Thus there are some drawbacks in this architecture. The interface communicate each functional macro increases the overhead of whole system. Besides, the memory bandwidth is limited by the communication bus. These factors decrease the efficiency seriously. In the meantime, the power consumption should be increased when more IP blocks are included. The idle IP macros waste unnecessary power dissipation, too. As discussed above, the platform-based architecture is more flexible and programmable than ASIC design. Thus this architecture is still a task-oriented system. It can not be applied to any application using the same framework.

Fig 2.4 An Example of Platform-Based Architecture

The reconfigurable architecture, the third design methodology, is similar to platform-based architecture. As shown in Fig 2.5, there are multiple general processing elements in this architecture. These processing elements, or ALU cluster IP blocks, play the key role of operating data stream. A system of reconfigurable architecture is built up with a micro controller, a bus or a network on chip system and a well-hierarchical memory system. One advantage of this kind of architecture is the usages of hardware accelerator IP are reconfigurable. It provides a significant flexibility and programmability for different applications. Another advantage is the applications can be operated concurrently. It means that it provides the ability for parallel operation. Nevertheless, there exist some potential drawbacks in using the design methodology. First, without power management system the power dissipation of unused process elements can not be saved. Second, the reconfigurable architecture could not match the above-introduced characteristic very well since the bandwidth of communication bus is insufficient and data transfer bottleneck encounters between process elements and memory system. The efficient memory hierarchy system is needed to solve the performance degradation. Current research such as [15], [16] and [17] are referenced in the bibliography.

Fig 2.5 A Diagram of Reconfigurable Architecture

In conclusion, one of these system architectures can be selected to implement the design trading off between pros and cons addressed above. These pros and cons corresponding to characteristics of media applications are summarized in the Table 2.1 listed below. Thus, any one of them adopted alone suffers from some drawbacks and can not meet the application of media processing very well. Consequently, the proposed design will be addressed and discussed in later section. It must resolves these issues.

Table 2.1 System Architecture vs. media application

ASIC Platform-Based

will be limited by bus

CHAPTER 3

Development Roadmap and Proposed Design

Base on the previous two chapters, the background, challenges are addressed.

Then the developmental roadmap and the proposed design of this thesis will discussed in this chapter.

The first section is the developmental roadmap of this thesis. It introduces the motivation to propose these designs after the issues of Chapter 2 are discussed.

Afterwards the streaming programming model and developmental roadmap overcome the issues mentioned above are introduced.

The second section is the demonstration of previous design of ALU cluster.

Review the architecture of the ALU cluster, the key processing element of ALU cluster intellectual property. Through testing the manufactured chip, the results confirm the correctness of functionality and the architecture is not only feasible but also efficient for media applications. The latter part of the description of this paragraph will be introduced clearly in next chapter.

The third part of this chapter is the description of the designed ALU cluster intellectual property. This design is an integration of the improved ALU cluster and the AMBA AHB slave interface. The improved ALU cluster is based on the ALU cluster discussed in previous section. The detail architectures and overview of AMBA AHB slave protocol are introduced in this section.

Eventually the forth section of this chapter is the description of the designed floating point operation units for ALU cluster IP. It will bring up an idea to integrate the floating point unit in the ALU cluster IP. The design consideration of the floating point units are described in the section. The details of the design are summarized.

3.1 Developmental Roadmap 3.1.1 Motivation

Issues of programming models and system architectures encountered in modern media processing system are addressed in Section 2.1 and Section 2.2. These drawbacks make the media applications handled inefficiently. The situation is more and more serious when great deals of media applications are applied in portable systems.

The media streaming architecture with homogeneous processor cores, a turnkey solution for media applications, is proposed in this thesis. This system intends to provide several cons of processing data stream. First it provides highly parallel computing ability so that multiple processing elements are needed. Performance improvement of media applications is achieved because of exploiting the large available parallelism inherence of media process. Subsequently, the communication bandwidth bottleneck discussed above has to solve. An efficient hierarchy of memory system is needed to expose the characteristic of little global data reuse and high computation to memory access ratio in media applications. Therefore, a reconfigurable hardware accelerator is built as a processing element to form the media streaming architecture with homogeneous processor cores in this thesis.

3.1.2 Roadmap

As mentioned above in the previous chapter, the conventional programming model is not suitable for these applications. So the stream programming model is adopted in this thesis. We will discuss the stream programming model below.

Following the developmental roadmap including system architecture is described in this chapter later.

3.1.2.1 Stream Programming Model

In the stream programming model, data is aligned in order as a stream. Streams are arbitrary data type. Operations are applied on entire streams. These operations perform computations, stream transfers, loads and stores etc. in the programming model. Nodes which carry out these operations are called kernels. They perform computation, such as a function, to each element of whole data streams. Kernels input

one or more data streams to operate and output one or more data streams as outputs.

These kernels only operate on local data and may not make arbitrary memory references.

After introducing streams, operations and kernels in the stream programming model, the structure of the model are depicted in Fig 3.1. As shown in the diagram, the stream programming model handles data by chaining operations together and makes data passing through kernels. Two example of dealing with applications using stream programming model are shown in Fig 3.2 and Fig 3.3 respectively. One is 1024-points complex radix-2 Fast Fourier Transform (FFT), a popular operation in multimedia processing [18]. Another is the example of image processing, the Stereo Depth Extraction, commonly used in modern image and medical diagnosing system [3].

Fig 3.1 Stream Programming model

Fig 3.2 Example of 1024-points radix-2 Fast Fourier Transform

Fig 3.3 Example of Stereo Depth Extraction

Remind the characteristic of the conventional programming model discussed in Section 2.1. Comparison between the stream programming model and the conventional programming model, some obvious pros are revealed when expressing media applications in the stream programming model. Corresponding three features of media process such as little data reuse, high large available parallelism and computation to memory access ratio, Pros are described briefly below. First, multiple kernels exploit the inherent parallelism feature. Second, data streams produced at the end of one kernel will consume at the next kernel makes the programming model fit

the feature of little data reuse. Finally the high computation to memory access ratio results from the minimization global memory usage in the stream programming model.

Features compare between these two programming models are listed in Table 3.1.

Table 3.1 Comparison between programming models

Conventional Programming

Model Stream Programming Model Large available

parallelism One central processor unit Multiple kernels Little data reuse Traditional caches are

ineffective

Data produced at the end of one kernel will consume at the

next kernel High

computation to memory access

ratio

Each element reference the off-chip memory

Minimize global memory usage

The programming model in this thesis was adopted the stream programming model. Because of the features mentioned above, it is suitable for processing system aimed at multimedia processing applications.

3.1.2.2 Developmental Roadmap

The developmental roadmap proposed in this section provides a suitable solution to process applications to conform the requirement expected. A sketch map of proposed development roadmap is illustrated in Fig 3.4. In this roadmap, five steps are segmented to build the media streaming architecture with homogeneous processor cores. As illustrated in the Fig 3.4, the proposed system gains the advantages from platform-based architecture and reconfigurable architecture. The mixture of system architectures are utilized as the structure of proposed system to overcome the challenge of issues deriving from using these system architectures singly. And the whole processing system provides an efficient processing system to get over issues of programming models and system architectures.

ALU cluster IP

Descriptions of these five steps are introduced briefly in this section. First, the leftmost chip photo, implemented in UMC 0.18um CMOS technology, in Fig 3.4 is the prototype 1 of ALU cluster [19] [20] . This is called the first step of proposed developmental roadmap. The chip had been measured and verified. As a processing element, the measurement results demonstrate that the architecture and functionality is suitable and correct for the applications.

By the right side of prototype 1, the ALU cluster IP with AMBA AHB interface, the prototype 2, was designed and taped out using TSMC 0.15um CMOS technology.

The layout and die photo with package are depicted in the development roadmap respectively. This step verifies the designed ALU cluster IP to be suitable as a reconfigurable hardware accelerator in media applications. In addition to this purpose, the interface obeyed the AHB slave bus provides a common communication bridge between these ALU cluster IPs and micro-controller used to manage whole media streaming system. These features make the proposed prototype 2 have ability as a processing element to be applied in the media streaming system. It is a significant feature of the second step in developmental roadmap.

In modern multimedia applications, the floating point operations occupy a large percentage of the computation amount. These floating point operations stand a key role of performance and power consumption in whole applications. A floating point unit is essential in performance improvement if the budget of power consumption and logic resources are agreed. Consequently, an ALU cluster IP supported floating point operation is a requirement. As shown in the middle of Fig 3.4, an ALU cluster IP with floating point supported are introduced. The hard macro of floating point unit is designed and implemented using TSMC 0.18um CMS technology. The combinations of ALU cluster IP and the hard macro provides efficient computing ability to handle floating point operations. It is the third generation in the described developmental roadmap.

Following the forth step in Fig 3.4 is discussed in this paragraph. In this step, system integration is preliminary started. An ALU cluster IP with compatible board for logic tile connector will be stacked into the RealView versatile platform baseboard for ARM926EJ-S [21] [22]. In this baseboard, the ALU cluster IP is verified whether the AHB bus interface fits the bus protocol. One ALU cluster IP stacked in the board are also made sure that the IP has the ability as a hardware accelerator integrated with the platform.

Finally multiple processing elements, ALU cluster IPs, will be combined with the versatile baseboard to form a media streaming architecture with homogeneous processor cores. The proposed system matches the features of media applications such as large available parallelism, little data reuse and high computation to memory access ratio. Suitable benchmarks are ported into this processing system and compare with each other. These applications include some popular operations in multimedia applications and MIMO-OFDM system. The finite impulse response (FIR) filter system and Fast Fourier Transform (FFT) are selected as benchmarks in media applications. And the key operations of MIMO-OFDM system, such as matrix inversion and Gram-Schmitt process, are selected as benchmarks, too [23]. Adoption of stream programming model and mixture of system architectures make the media streaming system become a turkey solution for modern multimedia processing requirement.

Five paragraphs discussed above are introduced the proposed developmental roadmap compendiously. The details of these steps, such as micro architectures, implementation results and etc., are discussed and described in the following sections and chapters respectively.

3.2 An ALU cluster

The briefly description of the previous design, an ALU cluster, included the micro architecture is described. The ALU cluster is called prototype 1 in the above-presented developmental roadmap. It is convincible that it can handle media applications expectedly.

3.2.1 Micro-Architecture of an ALU cluster

As the major part for handling the media processing, an ALU cluster includes five arithmetic units, supporting to process the parallel data concurrently. As shown in Fig 3.5, they are two ALUs, two multipliers and one divider. Large amount of digital signal processing application are suitable for porting in the architecture with mixture of arithmetic units. There is also one scratch pad register file (SPRF), ten banks of intra register file (IRF), a controller and a decoder.

Fig 3.5 Micro-Architecture of ALU cluster

There are thirteen instructions can be executed by the ALU, such as ADD, SUB, ABS, AND, OR, XOR, NOT, SLL, SRL, SRA, LT, GT, EQ. The adder and comparator adopted for ALUs are the carry-lookahead architecture with two stage pipeline. Booth encoding architecture was adopted for our four stage pipeline multiplier. This multiplier can carry out multiplication. The last arithmetic unit is the divider. It performs the division operation that gets the quotient and remainder and calculates the square root. The designed divider in an ALU cluster is not the key kernel about performance concerned so that this unit is not pipelined and considered to shrink the logic resource by increasing latencies of operation.

As mentioned above, in the ALU cluster there are IRFs embedded for each operation units. These intra register files are local to themselves arithmetic units. The purpose of IRFs is to provide an efficient memory bandwidth for arithmetic units. In other words, these arithmetic units would not waste the precious global memory bandwidth. It acquires required bandwidth from less precious bandwidth of IRFs.

Another extra storage element inside the ALU cluster is scratch pad register file. The capabilities of SPRF are as the storage element for commonly used coefficient of applications.

The remaining parts of the ALU cluster are the decoder and the controller. The decoder can decode the instructions from off-chip instruction memory and provides control signals needed for ALU cluster. During the execution of an ALU cluster, the controller sequences and issues the decoded instructions to the function units and decides the inputs source, such as IRF, SPRF or data memory. When an ALU cluster operates to read or write, the controller manages the data flowing.

3.3 An ALU cluster Intellectual Property

In this section, an ALU cluster Intellectual Property (IP) is designed and the architecture is also discussed. It is the prototype 2 mentioned in the roadmap presented in previous section [24]. As mentioned in previous chapter, there is an AMBA AHB wrapper in the ALU cluster IP. First the AMBA AHB protocol will be described in the following paragraph. Then the micro architecture of an ALU cluster IP will be presented and discussed in the last part of this section.

3.3.1 Overview of AMBA

The Advanced Microcontroller Bus Architecture (AMBA) specification defines an on-chip communications standard for designing high-performance embedded microcontrollers [25]. There are three buses defined in this specification. They are Advanced High-performance Bus (AHB), Advanced System Bus (ASB), and Advanced Peripheral Bus (APB). These bus protocols are used in different applications. For example, the AHB are used as the high-performance system backbone bus. It is for high-performance and high clock frequency system modules. It provides the efficient connection of processors, on-chip memories and off-chip external memory interfaces with low-power peripheral macrocell functions. The ASB is also for high-performance system modules. It is suitable for system bus that the high-performance features of AHB are not required. It also supports the efficient

connection of processors, on-chip memories and off-chip external memory interfaces with low-power peripheral macrocell functions the same as AHB. Finally, the APB optimized with minimal power consumption and interface complexity is used for peripherals. APB can be used in conjunction with either version of the system bus.

Four key requirements are satisfied by AMBA specification. They are right-first-time, technology-independent, modular system design and minimization of the silicon infrastructure. The system obeyed AMBA protocol could facilitate the right-first-time development of embedded microcontroller products with one or more CPUs or signal processors. The specification is technology-independent and ensures

Four key requirements are satisfied by AMBA specification. They are right-first-time, technology-independent, modular system design and minimization of the silicon infrastructure. The system obeyed AMBA protocol could facilitate the right-first-time development of embedded microcontroller products with one or more CPUs or signal processors. The specification is technology-independent and ensures