單晶片網路系統平台設計最佳化之研究

(1)

»ñø;.

é.é@~X

}ÿ¡Z

þnçÙ¿¬'t·;@~

On the Study of Design Optimization for

Network-on-Chip Platform

@~ß : rW¼

¼0>0 : øÿò

(2)

On the Study of Design Optimization for

Network-on-Chip Platform

@~ß : rW¼

Student : Cheng-Yeh Wang

¼0>0 : øÿò

Advisor : Jing-Yang Jou

»ñø;.

é^.o

é.

é@~X

}ÿ¡Z

A Dissertation

Submitted to Department of Electronics Engineering

and Institute of Electronics

College of Electrical and Computer Engineering

National Chiao-Tung University

in partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

in

Electronics Engineering

June 2007

Hsin-Chu, Taiwan, Republic of China

ºÓ»Üè0O0`

(3)

þnçÙ¿¬'t·;@~

.ß : rW¼

¼0>0 : øÿò

»ñø;.

é^.o

é.

é@~X

`

½Ùþn¹ÝW'Ó¦ª±'`Î×Í¥Ý Þ.h&Æèþnç®ß ">mñf»ð®ß |C¶°®ß ¼ª±'Xm` ¯Ù'ï|">ÝÿáéÝ©P¯é'ï |ÿÕ×Í?Ý' 3Í@~&Æè{[þnçÙqAÌaéøð¸àÝò Ý;¼ø2àf¥5gÑaVø4&Æ|¾Õ|ìÝ©P (1) 1J ´ (2) ;6B7¸à (3) º®ßih²&Æè;G0'Ý¼ Õ°hÕ°¸àß,Ý¯¯ÙÀ§æt;.h@®Ý ` ÞãÝçÿa ày5çø;GÊÝ£G|5¬v/ i

(4)

ÙÝ;GÊªÕt±h²ãy´æbÍÁ§´mOøÁ§`J ÍÙºÍÁ§X§×.h´ÝmOºÊÕ3hÕ°Ê Õ;G|C;GÊÝÏµ8ý^bÊÝÏµìÙ[è 20% h²jEÑaP">mñf»; ÝC-t·;è)PÝ]°h ]°¸à">Ù5|CÞãÿa5¼¾Õ>t·;Ýøs"Ù îÝ!C-Ýý0ÿlàyht·;øh²{8nÝÿlhã ¼Õ®£|¾Õ¸àïÝ£O3t·;ÝÄqA¸àï Ý§×´0t·ÝPbt·ÝP|b[Ýª±t· Ýè h)P]°8ýöÜÝÿa5|b[Ý3KXmÝ£` 8 ýöÜÝÙ5ÿ?Þ@Ý t¡×Íb[£Ý)WÕ°èày¶° Ý®ß h)WÕ °3&!JÝWÏ)W®ÞîÊÕìà;ô|Ca;ôqA)WWÝÕ ¾` |CTÕ` ÝO|¿ÉßWWÝ5;ô¸à±ÝWÏßW Õ°ÊÕN×f5Ý;ôXxÝ)W |®ß{>|C« Ý¶° ii

(5)

On the Study of Design Optimization for

Network-on-Chip Platform

Student : Cheng-Yeh Wang

Advisor : Jing-Yang Jou

Department of Electronics Engineering

and Institute of Electronics

National Chiao-Tung University

Abstract

As System-on-Chip (SoC) designs progressively grow, reducing the development time becomes a crucial challenge. Therefore, the Network-on-Chip (NoC) generator, the FFT generator, and the multiplier generator are developed for reducing the design time.

In this work, a high-performance Network-on-Chip platform is presented. To achieve (1) bandwidth guarantee (2) economical memory usage, and (3) deadlock free, the virtual-circuit switching with dedicated connection path, virtual channel flow control with weighted

(6)

profile-driven strategy to bind tasks onto the NoC such that the overall system throughput can be maximized. To analyze the network traffic, a cycle-accurate network simulator is hence implemented. The traffic contention information is analyzed, and then fed back to the proposed optimization flow. The effects of communication amount, traffic contention, and bandwidth requirement are considered to perform the task binding. The overall sys-tem throughput is improved up to 20% for 100 test cases as compared with the task binding without considering the communication and contention effects.

This thesis also describes a novel hybrid method for the wordlength optimization of pipelined FFT processors which is the arithmetic kernel of OFDM-based systems. This methodology utilizes the rapid computation of statistical analysis, and the accurate evalu-ation of simulevalu-ation-based analysis to investigate a speedy optimizevalu-ation flow. A statistical error model for varying wordlengths of PE stages of an FFT processor is developed to support this optimization flow. A technology-dependent model is extracted to support the FFT’s operating frequency constraint. The wordlength boundary is found by constraints, and the optimal form is introduced to reduce computation time. Experimental results show that the wordlength optimization employing the speedy flow reduces the percentage of the total area of the FFT processor that increases with an increasing FFT length. The proposed hybrid method requires shorter prediction time than the absolute simulation-based method

(7)

does and achieves more accurate outcomes than a statistical calculation does.

Finally, an effective multiplier synthesis algorithm for cell-based multipliers is pre-sented. The synthesis algorithm considers gate delay and wire delay for non-regular tree synthesis. Based on arrival time and required time of the tree constraints, the generated compressed tree can achieve balanced path delay. By using a novel tree generation algo-rithm with timing consideration for each vertical compressor slice (VCS), the developed synthesizer can automatically generate high-speed multipliers in small area.

(8)

(9)

Acknowledgements

First and foremost, I would like to express my greatest appreciation to my advisor, Pro-fessor Jing-Yang Jou (ø ÿ ò > 0) for his suggestions and guidances. He not only encourages the freedom thinking but also is our ideal. I would like to very much thank to Professor Lan-Da Van (oÕ¾>0) and Professor Juinn-Dar Huang (?6¾>0), who give me valuable suggestions. Also, I would like to thank the involved members in the projects, Ya-Chi Yang (Ä-\), Tson-Yee Lin (D), Chih-Bin Kuo (J ü), Pao-Jui Huang (?1), Chih-Chieh Chou (ø), and Guan-Hao Chen (WC 8). Without the seamless cooperation, the projects would not be so successful. Thanks to Shang-Wei Tu ()$º), Geeng-Wei Lee (A î), and Liang-Yu Lin()E) for several useful discussions. Special thanks to all EDA members for the wonderful time we share together.

I would like express my sincere appreciation to Miss Zwei-Mei Lee (Ad), my girlfriend, for her patient encouragement and the discussion of writing this thesis. I would appreciate my family for their patient wait and my father education philosophy ”no

(10)

b5óÕ9ò) ”.

C

HENG

-Y

EH

W

ANG

National Chiao-Tung University 2007, June

(11)

List of Tables

3.1 Example of Random Verification . . . 64 3.2 Common Specifications of FFT for OFDM . . . 79 3.3 Constraints for Optimization . . . 81 3.4 Area Reduction of R2SDF with Different FFT Lengths Using Hybrid

Method . . . 82 3.5 Area Reduction of R22_{SDF with Di}fferent FFT Lengths Using Hybrid

Method . . . 83 3.6 Area Reduction of R2SDF with Different FFT Lengths Using Statistical

Analysis . . . 85 4.1 Selection of partial product . . . 91 4.2 Experimental Results . . . 109

(16)

(17)

List of Figures

1.1 OFDM transmitter data flow graph [1] . . . 2

1.2 OFDM receiver data flow graph [1] . . . 2

1.3 OFDM channel estimation [1] . . . 2

1.4 System modeling graph[2] . . . 4

1.5 Overview of NoC platform . . . 6

2.1 A point-to-point network. . . 9

2.2 A bus-based network. . . 10

2.3 A switch-based network. . . 10

2.4 Circuit switched network. . . 13

2.5 Packet switched network. . . 14

2.6 Virtual-circuit switched network. . . 16

2.7 Mesh-based interconnection architecture of the NoC platform. . . 17

2.8 Transformation from relay station to switch. . . 18 xv

(18)

2.10 Bandwidth allocation of a physical channel using a weighted round-robin

scheduler. . . 20

2.11 Interface transactions between two switches. . . 21

2.12 Real-time QoS modeling. . . 23

2.13 NoC platform model. . . 26

2.14 The task graph of MPEG-4 encoder. . . 27

2.15 Illustration of binding methodology. . . 29

2.16 Contention in time domain. . . 31

2.17 Proposed task binding methodology. . . 32

2.18 Pseudo code of the path assignment. . . 35

2.19 Histograms of normalized latency under different injection rate. . . 38

2.20 Histograms of normalized latency under different buffer size of virtual channel. . . 39

2.21 Fail rate under different communication factors. . . 41

2.22 Fail rate under different buffer size. . . 42

2.23 Latency under different communication factors. . . 43

2.24 Latency under different buffer size. . . 44

2.25 Throughput ratio under different communication factor. . . 45 xvi

(19)

3.1 Conventional R2SDF and R22SDF DIF implementations. . . 53

3.2 Error model of a PE stage. . . 56

3.3 Propagation of quantization and scaling errors. . . 57

3.4 Propagation of multiplication errors. . . 59

3.5 Propagation of noiseless multiplications. . . 60

3.6 Block diagram of simulation analysis. . . 62

3.7 Histogram of SQNR difference with randomly generated wordlengths. . . 65

3.8 Histogram of SQNR difference with partial exhaustive verification. . . 66

3.9 Wordlength optimization flow of a PE stage. . . 68

3.10 Evaluation of the upper bound wordlength. . . 69

3.11 Evaluation of the lower bound wordlength. . . 71

3.12 Area increment of each PE stage as the wordlength increases 1 bit . . . . 72

3.13 The procedure to determine the optimized wordlength set candidates. . . . 74

3.14 Optimized wordlength selection. . . 76

3.15 An example of hybrid wordlength optimization. . . 78

3.16 An example of pure statistical analysis. . . 78

3.17 Area reduction rate versus SQNR constraints. . . 84

3.18 Comparisons of results using different analytical methods. . . 87

4.1 Multiplication steps. . . 90 xvii

(20)

4.4 Illustration of Wallace tree for reducing 18 partial products. . . 93

4.5 Overview of multiplier generation. . . 94

4.6 Conceptual profile of input arrival time of final adder. . . 95

4.7 Overview of vertical compressor slice. . . 96

4.8 Evaluation of output arrival time . . . 97

4.9 Cell-based delay model of a 3-to-2 compressor . . . 98

4.10 π-model of the i-th wire (wi) . . . 98

4.11 L-shaped approximation between two cells. . . 99

4.12 An example of wire delay estimation using L-shaped approximation. . . . 100

4.13 Full Decomposition (FD) procedure. . . 101

4.14 Feasibility Checking (FC) algorithm. . . 103

4.15 The first step of feasibility checking: decomposition. . . 105

4.16 The second step of feasibility checking: further decomposition. . . 105

4.17 The third step of feasibility checking: check the derived arrival time. . . . 106

4.18 VCS generation algorithm. . . 107

4.19 Experimental flow. . . 108

(21)

Chapter 1 Introduction

1.1 Motivation

In SoC era, available gate count grows year by year. Effectively utilizing silicon area is a significant challenge. Merging chips into a single chip becomes the mainstream to im-prove the silicon utilization. This trend is especially obvious in computer industry. CPU companies have developed multi-core processors by implementing more than one cores into a processor [3]. Chipset companies intend to integrate more peripherals in a single chip. In addition to the mentioned above, embedded system designs are also conducted to integrate processors, accelerators, and peripherals into a single chip to implement hand-held devices [4].

In application domain, Software-defined Radio (SDR) systems [5] are used to imple-1

(22)

P/S P/S Insertion Interval Guard Insertion Interval Guard D/A LPF D/A LPF Up Converter Up Converter IDFT IDFT Time Space Coded Signal Mapper Pilot Signal I Pilot Signal II S/P

Figure 1.1: OFDM transmitter data flow graph [1]

Down Converter A/D LPF Guard Interval Removal S/P DFT Fine Signal Detection Signal Demapper P/S Channel Estimation

Figure 1.2: OFDM receiver data flow graph [1]

Coarse Signal Detection Data Interference Cancellation Match Pilot IDFT Selection Path DFT

(23)

1.1. MOTIVATION 3

ment multi-standard communications. These SDR systems produce a new radio by run-ning new software. If the target standards use Orthogonal Frequency Division Multiplex (OFDM) to get high spectral efficiency, the computational complexity will be huge in-creasing. For example, the OFDM system as shown in Fig. 1.1, Fig. 1.2, and Fig. 1.3 was used to achieve high spectral efficiency [1]. This communication software was evaluated, and the design required multiple processors computing power to meet real-time constraint [6]. A programmable multi-core processor with accelerators is suitable for implementing such systems.

The integrated chip has more complex traffic than traditional single-core solution. Hence, on-chip communication is critical in the system implementation [7]. The complex communication scenario could cause the system performance out of control, and result in a failed system. Three major communication networks, the point-to-point connection, the bus-based communication, and the switching network, have been developed to process these complex traffic [8]. In this work, a virtual circuit switching network for SoC is proposed, and it can achieves performance guarantee and high utilization.

To optimize an SoC system, the tradeoff among hardware cost, system performance, and power dissipation can be performed at system level, register transfer level (RTL), and circuit level. When the tradeoff is assessed at system level, it has more flexibility to improve the system than that is done at other levels. Traditionally, system designers came to a compromise according to experience or manual estimation. However, systems are

(24)

C A B D F E Approximate Timed Cycle Timed

E: Cycle-accurate computation model B: Component−assembly model C: Bus−arbitration model D: Bus−functional model F: Implementation model A: Specification model Cycle Timed Approximate Timed Untimed Communication Computation Untimed

Figure 1.4: System modeling graph[2]

more and more complex along with the progressive technology. This tradeoff cannot just be manually handled, and thus Computer Aided Design (CAD) becomes more and more important in system designs [9].

There are several model proposed at system level as shown in Fig. 1.4 [2]. Researchers proposed a top-down refinement flow from untimed functional model to cycle accurate model. Architecture designers create the model and the systematic analysis is applied to optimize the architecture.

Design space exploration (DSE) can identify a suitable architecture for specific ap-plication, and then the architecture candidate is detailly evaluated. The DSE is time-consuming if it is performed at low abstraction levels, e.g. RTL. Since reducing design time is crucial for designers in SoC era, the DSE is preferred to be performed at electronic

(25)

1.1. MOTIVATION 5

system level (ESL). At ESL, system’s function and timing information are modeled for evaluating the system. The timing information of a building block can easily be modified for different implementation. For example, the single data rate (SDR) DRAM and double data rate (DDR) DRAM are both modeled as memory storage, but with different timing information. If system evaluation indicates that memory bandwidth is insufficient, and SDR DRAM is going to be replaced by DDR DRAM, designers only need to change the timing model of this building block to introduce a new component.

After obtaining the architecture candidate at ESL, the DSE of this candidate is then performed at RTL for refinement. Since the design space of each functional block needs to be explored, it is definitely time-consuming for a complex system. Therefore, it can significantly reduce the development time, if RTL functional blocks can be generated automatically.

In this thesis, a Network-on-Chip (NoC) platform are proposed for software-defined OFDM wireless communication. A NoC generator, a fast fourier transform (FFT) gen-erator, and a multiplier are developed for speeding up the DSE of a system as shown in Fig. 1.5. These generators produce the optimized prototypes based on the given speci-fications. The produced prototypes have accurate function and timing information. The information is used to replace those of the corresponding functional blocks of a system at ESL for the DSE. Therefore, the DSE can be more accurate. When design flow advances to RTL, these blocks can just be replaced by the prototyping RTL designs produced by the

(26)

SW PE SW PE SW PE SW PE SW PE SW PE SW PE SW PE SW PE FIFO CPR CPR CPR CPR FIFO FFT Generator Multiplier Generator

Figure 1.5: Overview of NoC platform

generators for system evaluation. Furthermore, when the flow advances to physical level, the layout of these blocks are replaced by the prototyping layout. The total development time is therefore reduced.

1.2 Thesis Organization

This thesis is organized into five chapters. Chapter 1 gives the introduction of the the-sis from the motivation of automatic generators. Chapter 2 describes the development of NoC generator, and the methodology for optimizing NoC networks. The experimental results of the NoC generator are given in the rest of this chapter. Chapter 3 briefly re-views pipelined FFT processors, and then presents the FFT generator and the wordlength

(27)

1.2. THESIS ORGANIZATION 7

optimization algorithm, following by the experimental results. Chapter 4 discusses the multiplier generator based on the gate delay and wire delay optimization. Finally, conclu-sions and future works are drawn in chapter 5.

(28)

(29)

Chapter 2 Network on Chip

Three types of networks were developed for on-chip communications [8]. A point-to-point communication network is shown in Fig. 2.1 which is constructed using a dedicated channel between the source and the destination. Without sharing channel with other com-munication traffic, this network has minimum run-time uncertainty, but it requires large silicon area due to the large amount of communication paths. These communication paths

1 PE 2 PE 3 PE 4 PE

Figure 2.1: A point-to-point network. 9

(30)

PE 1 PE 3 PE 6 Bus PE 4 Bus PE 2 PE 5

Figure 2.2: A bus-based network.

PE 1 PE 4 PE 3 PE 6 PE 2 PE 5 Switched Network

(31)

11

also need to be recognized at design time. Hence, this communication network is often employed in application-specific designs. Fig. 2.2 shows a bus-based communication net-work which is often used in IP-reused designs. It is a centralized netnet-work, and needs an arbitration mechanism to decide which processing element (PE) can use the bus. Such a centralized network will become a communication bottleneck as the number of PEs in-creases. Since on-chip communication is more and more complex, and traffic is heavier in a SoC design, the switch-based network is hence developed. A conceptual sketch of the switch-based network is shown in Fig. 2.3. This network is decentralized and concurrent, and hence its energy does not waste on meaningless signal transitions.

Circuit switching and packet switching are mostly used for network communications. The connectionless approach and connection-oriented approach are usually employed in packet switched networks. The store-and-forward switching, the virtual cut-through switching, and the wormhole switching are the connectionless approaches. The worm-hole switching is suitable for on-chip communications due to good average latency and low memory usage, but it has unpredictable latency under heavy traffic. The connection-oriented approach employs the circuit switched concept in the packet switched network, and it is hence called as virtual-circuit switching.

For Network-on-Chip (NoC), circuit switching and wormhole switching are widely used for on-chip communications [10]. To achieve high resource utilization and perfor-mance guarantee, the hybrid method combining circuit switching and wormhole switching

(32)

were proposed [11]. However, some applications need both high resource utilization and performance guarantee in each path. In this work, an NoC generator is developed based on the virtual-circuit switching typically used in computer networks to achieve high resource utilization and performance guarantee.

The rest of this chapter is organized as follows. Section 2.1 introduces the network switching. Section 2.2 introduces the developed switch architecture, and the NoC plat-form design. In Section 2.3, the communication-aware task binding methodology is de-scribed. Then, experimental results and discussions are given in Section 2.4. Finally, a summary is remarked.

2.1 Overview of Network Switching

In this section, circuit switching, connectionless packet switching, and virtual-circuit switching are briefly reviewed.

2.1.1 Circuit Switching

Circuit switching uses the dedicated resources to meet the real-time requirement. How-ever, the dedicated resources will be wasted if traffic is not continuous. Since on-chip traffic is usually a burst transaction, the circuit switching is therefore not adequate to such applications. On the other hand, it is satisfactory in real-time applications.

(33)

2.1. OVERVIEW OF NETWORK SWITCHING 13 BF BF T1 T3 T2 T4 T1 T3 T4 T2 BF BF T1 T3 T2 T4 T1 T3 T2 T4 SW2 SW1 SW3 SW4

Figure 2.4: Circuit switched network.

Time-division multiplexing (TDM) is employed in circuit switching for transmitting data through dedicated channels. As connection paths are established, the required time slots and buffers will be reserved for data transactions. Hence, the contention will not happen and the performance can be guaranteed.

Fig. 2.4 shows the conceptual plot of a circuit switching network, where T 1 to T 4 are time slots in a round. The highlighted time slots and buffers are reserved for the path indicated using the solid arrows as shown in Fig. 2.4. If the time slots are well-arranged, the latency can be reduced, but the throughput will not be improved.

(34)

BF2 BF1 BF1 BF2 BF2 BF1 BF1 BF2 T1 SW1 SW4 T4 T2 SW2 SW3 T3

Figure 2.5: Packet switched network.

2.1.2 Connectionless Packet Switching

Connectionless packet switching is widely used in data communications. This switching approach employs the shared resources to achieve high resource utilization, but it has unpredictable latency. If there is heavy traffic, the resources will be occupied, and the packet switching network cannot work well.

In connectionless packet switching, buffers are shared by all transactions. Buffers may overflow and will drop packets if the network has no handshaking schemes. If there exists handshaking scheme, a transaction will stall until the buffers of the destination are released. The conceptual sketch of a packet switched network is shown in Fig. 2.5. BF1 and BF2 Buffers in a switch are shared by all packets regardless of the source and the destination. When the switch receives a packet, it reserves a buffer for this packet, and

(35)

2.1. OVERVIEW OF NETWORK SWITCHING 15

then releases this buffer when the packet passes through to the destination. Hence, the buffers in the packet switched network can achieve high utilization.

For store-and-forward switched networks, a switch receives a complete packet, and then forwards to the destination. Hence, the switch requires to reserve sufficient buffers for this packet. For virtual cut-through switched networks, a switch needs to reserve enough buffers for a complete packet, but it can forward this packet to the destination directly without completely receiving this packet. For the wormhole switched networks, a switch can directly forward the received packet to the destination without reserving any buffers.

2.1.3 Virtual-circuit Switching

Virtual-circuit switching requires setting up a virtual connection from the source to the destination before sending packets. Fig. 2.6 shows the conceptual sketch of a virtual-circuit switched network, where virtual-virtual-circuit identifier (VCI) is introduced to specify which virtual-circuit access the physical wires. The VCI is not a global identifier; it has link local scope and is carried inside the header of the packet. As shown in Fig. 2.6, the virtual-circuit table of a switch is initially established based on routing paths, and is used to indicate the VCI (Out VCI) of the delivered packet according to this packet’s original VCI (In VCI). The packet delivered from the SW1 switch to the SW4 switch will change the VCI of the packet header from In VCI to Out VCI according to each virtual circuit

(36)

BF BF BF BF 1 3 2 4 2 1 3 4 Out VCI In VCI 1 3 2 4 1 4 3 2 Out VCI In VCI 1 3 2 4 1 3 2 4 Out VCI In VCI 1 3 2 4 2 1 3 4 Out VCI In VCI SW2 SW1 SW3 SW4

Figure 2.6: Virtual-circuit switched network.

table in the path. If enough buffers and bandwidth are reserved for this path, the quality of service (QoS) can be provided.

2.2 Architecture Models and Platform Design

There are many different interconnection architectures of NoC platform. P.P. Pande et al. [12] compare the performance and characteristics of a variety of NoC architectures and also obtain comparative results for a number of common NoC topologies. In this work, several assumptions are made in the following. First, we assume that our intercon-nection architecture of the NoC platform is a mesh-based topology where the platform

(37)

2.2. ARCHITECTURE MODELS AND PLATFORM DESIGN 17 NI B P M N P S E W SW NI B P M N P S E W SW NI B P M N P _S E W SW NI B P M N P _S E W SW NI B P M N P S E W SW NI B P M N P _S E W SW NI B P M N P _S E W SW N P S E W SW NI B P M N P S E W SW NI B P M N P S E W SW NI B P M N P _S E W SW NI B P M N P _S E W SW NI B P M N P S E W SW NI B P M N P _S E W SW NI B P M N P _S E W SW NI B P M N P S E W SW DECISION N E W S N S E W DECISION N S W E _DECISION N E W S DECISION N S W E DECISION SW = Switch NI = Network Interface P = Processor Core B = Buffer M = Memory NI B P M SW P W E N S

Figure 2.7: Mesh-based interconnection architecture of the NoC platform.

is illustrated in Fig. 2.7. Second, the platform that consists of two kinds of components: identical processors and switches. Third, each processor contains local memory and is connected to the local switch. Fourth, each switch connects to the neighboring switches and the local processor.

Three reasons are considered for choosing the 2-D mesh topology. First, the simple connection and easy routing are preferred in parallel computing platforms [13]. Next, the uniform interconnection among the nodes makes balanced propagation delay between switches and ensures the overall scalability of the network. Finally, this topology meets the plane manufacturing topology of IC technology.

(38)

PE PE PE PE PE PE PE PE RS RS RS RS RS SW SW SW (b) Switch−based Physical Channel Virtual Channel

(a) Relay Station−based RS

Figure 2.8: Transformation from relay station to switch.

2.2.1 Network Switching

We propose a switch architecture based on the latency-insensitive concepts [14] [15] and utilizes the virtual-circuit switching technique to achieve high bandwidth utilization, bandwidth guarantee and predictable latency under heavy traffic condition. Relay sta-tion(RS) is used for pipeline the long interconnect in latency-insensitive design. The topology of relay station connection is shown in the Fig. 2.8 [15]. In order to improve the low utilization of the dedicated peer-to-peer connections, the RSs are replaced by our switches and the virtual channels are substituted for the connections between RSs.

(39)

2.2. ARCHITECTURE MODELS AND PLATFORM DESIGN 19

Physical

Channel

BF2

BF1

BF3

BF4

SW1

SW2

Path A

Path B

Figure 2.9: An example of virtual channel scheme.

2.2.2 Switch Design

The proposed switch architecture using the hybrid of virtual channel scheme, the weighted round-robin scheduling, and SRAM-based configuration is capable of providing high-throughput, bandwidth guarantee, economical memory usage, and deadlock free. We summarize the switch capabilities as follows:

First, each switch based on virtual-circuit switching owns the advantages of pre-dictable behavior and the real-time response. The switches use the virtual channel flow control to enhance the overall latency and the throughput of a network. For example, in Fig. 2.9, there are two messages crossing the physical channel between switches SW1 and SW2. Without using the virtual channel technique, the message data will be buffered at the input or output of the physical channel. Moreover, the transfer in this channel will

(40)

Buffer C

Buffer B

Buffer A

Buffer B

Buffer C

TIME

A2

B1

B2

A3

A4

C1

C2

A5

B3

A1

MUX

Figure 2.10: Bandwidth allocation of a physical channel using a weighted round-robin scheduler.

be blocked until the buffers are released. In this work, the messages can be delivered rather than blocked by dividing the physical channel into several virtual channels. The waiting time of the message transfer is reduced, and the average latency of this channel is decreased. Thus, the physical channel gets higher utilization and the network obtains a larger throughput.

Second, concerning the bandwidth sharing of a physical channel among all the virtual channels, we exploit the weighted round-robin scheduling scheme to grant the use of the physical channel to each virtual channel. Instead of using the time-division method, the weighted round-robin scheduler as shown in Fig. 2.10 allocates different bandwidth for each virtual channel by assigning different amount of the time slots. The higher weight of

(41)

M U X

Controller

E1,sw1 E2,sw1 E3,sw1 E4,sw1

1 0 1 0 STATUS E2,sw1 E3,sw1 E4,sw1 E1,sw1 STATUS Table :

1 = there is data in the buffer; require to access the channel

S1,sw2 S2,sw2 E1,sw2 E2,sw2 LENGTH−Full Table : 1 = buffer is full; no more data in 1 0 0 0 E1,sw2 E2,sw2 S1,sw2 S2,sw2 LENGTH−Full SW2 E1,sw1 E2,sw1 E3,sw1 E4,sw1 E2,sw2 S1,sw2 S2,sw2 E1,sw2 Mapping Table Address SW2 SW1 SW1 Address Data Ack

Figure 2.11: Interface transactions between two switches.

a channel means that more communication bandwidth is available.

Third, the data exchange protocol between two switches or between the switch and the network interface of the local processor is executed within four clock cycles. Fig. 2.11 shows that the interface transaction between two adjacent switches, SW1 and SW2. The address mapping table records the destination address to which the messages are trans-ferred. At the first cycle, if the buffer, E1,sw1, of SW1 has data inside, the system controller

grants the channel priority to this data. At cycle 2, this E1,sw1 buffer sends the address of E1,sw2 through the Address-line to indicate that this transaction tries to deliver data to the

(42)

signal, true or false, back to SW1 through the Ack-line according to its buffer status, full or available. Meanwhile, the buffer E1,sw1 sends the data through the Data-line, and the

buffer E1,sw2stores this data if it has spare space. However, this data may be discarded if E1,sw2 is already full. During the fourth cycle, the buffer E1,sw1 keeps this data until the

transaction is successfully completed.

Fourth, our switch provides different memory configurations to improve the local memory utilization. The first reason is that not all buffers of the switches are reserved when the number of the connection paths is smaller than the number of the designed buffers. The second reason, the memory is a critical component for buffering data in a network. Therefore, in memory implementation, we use two-port SRAM instead of regis-ters when the number of virtual channel is large in the physical channel. In the switch, the memory is divided into several different sizes of buffers to optimize the utilization. The memory in a switch port can be partitioned into 8 8-words blocks, 16 4-words blocks or 32 2-words blocks.

Finally, in order to support the real-time application, our switches is able to estab-lish the dedicated connection paths in advance by reserving the corresponding virtual channels since the behavior of the communication and the number of the nodes can be predetermined in early stage of system design.

Although both the traditional circuit-switching and the proposed switching configu-ration have latency guarantee, the proposed one has smaller average latency and higher

(43)

2.2. ARCHITECTURE MODELS AND PLATFORM DESIGN 23 MUX MUX MUX Buffer Buffer SW 2 Buffer 2 Buffer Buffer SW K Buffer K Communication Path Buffer 1 Buffer Buffer SW 1

Scheduler 1 Scheduler 2 Scheduler K

Figure 2.12: Real-time QoS modeling.

hardware utilization. The proposed one has the worst case guarantee as compared to the worm-hole packet switching while both switches have small buffer size and high hardware utilization.

2.2.3 Quality of Service Modeling and Property

In the real-time system, the latency guarantee is the essential requirement of the quality of service (QoS) while the scheduling algorithm enables the appropriate task scheduling to satisfy the real-time requirement in the worst case condition. On the other hand, the QoS also plays a critical role even in a non-real time system. Generally, when using the communication fabric without performance guarantee, designers have to expend more design efforts to estimate the communication latency to make sure that the communication

(44)

loading is not underestimated for the given on-chip network communication architecture. As a consequence, the communication system infrastructure is usually over-designed to avoid the communication congestion. In this work, we use the weighted round-robin scheduling for our QoS model as shown in Fig. 2.12, where the weighted round-robin scheduling is a minimal resources scheduling scheme. Each master has a weight number

Ni in the controlled scheduler. The scheduler grants the master if the master proposes

the request. The master can transmit at most Ni-word data in a round. After that, the

scheduler grants the next master until the round is complete.

Our switches support to establish a predictable communication quality of NoC plat-form and also provide a simple communication model for reducing the design complexity. As shown in Fig. 2.12, the communication path from Buffer 1 to Buffer K is established. The transactions from Buffer i is granted by the weighted round-robin Scheduler i. Before analyzing the properties of QoS model as exposed in Fig. 2.12, the useful definitions are revealed in the following:

1) wi,jis the weight of the Buffer i in the weighted round-robin Scheduler j.

2) Wjis the sum of weight of the buffers controlled by the Scheduler j.

3) Dmaxdenotes the maximum delay of a 1-word transmission.

(45)

5) Lmaxdenotes the maximum communication path latency of a 1-word transmission.

6) Rpathdenotes the throughput rate of a path.

7) Lburstdenotes the maximum burst data latency.

Using the above definitions, the proposed network switch design has six properties to guarantee QoS, where the six QoS properties are described as follows:

• Property 1: If Buffer i is empty, the maximum delay from the data arrival to the transfer is the time period of a round in the round-robin scheduler, i.e., Dmax = Wj.

• Property 2: If there are data in Buffer i, the Dmax between the transactions is Wj.

• Property 3: If the buffer size is the double of the buffer’s weight or more, the pro-vided lower-bound throughput rate is the ratio of the weight and the sum of the weights in the round-robin scheduler, i.e., R ≥ wi,j

Wj.

• Property 4: The maximum path latency of 1-word transmission is the sum of maxi-mum node latency of 1-word transmission, i.e., Lmax = P

k j=1Wj.

• Property 5: Rpath is dominated by the minimum throughput of the buffers in the

path, i.e., Rpath = min

n

wi,j

Wj

o

, where j = 1, 2, 3, · · · , k.

• Property 6: Using Property 4 and Property 5, the burst data delay can be obtained as Lburst= Lmax+ _RN

(46)

SW PE SW PE SW PE SW PE SW PE SW PE SW PE SW PE SW PE Processor Memory Buffer NI

Figure 2.13: NoC platform model.

The Property 6 means that our switch has an upper bound of the burst data delay such that system designers can design target systems to meet real-time constraints.

2.3 Task Binding Methodology

In this section, we present the communication-aware methodology to solve the task bind-ing problem based on the NoC platform which is constructed by the proposed switch as mention before.

(47)

2.3. TASK BINDING METHODOLOGY 27 A MEC B MEF C MVMVD D HVLC H SP G TVLC F DCTQ E MC Current Frame Stream C6 C7 C8 C4 C3 C2 C5 C1

Figure 2.14: The task graph of MPEG-4 encoder.

2.3.1 NoC platform Modeling

Without loss of generality, the proposed NoC platform using an efficient switch architec-ture can be modeled in Fig. 2.13. The behavior of the NoC platform model are described in the following:

1) The platform composed of the processing element (PE) and switch (SW) is the mesh-based communication architecture. Each PE contains one processor, memory, and network interface (NI).

2) All processors in this platform are identical.

3) The processors have limited buffer to store input and output data.

4) Each PE has local memory to store the execution code and the data.

(48)

We employ the task graph to model applications and assume that applications are able to be partitioned into many communicated tasks due to the parallelism. Fig. 2.14 shows a task graph of an MPEG-4 encoder [16]. A vertex represents a task and the functionality labeled in the vertex. For example, task C denoted as MVMVD performs the motion vector to motion vector difference calculation. The edge represents a data transmission and the corresponding communication amount. After task B is finished, task B transmits C2 unit data to task C and transmits C5 unit data to task E. An edge also indicates the data dependency. A task cannot be executed until it receives the data from the predecessor. For example, task H cannot be executed until it receives C4 unit data from task D and C8 unit data from task G.

2.3.2 Task Binding Problem

The task binding problem is formulated in the following descriptions: • Given:

1) The application is modeled as a task graph G(V,E), where V is a task and the weight denote the computation amount and E is the dependence and the communication amount between the tasks.

2) The NoC platform (PE,S), where PE and S contain the position information of processors and the communication architecture information, respectively.

(49)

2.3. TASK BINDING METHODOLOGY 29 Y SW SW SW W Z X Z C A B Wrapper W SW Wrapper Z SW B SW Wrapper X SW A C Wrapper Y SW SW Figure 2.15: Illustration of binding methodology. • Goal:

To bind each V to PE and to assign the routing path between tasks such that the overall system throughput is maximum.

An example of binding result is shown in Fig. 2.15 that tasks bind to PEs and the communication paths are established. The path A and path B are overlapped and will contend in temporal domain.

In this work, according to Fig. 2.15, we propose a more accurate communication cost calculation scheme that will be exploited by the task mapping and the routing path assign-ment processes. The more complete communication cost function ξ is described in the following. ξ = X X∈each path X each channel in X 1+ Camount,X+ ρX Camount,MAX + Bpenalty (2.1)

(50)

where Camount, ρX, and Bpenaltydenote the communication amount, contention factor which

models the total effect of communication contention on a physical channel, and bandwidth penalty respectively. The detailed expression of ρX, and Bpenalty are revealed in the

fol-lowing:

ρX =

X

Y ∈other paths in the channel

Camount,Y × ΓY →X (2.2) and Bpenalty =        α × Bdemanded−Bprovided

Bprovided , if Bdemanded > Bprovided

0 , if Bdemanded < Bprovided

(2.3) where Γ , α, Bdemanded, and Bprovideddenotes the contention density, the penalty weight, the

demanded bandwidth used by communication paths in a physical channel, and the pro-vided bandwidth of a physical channel respectively. The contention density is formulated as

Γb→a =

t(a,b)overlap tcommun.time,a

(2.4) The density of each pair of the communication paths can be derived from the commu-nication profile in the time domain as shown in Fig. 2.16.

The communication cost proposed in (2.1) means that the efficiency of a communica-tion path is affected by the distance of the path, the number of paths to share the channel and the bandwidth usage.

Therefore, we propose a design methodology as shown in Fig. 2.17 to solve the task binding problem. The first block, Task Graph, contains the computation and

(51)

communi-2.3. TASK BINDING METHODOLOGY 31

t1

t2 t3 t4 t5

_t6

_t7

Time Line

Path B

Path A

Contention occurres

Figure 2.16: Contention in time domain.

cation information for all tasks to support the requirements of the task mapping. The mapping approach employs the placement techniques used in FPGA to map tasks onto PE [17]. The main idea is that if the traffic loading of any two tasks is heavy, these two tasks are allocated as next to each other. After the task mapping, the shortest path technique is applied to configure all the connection paths for these tasks. Then, the sim-ulation is performed to obtain the profile of the communication in time domain and the contention parameters mentioned above are calculated. Finally, the profile feeds back to the task mapping process and the path assignment proceeds this profile to achieve the bet-ter assignment. The profile referred to as profile-driven optimization provides the more accuracy contention information than the system simulation without routing information. This design flow can be proceeded iteratively to enhance the system performance.

(52)

Application Task Graph Task Mapping 00 00 11 11 00 00 11 11 00001111 00 00 00 11 11 11 00 00 00 00 00 11 11 11 11 11 00 00 00 00 00 11 11 11 11 11 NoC Platform Good ? Routing Performance Analysis Finish Yes No

(53)

2.3. TASK BINDING METHODOLOGY 33

2.3.3 Task Mapping

The task mapping of the proposed task binding method utilizes the simulated annealing technique since this technique is simple and suitable for diverse architectures developed in the on-chip networks. This mapping technique can be easily adapted to different cost requirements for the optimization. In this work, the goal of the task mapping is to mini-mize the overall communication resource usage. In the task mapping stage, because the routing is not applied, the path information in the physical channel cannot be obtained so that the simplified cost function as listed in (2.5) is derived by neglecting the contention factor ρX and the bandwidth penalty Bpenaltyin (2.1).

ξ= X

Path of pair of processors

distance × 1+ Camount,X Camount,MAX (2.5)

It is worthy of noting that we can consider (2.5) as the simplified version of (2.1). On the other hand, the distance in (2.5) is the Manhattan distance between the source node and the destination. The first term, (distance × 1), of the cost function describes the resource usage of the virtual channels. This term is also the conventional cost in the FPGA placement algorithm. In our mapping algorithm, not only the distance, but also the communication amounts between the tasks affect the system performance. The second term, (distance × Camount,x

Camount,MAX), expressed as the normalized result represents the

(54)

2.3.4 Connection Path Assignment

After finishing the task mapping, a router is used to assign the connection paths between any pair of interconnected processors. The algorithm proposed in [18] is applied to solve the routing problem. This router is essentially a variant of the maze router [19], where Dijkstra’s algorithm [20] is applied to find the lowest cost path between the transmitting and the receiving processors. The Pathfinder algorithm [17] then performs multiple rout-ing iterations to rip up some or all nets and reroute them by different paths if there exists a competition for routing resources that makes the illegal routing. Please note that ripping up and rerouting these nets only affect the net ordering. These nets are all routed by the same maze routing algorithm.

The cost function of the path assignment is to use the equation (2.1). There are two differences between cost functions (2.1) and (2.5). One is that the distance of the path assignment is the real routing length rather than the Manhattan distance used in the task mapping. The other difference is that the contention density and bandwidth penalty are included in the path assignment cost to describe the contention effect and total bandwidth introduced by other paths.

Fig. 2.18 shows the procedure of the path assignment algorithm. At the first, the all-shortest-path algorithm is applied to this path assignment. In Step 2, the overused virtual channels can be solved so that all the nets of the paths are routed. However, the overused

(55)

2.3. TASK BINDING METHODOLOGY 35

3. Sort paths by cost function

5. re−route the paths in orderly

6. if (no improvement in this iteration) 7. exit

1. The all shortest path algorithm

4. while (1)

2. Fix virtual channel overuse Algorithm:

Output: Route all communication paths such that the communication cost is minimized.

Input: Given an NoC architecture with the locations of transmitter and receiver. Number of virtual channel in a physical channel is assigned.

Path Assignment

(56)

bandwidth of some physical channels is not overcome in this step. From Step 3 to Step 7, these communication paths are redistributed into the physical channels to avoid overuse. The redistributed method is to find the path which has the maximum cost and then reroute this path to reduce its cost. The cost of this critical path may not be improved due to the possible heavy contention with other paths, so other paths need to be rerouted alternately depending on the measurement cost.

2.4 Experimental Results and Discussion

In this work, a 2D-mesh configuration is used to establish the network infrastructure for the experiments. The functionality and performance of the proposed platform with 4-by-4 nodes are evaluated. The traffic on this platform is generated by a random traffic generat-ing function. The major performance evaluated in this experiment is the communication information of the platform. Then, the proposed task binding method is applied to the platform with the randomly generated task graph. Finally, the performance of our task binding algorithm will be evaluated based on this platform.

The proposed switch architecture is modeled in both cycle-accurate C++ and Verilog HDL. The C++ model is used for system design and the platform evaluation. The Verilog model is used for hardware design. After the synthesis with 0.25um standard CMOS technology, this switch can operate at 185 MHz in the typical-case corner.

(57)

2.4. EXPERIMENTAL RESULTS AND DISCUSSION 37

In order to evaluate the traffic performance of the switch-based network platform, the C++ model of the switch is combined into the overall model of the network, and each original processing element (PE) is replaced by a random traffic pattern generator. This pattern generator generates random size packets which move from the arbitrary chosen source to the random destination.

The latency in this study means the elapsed time required for the data packet trans-mitted from the source node to the destination node [21]. Maximum latency is defined as the predicted worst case latency and the maximum latency can be obtained by Property 6. Normalized latency is defined as the latency divided by the maximum latency and normal-ized latency indicates the average performance. Injection rate is defined as the required bandwidth of the generated traffic divided by the guarantee bandwidth of a communica-tion path. By changing the value of the injeccommunica-tion rate, the different communication loads are available to evaluate the platform.

Fig. 2.19 shows the experimental histograms of the normalized latency versus different injection rates while each virtual channel of the platform only has a 2-word buffer. The network latency guarantee (normalized latency ≤ 1) is achieved even at the high injection rate (injection rate = 1). This means that the proposed NoC platform has the property of the minimum bandwidth guarantee for each transmission. The normalized latency approaches to zero when the injection rate decreases. This indicates that the average latency reduces as the injection rate decreases. With the property of the latency guarantee,

(58)

(59)

Figure 2.20: Histograms of normalized latency under different buffer size of virtual chan-nel.

the predictability of the proposed platform can be obtained and the real-time systems can be realized.

Fig. 2.20 shows the normalized latency under the different buffer sizes of the virtual channels. The normalized latency reduces when the buffer size is increased. In general, the bigger buffer size of the virtual channels in the switches, the better communication performance can be achieved for the system in various applications.

In this switch-based network infrastructure, assume that the processor can begin to operate only when all input data are available and its output buffer size is enough for the data generated by itself.

(60)

The Task Graph For Free (TGFF) [22], a user-controllable, general-purpose, pseudo-random task graph generator, is employed to generate the task graphs used in the experi-ments. 100 task graphs are used, each task graph has at least 60 to 100 tasks, and the max-imum inputs/outputs of each task are 7 to 10. The task graph is generated by TGFF and the communication amount is modified by multiplying the communication factor. Higher communication factor means the higher ratio of the communication and computation. On the other hand, the communication factor also indicates the provided physical bandwidth. The larger communication factor, the smaller physical bandwidth. In this experiment, the communication factor is set to 0.25, 0.5, 1, 2, and 4. The task graph with the lower com-munication factor implies that the application is computation intensive. Therefore, the higher factor means the application is communication intensive. Fail rate is defined as the number of fail transactions over the number of total transactions. Higher fail rate means that more power consumption is used for useless (meaningless) transaction and thus total power increases.

Fig. 2.21 shows the relationship between the fail rate and the communication factor by using traditional approach and the proposed one. The traditional approach is build without considering communication information. In Fig. 2.21, the fail rate increases along with the communication factor and saturates about 16%. This means that even under communication intensive, the fail rate is under-controlled. In Fig. 2.21, as compared with the traditional method, our proposed approach can improve the fail rate by 12.3% and

(61)

(62)

Figure 2.22: Fail rate under different buffer size. 5.6% with respect to the communication factor 1 and 4 respectively.

In order to reduce fail rate, the buffer size is increased. The result shows that the fail rate decreases along with the increasing buffer size and is shown in Fig. 2.22. The transac-tion is failed when the buffers of the destination are full. This means that larger buffer size gets lower fail probability. Compared with the traditional method, 50.8% improvement of fail rate versus 16 buffers can be obtained via the new approach.

Fig. 2.23 shows the communication latency versus the communication factor. The la-tency grows linearly with the increasing communication factor. In Fig. 2.23, as compared with the traditional method, our proposed approach can improve the latency by 9.8% with

(63)

(64)

Figure 2.24: Latency under different buffer size. respect to the communication factor 4.

Increasing the buffer size can also reduce the latency. As shown in Fig. 2.24, the latency decreases when the buffer size increases. This also means that lower fail rate gets smaller latency. Compared with the traditional method, 14.8% improvement of latency versus 16 buffers can be obtained via the new approach.

Fig. 2.25 shows the comparison of the throughput ratio versus the communication fac-tor between the proposed algorithms. The throughput ratio is defined as the throughput resulted from the proposed algorithm divided by this one obtained from the traditional approach. The first curve shows the result by applying (2.5). Then, add contention factor

(65)

(66)

in the connection path assignment in the second curve. Finally, add bandwidth penalty in the connection path assignment to improve the performance under heavy load. In the computation-intensive task graph, the corresponding throughputs are almost equal under different algorithms and this implies that the communication effect can be neglected. On the other hand, in the communication-intensive task graph, the bandwidth penalty be-comes dominating because the required bandwidth is more than the provided bandwidth. To balance the bandwidth becomes the most important issue. As shown in Fig. 2.25, the proposed algorithm improve throughput 20% under normal communication (communica-tion factor= 1).

From the above simulation results obtained from the new model of switch-based ar-chitecture and communication-aware methodology, the proposed approach outperforms the traditional method and provides high QoS quality including high-throughput, latency-insensitive, bandwidth guarantee, and high memory utilization.

2.5 Summary

This work proposes a switch-based network platform design that adopts the hybrid of the latency-insensitive concept, virtual-circuit switching, weighted round-robin scheduling, and pipeline bus. The platform has the latency guarantee and the low average latency. The proposed task binding algorithm employs the iterative profile-driven optimization

(67)

2.5. SUMMARY 47

technique to reduce the effect of the communication amount and the communication con-tentions, so that the high system throughput is achieved. The experimental results indicate that the task binding approach increases the system utilization and effectively improves the network throughput up to 20% on average.

(68)

(69)

Chapter 3 FFT Processor

The FFT is one of the most widely used algorithms for calculating the Discrete Fourier Transform (DFT) owing to its efficiency in reducing computation time [23]. Recently, the FFT requiring real-time processing has played a significant role in many communication systems based on Orthogonal Frequency Division Multiplexing (OFDM) technology such as HDTV, xDSL modems and wideband mobile terminals.

Pipelined FFT implementations are highly appropriate for real-time applications since pipelined FFT can be easily merged with the sequential nature of sampling. Several FFT architectures were developed, such as Radix-2 Multi-path Delay Commutator (R2MDC) [24], Radix-2 Single-path Delay Feedback (R2SDF) [25], Radix-22 _{Single-path Delay}

Feedback (R22_{SDF) [26][27], 4 Single-path Delay Feedback (R4SDF) and}

Radix-4 Multi-path Delay Commutator (RRadix-4MDC) [2Radix-4]. Among these architectures, delay feed-49

(70)

back approaches are always more efficient than the corresponding delay commutator ap-proaches in terms of required memory size [26] [28]. The R4SDF requires fewer multi-pliers than those required by R2SDF; however, the R2SDF architecture is simple and reg-ular. The R22_{SDF architecture is a compromise endowed with the R2SDF structure and}

the multiplicative complexity of the R4SDF. This study focuses on R2SDF and R22_SDF

architectures.

Since the pipeline FFT architecture is memory-consuming, reducing its memory re-quirement will save a significant amount of chip area. Several studies have employed regular module implementations and have attempted to reduce the area-consuming ele-ments in the FFT design. The design of [29] reduces the amount of memory used to store the twiddle factors by employing canonic signed digit (CSD) constant multipliers. A new FFT architecture, the radix-2 single deep delay feedback (R2SD2_{SF) presented in [30],}

has smaller complex multipliers and adders than other FFT designs. Both the designs of [29] and [30] have fixed wordlength for data and coefficients for each pipeline stage. The possibility to use varying wordlengths for these stages is frequently ignored when achieving modularized solutions. However, the increasing use of intellectual property (IP) makes the non-module implementation viable, allowing for the further exploitation of pipelined architectures.

In general, an FFT cannot be implemented exactly. Each multiplier and adder in the pipelined FFT architecture can introduce errors due to rounding or truncation of

(71)

arith-51

metic results. Errors typically accumulate successively over FFT stages. That is, errors from early stages can affect performance in latter stages. The wordlengths of data and co-efficients chiefly affect precision, quantization errors, and hardware complexity. Increased wordlengths increase the precision and reduce quantization error at the cost of area and power. Conversely, to maintain a lower hardware cost, a shorter wordlength can be chosen at the sacrifice of precision. Therefore, identifying an optimized solution of wordlength is necessary.

Two conventional methods for FFT error analysis of signal to quantization noise ratio (SQNR) and wordlengths are statistical error analysis and simulation-based analysis. Al-though the SQNR can be calculated efficiently by employing statistical models [31] [32] [33], the accuracy of the calculated result heavily depends on the model used. A more precise model yields more accurate results. The simulation-based method evaluates the FFT by comparing simulation results of the fixed-point computations with those obtained using the floating-point arithmetic [34]. Although simulation increases the accuracy of the evaluation results, it is time-consuming.

According to error analysis, optimizing wordlengths of pipeline stages in FFT pro-cessors for given specifications is feasible. Optimization of an 8192-point FFT processor using the simulation method has shown that progressive wordlengths and scaling in the early stages can achieve a good compromise between SQNR and hardware cost [35]. However, this approach requires a long time to run the simulation.

(72)

This work presents a statistical model for error analysis at the stage level with varying wordlengths in the pipeline FFT processor. Furthermore, a hybrid method for reducing the required simulation time is introduced. The optimized wordlength parameters at each stage are generated automatically according to design specifications of FFT processors, such as the length of FFT, SQNR and the real-time processing requirements. Finally, the optimization flow using the proposed error model and the hybrid method is demonstrated. The rest of this chapter is organized as follows. Section 3.1 gives a brief review of the FFT. Section 3.2 then introduces statistical and simulation-based error analyses and demonstrates the effectiveness of these methods. Section 3.3 describes the proposed method for wordlength optimization step-by-step, while Section 3.4 summarizes the ex-perimental results. Conclusions are finally drawn in Section 3.5.

3.1 Overview of FFT

An FFT based on structuring the DFT computation by forming increasingly smaller sub-sequences of the input sequence x[n] is called a decimation-in-time (DIT) FFT. Alterna-tively, an FFT can also be decomposed using a first-half/second-half approach that divides the output sequence X(r) into increasingly smaller subsequences; this procedure is called a decimation-in-frequency (DIF) FFT [36]. Since both of these schemes are similar in nature, their performance cannot be exactly compared without a given architecture [33].

(73)

3.1. OVERVIEW OF FFT 53

N/2 FIFO 1 FIFO

Butterfly Butterfly Butterfly Butterfly

clock Controller (a) R2SDF architecture. b_P−1 b_P−1 b_P−1 b_P b_P b_P b_P b₁ b₁ b₁ b₁ b₂ b₂ 2 b b₂ b₁ b₂ b_P−1 b_P−1 Wp Wp Wp N/2 FIFO 1 FIFO

Butterfly Butterfly Butterfly Butterfly b₁ b₁ b₁ b₁ b₂ b₂ b_P−1 b_P−1 b_P−1 b_P−1 b_P b_P b_P b_P b₂ b₂ clock Controller (b) R2 SDF architecture.2 Wp b₂ 2 FIFO x[n] X(n) N/4 FIFO 2 FIFO x[n] X(n) N/4 FIFO −j −j

Radix−2 Radix−2 Radix−2 Radix−2

Figure 3.1: Conventional R2SDF and R22SDF DIF implementations.

In this work, the DIF algorithm is used to illustrate the architectural implementations. This work examines the architectures of R2SDF and R22_{SDF for the fixed-point DIF}

pipeline FFT processor to demonstrate the effectiveness of the proposed optimization method. Their block diagrams are shown in Fig. 3.1, where N is FFT length, bk is the

wordlength of stage k, k ∈ {1, 2, · · · , P }, and P = log₂N. Due to spatial regularity, both controllers in these architectures can be implemented by using simple P -bit counters [25] [27]. Since the valid output range of the+/- operation of the FFT butterfly is double that of the valid input range, a scaling by 1/2 is applied to eliminate the overflow.

單晶片網路系統平台設計最佳化之研究

»ñø;.

é .é@~X

}ÿ¡Z

þnç­Ù¿¬'t·;@~

On the Study of Design Optimization for

Network-on-Chip Platform

@~ß : rW¼

¼0>0 : øÿò

On the Study of Design Optimization for

Network-on-Chip Platform

@~ß : rW¼

Student : Cheng-Yeh Wang

¼0>0 : øÿò

Advisor : Jing-Yang Jou

»ñø;.

é^.o

é .

é@~X

}ÿ¡Z

A Dissertation

Submitted to Department of Electronics Engineering

and Institute of Electronics

College of Electrical and Computer Engineering

National Chiao-Tung University

in partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

in

Electronics Engineering

June 2007

Hsin-Chu, Taiwan, Republic of China

ºÓ»Üè0O0`

þnç­Ù¿¬'t·;@~

.ß : rW¼

¼0>0 : øÿò

»ñø;.

é^.o

é .

é@~X

`

On the Study of Design Optimization for

Network-on-Chip Platform

Student : Cheng-Yeh Wang

Advisor : Jing-Yang Jou

Department of Electronics Engineering

and Institute of Electronics

National Chiao-Tung University

Abstract

Acknowledgements

C

-Y

W

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Motivation

1.2

Thesis Organization

Chapter 2

Network on Chip

2.1

Overview of Network Switching

2.1.1

Circuit Switching

2.1.2

Connectionless Packet Switching

2.1.3

Virtual-circuit Switching

2.2

Architecture Models and Platform Design

2.2.1

Network Switching

Physical

Channel

BF2

BF1

é.é@~X

þnçÙ¿¬'t·;@~

é.

ºÓ»Üè0O0`

þnçÙ¿¬'t·;@~

é.

`

_t6

_t7