• 沒有找到結果。

Chapter 1 Introduction

1.4 Organization

The organization of this thesis is as follows. An overview of on-chip interconnection network is introduced in the chapter 2. In this chapter, the design concept of NoCs will be described. Then, we would introduce flow control mechanism and interconnection network (crossbar) including arbitration mechanism in the chapter 3. Chapter 4 presents an efficient network interface for memory-centric on-chip interconnection network which can reduce the data blocking by a borrowing mechanism. Finally, a memory-centric on-chip data communication platform

Chapter 2

Previous Work of On-Chip Interconnection Network

In this chapter, I describe how Network on Chip (NoC) will be the next major challenge to implement complex and function-rich application in advanced process technologies in section 2.1. The general design concept is discussed in section 2.2.

The interconnect architecture, topologies, of NoC should be efficient for a huge amount of processor elements. A number of different interconnect architectures will been present in section 2.3. Moreover, some advance topologies are present to adopt with on-chip platform. Switching fabrics (or called router) is a key component in network-on-chip to command the data communication. I will describe the components in switch fabrics and how they influence the NoC systems in section 2.4, which includes four parts: routing units, buffers, switching circuits, and arbitration unit. In addition, the implementation of each unit will be described also.

2.2 Why NoC?

System-on-chip (SOC) designs provide the integrated solution to the challenging design problems in the communications, multimedia and consumer electronics.

Moreover, every year System-on-Chip designs become increasingly complex, while the associated numbers of transistors grows exponentially. The successful design of SoC depends on the availability of the methodologies that allow designers to copy with two major challenges: the extreme miniaturization of device and wire features, and the extremely large scale of integration. Most SoC will find their application within embedded systems, traditional figures of merit, such as performance, energy consumption and cost. It will be as important as the first-design correct and reliable operation and robustness. Modern SoC design is faced with a number of problems

caused by the scale and complexity of the designs. For ideal IP-based SoC, on-chip bus interfaces between each IP and a good verification environment [2.1][2.2].

Figure 2.1 Traditional Synchronous Bus

In the next SoC era, however, there are some challenges for traditional on-chip bus platform which is shown in Fig. 2.1. First, the required on-chip communication bandwidth is growing beyond that provided by standard on-chip buses [2.3]. The shared bus architecture will limit the development factor for integration with increasing IP blocks. Existing bus architectures and techniques are proving to be non-scalable, unable to meet leading edge complexity and performance requirements.

Second, the interconnect delay across the chip exceeds the average clock period of the IP blocks, especially in nano-scale technologies [2.4]. The ratio of global interconnect delay to average clock period will continue to grow. In a 60nm process, a signal can reach only 5% of the die’s length in a clock cycle. However, an interconnect channel design methodology for high performance ICs has proposed in [2.5], it devised a methodology to size the FIFOs in an interconnect channel containing one or more FIFOs connected in series and shows that the sizing of the FIFOs in the channel is a function of system parameters such as data production rate and communication rate, number of channel stages etc.

Third, in nano-scale technologies, increased coupling effect for interconnects not only aggravates the power-delay metrics but also deteriorates the signal integrity due to capacitive and inductive crosstalk noises.Several options were proposed to reduce the inter-wire capacitances. The first option is to widen the pitch between bus lines.

routing time do not allow us trying it to minimize the coupling capacitances. The third option is to change the geometrical shape of bus lines. But the disadvantage of this method is that the frank area will increase since the cross-sectional area of a bus line is fixed. The fourth technique is to add a shielding line (VDD/Ground) between two adjacent signal lines. The fifth option is to reduce power is through the use of bus encoding schemes [2.6][2.7][2.8][2.9][2.10][2.11]. By the end of the decade, using 60 nm transistors operating below one volt, with grow to 4 billion transistors running at 10GHz, according to the International Technology Roadmap for Semiconductors.

On-chip physical interconnections will present a limited factor for performance and energy consumption. The encoding schemes for low power and reliability issues are proposed in [2.12]. The designers must overcome the challenge of noises to provide the function correct, reliable operation of the interacting components. A robust self-calibrating transmission scheme for interconnections is proposed in [2.13] and it examines some physical properties of on-chip interconnects, with the goal of achieving fast, reliable and low-energy communication.

Forth, both the system design and performance are limited by the complexity of the interconnection between the different modules and blocks into single clocked design. Different data transfer speeds are required, as well as parallel transmission.

The traditional system buses may not be suitable for such a system since only one module can transmit at a time. Additionally, the modern SOC designer assembles the system using ready virtual components which might not be easily adaptable to different clocking situations. The solution to above problems is a segmented bus design combined with the concept of the globally asynchronous local synchronous (GALS) system architecture [2.14][2.15][2.16][2.17][2.18][2.19]. Asynchronous design can make the circuits resilient to delay variation.

Figure 2.2 (a) Multi-Layer Bus Architecture (b) Centralized Crossbar Switch

Figure 2.3 Network-on-Chip Architecture

For the above mentioned problems, new architectures for the on-chip communications are proposed to adapt the next SoC era. The traditional synchronous on-chip bus architectures as Fig. 1 are faced a serious of acid tests which are mentioned in the last paragraphs. Multi-layer on-chip shared bus as Fig. 2.2(a) is the advised version of the traditional on-chip bus to reduce the shared-medium channels [2.20][2.21][2.22]. It’s the specification of an interconnect scheme that overcome the limitations of shared bus. Therefore, it enables parallel access paths between multiple masters and slaves by a bus matrix. When each master has its corresponding bus, the structure is equivalent to a full crossbar as Fig. 2.2(b). However, not only centralized crossbar switching systems but also multi-layer bus architectures will be confused with complex wire routings which will introduce larger power consumption and interconnect delay with increasing processor elements.

The network-on-chip architecture as Figure 2.3 is based on a homogeneous and scalable switch fabric network, which considers all the requirements of on-chip

Maste

design-time specialization. The motivation of establishing NoC platform is to achieve performance using a system perspective of communication. The core of NoC technology is the active switching fabric that manages multi-purpose data packets within complex, IP laden designs. The most important characteristics of NoC architecture can be summarized as packet switched approach [2.26], flexible and user-defined topology and global asynchronous locally synchronous (GALS) implementation.

2.2 The Design Concept of Network-on-Chip

The topic of Network-on-Chip(NoC) designs is vast and complex. There is a large literature on architectures for NoCs. Consider on-chip communication and its abstraction of network-on-chip as a micro-network and analyze the various levels of the micro-network stack bottom to up as right part in Fig. 2.4, starting from physical layer to software layer. NoC protocols are typical organized in layers, in a fashion that resembles the OSI protocol stacks as the left part in Fig. 2.4 [2.27]. However, the OSI protocol stacks is resembled for a marco-network. For a micro-network, the protocol stack will be reduced to physical layer, data-link layer, network and transport layer and software later [2.28]. The characteristics of each layer will be described in this section.

Figure 2.4 The design abstraction levels of NoC

2.2.1 The Design Abstraction Levels of Network-on-Chip

NoC protocols are described bottom-up, starting from the physical up to the software layer. In the physical layer, global wires are the physical implementation of the communication channels. Traditional rail-to-rail voltage signaling with capacitive termination, as used today for on-chip communication, is definitely not well-suited for high-speed, low-energy communications for future global interconnect. Reduced swing can significantly reduce communication power dissipation which preserves the speed of data communication. Nevertheless, as the technology trends lead us to use smaller voltage swings and capacitances, the upset probabilities will rise. It is important to realize that a well-balanced design should not over design wires so that their behavior approaches an ideal one, because that the corresponding cost in performance, energy-efficiency and modularity may be too high. Physical layer design should find a compromise between competing quality metrics and provide a clean and complete abstraction of channel characteristics to micro-network layers above.

Due to the limitations in the physical level and the high bandwidth requirement, the SoC design will use network architectures similar to those used for multi-processors. Network-on-chip design entails the specification of network architectures and control protocols. The data-link layer abstracts the physical layer as an unreliable digital link, where the probability of bit upsets is non null. Furthermore, reliability can be traded off for energy. The main purpose of data-link protocols is to increase the reliability of the link up to a minimum required level, under the assumption that the physical layer by itself is not sufficiently reliable. At the data link layer, error correction can be complemented by several packet-based error detection and recovery protocols. Several parameters in the protocols can be adjusted depending on the goal to achieve maximum performance at a specified residual error probability within given energy consumption bounds.

At the network layer, packet data transmission can be customized by the choice

and energy consumption. Robustness and fault tolerance will also be highly desirable.

At the transport layer, algorithms deal with the decomposition of messages into packets at the source and their assembly at destination. Packetization granularity is a critical design decision because the behavior of most network control algorithm is very sensitive to packet size. Packet size can be application specific in SoCs, as opposed to general network.

Software layers comprise system and application software which includes processing element and network operating systems. The system software provides us with an abstraction of the underlying hardware platform. Moreover, policies implemented at the system software layer request either specific protocols or parameters at the lower layers to achieve the appropriate information flow. The hardware abstraction is coupled to the design of wrappers for processor cores which perform as network interfaces between cores and NoC architecture.

2.3 Topologies for Network-on-Chip Architecture

Figure 2.5 NoC architecture (a) SPIN (b) Mesh (c) Torus (d) Folded tours (e) Octagon (f) Butterfly Fat Tree

Network on Chip (NoC) technologies will enable designing parallel systems resembling cellular structures including thousands of processors. Such systems combined with multi-threaded computing can increase system efficiency for fine-grain parallel programs [2.29][2.30]. Therefore, the interconnect architecture of

NoC should be efficient for a huge amount of processor elements. A number of different interconnect architectures have been proposed as Fig. 2.8. Their origins can be traced back to the field of parallel computing. However, a different set of constraints exists when adapting these architectures to the SoC design paradigm.

2.3.1 Conventional Topologies of Network-on-Chip

A generic interconnect template has proposed which is called SPIN (Scalable, Programmable, Integrated Network) for on-chip packet switched interconnections as Fig. 2.8(a), where a fat-tree architecture is used to interconnect IP blocks. In this fat tree, every node has four children and the parent is replicated four times at any level of the tree. The functional IP blocks reside at the leaves and the switches reside at the vertices. A mesh-based [2.31][2.32] interconnect architecture consists of an mxn mesh of switches interconnecting computational resources (IPs) placed along with the switches, as shown in Fig. 2.8(b). Every switch, except those at the edges, is connected to four neighboring switches and one IP block.

2D torus has proposed as NoC architecture, shown in Fig. 2.8(c). The Torus architecture is basically the same as a regular mesh. The only difference is that the switches at the edges are connected to the switches at the opposite edge through wrap-around channels. Every switch has five ports, one connected to the local resource and the others connected to the closest neighboring switches. The long end-around connections can yield excessive delays. However, this can be avoided by folding the torus as Fig. 2.8(d). This renders to a more suitable VLSI implementation.

Karim et al. [2.33] have proposed the OCTAGON MP-SoC architecture. Fig.

2.8(e) shows a basic octagon unit consisting of eight nodes and 12 bidirectional links.

Each node is associated with a processing element and a switch. Communication between any pair of nodes takes at most two hops within the basic octagonal unit. For a system consisting of more than eight nodes, the octagon is extended to multidimensional space. Of course, this type of interconnection mechanism may significantly increase the wiring complexity. In a Butterfly Fat-Tree (BFT)

number of switches in the butterfly fat tree architecture converges to a constant independent of the number of levels.

2.3.2 Advanced Network-on-Chip Architectures

A popular network topology of NoC implementations is the two-dimensional mesh architecture, and it provides a regular topology and communications. Therefore, many advanced NoC architectures are proposed which are based on mesh topologies.

An advanced NoC architecture, called Xpipes as Fig. 2.9, targeting high performance and reliable communication for on-chip multi-processors is introduced [2.34]. Data links can be pipelined with a flexible number of stages to decouple link throughput from its length and to get arbitrary topologies. The I/O ports of each switch can be parameterized, and Xpipes is optimized from tile-based network on chip architecture.

Although it has dealt with the floorplan and different bandwidth between neighboring IP blocks, it belongs to the 2-D links architecture.

Figure 2.6 Xipies Architecture

An idea is presented to connect to the hierarchy network-on-chip as shown in Figure 2.10. The network on chip can be divided into two kinds of architecture, local network and global network. The local network preserves the features of 2-D links network on chip, and the global network is designed as centralized crossbar [2.35].

With the increasing of the processor elements and numbers of the local network, however, the global network might be designed as the distributed crossbars. In Figure2.13, block M is mentioned as memory block and block P is about the

processor element. Other hierarchy Network-on-chip or hybrid network-on-chip are also proposed to adopt multiple processor elements and heterogeneous systems [2.36][2.37] [2.38] [2.39].

Figure 2.7 Hierarchy Network-on-Chip Architecture

In order to achieve better performance, functionality and packaging density, three dimensional ICs are proposed with multiple layers of active devices. Besides, three- dimensional (3D) ICs allow for performance enhancements in the absence of scaling.

This is the result of each transistor being able to reduce interconnect length and access more nearest neighbors. The performance improvement arising from the architectural advantages of NoCs will be significantly enhanced if 3D ICs are adopted as the basic fabrication methodologies. Therefore, new topologies of 3-D network are also proposed for the future ICs [2.40].

2.4 Switching Fabrics in Network-on-Chip

Switching fabrics (or called router) is a key component in network-on-chip to command the data communication. Every processor element is called resource and connects to a switch fabric. The resources consist of process elements, IP blocks, embedded memory, DMA controllers etc. The implementation of routers depends on the topology and protocol of network-on-chip. In addition, the topology and control flows are the design issue for the interfaces of processor elements. No whether which network-on-chip architecture is, the router could be divided into five parts as follow:

 Switching Circuit

 Arbitration Unit

The link control units (routing units) control the communication in the network-on-chip backbone, and the arbitration unit arbitrates contention data which are routed to the same channel. The NoC former should avoid deadlock [2.41] of the on-chip communications and traffic which are intruded by the bad policy routing algorithms. Besides, it will influence on the sizes of buffers, number of MUXs for switching and the complexity of interconnects. For example, each switch connects to the side of switches with four directions in a mesh network-on-chip. The links of the switches are shown in Figure 2.8(b) and the architecture of the switches in a mesh (tile-based) NoC is shown in Figure 2.10.

We would not introduce link control unit (routing unit) here because in this thesis we focus on buffers (queues), switching circuit (network interface) and arbitration unit. The detail of these units will be described in following sections.

2.4.1 Buffers in Switch Fabrics

In network-on-chip platform, buffers will significantly affect the overall performance and the arbitration algorithm. Buffer allow for local storage of data that cannot be immediately routed. Unfortunately, queuing buffers have a high cost in terms of area and power consumption, and thus many NoC implementations strive with limited buffer sizes. If the design lacks sufficient buffer space, on the contrary, the buffers may fill up too fast while over-provisioning of buffers clearly is a waste of scarce area resources [2.42].

Queuing buffer is used in switch fabrics or network interfaces to store un-routed data, and buffer architectures can be classified by the location and circuit implementation of buffers. The queuing buffers consume the most area and power consumption among composing blocks in NoCs. However, insufficient buffer size is a factor to induce head-of-line blocking problems as Fig. 2.8. When the head data of a virtual channel could not be routed and data behind the head data are occupied queuing buffers, it will influence the performance of the network. That’s the so-called

“head-of-line blocking problem.” Nevertheless, head-of-line blocking problems not

only reduce the performance but also increase power consumption of on-chip communication. Therefore, head-of-line blocking is a key factor to evaluate different buffer architecture.

Buffer Input1

Input2

Input3

Output1

Output2

Output3

Figure 2.8 Head-of-line blocking problem

Depending on the location of queuing buffers, the buffers can be placed either before or after the interconnection matrix in a switch fabric, which are input buffer and output buffer, respectively. To be sure, there is a distinction between input buffers and output buffers. If a data word is delayed in a switch fabric with input buffers, it will stall all data words arriving on the same input. None of them can be processed until the first one has been forwarded successfully. With the output buffers, the situation is different because that the switching is performed before the buffering. If a switch fabric cannot send the data over one of its outputs, the buffers at that output will fill up. There is, however, no immediate influence on the inputs. The successive data words can still be received. An architectural disadvantage of output buffering is that in one cycle, data from multiple input ports may have to be written to the same

Depending on the location of queuing buffers, the buffers can be placed either before or after the interconnection matrix in a switch fabric, which are input buffer and output buffer, respectively. To be sure, there is a distinction between input buffers and output buffers. If a data word is delayed in a switch fabric with input buffers, it will stall all data words arriving on the same input. None of them can be processed until the first one has been forwarded successfully. With the output buffers, the situation is different because that the switching is performed before the buffering. If a switch fabric cannot send the data over one of its outputs, the buffers at that output will fill up. There is, however, no immediate influence on the inputs. The successive data words can still be received. An architectural disadvantage of output buffering is that in one cycle, data from multiple input ports may have to be written to the same