Chapter 1 Introduction
1.5 The focus of this thesis
Based on the promise that communication will become the bottleneck of future system performance and on-chip communication will be treated as micron-network. We
8
propose a novel network platform and related infrastructure for on-chip communication in this thesis. System designers can also benefit from our framework to analyze the system performance and make better decisions at higher level because our platform exhibits predictable performance.
The rest of this thesis is organized as follows. In Chapter 2, we introduce basic network concepts. In Chapter 3, we highlight the requirements of future network and present details of our platform design and a novel switch design for on-chip communication. We prove the correctness of our platform and study some design space explorations in Chapter 4. Finally, we give the conclusion and future work in Chapter 5.
9
Chapter 2
Preliminaries
In this chapter, we introduce some basic concepts and related issues about constructing network. This chapter provides background knowledge of our platform and switch design in Chapter 3.
2.1 Topology
The word “topology” defines how the nodes are interconnected by channels and is usually modeled by a graph [7]. The nodes include communication fabrics, bridges and processors. Major network topologies can be categorized as direct network and indirect
10
network. In direct network, nodes are connected directly with each other by the network.
In indirect network, nodes are connected by one or more intermediate node switches.
The switching nodes perform the routing and arbitration operations. Because of different performance requirements and cost trade-off, many different network topologies are designed for specific applications [11]. We are going to give a brief description of some of the popular network topologies.
2.1.1 Direct network topologies
1. Orthogonal
A network topology is orthogonal if and only if nodes can be arranged in an orthogonal n-dimensional space. The most popular direct networks are k-ary n-dimensional mesh, k-ary n-dimensional cube and the hypercube, as shown in Figure 3.
Such kinds of topologies exhibit the properties of regularity and symmetry.
4-ary 2-dim mesh 4-ary 2-dim torus 2-ary 4-cube (hyper cube) Figure 3 : Orthogonal network topology
11
2. Other direct network topologies
In addition to these topologies defined above, there are many other topologies that have been proposed with different properties, as shown in Figure 4. The cube-connected-cycles topology is proposed as an alternative way to orthogonal topologies to reduce the degree of each node. Tree topology provides the advantage of low implementation cost, in which each of these nodes on the topology is in turn connected to a disjoint set of descendants. A star graph is proposed to minimizing the network diameter of cube-connected cycles. However, it need more complex routing algorithm.
Figure 4 : Other direct network topology
2.1.2 Indirect network topologies
1. Crossbar networks
Crossbar networks allow any node in the system to communicate with any other node directly, as shown in Figure 5. In such way, several processors or memories can communicate simultaneously without contention. The disadvantage of crossbar networks
12
is the cost, and has been traditionally used in small-scale system [11].
2. Multi-stage interconnection network
Multi-stage interconnection networks (MIN) connect the input nodes to output nodes through switch stages, which are crossbar network. The number of stages and connections between switch stages determine the routing capability of the networks.
Depending on the interconnection scheme employed between two adjacent nodes, various MINs have been proposed.
Fat-tree is one classical topology of MIN. A fat-tree network can provide multiple data paths from source node to destination nodes depending on the path usage. As shown in Figure 5, the latency is directly proportional to the depth of the tree.
Figure 5 : Indirect network topology
Among these topologies, 2-D mesh is considered as the most suitable topology for on-chip network because the 2-D mesh has the advantages of an acceptable wire cost,
13
reasonably high bandwidth, and that it is easy to group components on plane.
2.2 Switching strategy
Switching strategy is defined as the method used to exchange data between network components. Common switching strategies can be classified into two categories:
connection-oriented and connection-nless.
Connection-oriented switching technique is widely used in telecommunication. It is also named circuit switching because the connection from source to destination is built before data transmission. Once the connection established, data from source to destination can be transmitted with guaranteed bandwidth and will be delivered without any contention. With this advantage, we can employ it to build a real time system. This strategy is advantageous when data transmission is long and few.
Alternative to connection-oriented switching strategy, connection-less switching strategy partitioned data into several packets before transmission. The routing and transmission of packets are handled by network fabrics individually. Without any reservations of the channel bandwidth, it provides more efficient bandwidth utilization.
Common types of communication-less switching include store-and-forward, virtual-cut-through and wormhole switching.
Store-and-forward switching technique is named because each packet transmitted in network is completely buffered at each intermediate node before it is forwarded to the
14
next node. The header information of each packet is extracted by the intermediate switch to determine the output destination over which the packet is to be forwarded. Different from the circuit switching, store-and-forward switching is advantageous when the messages are short and frequent. However, the implementation of store-and forward switching is expensive because a switch should have enough buffer size to hold a whole packet.
Unlike the store-and-forward switching, that switch should hold the whole packet before it is forwarded to next switch, virtual-cut-through switching can start the transmission as soon as the routing decision of packet is determined and the output channel is free. Actually, the packet doesn’t even have to be stored at the output buffer and can cut through to the input of the next switch before the complete packet is received at current switch. In the absence of blocking, virtual-cut-through switching performs better than store-and-forward switching because the packet is effectively pipelined through successive switches. If the header of packet is blocked on a busy output channel, virtual-cut-through switching will hold the complete message in the switch and behaves like store-and-forward switching.
The requirement to buffer whole packet in the switches makes it difficult to construct a faster and smaller switches. In wormhole switching, packets are pipelined through the network like virtual-cut-through switching. However, the buffer requirements with switches are reduced over that for virtual-cut-through switching. If the packet is blocked in the network, the buffer in the switch doesn’t have the capability of buffering the
15
whole packet; the blocked packets will occupy buffers in several switches. This degrades the network performance because the packet blocked by other packets will occupy buffers in these switches on part of its transmission path, similarly blocking other packets. Moreover, it often causes deadlock problem to happen. A deadlock situation is the network state that some packets cannot advance toward their destination because the buffers requested by them are full. As shown in Figure 6, all the packets involved in a deadlocked configuration are blocked forever [7].
Figure 6 : Deadlock situation
Virtual channels are originally introduced to solve the problem of deadlock in wormhole switching. The key idea is to multiplexing the physical channel to support several virtual channels. Logically, each virtual channel is operating as if a distinct physical channel operates at lower bandwidth. By providing two virtual channels at output channel at each switch in Figure 7(virtual channels), all the packets blocked in the switches continue to make progress with half the channel bandwidth as shown in Figure 7(virtual channels solve the deadlock problem). This technique can not only solve
16
deadlock problem but also improve network throughput.
Figure 7 : Virtual channel
2.3 Routing algorithm
Routing algorithms determine the path followed by each packet. Figure 8 presents a taxonomy of routing algorithms that are classified according to several criteria. Routing algorithms can be first classified according to the number of destinations. Packets may have only one destination or be broadcasted to multiple destinations. Routing algorithm can also be classified according to the place where the routing decisions are made. The decision can be made centralized at the source (centralized routing), be determined in a distributed manner while across the network (distributed routing), or hybrid schemes.
Moreover, routing algorithms can be classified according to the way they are implemented. The most popular ways consists of either looking at a routing table or executing a routing algorithm in software and hardware based on finite state machine. In
17
both cases, they can be either deterministic or adaptive according to whether the packet transmitted between a given source/destination pair is supplied with the same path.
Adaptive routing can also be classified according to their progressiveness as progressive and backtracking. Progressive routing moves the header forward, reserving a new channel at each routing operation. Backtracking allow the header to backtrack while it is blocked. Backtracking routing algorithms are mainly used for fault tolerance. In the scope of adaptive routing, routing algorithms can be classified according to the distance of routing path as profitable or misrouting. Profitable routing algorithms always deliver the packet closer to the destination across the network, while misrouting algorithms may send packet away from the destination. The last taxonomy is according to the number of paths as completely adaptive or partially adaptive.
18
Figure 8 : A taxonomy for routing algorithms [7]
2.4 Transaction protocol
In network or telecommunication field, protocol is an agree-upon format or a set of rules for transmitting data between two devices. The protocols determine the type of error detection or error correction to be used, the data compression methods, or handshaking convention between sending device and receiving device. It not only defines how senders and receivers execute the communication transactions, but also determines how data flows across the network. For on-chip communication, different protocol options greatly influence the reliability and power consumption issues.
19
Chapter 3
Our Platform and Switch Design
In this chapter, we will describe our platform and switch design in detail. First, we will remark what future network infrastructure should provide for communication-driven system design methodology. Following, we will present a complete description of our platform and switch design. We will also present how to use the switch to transmit messages between components through illustrations. After that, we will review how our platform meets these requirements of constructing future on-chip network.
20
3.1 What network we need
When it comes to constructing the network infrastructure for future system on chip, the hardest problem is to meet the various communication requirements in different application domains. Some applications, such as Software Defined Radio and MPEG codec, can be thread paralleling processing and they just need local and fixed communication bandwidth. For other applications, there may be irregular traffic load among communication channels. Here we summarize some basic concepts what future communication infrastructure should provide in communication-driven system design flow.
1. Efficient communication
When we consider constructing a network infrastructure for system on chip, the first task is to balance computing power and communication capability. If we implement a system by integrating some powerful processing elements that coordinate with other components, yet we only provide a poor communication infrastructure. The first problem is that messages transmitted between these components will waste unnecessary time on transmitting. Lots of jobs assigned to processing elements will be postponed because the data needed is delayed.
This is a serious problem which not only makes processing elements idle to degrade the whole system performance, but also make extra power consumption
21
while processing elements wait for data. Thus, we must provide a network infrastructure that meets high network utilization criticism.
2. Guaranteed throughput
For some applications, real time requirement is the critical issue. Circuit switching may be a good choice for such kinds of applications because it provides the transmission with guaranteed throughput.
3. Fault tolerance capability
Even with the advance of semiconductor technology, it still cannot be promised to fabricate a perfect chip without any error on the chip. This problem becomes worse in deep-submicron era. When it comes to integrating several processing elements on a chip, there is better chance that we will find some manufacturing faults in it. There may be faulty fabrics in memory, wrong connections between components, or breaking down processing elements. These issues can be rare but unavoidable. Future network infrastructure should provide some mechanisms such that the whole system still works smoothly with faulty components on it.
In this chapter, we propose a novel platform and switch design as a feasible solution to these network requirements and as the network infrastructure for the future communication-driven system design methodology.
22
3.2 Switch architecture
3.2.1 System Scheme
Figure 9 : A 2-D mesh switched network with 2x3 nodes
Our platform uses a 2-D mesh topology to organize on-chip components, as shown in Figure 9. The main reason for selecting the two dimensional mesh is its acceptable wire cost, and that it is easy to group components on plane [5][11]. In our platform, the network is composed of 5-ports switches. Processors use network interface to communicate within network.
The architecture of 5-ports switch is shown in Figure 10. The switch has four ports connecting to neighboring switches and one port connecting to local processing element.
Each port is composed of input and output stage, which is shown in Figure 13 and Figure 14.
23
Figure 10 : Switch architecture
The basic transmission procedure is illustrated in Figure 11. Suppose that a packet is sent into current switch from the neighboring switch at west direction and will be delivered forward the neighboring switch at east direction. In current switch, the packet will be received by input stage of west port first and be stored in memory of output stage of east port. Once the output channel of east port which is connecting to the neighboring switch is available, the output stage of east port in current switch will send the packet to the next switch soon.
24
Figure 11 : Basic transmission procedure
The interface of switch is composed of input and output channel. Each channel contains Address-line, Data-line and Ack-line. We show that in Figure 12. The Address-line delivers the input or output address of the packet. The Data-line delivers data transmitted. And the Ack-line feeds acknowledgement back to source switch or processing elements to report the result of transmission. Output channel and input channel are complementary to each other.
Figure 12 : Switch interface
After basic introduction of switch architecture, we explain the architecture of input, output stage and the organization of memory hierarchy in detail.
25
3.2.1.1 Input stage
Figure 13 : Input stage of switch port The main duty of input stage of switch port is as follows:
(1) Address controller extracts packet address from input Address-line to decide where to store input data.
(2) Dispatch input data on input Data-line to the buffer which stores the data.
(3) Collect output acknowledgement from ack-controller of other output stages and deliver acknowledgement signal on input Ack-line.
26
3.2.1.2 Output stage
Figure 14 : Output stage of switch port The output stage is composed of following elements:
(1)There are four memory modules in each direction, which are called RAM in Figure 14. They are all one-read/one-write memory architecture. These four memory modules store input data which is received by other four input stages separately. Take Figure 15 as an example. The data of these packets, which comes from north direction and will make east turn in current switch, will always be received by input stage of north port and then be stored in RAM-N of output stage of east port.
27
Figure 15 : Memory duty diagram
(2) Buffer controller records the size and status of the buffers in the switch, and asks Arbiter to grant channel privilege.
(3)Ack controller checks status of the buffer which is indicated by input address, and responses acknowledgement according to the status. For example, in Figure 16, a packet from south direction is transmitted across current switch to east direction. The neighboring switch at south direction will first notify the Ack-controller of east port of current switch to check whether there are available buffer space to store the data or not.
The Ack-controller will response acknowledgement as a result. After receiving acknowledgement, the neighboring switch at south direction will know whether the data which is transmitted at this transmission is successfully received by the switch or not.
(4)Arbiter use weighted round robin scheduling to grant channel privilege.
28
Figure 16 : Ack controller
3.2.2 Organization of internal buffers
After brief description of our memory architecture in subsection 3.2.1.2(1), we present the implementation details in this section. Each memory module, which is called RAM in Figure 14, can be partitioned into several buffers to provide necessary virtual channels. As illustrated in Figure 17, we partition each memory module into 4 buffers, resulting total 16 data buffers in this port, which means a capacity of 16 virtual channels to route packets. The size of memory will influence the flexibility of partition. For example, a memory module which has 32 words can be partitioned to two buffers with 16-words, four buffers with 8-words, or even eight buffers with 4-words.
We must highlight that the partition of memory is reconfigurable independently not only in each switch but also in each port even after fabrication. Memory partition can be used to trade-off between flexibility of routing packets and communication performance.
29
Assume that there are applications that need lots of long-distance transmissions in our platform. It will become difficult to route all the packets if we only provide few virtual channels in each switch. On the contrary, smaller buffer size causes higher failing rate at transmission and degrades communication performance.
Figure 17 : Organization of memory hierarchy
The organization of memory hierarchy is illustrated in Figure 17. In our platform, we will give each buffer in the switch a buffer-id. Figure 18 illustrates the meaning of expression.
Figure 18 : Buffer naming rule
30
For example, the buffer-id S3.E-S{3} is the identification of the buffer that is the third buffer of the memory module, which stores data from south direction at east port of switch 3.
In addition to the space for each buffer to store input data, there is another memory space, called routing table, for each buffer to record a unique buffer-id as output address.
The data stored in current buffer will be sent to the buffer which is identified with this buffer-id at next successful transaction. As you can figure, the buffer with this unique buffer-id must be one of the buffers of the neighboring switch, which is connected to current buffer. By configuring the routing tables of the switches in our platform, we can provide the needed transmission paths for different applications.
3.3 Transaction
After a detail explanation of our switch architecture and organization of memory hierarchy, we describe how to form a transmission path and explain transaction
After a detail explanation of our switch architecture and organization of memory hierarchy, we describe how to form a transmission path and explain transaction