Chapter 1 Introduction
1.5 Thesis Organization
In chapter 2, the traditional NoC platform, the switch design considerations and the switch configuration are introduced. Chapter 3 describes the configuration of proposed hierarchical 2-D mesh NoC platform and the communication contention-aware task binding methodology using the platform. The experiment flow and experimental results are shown and discussed in chapter 4. Finally, conclusions and future works are given in the last chapter.
7
Chapter 2
Overview of Network-on-Chip
This chapter will introduce the network-on-chip (NoC) platforms as well as the switch design of on-chip network communications. Switches are the most critical elements for on-chip networks. In this chapter, the common switch strategies, the required switch properties, the switch model, and the transaction behavior will be discussed in more detail.
2.1 Network-on-Chip Platform
Lots of NoC platform topologies have been developed, such as torus, octagon,
8
butterfly fat-tree (BFT), 2-D mesh, etc. which are collated and discussed in [11]. The 2-D mesh topology NoCs have the properties of simple connection and easy routing for communications [12]. Such NoC architectures also have the uniform interconnection and transaction time between two elements, thus ensuring the scalability of the networks.
Besides, the rectangular topology of such NoC architectures meets the IC manufacturing topology; in other words, the architectures are easy to be realized. Because of these properties described above, the 2-D mesh topology NoCs are investigated in this work.
SW
Figure 2.1: 2-D mesh topology network.
Figure 2.1 shows the illustration of 2-D mesh topology networks consisting of
processing elements (PE) and switches (SW). Each PE is composed of a processor with buffers, local memories and network interface, and is used to execute computing jobs. Each PE is also connected to its local switch. This switch can buffer communication data. Each switch connects to the four neighboring switches. The PEs communicate through the switches.
9
2.2 Switch
2.2.1 Switching Strategy
Three widely used switching strategies are connection-oriented switching, connectionless switching, and hybrid switching.
2.2.1.1 Connection-oriented Switching
The connection-oriented switching, named circuit switching, determines a dedicated physical path from the source to the destination before transmitting the data.
This dedicated path will be reserved until all the data are transmitted. There are two connection ways of the dedicated paths. The connection ways are determined according to whether the dedicated path can be reprogrammed or not. The static way means that the decided path cannot be reprogrammed, such as point-to-point connection.
In contrast, the dynamic way is reprogrammable. If a physical channel is reserved for one dedicated path of data transmission, it is not available for other paths. Such connection ways can have the full bandwidth of the physical channel. This means that the latency will be guaranteed, and the performance is predicable. Hence, the circuit switching is suitable for real-time applications and long, infrequent data transmission.
However, one physical channel reserved for only one connection will make the bandwidth utilization low when the transmission is not continuous, and hence this will degrade the overall performance.
10
2.2.1.2 Connectionless Switching
Figure 2.2: Example of packet switching.
In the connectionless switching, named packet switching, the data packets are transmitted. A packet contains the information of the destination, the packet size, and the transmission data. In contrast to the circuit switching, the connection of packet switching from the source to the destination is not reserved before data transmission.
For example, a packet will be transmitted from a source PE to a destination PE as shown in Figure 2.2. There are four switches, A, B, C, and D, between the source PE and the destination PE, and there is not only one path to transmit the packet. For the switch A, there are B, C, and D, connected to A. The transmission path is not reserved before the data transmission. When the data is transmitted to A, the next passing switch, B, C, or D, is decided. When utilizing the packet switching, the buffers of a switch are released until the packet is transmitted to the next switch. If there are data buffered in the input or output of a physical channel, other data intending to access this physical channel will be stuck as well until the preceding buffers can be released. In Figure 2.3, no data can be transmitted forward because not any of the front buffers is released. In this example, it is a deadlock situation. In summary, the advantage of the packet switching is that the buffers of packet switching strategies get high utilization. But the latency is
11
unpredictable, because a packet may be blocked for uncertain time when there is heavy traffic.
Figure 2.3: Illustration of communication deadlock.
Three methods are commonly used to accomplish the packet switching. They are the store-and-forward, the virtual-cut-through, and the wormhole, respectively. In the store-and-forward, a packet is allowed to be transmitted to the next switch only when the whole packet is available, and the next switch has the capability of receiving this packet.
Hence, the store-and-forward requires large buffer size to provide the capability of a whole packet. Besides, it is efficient for short, frequent transmission. In the virtual-cut-through, a switch can allocate buffers for a whole packet. The packet will be transmitted to the next switch when the routing information is available rather than the whole packet. The latency can be short, and the bandwidth utilization can be high if the routing information is not blocked. However, if the routing information is really
12
blocked, the packet will be completely buffered until it can be transmitted. Summarily, the virtual-cut-through is more efficient than the store-and-forward when considering the latency and the bandwidth utilization. Both of them require large buffer size. In the wormhole switching, a switch only has the capability of some units of a packet instead of a whole packet. It directly transmits data when the next switch has released buffer.
Hence, the wormhole switching requires fewer buffers than the store-and-forward and the virtual-cut-through.
2.2.1.3 Hybrid Switching
The hybrid switching means the use of both the circuit switching and the packet switching for different communication requirements in NoC platforms. Therefore, it can have the characteristics of both the circuit and packet switching. The virtual-circuit switching is a hybrid switching when it uses dedicated virtual connections similar to the circuit switching and packet transmission similar to the packet switching.
2.2.2 Virtual-Circuit Switching
In this work, we use a switch architecture which is based on the latency-insensitive concepts [13][14] and utilize the virtual-circuit switching technique. Using the switching architecture can achieve high bandwidth utilization, guaranteed bandwidth and predictable latency under high communication loading. This switching architecture also has predictable characteristics and can support real-time applications.
13
Figure 2.4: Example of virtual channel scheme.
For the virtual channel flow control [15], a physical channel can be divided into several virtual channels. For example as shown in Figure 2.4, both path A and path B try to access the physical channel between SW1 and SW2. Without using the virtual channel scheme, if path A gets the grant of the physical channel first, path B can not access the physical channel until path A finishes transmitting data. The data of the path B will be buffered in the input or output of the physical channel. When applying the virtual channel technique, path A and path B access the physical channel in turns. The waiting time for transmitting the buffered data will be reduced. Thus, the latency will be decreased, the utilization of the channel will be increased, and the system throughput will be improved.
14
Figure 2.5: Request-oriented weighted round-robin scheduling scheme.
In the virtual-circuit switching, the physical channel is divided into several virtual channels. Thus, these virtual channels share the bandwidth of the physical channel.
To arrange the available bandwidth of each virtual channel, the request-oriented weighted round-robin scheduling scheme is applied. The round-robin scheduling gives grant sequentially and cyclically, and ignores the buffers without giving request. If a requested buffer has a weight number, w, it can transmit w times continuously as it gets the grant of transmission from the scheduler. The higher weight of the buffer gets the more bandwidth. For example shown in Figure 2.5, Buffer A, B, C and D have weight numbers 1, 2, 2, 1, respectively. The scheduled sequence using the physical channel will be A, B, B, C, C and D. If Buffer A, C and D make requests, Buffer B will be ignored as the sequence index reaches Buffer B in a clock cycle, and Buffer C will get the grant in this clock cycle. As finishing a round, the sequence index will be back to Buffer A. In this case, each Buffer A and D gets one-sixth of the physical channel bandwidth, and each Buffer B and C gets one-third of one.
15
Figure 2.6: Data transmission from a switch to the adjacent switch or the local processor.
Figure 2.6 illustrates the data transmission from a switch to an adjacent switch or
to the local processor. The address mapping table in a switch records the destination buffer address of each buffer in this switch. As shown in Figure 2.6, E1, SW1 makes a connection to E1, SW2. If there are data in E1, SW1, they will be transmitted to E1, SW2. The data transmission expends four clock cycles. In the first cycle, the requested buffer E1, SW1 gets the grant from the scheduler, and E1, SW1 can transmit the data in the following cycles. In the second cycle, E1, SW1 sends the destination address of E1, SW2 to SW2 through the Address-line to request SW2 that the transmitted data should be
16
reserved in E1, SW2. In the third cycle, E1, SW2 sends an acknowledge signal back to E1, SW1 through the Ack-line to notify E1, SW1 of the status of E1, SW2. If E1, SW2 is full, the acknowledge signal will be true. On the contrary, if E1, SW2 is available, the acknowledge signal will be false. In the same cycle, E1, SW1 sends the data to E1, SW2
through the Data-line. The transmitted data will be reserved or discarded according to whether E1, SW2 is available or not. In the fourth cycle, the data in E1, SW1 that has been transmitted should be reserved or released according to the acknowledge signal issued from E1, SW2. If the acknowledge signal is false, the data should be reserved and retransmitted in the next round.
Figure 2.7: Switch buffer organization.
The switch buffer utilization is an important factor for the communication efficiency. Because that sometimes not all the switch buffers are reserved when the number of the connection paths is less than the number of the virtual channels in a
17
physical channel. We use two-port SRAM memories instead of registers to implement the switch buffers. Therefore, we provide the flexibility to make the trade-off between the buffer size and the number of the virtual channels. Figure 2.7 shows an example of the switch buffer organization. There are four buffer banks in a port and each bank only receives data from the corresponding direction. For example, bank-N of E-port only receives data from the input of N-port. Each buffer bank can be divided into several buffer queues to provide necessary virtual channels. As shown in Figure 2.7, a buffer bank is divided into four buffer queues, and there are total sixteen provided virtual channels in a port. Designers can figure out the number of the necessary virtual channels of a buffer bank in the early system design stage and make a suitable switch buffer partition. Taking a 32-word buffer bank for example, it can be divided into 4 8-word buffer queues, 8 4-word buffer queues, or 16 2-word buffer queues for different applications. Thus, this switch buffer organization will make the switch buffer utilization higher, and improve the communication efficiency.
Finally, the switch will assign the dedicated connection paths by reserving the corresponding virtual channels and buffers before the data transmission. The passing switches of connection paths and the communication behavior can be known in early stage of the system design. It means that the switch has predictable characteristics and can support real-time applications.
We utilize the switch architecture with the virtual channel scheme, the request-oriented weighted round-robin scheduling, and SRAM configuration buffers. It
18
has features of deadlock free, high bandwidth utilization, high buffer utilization, and can support real-time applications. Also, it provides capabilities of small latency and high throughput. Comparing to the traditional switching strategies, circuit switching has latency guarantee and can support real-time applications; wormhole switching has smaller buffer size and higher hardware utilization. The virtual-circuit switching has not only the capabilities of circuit switching and wormhole switching but also many other advantages mentioned before.
2.2.3 Switch Architecture
Figure 2.8: Switch architecture.
Figure 2.8 shows the architecture of the proposed switch. There are five ports, East (E), South (S), West (W), North (N), and Local (L), in a switch. The outputs of the ports, E, S, W and N, are connected to the corresponding input ports of the adjacent
19
switches, and the L port is connected to the interface of the local PE. For example shown in Figure 2.8, the E port output of the left switch is connected to the W port input of the right switch, and two physical channels with different directions, forward and backward, are built across the two switches. Each physical channel includes Address-line, Data-line, and Ack-line. Address-line is used to transmit destination address, Data-line is used to transmit data, and Ack-line is responsible for transmitting the acknowledge signal.
Figure 2.9: Expression of the buffer-id.
The buffer at the output of a port is partitioned into four buffer banks that only receive data from other four corresponding ports. The detail buffer configuration has been shown in Figure 2.7. A switch transmits data to the next switch according to the address in the address mapping table. The expression of the buffer-id is shown in Figure 2.9. For example, the buffer-id S1.E-W2 means that the buffer belongs to the
second buffer in W-bank of E-port of SW1.
As mentioned in section 2.2.2, a physical channel can be divided into several virtual channels. We use a factor, channel width factor, to indicate the maximum allowable number of the virtual channels in each buffer bank in a port of a switch. For example shown in Figure 2.7, there are four virtual channels in each buffer bank. The
20
channel width factor is four, where a buffer bank can be assigned at most four virtual channels. In other words, there are at most sixteen virtual channels for a physical channel.
The traditional architecture is described in this chapter. However, when the application is complex and the communication loading is heavy, long distance transmission is necessary. We need an improved architecture to support these applications. In the next chapter, the proposed hierarchical architecture will be discussed.
21
Chapter 3 Hierarchical NoC Architecture
In this chapter, the proposed hierarchical architecture for NoC platform is presented, where the task binding method is applied to this platform. Using the hierarchical architecture for complex applications, the overall performance can be improved. The communication contention-aware methodology for the task binding method including task mapping and path assignment is discussed particularly.
22
3.1 Hierarchical Architecture
Figure 3.1: Hierarchical 2-D mesh NoC platform.
Figure 3.1 shows the proposed hierarchical 2-D mesh NoC platform. In this
hierarchical architecture, two 2-D mesh switch networks are connected by using interchange switches (SW_I). We define the added network as network level-2 (L2) including SW_I and SW_L2; the traditional part is defined as network level-1 (L1).
Every three PEs in x-direction and y-direction, the PE is replaced by SW_I and SW_L2.
The port names of a SW_I are shown in Figure 3.1. Some vertical or horizontal physical channels are disconnected in order to release switch ports to connect to SW_I.
As shown in Figure 3.1, we assume that the coordinate of a SW_I is (x, y). If the sum of x and y is odd, the horizontal physical channels will be disconnected; if the sum is even, the vertical ones will be disconnected. The buffers between two SW_L2 are relay stations without switches, and they transmit data forward directly in the next clock
23
cycle. A parameter named the physical channel width ratio, R, is used to characterize the hierarchical architecture, and is defined as
L2 L1
Physical channel width Physical channel width
R= (3.1)
The physical channel width means the available size of the transmission data. If the size of L1 transmission data is equal to one-word, and that of L2 data is equal to four-word, R will be given by four.
Figure 3.2: Data transmission between two L2 switches.
In this hierarchical architecture, the data transmission between two L2 switches requires eight clock cycles. The detail timing diagram and the address mapping table of the connections of a SW_L2 are shown in Figure 3.2. This address mapping table
24
records the destination buffer address of each buffer in this switch. In Figure 3.2, the corresponding buffer is labeled in the table. For example, the E1, SW1 buffer has a connection to the E1, SW2 buffer. In the first clock cycle, the requested buffer E1, SW1
gets the grant from the scheduler, and E1, SW1 is allowed to transmit the data in the next cycle. From the second clock cycle to the fourth clock cycle, E1, SW1 sends the address of the destination buffer, E1, SW2, to SW2 through the Address-line. In the fifth clock cycle, the address arrives at SW2, and notifies SW2 that the following data should be reserved in E1, SW2. From the fifth clock cycle to the seventh clock cycle, E1, SW2 sends an acknowledge signal back to E1, SW1 through the Ack-line. At the same clock cycles, E1, SW1 continuously sends three four-word data to E1, SW2 through the Data-line. If E1, SW2 does not have enough buffer space to save the three coming data, the acknowledge signal will be false. On the contrary, if E1, SW2 is available, the acknowledge signal will be true. In the eighth clock cycle, the three data in E1, SW1 that has been transmitted should be reserved or released according to the acknowledge signal received from E1, SW2. If the acknowledge signal is false, this means the transaction is fail, and the data should be reserved and will be transmitted again in the next round. If the acknowledge signal is true, the transaction is succeeded.
25
Figure 3.3: Example of data transmission passing through hierarchy.
For the L1 network, the width of the physical channel is one-word width. On the other hand, for the L2 network, the width is four-word width. Consider an example of a connection path shown in Figure 3.3, the source PE transmits data to the sink PE through L2. A four-word data can be transmitted from SW2 to SW3 when the four one-word data from SW1 is available. SW4 makes transactions to SW5 until the three four-word data from SW3 is available, because a SW_L2 transmits three four-word data continuously. Hence, for the SW_L2 buffers connected to another SW_L2, the weights of the round-robin scheduling are assigned at least three or six.
Figure 3.4: Buffer architecture of SW_1I.
26
Figure 3.5: Buffer architecture of SW_I.
Since the transmission data size of L2 is four-word, some allocated buffer size should be adjusted. Taking Figure 3.3 for example, if a buffer bank is divided into 16
Since the transmission data size of L2 is four-word, some allocated buffer size should be adjusted. Taking Figure 3.3 for example, if a buffer bank is divided into 16