Transaction protocol - 單晶片多處理器系統的通訊交換器設計

Chapter 2 Preliminaries

2.4 Transaction protocol

In network or telecommunication field, protocol is an agree-upon format or a set of rules for transmitting data between two devices. The protocols determine the type of error detection or error correction to be used, the data compression methods, or handshaking convention between sending device and receiving device. It not only defines how senders and receivers execute the communication transactions, but also determines how data flows across the network. For on-chip communication, different protocol options greatly influence the reliability and power consumption issues.

Chapter 3 Our Platform and Switch Design

In this chapter, we will describe our platform and switch design in detail. First, we will remark what future network infrastructure should provide for communication-driven system design methodology. Following, we will present a complete description of our platform and switch design. We will also present how to use the switch to transmit messages between components through illustrations. After that, we will review how our platform meets these requirements of constructing future on-chip network.

3.1 What network we need

When it comes to constructing the network infrastructure for future system on chip, the hardest problem is to meet the various communication requirements in different application domains. Some applications, such as Software Defined Radio and MPEG codec, can be thread paralleling processing and they just need local and fixed communication bandwidth. For other applications, there may be irregular traffic load among communication channels. Here we summarize some basic concepts what future communication infrastructure should provide in communication-driven system design flow.

1. Efficient communication

When we consider constructing a network infrastructure for system on chip, the first task is to balance computing power and communication capability. If we implement a system by integrating some powerful processing elements that coordinate with other components, yet we only provide a poor communication infrastructure. The first problem is that messages transmitted between these components will waste unnecessary time on transmitting. Lots of jobs assigned to processing elements will be postponed because the data needed is delayed.

This is a serious problem which not only makes processing elements idle to degrade the whole system performance, but also make extra power consumption

while processing elements wait for data. Thus, we must provide a network infrastructure that meets high network utilization criticism.

2. Guaranteed throughput

For some applications, real time requirement is the critical issue. Circuit switching may be a good choice for such kinds of applications because it provides the transmission with guaranteed throughput.

3. Fault tolerance capability

Even with the advance of semiconductor technology, it still cannot be promised to fabricate a perfect chip without any error on the chip. This problem becomes worse in deep-submicron era. When it comes to integrating several processing elements on a chip, there is better chance that we will find some manufacturing faults in it. There may be faulty fabrics in memory, wrong connections between components, or breaking down processing elements. These issues can be rare but unavoidable. Future network infrastructure should provide some mechanisms such that the whole system still works smoothly with faulty components on it.

In this chapter, we propose a novel platform and switch design as a feasible solution to these network requirements and as the network infrastructure for the future communication-driven system design methodology.

3.2 Switch architecture

3.2.1 System Scheme

Figure 9 : A 2-D mesh switched network with 2x3 nodes

Our platform uses a 2-D mesh topology to organize on-chip components, as shown in Figure 9. The main reason for selecting the two dimensional mesh is its acceptable wire cost, and that it is easy to group components on plane [5][11]. In our platform, the network is composed of 5-ports switches. Processors use network interface to communicate within network.

The architecture of 5-ports switch is shown in Figure 10. The switch has four ports connecting to neighboring switches and one port connecting to local processing element.

Each port is composed of input and output stage, which is shown in Figure 13 and Figure 14.

Figure 10 : Switch architecture

The basic transmission procedure is illustrated in Figure 11. Suppose that a packet is sent into current switch from the neighboring switch at west direction and will be delivered forward the neighboring switch at east direction. In current switch, the packet will be received by input stage of west port first and be stored in memory of output stage of east port. Once the output channel of east port which is connecting to the neighboring switch is available, the output stage of east port in current switch will send the packet to the next switch soon.

Figure 11 : Basic transmission procedure

The interface of switch is composed of input and output channel. Each channel contains Address-line, Data-line and Ack-line. We show that in Figure 12. The Address-line delivers the input or output address of the packet. The Data-line delivers data transmitted. And the Ack-line feeds acknowledgement back to source switch or processing elements to report the result of transmission. Output channel and input channel are complementary to each other.

Figure 12 : Switch interface

After basic introduction of switch architecture, we explain the architecture of input, output stage and the organization of memory hierarchy in detail.

3.2.1.1 Input stage

Figure 13 : Input stage of switch port The main duty of input stage of switch port is as follows:

(1) Address controller extracts packet address from input Address-line to decide where to store input data.

(2) Dispatch input data on input Data-line to the buffer which stores the data.

(3) Collect output acknowledgement from ack-controller of other output stages and deliver acknowledgement signal on input Ack-line.

3.2.1.2 Output stage

Figure 14 : Output stage of switch port The output stage is composed of following elements:

(1)There are four memory modules in each direction, which are called RAM in Figure 14. They are all one-read/one-write memory architecture. These four memory modules store input data which is received by other four input stages separately. Take Figure 15 as an example. The data of these packets, which comes from north direction and will make east turn in current switch, will always be received by input stage of north port and then be stored in RAM-N of output stage of east port.

Figure 15 : Memory duty diagram

(2) Buffer controller records the size and status of the buffers in the switch, and asks Arbiter to grant channel privilege.

(3)Ack controller checks status of the buffer which is indicated by input address, and responses acknowledgement according to the status. For example, in Figure 16, a packet from south direction is transmitted across current switch to east direction. The neighboring switch at south direction will first notify the Ack-controller of east port of current switch to check whether there are available buffer space to store the data or not.

The Ack-controller will response acknowledgement as a result. After receiving acknowledgement, the neighboring switch at south direction will know whether the data which is transmitted at this transmission is successfully received by the switch or not.

(4)Arbiter use weighted round robin scheduling to grant channel privilege.

Figure 16 : Ack controller

3.2.2 Organization of internal buffers

After brief description of our memory architecture in subsection 3.2.1.2(1), we present the implementation details in this section. Each memory module, which is called RAM in Figure 14, can be partitioned into several buffers to provide necessary virtual channels. As illustrated in Figure 17, we partition each memory module into 4 buffers, resulting total 16 data buffers in this port, which means a capacity of 16 virtual channels to route packets. The size of memory will influence the flexibility of partition. For example, a memory module which has 32 words can be partitioned to two buffers with 16-words, four buffers with 8-words, or even eight buffers with 4-words.

We must highlight that the partition of memory is reconfigurable independently not only in each switch but also in each port even after fabrication. Memory partition can be used to trade-off between flexibility of routing packets and communication performance.

Assume that there are applications that need lots of long-distance transmissions in our platform. It will become difficult to route all the packets if we only provide few virtual channels in each switch. On the contrary, smaller buffer size causes higher failing rate at transmission and degrades communication performance.

Figure 17 : Organization of memory hierarchy

The organization of memory hierarchy is illustrated in Figure 17. In our platform, we will give each buffer in the switch a buffer-id. Figure 18 illustrates the meaning of expression.

Figure 18 : Buffer naming rule

For example, the buffer-id S3.E-S{3} is the identification of the buffer that is the third buffer of the memory module, which stores data from south direction at east port of switch 3.

In addition to the space for each buffer to store input data, there is another memory space, called routing table, for each buffer to record a unique buffer-id as output address.

The data stored in current buffer will be sent to the buffer which is identified with this buffer-id at next successful transaction. As you can figure, the buffer with this unique buffer-id must be one of the buffers of the neighboring switch, which is connected to current buffer. By configuring the routing tables of the switches in our platform, we can provide the needed transmission paths for different applications.

3.3 Transaction

After a detail explanation of our switch architecture and organization of memory hierarchy, we describe how to form a transmission path and explain transaction procedure with a distinct illustration here.

3.3.1 Path configuration

In the system design flow that we introduced in section 1.3, after mapping the applications onto associated processing elements, we have to provide all transmission paths that the applications need. Assume that the system designers have decided all the transmission paths of routing packets. The next thing we should do is to configure our

platform to form these paths. For each transmission path, we will look for one available buffer in each switch along the path and reserve these buffers to form a dedicated virtual channel. Among these buffers that form this dedicated virtual channel, we will repeatedly assign the buffer-id of the succeeding buffer for current buffer as output address. By configuring the routing table of the buffers, we can set up this the transmission path.

Unlike the packet switching, which should decode the packet address, search space to store the data, and compute the routing path of the packet. We simplify the duty of switch by the routing table of each buffer.

In Figure 19, assume that each memory in the switch is partitioned into four buffers with eight words. Supposing that one of the applications in our platform will deliver messages from processor-1 to processor-2. There is a transmission path from network interface of processor-1 through switch-1 and switch-2 to processor-2. The transmission path is established by configuring routing table of buffers in these two switches. First, we assign buffer-id: S1.S-P{2} for processor 1 as source buffer-id when it want to send message to processor-2. Secondly, assign buffer-id: S2.L-N{1} for buffer: S1.S-P{2} as output address and assign a memory address for buffer: S2.L-N{1} as output address.

This buffer chain which is composed of the network interface of processor-1, the buffer S1.S-P{2}, the buffer S2.L-N{1} and the network interface of processors-2 will form a dedicated transmission path for packets from processor-1 to processor-2.

Figure 19 : Path transaction procedure

3.3.2 Transaction procedure

After the description of setting up transmission path, we illustrate our transaction procedure in Figure 19. Assume that processor-1 send a packet with 2 words to processor-2. We mark the words as grey circle-{a} and grey circle-{b} in Figure 19.

First, processor-1 sends word-{a} to input stage of local port of switch-1 and switch-1 stores word-{a} in buffer: S1.S-L{2}, as shown in Figure 19(a). In Figure 19(b), the output stage of south port of switch-1 send data-{a} to input stage of north port of switch-2 and switch-2 stores data-{a} in buffer: S2.L-N{1}. At the same time, processor-1 can send data-{b} to switch-1 and switch-1 stores data-{b} in buffer:

S1.S-L{2} as it did to word-{a}. This shows that we allow pipelined transactions. In the last step, output stage of local port of switch-2 sends data-{a} to processor-2 and finishes the transmission of word-{a}. Word-{b} will arrive at processor-2 with the

same procedures.

3.4 Transaction protocol

In this section, we explain the detail of transaction procedure between neighboring switches. In Figure 20, we show the interface diagram between two neighboring switches. Routing table is a mapping between buffers of east port of switch-1 and buffers of switch-2. It records the buffer-id of switch-2 as output address of output buffer in east port of switch-1. For example, the output address of S1.E-N{1} is S2.E-W{1} in Figure 20.

Figure 20 : Transaction protocol between switches

Assume that at this moment, the buffer S-1.E-N{1} in switch-1 stores data that will be delivered to buffer S-2.E-W{1} in switch-2. We will show the detailed transaction

procedure between these two switches. This is a simple example but shows clearly our procedure. Our procedure is divided into four steps: channel privilege arbitration, output address transmission, data transmission, and acknowledgement. Each step will be finished in one clock cycle. And these transactions can be executed in pipelined manner to increase throughput.

The detailed transaction is as follows:

Cycle 1: buffer S1.E-N{1} notes the controller that it wants to access the output channel and urges controller to grant channel privilege to it.

Cycle 2: buffer S1.E-N{1} sends address ‘S2.E-W{1}’ on the Address-line. This address indicates that this transaction tries to send data into buffer S2.E-W{1}.

Cycle 3: In switch-1, buffer S1.E-N{1} sends data on Data-line. In switch-2, the Ack-controller sends the acknowledgement (false/true) according to the status (full/available) of buffer S2.E-W{1} on Ack-line. At the same time, buffer S2.E-W{1}

stores data on input Data-line if it still has memory space, else discards the data.

Cycle 4: according the acknowledgement on Ack-line, buffer S1.E-N{1} in switch-1 will decide whether to keep the data or erase data it stores. If the acknowledgement is true, it means that this is a successful transmission. Buffer S1.E-N{1} will erase the data that has been transmitted successfully.

3.5 Round robin scheduling

At the output stage of switch port, we implement the arbiter with round robin scheduling technique to decide which virtual channel can get the privilege to access output channel and transmit data.[7] In this way, these transmission paths that deliver data in the same port of switch will use equal bandwidth of output channel. This technique will avoid the starvation for accessing channel. Note that only these virtual channels that are really active to transmit data will be scheduled. We won’t guarantee channel privilege to these buffers that don’t transmit data. This will increase the channel utilization and prevent unnecessary power consumption.

Figure 21 : Output arbitration

Figure 22 : Round robin scheduling

In Figure 21, there are three messages delivered by three virtual channels separately in output port. We identities these three messages as A, B, C. If we don’t use a round robin scheduling to transmit data, a short message may be possibly postponed a very long time before it is transmitted. We show this situation in Figure 22 (without scheduling). With a round robin scheduling technique, each message can equally share the channel bandwidth, as shown in Figure 22 (with round robin scheduling).

Moreover, there may be different communication requirements for different applications. We sometimes need to provide larger bandwidth for some transmission paths. By assigning different weight to each transmission path, we can provide flexible bandwidth. We show this in Figure 22 (weighted round robin scheduling). In this example, buffer A has twice bandwidth than buffer B and buffer C.

3.6 Performance

For each transmission between on-chip components, we will reserve buffers in those switches which are on the transmission path to form a virtual channel connection. With such dedicated channel and round robin scheduling, we can guarantee the minimum bandwidth of each transmission. For example, consider that in east port of switch in Figure 23, there are three paths sharing the bandwidth. We assign them with the same weight. This means that for each transmission path, they are all guaranteed to use at least one third bandwidth in this switch.

We make a simple expression:

channel

Figure 23 : Bandwidth sharing example

If the transmission intervals of these three paths are not overlapped, the switch will

provide higher bandwidth to these paths that are transmitting data.

The example in Figure 23 and equation (3-1) describes only local guaranteed bandwidth expression in one switch. When it comes to the guaranteed bandwidth of whole transmission path, we need to trace all the local guaranteed bandwidth in those switches along this transmission path, and choose the smallest guaranteed bandwidth as the minimum bandwidth of this transmission path. Equation (3-2) and (3-3) describe the expression.

Assume we provide all the channels of our platform with the same bandwidth, called standard channel bandwidth (scb), in Figure 24. There is a transmission path from P1 to P6 through switches S1, S2, S3 and S6. The related LBW of these switches are shown in the figure. The guaranteed bandwidth of this path is equal to the smallest local guaranteed bandwidth of these switches, which is LGBWS3.

Figure 24 : Bandwidth guaranteed transmission path

3.7 Design overview

In section 3.1, we summarized the requirements that a future communication infrastructure should provide: efficient communication, guaranteed throughput and fault tolerance capability. We explain how to use our platform to meet these requirements.

1. Efficient communication

By profiling the applications in the communication-driven system methodology, we can get some statistics information about communication traffic of our system. Moreover, we will know the constraints requirements that different applications need. To avoid the traffic congestion between these transmission paths that need larger bandwidth, we can alleviate the network load

by assign different paths for them. Figure 25 is an example. In (a), if both transmission paths use the channel connection from S2 – S3 – S6, they will make traffic contention to degrade the system performance. We can assign different paths for them in (b) to avoid the overlapping of transmission paths and get better performance. With proper setting of dedicated transmission paths, we can provide a efficient communication environment.

Figure 25 : Assigning different path to avoid traffic congestion 2. Guaranteed throughput

With dedicated virtual channel and round robin scheduling, we can provide guaranteed bandwidth. In section 3.6, we express the equations. In this way, we can implement a hard real time system.

In addition, the advantage of providing a network infrastructure with guaranteed throughput is on system modeling. The system designer can predict

the worst cast of transmission and estimate the system performance at higher levels.

3. Fault tolerance capability

Figure 26 : Fault tolerance

With different path assignments, we can easily avoid to use the faulty components. In Figure 26 (a), originally the transmission path from P1 to P6 will use the faulty switch, S3. By assigning another transmission path, we can still transmit data from P1 to P6 correctly and solve this problem. This example shows that we provide a flexible environment to overcome fabrication faults.

Chapter 4 Experimental Results

To verify the functionality and evaluate the communication performance of our

在文檔中單晶片多處理器系統的通訊交換器設計 (頁 29-0)