Thesis Organization - 應用於單晶片多處理器系統之任務結合方法

Chapter 1 Introduction

1.3 Thesis Organization

The rest of the thesis is organized as follows. Chapter 2 introduces our Multi-Processor System-on-Chip (MPSoC) platform, the switch design considerations, the model of this switch and our design flow. In Chapter 3, details about task binding techniques we use are presented. Then, experiment flow and experimental results are given and discussed in Chapter 4. Finally, the conclusion is made in Chapter 5.

Chapter 2 Preliminary

2.1 Our Platform

As shown in Figure 1, there are two components in our platform: processors and switches. Each processor contains 32K bytes of local memory and connects to the local switch. Each switch connects to the four neighboring switches and the local processor.

Switch Switch Switch

Processor Processor Processor

Figure 1 Our platform

The topology of the network is 2D mesh. The reasons why we choose it are three folds. First, as shown in [15], because of the simple connection and easy routing provided by adjacency, it is widely used in parallel computing platforms. Second, the interconnect length between nodes is uniform, which ensures the uniformity of the performance and overall scalability of the network. And last, it meets the inherent constraint of IC manufacturing technology.

The way how data communication is carried out in the network is as follows: if a processor wants to pass information to any other processor, it sends information to the local switch. The switch then decides which adjacent switch is to receive the information. If the local processor of this switch is the destination, then the information is received by this processor; otherwise, the procedure just repeats until the information is sent to the destination.

2.2 Switch Design

2.2.1 Switching Strategies

Before we get to know how our switch works, knowledge about common switching strategies is required. Common switching strategies can be classified into two categories: connection-oriented switching and connectionless switching [16]. For connection-oriented switching, also named circuit switching, a connection from the source to the destination is established before data transmission. Once the connection is established, the full bandwidth of the hardware path is available. Circuit switching is generally a good switching strategy if data transmission is long and infrequent. That is, when the time to establish the connection compared to the time to transmit data is

short, the strategy is advantageous. Since there is a dedicated connection of data transmission, the available bandwidth is known, meaning that the latency can be guaranteed—this is really good for real-time applications. But, when connection is reserved for the duration of one data transmission, other data transmission may not use resources occupied by the connection. This may degrade the overall network performance.

Alternatively, for connectionless switching, data can be partitioned into packets.

Each packet can be individually routed from the source to the destination without any connection reserved prior to data transmission. As a result, the bandwidth utilization is more efficient.

Popular packet switching strategies include store-and-forward, virtual cut-through and wormhole. For store-and-forward, a packet is completely buffered at each intermediate node before it can be forwarded to the next node. Compared to circuit switching, its performance is better when data transmission is short and frequent since it does not require the existence of dedicated connection. However, the switch implementation is expensive because the switch should have the capability to buffer a whole packet.

Store-and-forward assumes that a whole packet must be available before it can be forwarded to the next node. This is not generally true, however, because the first few bytes of a packet may contain routing information. A switch can start forwarding information to the next node as soon as routing information of a packet is available.

The switching technique which exploits this is referred to as virtual-cut-through. In the absence of blocking, the latency experienced by data transmission using this method is shorter, implying higher bandwidth utilization. On the other hand, if the header of a packet is blocked during data transmission, a packet can be completely buffered until it can be transferred. In this way, virtual-cut-through works just as

effectively as store-and-forward. Of course, the switch implementation is still expensive considering it has to buffer a whole packet.

It’s difficult to construct a switch that is small, compact, fast, and capable of storing a whole packet. In wormhole switching, a packet is further decomposed into smaller units. In this way, a switch only has to be capable of storing a few units when data transmission is blocked. This suggests that the buffer requirement within a switch is substantially reduced over the requirements for virtual-cut-through. Hence, a switch can be much smaller and faster. However, when the head of a packet is blocked, the following parts of the packet cannot move on. This forms a blocking path and causes a pause in other data transmission in turn. Under this situation, latency of data transmission is unpredictable [17].

Switching strategy Strength Weakness

Circuit switching Bandwidth and latency are guaranteed

Store and forward Good when data transmission is short and frequent

Buffer size is big

Virtual cut through High bandwidth utilization Buffer size is big

Wormhole Buffer size is small

High bandwidth utilization

Blocking paths may cause network performance unpredictable

Table 1 Comparison of different switching strategies

Before we move on to next section, characteristics of all mentioned switching strategies are summarized in Table 1.

2.2.2 Features of Our Switch

Our switch has four important features. First, circuit switching is chosen because it does not require much memory. Usually, we don’t have too much memory on chip, for its size is large. Also, it supports real-time applications. This is not true for connectionless switching strategies.

Then, we exploit the idea of virtual channel flow control [18]. If parts of data are buffered at the input or output of each physical channel, once a message occupies the buffer, no other messages can ever access that channel until it is released. Even worse, a situation named deadlock, a network state where no messages can advance because each message requires a physical channel occupied by other message, could happen.

Figure 2 illustrates such a situation.

Figure 2 Deadlock

In fact, a physical channel can be divided into several unidirectional virtual

channels, each realized as a pair of buffers multiplexed across a physical channel. For example, in Figure 3 buffer a1 and b1 form a unidirectional virtual channel flowing from a1 to b1; buffer b2 and a2 form another from b2 to a2. Assuming that these two virtual channels share the bandwidth of the physical channel equally, each virtual channel operates as if on a separate physical channel, only with half of the original bandwidth.

Figure 3 Two unidirectional virtual channels multiplexed across a physical channel

By dividing a physical channel into several virtual channels, messages can make progress rather than being blocked. For example, Figure 4 shows two messages crossing the physical channel between switch 1 and switch 2. Without virtual channels, one message may prevent the other from advancing, depending on which gets the privilege of the physical channel first. However, with virtual channels multiplexed across the physical channel, both messages continue to make progress at a rate which is half the achievable if there are no virtual channels. Since the time required for a message to wait until it is transferred is reduced, the average latency is decreased. As a result, the physical channel utilization rate is higher and the network throughput is increased. By continuing to add virtual channels, the overall message latency and network throughput can be improved further—at the cost of more buffer size and

complex multiplexer.

Figure 4 Messages make progress rather than being blocked

Third, although the bandwidth of a physical channel can be equally shared among all virtual channels, this is generally not a good idea. As shown in Figure 5, instead of sharing bandwidth equally, we use a round-robin schedule to grant rights to each virtual channel. Only when messages are to be transmitted over some virtual channels does our round-robin schedule give rights to them. Those virtual channels without data transmission cannot, and do not have to, access the physical channel.

Buffer A Buffer B Buffer C Buffer A Buffer B Buffer C

A B ^C

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4

TIME

Figure 5 Virtual channels sharing the bandwidth of a physical channel using round-robin schedule

Moreover, if we find out there is heavy traffic over some virtual channels, we can grant privilege to them more frequently using weighted round-robin schedule shown in Figure 6. All we have to do is to configure our switch.

A B ^C

A1 A2 B1 C1 A3 A4 B2 C2 B3 C3

TIME

Figure 6 Virtual channels sharing the bandwidth of a physical channel using weighted round-robin schedule

Last, for traditional networks, the number of nodes is not known and the behavior of communication is not predetermined. This is not true for on-chip networks, where the number of nodes and the behavior of communication can be known before run-time. Therefore, we can have dedicated connection paths established in advance by reserving the corresponding virtual channels for them. With proper configuration, our switching method acts just like circuit switching. Still, if demanded, we can allocate more bandwidth to some connection paths by using the weighted round-robin schedule mentioned above. This suggests that our switching technique can do the same thing as does circuit switching to support real-time applications.

To summarize, features of our switch are listed below:

Small and fast

Re-configurable

Provides efficient bandwidth utilization

Supports for real-time applications: dedicated connection paths can be reserved

Bandwidth of each connection path can be configured

2.2.3 Switch Model

As shown in Figure 7, each switch has five pairs of input and output ports.

Output ports at North (N), East (E), South (S), and West (W) connect to the corresponding input ports of the adjacent switches. The output port at E side of the left switch, for example, connects to the input port at W side of the right switch. The remaining output and input ports at O side connect to and from the local processor respectively.

DECISION

DECISION DECISION

DECISION

Figure 7 Inside the switch

There are four queues at each side. Each queue stores data ready to be transferred to and received from the adjacent switch. For the input part of a queue, the label on it indicates that the received data will be transferred later to that side of the switch. For instance, the input part of the queue labeled N at the E side of a switch stores data to the N side of the same switch.

On the other hand, for the output part of a queue, the label on it indicates to which side the stored data is transferred in the adjacent switch. For example, the output part of the queue labeled S at the E side of a switch stores data that will later be transferred to the S side of the adjacent switch.

As mentioned in 2.2.2, a physical channel can be divided into several virtual channels. Our switch supports four modes with channel width factors equal to 1, 2, 4, and 8 respectively. Here, the channel width factor indicates the maximum allowable number of virtual channels passing each input/output queue at each side of a switch.

Since there are always four input and output queues at each side of a switch, the number of virtual channels connecting to the adjacent switch at each side are four times the channel width factor. Therefore, if the channel width factor is 2, there will be at most 8 virtual channels flowing out and 8 flowing in each side of a switch.

Here, data is transmitted over pipeline buses. In Figure 8, a grey line indicates a transmission path from processor 1 to processor 2. Here, processor 1 is trying to send two packets ‘a’ and ‘b’ to processor 2. The transmission steps are as follows: first, in Figure 8(a), processor 1 sends ‘a’ to its local switch, S1; in the next cycle, ‘a’ is sent to S2 while ‘b’ is sent to S1 at the same time as shown in Figure 8(b); then in (c), ‘a’

arrives in its destination, processor 2, while ‘b’ is sent to S2; finally, ‘b’ also arrives at processor 2 at the fourth cycle, finishing the transmission.

Figure 8 Processor 1 sending two works to processor 2

Since details of a transaction are not our focus in the switch model, we only need to know that it takes three cycles for a packet to pass a virtual channel. Therefore, the latency experienced by each packet is three times the number of virtual channels it passes plus the time this packet waits when it is blocked. In worst case, the bandwidth available to this packet is the minimum one of all virtual channels it passes along the transmission path.

2.3 Our Design Flow

Figure 9 depicts our design flow that starts with an application. Initially, the algorithm for the application is chosen and partitioned into interacting tasks. Then, these tasks and their relationship will be modeled by a graph model. In this graph model, the computation and communication amount required for each individual task and paired tasks will be indicated. Since many algorithms contain feedback loops, an iteration bound, which is the lower bound on the achievable iteration period, will be imposed. It is not possible to achieve an iteration period less than the iteration bound, even when infinite resources are available. Here we will examine if our algorithm

meets the performance constraints. If not, we will modify and repartition the algorithm again, repeating the same procedure until those constraints are met.

Figure 9 Our design flow

After the algorithm analysis step, processors for each task will be allocated. We will try schedule these tasks. In this step, two important factors are considered:

memory size and computing power. For the former, since each processor has only limited memory size, we must make sure that the required data for and the intermediate data generated by a task can be stored into a processor. For the latter, we wish to share task loads equally among the allocated processors so that no processor is idle while others are busy. By carefully handling these points, the system performance will be improved because each processor will be utilized to the maximum.

The task binding process is performed next. In this step, we decide which task should be mapped onto which processors. The interacting tasks should be always mapped onto processors in the same region, reducing the time spent on their interaction. In our platform, dedicated connection paths reserved in advance are used when processors communicate. When assigning connection paths, we try best to find

the shortest path between each paired processors, while avoiding over usage of routing resources.

Finally, all information generated will be collected and fed to our simulator. We can run some applications on our simulator to see if they work well on our platform.

Chapter 3 Task Binding

In this chapter, the task binding problem will be discussed. First, the problem formulation of task binding will be given in 3.1. Then our solution to the problem is presented in 3.2. Finally, details of the techniques we exploit to solve this problem will be shown in 3.3 and 3.4.

3.1 Problem Formulation

Task binding problem can be formulated as:

Given

Applications A1~Ak

Corresponding task graph (directed-acyclic graph) Gi = G(V, E) for each application Ai

With

Each vertex vЄVi representing a task to run on one processor, and the amount of computation shown in the vertex

Each edge eЄE_i indicating data transmission along the arrow, and the amount of communication shown by the edge

NV_i representing total number of vertices in G_i, and NV representing the

summation of NVi; NV be less or equal to the number of processors available

We wish to

Map each vertex onto a processor

Place connected tasks as close as possible to reduce interaction time

Find a corresponding connection path for each edge

Minimize the total routing resource requirement

3.2 Task Binding Flow

The process of task binding tries to find out a solution that every vertex in task graphs be mapped onto a processor, and for each pair of connected vertices, a connection path be reserved for communication. Here, because of the problem complexity, the process is divided into two parts.

The former part of the process, task mapping, is analogous to placement problem in FPGA. But in our problem, instead of determining which logic block within an FPGA should implement each of the logic blocks required by the circuit, we decide which processor should execute which task.

The latter part of the process, connection path assignment, does almost the same as does routing techniques in FPGA. In FPGA, there can be one source node and many sink nodes connecting to a signal wire. On the contrary, in our platform, only one source node and one sink node can be connected to the ends of a virtual channel.

This is because it is data that we want to pass along a virtual channel, not voltages.

In Figure 10, the flow of task binding is shown. First, a task graph containing the

computation/communication information for all tasks is given. Note that this task graph should be a directed-acyclic graph. After this, we exploit placement techniques used in FPGA to map tasks onto processors. If any two tasks communicate, processors in the same region will be allocated for these tasks. Then, we’ll reserve connection paths for these tasks. Therefore, when data transmission is to occur, pre-allocated and dedicated connection paths will be used. Finally, the position information of processor belonging to each task, the information of connection path for any two interconnected processors and the number of virtual channels needed across a physical channel will be reported.

Figure 10 Task binding flow

3.3 Task Mapping

The three major placers commonly in use today are min-cut, simulated annealing, and analytic based placers. Usually, the use of analytic based placers is often followed by iterative improvement [19]. Since design of architecture of on-chip-networks is an important issue, we wish to explore much different architecture. Thus, the optimization goals of our placer may change from architecture to architecture. Among the three types of commonly used placers, the simulated annealing based placer can be more easily adapted to new optimization goals than min-cut and analytic based placers [20]. Therefore, we use simulated-annealing technique to map tasks onto processors.

3.3.1 Simulated Annealing

Simulated annealing proposed in [21] is a widely used heuristic to solve several combinatorial optimization problems including many well known CAD ones. It belongs to the class of non-deterministic algorithms. As its name suggests, simulated annealing mimics the annealing process used to gradually cool molten metal to produce high-quality metal objects. During the process, a metal is heated to a very high temperature and then slowly cooled down. At some proper cooling rate, the process has a very good chance of producing high-quality metal objects. If we compare optimization to the annealing process, the attainment of a good solution is analogous to that of refined metal objects.

For a combinatorial optimization problem, we wish to find out solutions with low costs in the solution space, which is a set containing all possible solutions. Some

在文檔中應用於單晶片多處理器系統之任務結合方法 (頁 13-0)