Thesis Organization - 使用最少量緩衝器於延遲容忍系統中達成效能最佳化

Chapter 1 Introduction

1.3 Thesis Organization

This thesis is organized as follows. In Chapter 2 we give the preliminaries of our work. It includes the introduction of latency insensitive system, how to fix system performance degradation of LIS caused by multi-cycle communication, and some related works. The proposed strategy for performance optimization with minimal buffer size is given in Chapter 3. The experimental results and related analyses are provided in Chapter 4. Chapter 5 concludes this thesis and lists probable future works.

Chapter 2 Preliminaries

2.1 Latency Insensitive System (LIS)

The concept to design a system which is insensitive to arbitrary variation in interconnect delay was first presented in [14]. The proposed approach Latency insensitive design (LID) is a design methodology for SoC that enables automatic adjustment to original system in order to make new system get with variant delay.

LID encapsulates each IP core (the pearl) with an automatically-synthesized interface (the shell) and inserts repeaters to pipeline long interconnects. Those repeaters are called relay stations (RS) in LIS. By using LID, one can derive an LIS from original synchronous system. IP cores may be synchronous sequential logic blocks of any complexity as long as they satisfy the stallability, i.e., their operation can be temporarily stalled [12]. Relay stations are clocked buffers with two-fold storage capacity used to pipeline every long interconnect in order to let them meet the target clock period. After doing those movements, an LIS is latency-equivalent to original synchronous system [12]. It means that when we ignore stalling (void) events in timestamps, the rest informative (valid) events on each channel of an LIS are exactly the same with the informative events on each channel of the original system. To summarize contribution of LID is it guarantees that it can cope with any amount of interconnect delay without redesign of any IP core. Figure 2-1 illustrates the typical structure of an LIS implementation. Four pre-designed IP cores are encapsulated within the shells and five relay stations are inserted to long interconnects. IP cores communicate with each other by a set of point-to-point, pipelined channels. The encapsulated IP cores, relay stations, and point-to-point channels form the entire LIS.

Figure 2-1. Shell encapsulation and RS insertion in an LIS.

Figure 2-2 shows detailed architecture of encapsulated IP core. Block diagram in the example contains two input channels, one output channel, a controller to drive each element, and a stallable IP core. Each input channel has two end points. One is direct to input port of stallable IP core, and the other goes to the storage element queue located in every channel. IP core takes data either from input channel directly or from storage element controlled by multiplexer. A controller is accompanied with each encapsulated IP core, and it determines many vital controlling signals, such as select signal for multiplexer, stalling signal for IP core, and operation signals for queue. The details of the shell and relay station RTL logic designs are listed in [15].

Each shell and each relay station follow universal communication protocol. The protocol which allows shells and relay stations exchanging data on variant length channels is latency insensitive protocol (LIP) [11]. LIP defines the data exchanged by the shell as either valid or void and keeps the shells to ignore the existence of void data. The shell fires or executes the IP if and only if the IP can get a valid data from each input channel. The valid data from each input channel can be acquired from channel directly or from storage element queue. If the condition is not satisfied,

the shell stalls the core otherwise. The architecture of relay station is similar to the encapsulated IP. We can view the IP core of relay station as a simple edge triggered flip-flop.

Figure 2-2. Block diagram of an encapsulated IP core.

System throughput is the primary evaluation metric of system performance.

Throughput is usually calculated by valid data generation rate. Figure 2-3 and 2-4 show how to calculate throughput of LISs.

Figure 2-3. Progressive trace of a simple LIS.

1 2 3 4 5 6 7 8 9 T1 T2 T3 T4 T5 T6 T7 T8 T9

Figure 2-4. Output data sequence at core C in Figure 2-3.

In Figure 2-3, the big white rectangles represent IP cores in a system. The small white rectangles inside IP cores are queues on each input channel. IP A and B both have only one input channel and queue size on each input channel is all equal to 1. IP C has two input channels and queue size on each input channel is equal to 1, too. A channel queue, whose size is 1, is called a minimum queue so Figure 2-3 is an LIS with minimum queue on every channel. Red numbers in Figure 2-3 represent valid data and a positive integer “i” denotes the i-th valid data generated by the IP core.

Note that when an IP core takes (i-1)-th valid data from its input channels, it outputs its i-th valid data to output channels if IP fires. Otherwise, a shell stores the valid data in queue when an IP stalls. We trace i-th valid data to get the valid data generation rate.

This trace of data produced by IPs is called a progressive trace [16]. Since IP C is the only output of the simple LIS, system throughput can be derived by analyzing the data generation of output channel of IP C. Figure 2-4 shows the result of output data sequence at output channel of IP C. We find that IP C produces a valid data at every clock cycle so throughput of this LIS is 1 obviously. However, this simple LIS example does not consider the effect of inserted relay stations and back-pressure mentioned in [17] and [18]. A more realistic LIS example is showed in Figure 2-5.

The shaded rectangle indicates a relay station and a relay station simply passes received data to its output channel at next clock cycle. Red numbers are valid data, the same definition as in Figure 2-3, and blue numbers mean void data. Since a relay station only passes the received data, it never generates new valid data. We assign a

symbol ‘τ’ to represent a non-generated and void data which a relay station outputs at timestamp 1.

Figure 2-5. Simple LIS example with inserted relay station and back-pressure.

In timestamp 1, all IP cores produce their first valid data, while relay station can only stall and release void data τ. In timestamp 2, IP C only receives a valid data from one of its input channel, but IP C needs two first valid data from each of its input channels to generate second valid data. Therefore, IP C stalls and outputs a void data.

The first valid data generated by IP B is not processed, so it is stored in the queue of the lower input channel of IP C. As a result, lower input channel of IP C becomes full in the end of timestamp 2. In order to avoid valid data loss due to queue overflow, it forces IP B to stall at timestamp 3. The stop signal used to request source IP to stall at next timestamp is called back-pressure. The occurrence of back-pressure is highlighted by coloring the occurred channel red. In timestamp 3, IP C gets all required data from its input channels, so it can generate next valid data. IP B needs to stall, since the occurred back-pressure at timestamp 2. Note that since the queue of lower input channel of IP C is full at timestamp 2, the data sent by IP B at timestamp 2 will be discarded by the shell of IP C. This reason forces IP B to re-send generated

data at timestamp 3 although it stalls at timestamp 3. Another thing needed to be noticed is that the queue of IP B is full at timestamp 3. IP B sends a stop signal to IP A to request IP A to stall at next timestamp. In timestamp 4, all IPs produce their next valid data except IP A. IP A stalls at timestamp 4 but still re-sends data to IP B, like IP B does in timestamp 3. In timestamp 5, all IP cores fire to produce valid data and relay station passes a received void data. We find that the system behavior in timestamp 5 is identical to system behavior in timestamp 1. By progressive trace, we infer that the LIS example has a period of four clock cycles, as shown shows in Figure 2-6. Figure 2-6 is the output data sequence of IP C, and system behavior clearly repeats every four clock cycles. This LIS outputs three valid data in every four clock cycles, so throughput of LIS is obviously three fourth.

Figure 2-6. Output data sequence of core C in Figure 2-5

Finally, we summarize the advantages of LIS. LIS is a great solution to variant global interconnects length which is unknown in early design stage. By adding relay stations and encapsulating IP cores, LIS approach guarantees robustness for system behavior under LIP. However, LIS approach does not guarantee the same robustness for the throughput affected by back-pressure mechanism. There are two proposed technologies to deal with the throughput optimization problem of LIS. One is relay station insertion and the other is queue sizing of channel queue.

2.2 Throughput Optimization for Latency Insensitive System

The advantage and disadvantage of LIS have been discussed. Next, we discuss two related technologies used to optimize throughput of LIS.

2.2.1 Relay Station Insertion

In Figure 2-5, we discover that one of the reasons cause the occurrence of back-pressure is the imbalance of channel latency. Data transmitted from IP A to IP C on upper path has experienced one clock delay but data transmitted on lower channel has not. The imbalance of channel latency results in occurrence of back-pressure and degrades throughput of LIS. Casu and Macchiarulo suggest equalization which basically equalizes all paths by inserting enough relay stations to make them have the same latency [19]. Therefore, there are two reasons that relay stations need to be added to an LIS. The first is to break up long channels to meet target clock period.

The second reason is to optimize throughput by balancing latency of channels. Figure 2-7 demonstrates how to balance latency by inserting relay stations.

Figure 2-7. Optimize throughput by inserting relay station.

Left of Figure 2-7 is the same LIS example in Figure 2-5. We know that back-pressure occurs in this LIS architecture. Now, we insert a relay station to the channel connected IP B and IP C as shown in the right of Figure 2-7. As a result, all data arrived IP C have experienced the same latency, so back-pressure will not occur.

Throughput of the LIS improves to 1 finally. This is how we optimize throughput by relay station insertion. Nevertheless, relay station insertion still has its limitation. Lu and Koh have proved that equalization does not work for all systems [20]. Figure 2-8 illustrates a counter example. To balance the latency at paths from IP A to E a relay station must be added to either channel (A, C) or channel (C, E), but this ends up unbalancing either path from IP C to A or paths from IP E to C. Next, more relay stations need to be inserted to balance them. As a result, we find that throughput will never improve to 1 by doing exhaustive progressive trace.

Figure 2-8. Counter example of relay station insertion

From the discussed counter example, we know that relay station insertion still has some restrictions. Since relay station insertion is not a general solution for all LISs, the demand for better solutions rises.

2.2.2 Queue Sizing

Another reason which causes back-pressure to happen is size of queue. When queue is full, the shell needs to send a stop signal to stall source IP. This creates a motivation to increase size of queue so back-pressure will not happen. Without happening of back-pressure, performance of LIS can be optimized. Figure 2-9 illustrates the effect after increasing queue size of lower channel of IP C to 2. Left of

Figure 2-9 is the exact example in Figure 2-5 of timestamp 2. Back-pressure occurs when queue is full at this timestamp. After we adding one queue to lower channel of IP C, like the right of Figure 2-9, there always leaves one unused queue and hence back-pressure will never happen. Throughput of the example also improves to 1.

Figure 2-9. Optimize throughput by queue sizing

We view relay station insertion as another kind of queue sizing because relay station is a clocked storage element like queue. The difference of relay station and normal queue is that relay station forces all received data to delay one clock cycle but queue will not. The advantage of queue sizing is it will not potentially impact elsewhere in the system like relay station insertion since it only delays data by queue when needed. Increasing size of queue only causes slight additional hardware cost that will not influence whole system architecture in most of systems. Based on the characteristics, queue sizing becomes the mainstream of LIS throughput optimization.

To be summarized, queue sizing offers a trade-off between performance optimization and area overhead.

2.3 Related Works

LIS has been discussed frequently in recent years. Many research works are made under different hardware architecture assumptions and different physical

information assumptions. Next, we are going to introduce two important research works on LIS topic. Earlier works before 2003 only considered ideal LISs (LISs with infinite queues and no back-pressure). Lu and Koh are the first people who proposed the method to solve LIS with back-pressure problem by queue sizing [17]. They showed that performance of a practical LIS with finite queues and back-pressure can reach the performance of an ideal LIS if proper queue sizing is adopted. They also proposed an approach to analyze complex LISs. Lis graph and extended lis graph were presented to model LISs. Throughput of an LIS was decided by the most critical cycles called the system cycles. Throughput calculation of those LISs has been shown in equation (1).

Where C is the set of all cycles in the lis graph. W(Ci) is the sum of edge weights of cycle Ci, and |Ci| is the number of edges in cycle Ci.

of cycle Ci. System cycles are cycles with max cycle mean and those cycles determine throughput upper bound. Throughput can not be further improved by queue sizing when it reaches throughput upper bound which is equal to 1 in most cases.

Finally, Lu and Koh proposed a mixed integer linear programming (MILP) solution for queue sizing.

Collins and Carloni proposed a heuristic for queue sizing that produces solutions close to optimal solution in shorter time reported in [21] and [22]. A marked graph is a bipartite directed graph and Collins et al. use it to model LISs. Performance of LISs in marked graph is represented by maximal sustainable throughput (MST) θ. MST is determined by cycles with lowest tokens to places ratio. This ratio is similar to cycle

mean in [17]. The details of marked graph and MST will be introduced in Section 3.1.

Token deficit problem (TDP) is the problem of filling the token (queue) deficit of cycles in an LIS. Collins et al. claimed that their heuristic algorithm for TDP is guaranteed to produce a performance-wise optimal solution that may require more queue space. Additionally, Collins et al. proposed two trends of LISs. One is that the position where relay station inserted affects throughput seriously. The other is the efficiency of fixed queue size. Collins et al. claimed that assigning every queue size to 5, and throughput is above 90% of the optimal solution. Collins et al. also make a different hardware architecture assumption with Lu et al. proposed in [17]. In our opinion, Collins’ hardware architecture assumption is closer to practical situation.

There are some different methods to solve LIS problem for different purposes.

For instance, Casu and Macchiarulo avoided queue sizing issue by scheduling the activation of IPs [23]. A limitation of their work is that building schedules needs knowledge about the global system behavior. Bufistov et al. proposed the method that combines both queue sizing and relay station insertion techniques to achieve optimal throughput [24]. However, they made an assumption that the increase of queue size will also cause the increase of channel delay. This assumption will not happen in the hardware architecture we used.

Chapter 3 Throughput Optimization for LIS with Minimal Buffer Size

3.1 Marked Graph and Quantitative Graph (QG)

We introduce details of the marked graph first. Marked graph is a proposed modeling architecture for synchronous systems. Their simplicity makes them quite amenable to analyze synchronous systems which have a periodic behavior like LIS. A marked graph has two kinds of vertices: places and transitions. By definition, each place has exactly one incoming edge and one outgoing edge that both connect to transitions. Places have the ability to hold 0 or more tokens. Transitions cannot hold tokens, but they can fire and move tokens around in the graph. Each outgoing edge from a place connects to a transition, and each incoming edge connects to a place coming from a transition. A transition is enabled to fire when the place on each of its incoming edges has at least one token as the fire condition we described. All components of a marked graph fire to produce valid according to global clock.

Detailed definitions of the marked graph are reported in [21], [22], and [25].

Figure 3-1. Modeling relay station and shell with marked graph representation.

Figure 3-1 shows the marked graph representation of a relay station and a shell.

The large white circles represent places, the small black circles represent tokens, the black vertical bars represent transitions, and the number q represents the channel has q tokens. Initially, the relay station’s place on solid edge has no token since the relay station produces a void data in timestamp 1, and its dashed edge has two tokens on place corresponding to the two available storage spaces in the queue. Recall that relay station is a clocked buffer with two storage capacity. The shell’s place on solid edge has one token since the shell produces a valid data in timestamp 1, and its dashed edge has q tokens on place. Number q is a positive integer. Using a marked graph representation, valid or void data are presented by tokens on the solid edges. The tokens on the dashed edges represent available spaces of queue in the channel [21].

Figure 3-2. Transformation from original LIS graph to marked graph representation.

Figure 3-2 illustrates how to transform original LIS graph to a marked graph representation. All queue size of shells are set to 1 in this case. It is convenient to calculate MST after we transform LIS to marked graph. We used to compute system throughput by progressive trace as mentioned in Section 2.1, but progressive trace spends a lot of time to simulate IP behavior on every timestamp, so it is unpractical to calculate throughput by progressive trace in complex system. However, based on Section 2.3, we can compute the MST of the graph by finding the cycles with the lowest ratio of tokens to places. In Figure 3-2, the most critical cycle {A, D, C, B, A}

has four places but only three tokens on it, so the system has MST of three fourth.

Another convenience of marked graph representation is that it can reflect queue sizing problem easily. Figure 3-3 shows how to reflect queue sizing problem to the marked graph. If we want to add an extra queue to IP B, we only need to put an additional

在文檔中使用最少量緩衝器於延遲容忍系統中達成效能最佳化 (頁 16-0)