Related Works - 使用最少量緩衝器於延遲容忍系統中達成效能最佳化

Chapter 2 Preliminaries

2.3 Related Works

LIS has been discussed frequently in recent years. Many research works are made under different hardware architecture assumptions and different physical

information assumptions. Next, we are going to introduce two important research works on LIS topic. Earlier works before 2003 only considered ideal LISs (LISs with infinite queues and no back-pressure). Lu and Koh are the first people who proposed the method to solve LIS with back-pressure problem by queue sizing [17]. They showed that performance of a practical LIS with finite queues and back-pressure can reach the performance of an ideal LIS if proper queue sizing is adopted. They also proposed an approach to analyze complex LISs. Lis graph and extended lis graph were presented to model LISs. Throughput of an LIS was decided by the most critical cycles called the system cycles. Throughput calculation of those LISs has been shown in equation (1).

Where C is the set of all cycles in the lis graph. W(Ci) is the sum of edge weights of cycle Ci, and |Ci| is the number of edges in cycle Ci.

of cycle Ci. System cycles are cycles with max cycle mean and those cycles determine throughput upper bound. Throughput can not be further improved by queue sizing when it reaches throughput upper bound which is equal to 1 in most cases.

Finally, Lu and Koh proposed a mixed integer linear programming (MILP) solution for queue sizing.

Collins and Carloni proposed a heuristic for queue sizing that produces solutions close to optimal solution in shorter time reported in [21] and [22]. A marked graph is a bipartite directed graph and Collins et al. use it to model LISs. Performance of LISs in marked graph is represented by maximal sustainable throughput (MST) θ. MST is determined by cycles with lowest tokens to places ratio. This ratio is similar to cycle

mean in [17]. The details of marked graph and MST will be introduced in Section 3.1.

Token deficit problem (TDP) is the problem of filling the token (queue) deficit of cycles in an LIS. Collins et al. claimed that their heuristic algorithm for TDP is guaranteed to produce a performance-wise optimal solution that may require more queue space. Additionally, Collins et al. proposed two trends of LISs. One is that the position where relay station inserted affects throughput seriously. The other is the efficiency of fixed queue size. Collins et al. claimed that assigning every queue size to 5, and throughput is above 90% of the optimal solution. Collins et al. also make a different hardware architecture assumption with Lu et al. proposed in [17]. In our opinion, Collins’ hardware architecture assumption is closer to practical situation.

There are some different methods to solve LIS problem for different purposes.

For instance, Casu and Macchiarulo avoided queue sizing issue by scheduling the activation of IPs [23]. A limitation of their work is that building schedules needs knowledge about the global system behavior. Bufistov et al. proposed the method that combines both queue sizing and relay station insertion techniques to achieve optimal throughput [24]. However, they made an assumption that the increase of queue size will also cause the increase of channel delay. This assumption will not happen in the hardware architecture we used.

Chapter 3 Throughput Optimization for LIS with Minimal Buffer Size

3.1 Marked Graph and Quantitative Graph (QG)

We introduce details of the marked graph first. Marked graph is a proposed modeling architecture for synchronous systems. Their simplicity makes them quite amenable to analyze synchronous systems which have a periodic behavior like LIS. A marked graph has two kinds of vertices: places and transitions. By definition, each place has exactly one incoming edge and one outgoing edge that both connect to transitions. Places have the ability to hold 0 or more tokens. Transitions cannot hold tokens, but they can fire and move tokens around in the graph. Each outgoing edge from a place connects to a transition, and each incoming edge connects to a place coming from a transition. A transition is enabled to fire when the place on each of its incoming edges has at least one token as the fire condition we described. All components of a marked graph fire to produce valid according to global clock.

Detailed definitions of the marked graph are reported in [21], [22], and [25].

Figure 3-1. Modeling relay station and shell with marked graph representation.

Figure 3-1 shows the marked graph representation of a relay station and a shell.

The large white circles represent places, the small black circles represent tokens, the black vertical bars represent transitions, and the number q represents the channel has q tokens. Initially, the relay station’s place on solid edge has no token since the relay station produces a void data in timestamp 1, and its dashed edge has two tokens on place corresponding to the two available storage spaces in the queue. Recall that relay station is a clocked buffer with two storage capacity. The shell’s place on solid edge has one token since the shell produces a valid data in timestamp 1, and its dashed edge has q tokens on place. Number q is a positive integer. Using a marked graph representation, valid or void data are presented by tokens on the solid edges. The tokens on the dashed edges represent available spaces of queue in the channel [21].

Figure 3-2. Transformation from original LIS graph to marked graph representation.

Figure 3-2 illustrates how to transform original LIS graph to a marked graph representation. All queue size of shells are set to 1 in this case. It is convenient to calculate MST after we transform LIS to marked graph. We used to compute system throughput by progressive trace as mentioned in Section 2.1, but progressive trace spends a lot of time to simulate IP behavior on every timestamp, so it is unpractical to calculate throughput by progressive trace in complex system. However, based on Section 2.3, we can compute the MST of the graph by finding the cycles with the lowest ratio of tokens to places. In Figure 3-2, the most critical cycle {A, D, C, B, A}

has four places but only three tokens on it, so the system has MST of three fourth.

Another convenience of marked graph representation is that it can reflect queue sizing problem easily. Figure 3-3 shows how to reflect queue sizing problem to the marked graph. If we want to add an extra queue to IP B, we only need to put an additional token on dashed edge of IP B. Finally, the most critical cycle {A, D, C, B, A} of the system has ratio equal to 1, so the system has optimal MST 1.

Figure 3-3. Queue sizing problem reflects to marked graph representation.

We prefer to adopt marked graph representation on our LIS research. This is because: (1) it is easy to transform original LIS graph to marked graph representation.

All we need to do is to find all channels and IP cores in LIS, and then transforms them to relay station or shell representation, as shown in Figure 3-1. (2) throughput of LIS is easy to calculate in marked graph, since we only need to find the cycles with lowest tokens to places ratio in the marked graph. (3) it is easy to decide which places in the marked graph should have more tokens. This greatly helps us find the optimal solution.

Although marked graph representation is convenient, there still exist some drawbacks in it. One is that we used to calculate throughput with pure integer number.

Using tokens and places is easy to operate at graph, but it is indirect in calculating throughput. Since that, we propose a new graph representation quantitative graph (QG)

which can handle those problems properly. Figure 3-4 shows the flow how we transform from a marked graph to a quantitative graph. First, we want to quantify number of places and tokens into integers. Now we get an intermediate graph which only contains four integers in each channel. Second, we transform every transition into a vertex and get rid of all dashed edges in the intermediate graph. This is feasible because each dashed edge corresponds to a solid edge in the marked graph. Whenever there exists a solid edge, there must exists a corresponding dashed edge. In the end, we create a new graph with vertices and four weightings in each channel. Those weightings represent number of places and tokens on solid edge and dashed edge of the channel. We call this new graph quantitative graph.

A D

ps(e) : places of solid edge ts(e) : tokens of solid edge pd(e) : places of dashed edge

td(e) : tokens of dashed edge [ps(e), ts(e), pd(e), td(e)]

Figure 3-4. Transformation from a marked graph to a quantitative graph.

Definition of quantitative graph: A quantitative graph GQ = (VQ, EQ, ps, ts, pd, td) is a weighted directed graph, where

‧ VQ is the set of vertices.

‧ EQ is the set of edges, and each edge carries four weightings ps, ts, pd, and td.

‧ ps: E→Z⁺ shows the number of places of the corresponding solid edge.

‧ ts: E→N represents the number of tokens of the corresponding solid edge.

‧ pd: E→Z⁺ identifies the number of places of the corresponding dashed edge.

‧ td: E→Z⁺ is the number of tokens of the corresponding dashed edge.

Formal transformation from a specified marked graph to the quantitative graph is described as follows. Each transition ti in the marked graph converts to a vertex vi in the quantitative graph. Each edge (vi,vj) in the quantitative graph corresponds to a pair of edges in the marked graph, including a solid edge (ti,tj) and a dashed edge (tj,ti).

Places and tokens of solid edges transform to weightings ps and ts in the quantitative graph. Places and tokens of dashed edges transform to weightings pd and td. For example, ps((vi, vj))=1, ts((vi, vj))=0, pd((vj, vi))=1, and td((vj, vi))=2 represent an input channel of relay station in the quantitative graph. System throughput of QG is decided by cycles with lowest ratio of tokens to places, which is identical to original marked graph. However, tokens and places in the marked graph are transformed to weightings in the quantitative graph. Throughput calculation of QG is modified to

find lowest ratio of

∑ ∑ ∑ ∑

Summation of ts and td represent total tokens in cycle C. Summation of ps and pd are total places in cycle C. S represent set of solid edges and D represent set of dashed edges. We define this ratio as T(C).

3.2 Quantitative Graph Reduction

There still exist some vital problems unsolved even after we transform marked graph to QG. One of them is when global interconnects latency becomes worse, and we will need more relay stations to pipeline interconnects, so graph size becomes

huge. Some LISs may be unsolvable due to huge graph size. That urges us to try to further reduce graph size.

3.2.1 Path Condensation

If there exists a simple path in the QG and every vertex inside the path all have only one input edge and only one output edge. We find that it is equivalent in calculating throughput after we combine all edges and vertices in the simple path into a single edge. And all weightings of the single edge are the summation of weightings of all combined edges.

Figure 3-5. Two graphs are equivalent in throughput calculation.

Figure 3-5 illustrates the concept of combination. Left graph of Figure 3-5 is original QG and right graph of Figure 3-5 is the graph after combinative operation.

The pink vertex represents relay station and the two red edges correspond to two combinative paths in the left graph. Left graph has two cycles when we consider solid edges only. T(C) of those two cycles are two third and one. Since system throughput is determined by cycles with lowest T(C), system throughput of left graph is two third finally. Right graph also has two cycles with T(C) equals two third and one. System throughput of right graph is also two third. In the end, two graphs are equivalent in

system throughput but right graph has fewer vertices and edges. In other words, right graph is more efficient in counting cycles in the graph, this is to say, more efficient in calculating system throughput. We define this combinative operation as path condensation. By path condensation operation, we can eliminate all the relay stations and some IPs in the QG without influencing system throughput.

Definition of path condensation: We call a simple path pu,v <u,v1,…vn,v>

condensable if the path satisfies the following two conditions.

‧ The length of path |pu,v|≥ 3, or n ≥ 1

‧ For each intermediate vertex {v1,v2,…,vn}, its input degree and output degree must both equal to 1

Each condensable path pu,v can be replaced by a condensed edge ec (u, v) without affecting the overall system throughput, and for each condensed edge ps(ec)=pd(ec)=n+1, ts(ec)=

∑

reduced in number of edges. We observe that one of two red edges is dominating in calculating throughput in Figure 3-5. We observe left of Figure 3-6, and we know system throughput is two third. In other words, the cycle contains upper red edge dominating system throughput. That is to say, we can eliminate the other one red edge without affecting correctness of throughput calculation. The activity is showed in right of Figure 3-6 which dominating edge is kept and the other is eliminated. We

define this operation as edge unification.

v1 v3

[1,1,1,1]

[2,1,2,3]

v1 v3

[1,1,1,1]

[2,1,2,3]

[2,2,2,2]

System throughput : 2/3 System throughput : 2/3 Figure 3-6. Operation of edge unification.

Definition of edge unification: For any two vertices vi, vj in the quantitative graph, if there exist multiple edges from vi to vj, we group those edges into an Em. Each Em can be unified into a dominating edge ed, and we keep the dominating edge and get rid of others edges belonging to the same Em. This unification maintains system throughput. Each dominating edge ed is the edge with max(ps(e)-ts(e)), where e∈Em.

In Figure 3-6, the graph has only one Em which contains two red edges. From the definition, we know that upper red edge is the dominating edge of Em, so we eliminate the lower edge to decrease number of cycles in the graph. By edge unification, QG can de further reduced on graph size. Figure 3-7 demonstrates an example of total reduction procedure.

Figure 3-7. Total reduction procedure of an LIS example.

Figure 3-7 starts from a marked graph with seven channels where size of queue need to be decided. Those variables are indexed as a1 to a7. This is because queue size of the relay station is fixed to 2 in marked graph. Marked graph make this assumption to keep the relay station small and consistent. Therefore, we only need to view size of queue in each shell as a variable. In other words, now we have seven variables in this example. Next, we transform marked graph representation to QG representation. Then, we do the reduction procedure to the QG. From the definition of path condensation and edge unification, we know those procedures will not affect correctness of throughput. Finally we acquire a reduced graph which has the same throughput with the original marked graph while eliminating variables from seven to three. This makes throughput calculation in the reduced graph faster than with initial marked graph.

After getting result of system throughput, we need to recover from reduced graph to original QG to get the correct number of queues in whole system. We show this recovered procedure in Figure 3-8 and Figure 3-9.

Figure 3-8. An example of recovered procedure step 1

Figure 3-9. An example recovered procedure step 2.

In Figure 3-8, we illustrate recovered procedure step 1. In step 1, we recover reduced graph from edge unification first. To maintain the optimal throughput in recovered procedure, there is a condition must be satisfied. The condition is to make all edges belong to the same Em have equal td(e)-pd(e). That is to say, for all e∈Em, we make their td(e)-pd(e) equal. This is because all e∈Em needs to have the same number of extra queues. Whenever a cycle passes throughput the dominating edge of Em, there must exist other cycles pass throughput other edges belonging to the same Em in the original QG. When the cycle passes throughput dominating edge needs extra queues to achieve optimal throughput, we infer that other edges belonging to the same Em will also need the same number of extra queues to maintain optimal throughput.

For instance, we assign td(e) of the dominating edge in left of Figure 3-8 to be 8.

Then, we know td(e) of the other one edge is equal to 6, since 8-4 = 6-2. In step 2, we recover reduced graph from path condensation as showed in Figure 3-9. We already know that queue size of relay station is fixed to 2 so we only need to distribute rest queues to the shells equally. For instance, a condensed edge with 6 queues in the left of Figure 3-9 is recovered into corresponding two edges (v1, v5) and (v5, v4) in the

right of Figure 3-9. Each edge is allotted with 3 queues. We distribute queues equally in order to make every shell with similar area in hardware. As a result, we acquire final correct queue size solution in right of Figure 3-9.

3.3 Problem Formulation of Our approach

By the path condensation and edge unification, we can decrease graph size extremely and still keep the correctness of system throughput. It helps a lot in counting cycles in the graph for throughput calculation. Hence, we can find the optimal throughput quickly with the reduced QG. Then we propose an integer linear programming (ILP) to find the minimal queue size while maintaining optimal throughput. Following are proposed problem formulation:

Given:

td( while maintaining maximum throughput. )

Constraints :

where S represents set of solid edges and D represents set of dashed edges.

The proposed ILP formulation for the minimal queue size is very efficient because it has only |E| integer variables, and |C| constraints. |C| is number of cycles.

The flow of our approach is separate into three main processes that are discussed as following:

1. Initial setup: In this process, we set the parameters of graphs, including

constructing graphs from the benchmark, assigning the length latency of each channel, reducing graph size by path condensation and edge unification. All works make the graph can be handled easily and faster.

2. Find cycles: In this process, we identify all the cycles in the graph. By the helping of reduction procedure, cycles in the graph will be decreased greatly.

Hence, the time spent in this process will be shortened greatly, too. We use Johnson’s algorithm [26] to help us to find all the cycles in the graph.

3. ILP process: This is the main process of our approach. We take cycles obtained from process 2. And for each cycle, we decide queue size of each shell to make all cycles’ T(C) bigger or equal to 1 while minimizing total queue size.

3.4 Bit Width of Channels

In practical SoC system, channels usually have bit width on them. For example, a 32-bit CPU may has 32, 16, and 1-bit channels on it. Therefore, we take channel bit width issue into consideration. In LISs, queues are put to different positions will make different area cost when bit width is considered. Figure 3-10 illustrates the different area costs are made by different queues added positions.

Figure 3-10. Queues are added to different positions.

To take bit width of channels into consideration, we only need to modify our graph representation slightly. We add an extra width weighting w(e) to every edge in the QG. In other words, we modify four weightings (ps(e), ts(e), pd(e), td(e)) in each edge into five weightings (ps(e), ts(e), pd(e), td(e), w(e)). Since our purpose is to maintain optimal throughput with minimal queue size, the reduction procedure and

在文檔中使用最少量緩衝器於延遲容忍系統中達成效能最佳化 (頁 25-0)