Bit Width of Channels - Throughput Optimization for LIS with Minimal Buffer Size

Chapter 3 Throughput Optimization for LIS with Minimal Buffer Size

3.4 Bit Width of Channels

In practical SoC system, channels usually have bit width on them. For example, a 32-bit CPU may has 32, 16, and 1-bit channels on it. Therefore, we take channel bit width issue into consideration. In LISs, queues are put to different positions will make different area cost when bit width is considered. Figure 3-10 illustrates the different area costs are made by different queues added positions.

Figure 3-10. Queues are added to different positions.

To take bit width of channels into consideration, we only need to modify our graph representation slightly. We add an extra width weighting w(e) to every edge in the QG. In other words, we modify four weightings (ps(e), ts(e), pd(e), td(e)) in each edge into five weightings (ps(e), ts(e), pd(e), td(e), w(e)). Since our purpose is to maintain optimal throughput with minimal queue size, the reduction procedure and formulations need to be changed for consistency of system throughput. For path condensation, ps, ts, pd, td are still the same like in Section 3.2, and w(e) is assigned to minimal width among the edges of condensable path. This is because we prefer to put queues in edges with lower bit width to achieve minimal queue size. For edge unification, ps, ts, pd, td of dominating edge are still the same like in Section 3.2, and w(e) is assigned to summation of w(e) of all edges belong to the same Em. This is because we need to let all e∈Em have equal extra queues. Whenever the dominating edge needs an extra queue, other edges belonging to the same Em will need an extra queue, too. Figure 3-11 shows the changes in path condensation and edge unification.

In upper graphs of Figure 3-11, w(e) of condensed edge is the smaller w(e) of edge of condensable path <v1, v4>. In lower graphs in Figure 3-11, w(e) of dominating edge is the w(e) summation of two edges from v1 to v4.

Figure 3-11. Changes in the reduction procedure.

Some changes are made to our proposed ILP formulation. The modified formulation show as follows:

Given:

‧ A quantitative graph GQ(VQ, EQ, ps, ts, pd, td, w).

Objective :

‧ Minimize total queue

∑

∈

where S represents set of solid edges and D represents set of dashed edges.

We only slightly modify objective function of our ILP formulation in Section 3.3.

It is easy to take bit width issue into consideration on our graph representation and ILP formulation. This makes our proposed graph representation and formulation useful among the system with bit width all equal to 1 or the system with different bit width.

Chapter 4 Experimental Results

4.1 Environment Setup and Benchmarks

The benchmarks we used contain three sets, MCNC, GSRC, and ISCAS89.

However, MCNC and GSRC lack of transfer direction information. In order to add data dependency between the IPs in each benchmarks, we break each net on those benchmarks into a 2-pin net and randomly assign it with a direction. To provide more realistic cases, we take two cases of ISCAS89 as another benchmark set. Those ISCAS89 benchmarks already have direction information. The experiments are processed on a computer with an AMD 1.81GHz CPU and 2GB DRAM. We use the non-commercial LP/ILP solver lp_solver [27] to solve the proposed ILP formulation.

Figure 4-1 shows the flowchart of our experiments.

Figure 4-1. The flowchart of our experiments.

4.2 Weight and Channel Latency Assignment

Since latency of each channel is generated randomly in our experiments, more precisely, channel latency is a random real number obtained from an interval [1, A]. In other words, 0~A-1 relay stations are inserted to pipeline channel into 1~A parts. For example, if the random generated number is 2.4, it means that data need 2.4 clock cycles to transmit data along the channel, and two relay stations need to be inserted.

Each relay station in the experiments has 2 fixed queues like mentioned in Section 3.2.

The queue size of each shell is assigned to be one initially.

The bit width of each channel in a benchmark is assigned to be one initially. This means that each channel is a one-bit communication channel. To test the influence of bit width on channels, we assign a set of different bit width to channels. Then we compare the difference between those two bit width assignments. To model the worst case of benchmarks, we assume that every benchmark can achieve optimal throughput 1. This is the worst case because we need to consider every cycle and make its T(C) bigger or equal to 1 when throughput upper bound is 1. If throughput is a real number smaller than 1, cycles with tokens to places ratio bigger than throughput upper bound can be omitted.

4.3 Results

For each benchmark, we make three experiments on it. We find the efficiency of the reduction procedure in experiment Ⅰ. In other words, how many cycles are omitted after path condensation and edge unification are performed. In experiment Ⅱ, we compare our approach and heuristic algorithm proposed in [21] and we verify the variation when channel latency becomes worse. Finally, we compare our approach

and heuristic algorithm when bit width issue is considered.

4.3.1 Experiment Ⅰ

In experiment Ⅰ, we count number of cycles in original marked graph and in reduced QG. We use Johnson’s algorithm [26] to help us to count all cycles in both two graph representations. Channel latency locates on interval [1, 3]. In other words, 0~2 vertices are added to every edge in graphs.

Table 4-1. Number of cycles degradation after the reduction procedure performed.

Original QG Reduced QG

Benchmark

From Table 4-1, we show five MCNC benchmarks, six GSRC benchmarks, and two ISCAS89 benchmarks. Each Benchmark’s name and its experimental results are listed in Table 4-1. Column (V, E) under marked graph represents vertices and edges in original marked graph. Column # Cycles under marked graph represents number of cycles in marked graph representation. Column (V, E) and # Cycles under reduced

QG have the same meaning with definitions under marked graph. * represents number of cycles exceed one million so that is too hard to solve problem with this size. The reduction procedure decreases graph size from unsolvable to solvable size in one of five benchmarks of MCNC. And it decreases four benchmarks of GSRC to solvable size. We make the conclusion that reduction procedure is useful in decreasing cycles in the graph.

4.3.2 Experiment Ⅱ

In experiment Ⅱ, we verify the difference between our proposed method and Collins’ method in [21]. We make two different set of channel latency assignments in two experiments in experiment Ⅱ. The results of channel latency located on [1, 3] are showed in Table 4-2. The results of channel latency located on [1, 5] are showed in Table 4-3. All bit width is assigned to 1 in experiment Ⅱ.

Table 4-2. Experimental results when channel latency locates on [1, 3].

Proposed Method Collins Method Benchmark

From Table 4-2, we show the experimental results of two methods. Column # Queues represents number of queues needed to maintain optimal throughput in our proposed method and in Collins’ method. Run time represents time needed to compute this solution. Run time is counted by seconds. Our method saves 22%

number of queues than Collins’ method on average, but run time of our method is 2.5 times than Collins’ method on average.

Table 4-3. Experimental results when channel latency locates on [1, 5].

Proposed Method Collins Method Benchmark run time of our method is still about 2.5 times than Collins’ method in Table 4-3.

Compared Table 4-2 and 4-3, we find when the channel latency becomes worse, the difference between Collins’ method and our method enlarge. Our method will perform better than pervious works when channel latency becomes worse. From Chapter 1, we know that channel latency becomes worse as the manufacturing process scales down.

To be summarized, our method offers smaller area cost than Collins’ method with acceptable extra time.

4.3.3 Experiment Ⅲ

In experiment Ⅲ, we verify the difference between our method and Collins’

method when bit width is considered. Channel latency is limited to interval [1, 3]. Bit width of channels is assigned to 8, 16, 32, and 64 randomly. Those bit numbers are common used in practical chips.

Table 4-4. Bit width is assigned to 8, 16, 32, and 64 randomly.

Proposed Method Collins Method

ami49 14192 782 15568 327

n10 472 0 568 0 on average, but run time of our method is still about 2.5 times than Collins’ method.

Compared the experimental results in Table 4-2, which channel latency is assigned to

1, the difference between our method and Collins’ method enlarge greatly after taking bit width issue into consideration. To be summarized, our method offers better area cost than Collins’ method in more practical circuits.

4.4 Discussion

In experiment Ⅰ, our proposed reduction procedure is efficient in decreasing number of cycles in the graph. Since the number of cycles in a directed graph can grow faster than the exponential 2ⁿ, it is important to reduce the graph size of practical circuits. Without the reduction procedure, we know from Table 4-1 that cycles in some benchmarks exceed one million. The million order cycles are hard to process in normal computers and waste time to count all cycles. This is why reduction procedure is so important.

In experiment Ⅱ, our proposed method saves 22% of queues than Collins’

method on average. Even our method cost about 2.5 times on run time than Collins’

method, but additional time cost in our method is still acceptable. For instance, the benchmark with the most cycles in our experiment, ami49, only cost 947 seconds to solve it. So, we usually prefer to sacrifice acceptable time but saving valuable area in the chips. We make another experiment to verify what will happen if channel latency becomes worse. The experimental results show our method is more suitable than Collins’ method in worse channel latency.

In experiment Ⅲ, our proposed method saves 33% of queues than Collins’

method on average if bit width is considered. With similar time overhead to experimental results showed in Table 4-2, our method saves more area than Collins’

method when bit width is considered. The similar run time is because the only difference between those two formulations is the objective function. This makes our method more elastic to transform between different bit width assignments.

Since number of cycles determines efficiency of our method, decreasing number of cycles is the vital problem for our method. We propose the reduction procedure including path condensation and edge unification to decrease number of cycles.

However, there are some experimental skills helping us further reduce number of cycles. One is to ignore the cycle if and only if its T(C)>1, since it is not the most critical cycle. Another is to collapse each strongly connected component (SCC) into a single vertex. This is because throughput upper bound in our experiment is 1, and each sub-system must finally have throughput 1, too. Hence, we can view each SCC as a sub-system with throughput 1, and then we collapse them into a single vertex.

The final one is to ignore cycles containing only two edges, since it must the self-loop cycle in the original marked graph.

Chapter 5 Future Works and Conclusions

As the manufacturing process to deep submicron technology, length of interconnects becomes more unpredictable and uncontrollable. It makes designers hard to assembly pre-designed IP cores together at early design stage since the unknown signal transference time. Repeater insertion is the promising solution to solve this problem without heavily changing the designs. However, slight modifications on existed IP cores are unavoidable. This prolongs the product developed period on meaningless modification. And even worse, repeater insertion will degrade performance of overall system by multi-clock communication. LIS is a good solution for those existed problems. LIS handle the unpredictable interconnects problem by automatic inserting relay stations which is similar to mentioned repeater insertion. LIS avoids modified iterations by encapsulating every existed IP cores.

Encapsulating is to add some additional hardware called shell to the existed IPs. This step makes all encapsulated IP cores and relay stations can follow the same communication protocol—latency insensitive protocol. LIS works out performance degradation mainly by queue sizing technology. Finally, product developed period shortens and company can earn more benefit. From those reasons, we know that LIS is a gorgeous solution for time-to-market. However, the physical parameters, like length of interconnects, positions of IPs…etc. are known after floorplanning performing. From Section 2.1, we know throughput upper bound of an LIS is determined by system architecture. In other words, poor system architecture limits the spaces that LIS can improve. In our experimental results, channel latency is assigned to a reasonable interval, not obtained from realist floorplanning results. There are

many research working on determining best system architecture on floorplanning stage reported in [28] and [29]. After those performance-aware floorplannng performing, we acquire real physical information which is closer to optimal architecture. So our future direction is to combine our proposed method with real physical information acquired from performance-wise floorplanning.

We propose an optimal throughput optimization technique for LIS with minimal queue size. First, we transform original marked graph to quantitative graph. Then, we develop the reduction procedure for graph size reduction. We use an ILP formulation to guarantee the minimal queue demand. After acquiring minimal queue solution from reduced quantitative graph, we develop a recovered procedure to transform reduced quantitative graph back to quantitative graph while maintaining correctness of minimal queue size. The experimental results show that our method outperforms Collins’ in terms of queue size (area cost). Runtime of our method is acceptable for real industrial systems.

References

[1] R. H. Havemann and J. A. Hutchby, “High-performance interconnects: an integration overview,” in Proceedings of IEEE, pp. 586–601, 2001.

[2] R. Ho, K. W. Mai and M.A. Horowitz, “The future of wires,” in Proceedings of IEEE, pp. 490–504, 2001.

[3] International Technology Roadmap for Semiconductors, Semiconductor Industry Association, 2005.

[4] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,” Ph.D.

dissertation, Stanford Univ., CA, 1984.

[5] D. S. Bormann and P. Y. Cheung, “Asynchronous wrapper for heterogeneous systems,” in Proc. ICCD, pp. 307–317, 1997.

[6] J. Muttersbach, T. Villiger and W. Fichtner, “Practical design of globally-asynchronous locally-synchronous systems,” in Proc. ASYNC, pp.

52–59, 2000.

[7] S. Kumar, A. Jantsch, J-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K.

Tiensyrja, and A. Hemani, “A network on chip architecture and design methodology,” in Proc Symposium on VLSI, pp. 117–124, 2002.

[8] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and D.

Lindqvist, “Network on a chip: an architecture for billion transistor era,” in Proc.

of the IEEE NorChip Conf., 2002.

[9] J. Cong, Y. Fan, G. Han, X, Yang, and Z. Zhang, “Architecture and synthesis for on-chip multicycle communication,” in Proc. TCAD, pp. 550–564, 2004.

[10] J. Cong, Y. Fan, Z. Zhang, “Architecture-level synthesis for automatic interconnect pipelining,” in Proc. DAC, pp. 602–607, 2004.

[11] L. P. Carloni, K. L. McMillan, and A.L. Sangiovanni-Vincentelli,

“Latency-insensitive protocols,” in Proc. of the Computer-Aided Verification (CAV), 1999.

[12] L. P. Carloni, K. L. McMillan, and A.L. Sangiovanni-Vincentelli, “Theory of latency-insensitive design,” in IEEE Tran. CAD, vol. 20, no. 9, 2001.

[13] V. Adler, E. G. Friedman, “Repeater design to reduce delay and power in resistive interconnect,” in IEEE Trans. Circuits Syst. II: Analog Digital Signal Process, vol. 45, no. 5, pp. 607–616, 1998.

[14] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli,

“A methodology for correct-by-construction latency insensitive design,” in Proc.

ICCAD, pp. 309–315, 1999.

[15] C. Li, R. Collins, S. Sonalkar, and L. P. Carloni, “Design, implementation, and validation of a new class of interface circuits for latency insensitive design,” in Fifth ACM-IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE), 2007.

[16] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Performance analysis and optimization of latency insensitive systems,” in Proc. DAC, pp. 361–367, 2000.

[17] R. Lu, and C. Koh, “Performance optimization of latency insensitive systems through buffer queue sizing of communication channels,” in Proc. ICCAD, pp.

207–231, 2003.

[18] L. P. Carloni, “The role of back-pressure in implementing latency-insensitive systems,” in Electronic Notes in Theoretical Computer Science (ENTCS), vol.

146, no. 2, 2006.

[19] M. R. Casu and L. Macchiarulo, “Issues in implementing latency insensitive protocols,” in Proc. of the Conf. on Design, Automation and Test in Europe, pp.

1390–1391, 2004.

[20] R. Lu and C. Koh, “Performance analysis of latency-insensitive systems,” in IEEE Trans. CAD, vol. 25, pp. 469–483, 2006.

[21] C. Li, R. Collins, and L. P. Carloni, “Topology-based optimization of maximal sustainable throughput in a latency-insensitive system,” in Proc. DAC, pp.

410–415, 2007.

[22] C. Li, R. Collins, and L. P. Carloni, “Topology-based performance analysis and optimization of latency-insensitive systems,” in IEEE Trans. CAD, vol. 27, pp.

2277–2290, 2008.

[23] M. R. Casu and L. Macchiarulo, “A new approach to latency insensitive design,”

in Proc. DAC, pp. 576–581, 2004.

[24] D. Bufistov, J. Julvez, and J. Cortadella, “Performance optimization of elastic systems using buffer resizing and buffer insertion,” in Proc. ICCAD, pp.

442–448, 2008.

[25] F. Commoner, A. W. Holt, S. Even, and A. Pnueli, “Marked directed graphs,” in J. Comput. Syst. Sci., pp. 511–523, 1971.

[26] D. B. Johnson, “Finding All the Elementary Circuits of a Directed Graph,” in SIAM J. Comput., 1975.

[27] “lp_solver,” http://lpsolve.sourceforge.net/5.5/

[28] M. R. Casu and L. Macchiarulo, “Floorplanning for throughput,” in Proc. ISPD, pp. 62–69, 2004.

[29] J. Wang, H. Zhou and P. Wu, “Processing rate optimization by sequential system floorplanning,” in Proc. ISQED, pp. 340–345, 2006.

在文檔中使用最少量緩衝器於延遲容忍系統中達成效能最佳化 (頁 39-0)