Single-Sequence Approach - Consecutive Sampling Techniques

Chapter 2 Improved Vector Compaction Methods 19

2.5 Consecutive Sampling Techniques

2.5.1 Single-Sequence Approach

group[group_num].size=g_t-g_h;

group_num++;

g_t=g_h;

group_avg = pattern_pair[g_h].cdc;

pattern_pair[g_h].group_num=group_num;

}

Figure 2-6. The pseudo code of the grouping algorithm

2.5 Consecutive Sampling techniques

2.5.1 Single-Sequence Approach

After grouping, we can sample a number of pattern pairs from each group according to the size of the group divided by a user-defined compaction ratio. It is often called the proportional sampling strategy [27]. Instead, we can sample a single pattern pair from each group, which is called single sampling strategy [40]. The single sampling strategy can only be used if the power characteristic is a very precise approximation of the real power. However, it can achieve a very high compaction ratio. The proportional sampling strategy can be used

without a very precise power characteristic but the compaction ratio may not be as high as the single sampling strategy.

In traditional sampling methods, people sample some independent pattern pairs from each group and concatenate them into a continuous sequence for simulation. Therefore, the sequence will include about half useless transitions as shown in Figure 2-1. Using a state transition graph and selecting the Euler trails on it with enough samples could be an approach to reduce useless transitions. However, it can only be used when most of states and transitions have passed many times. In typical cases, not all parts of the state transition graphs will be visited many times such that we are hard to obtain enough samples. Therefore, in order to reduce the useless transitions in most cases, we propose a single-sequence algorithm to sample a sequence of consecutive pattern pairs from the original input sequence with the desired distribution and compaction ratio. The single sequence algorithm can be formulated as below.

Problem formulation: Given a sequence S of length N with entries in a set G={g1,g2,…,gm}, where gi∈Z⁺ (1,2,…) for 1 ≤ i ≤ m, and a set T={t1,t2,…,tm} with ti∈ (0,1,2,…) for 1 ≤ i ≤ m and t

i≤ s(gi) for 1 ≤ i ≤ m, where s(gi) represents how many times that gi appears in S, find the shortest subsequence S’ in S such that all gi∈G can be found in S’ at least ti times.

Solution: According to the problem formulation, we can see that the shortest subsequence

that satisfies the requirements will also satisfy the following two conditions. The first condition is that the corresponding group of the start point in the shortest subsequence must exactly appear as the requirement in T. If the corresponding group of the start point is larger

than the requirement in T, we can drop it to obtain a shorter sequence and the new subsequence will still satisfy the requirement in T. The second condition is that the corresponding group of the end point in the shortest subsequence must also exactly appear as the requirement in T. The reason of this condition is the same as the first condition. Based on these two conditions, we propose an algorithm that can find the shortest subsequence from the original input sequence to satisfy the requirement in T.

Step 1: Assume the subsequence starts from index tail and ends at index head. Find a

subsequence, whose tail is located at the start point of S, to satisfy the requirements in T and the second condition.

Step 2: Trace this subsequence by moving tail forward until the subsequence satisfies the

first condition. Then, this subsequence is one candidate of the shortest subsequence.

We will keep tracking the shortest one of all candidates.

Step 3: Move tail one step forward. This subsequence is now violating the requirements in T.

Step 4: Find the next subsequence that satisfies the requirements and the second condition by

moving head forward on S. If head is equal to N and the subsequence still does not satisfy the requirements, this procedure will be stopped. Else, go back to Step 2.

The global shortest subsequence will be the shortest one in those shortest subsequences found in Step 2. In our process, the shortest one will be found when the process stops because the process keeps tracking the shortest candidate. It is hard to give a formal proof for our algorithm, but we can explain it by simple descriptions as follows. If there is a shorter subsequence than the one we found, it means that some pattern pairs can be dropped from the

subsequence we found. However, the numbers of pattern pairs of the groups for first pattern pair and last pattern pair in our subsequence are just satisfying the requirement in T. It implies that no pattern pairs could be dropped from our subsequences. Therefore, the subsequence is the shortest subsequence that satisfies requirements.

The pseudo code of this algorithm is shown in Figure 2-7. In Figure 2-7, the subroutine Shortest_subsequence() will find the shortest subsequence that satisfies the requirements in T.

The sub_seq_statisfied() will test whether the subsequence from the index tail to the index head in S is satisfied the requirements in T and the first condition. The trace_forward() will increase the index tail one by one until the subsequence satisfies the second condition. The time complexity of this algorithm is O(n) because it only walks through the sequence S twice by head and tail.

Figure 2-7. The pseudo code of the single-sequence algorithm

Figure 2-8 is a simple example for the shortest subsequence searching process that samples one pattern pair from each sampled group (G={1,2,3,4}, T={1,1,1,1}). The first step is sub_seq_statisfied(), which finds the first subsequence that stratifies the requirements in T and the second condition. Two indexes tail and head define the subsequence. The second step is trace_forward(), which moves the index tail forward until the subsequence satisfies the first condition. This subsequence is one candidate of the shortest subsequence. Therefore, we record it by index f_tail and index f_head as the temporal shortest subsequence. The third step moves the index tail one step forward and apply the sub_seq_statisfied() to find the next subsequence that satisfies the requirement in T and the second condition. The fourth step is trace_forward() again which moves the index tail forward until the subsequence satisfies the first condition. This subsequence is also one candidate of the shortest subsequence.

Compared with the recorded temporal shortest subsequence, the existing shortest subsequence is shorter than the new one. Therefore, the indexes f_tail and f_head are not changed. The fifth step is the same as step 3 that moves the index tail one step forward and apply the sub_seq_statisfied() to find the next suitable subsequence. The sixth step is trace_forward() again that moves the index tail forward until the subsequence satisfies the first condition. This subsequence is also one candidate of the shortest subsequence.

Compared with the recorded temporal shortest subsequence, this new sequence is shorter than the existing shortest subsequence. Therefore, the indexes f_tail and f_head are changed to define the new temporal shortest subsequence from tail to head. After the sixth step, we cannot find any new subsequence. The recorded temporal shortest subsequence is the final shortest subsequence.

1 4 1 4 3 3 3 2 4 4 1 2 3 2 1

Figure 2-8. An example of the shortest subsequence searching

Ideally, we can find a compacted sequence without any useless transitions as shown in Figure 2-2. In real cases, however, the compacted sequence may still have some undesired or over-sampled transitions.

2.5.2 Multi-Sequence Approach

As shown in Figure 2-3, if we relax the limitation a little bit such that multiple consecutive sequences are allowed, we can find better solutions for vector compaction problem if we minimize the number of sequences instead of setting the number to be one. In this section, we will discuss this new extension and propose an algorithm to solve this more general problem.

Problem formulation: Given a sequence S of length N with entries in a set G={g1,g2,…,gm}, where gi∈Z⁺ (1,2,…) for 1 ≤ i ≤ m, and a set T={t1,t2,…,tm}, ti∈ (0,1,2,…) for 1 ≤ i ≤ m and t

i ≤ s(gi) for 1 ≤ i ≤ m, where s(gi) represents how many times that gi appears in S, find the minimum number of disjoint subsequences in S such that all gi∈G can be found in those

subsequences exactly ti times, for 1 ≤ i ≤ m.

This problem is very similar to the well-known EXACT COVER BY 3-SETS (X3C) problem [80]. The X3C problem is described as follows.

INSTANCE: Set X with |X| = 3q and a collection C of 3-element subsets of X.

QUESTION: Does C contain an exact cover for X, i.e., a subcollection C’ ⊆ C such that every element of X occurs in exactly one member of C’ ?

In fact, the X3C problem can be transformed into a special case of our problem in polynomial time. Instead the detailed deriving process, we briefly show the transforming process in Figure 2-9. If the minimum number of disjoint subsequences in our problem is equal to q in X3C problem, we can find that the subcollection C’ will satisfy the requirement.

In X3C problem:

X={1,2,3,4,5,6}, |X|=3*2, q=2 and C={{6,2,3}, {5,3,6}, {1,2,4}, {1,5,4}}

Transfer to multi sequences problem:

G={1,2,3,4,5,6,0}, s(1)=2, s(2)=2,s(3)=2, s(4)=2, s(5)=2, s(6)=2, s(0)=3 and T={1,1,1,1,1,1,0}

S 6 2 3 0 5 3 6 0 1 2 4 0 1 5 4

Figure 2-9. Transforming a X3C problem into a multi-sequence problem

Because the X3C problem is a NP-complete problem, our multi-sequence problem is a NP-complete problem, too. Therefore, we propose a heuristic algorithm to solve it. First, we will find the longest subsequence in which the pattern pairs of each group do not appear more than the requirements in T. Of course, this longest subsequence will not include any useless transitions. After that, we modify the numbers in T by subtracting the required sample number in T with the number of pattern pairs that appear in the first subsequence for each

group. Then we can find the next longest subsequence and modify the numbers in T again.

This process will be iteratively executed until all numbers in T are equal to zero. In order to ensure that all subsequences found in this process are disjoint, the subsequence found in each iteration will be marked in S. Finally, the sequence that concatenates those subsequences is the solution of our algorithm. As a summary, we describe our algorithm step-by-step as follows and demonstrate the pseudo code of our algorithm in Figure 2-10.

Step 1: Find the longest subsequence in which each gi in G appears times, where t_i^' t_i^'≤ ti

for all ti in T.

Step 2: Mark the longest subsequence in S and set ti=ti- for all tt_i^' i in T.

Step 3: If all ti=0, 1 ≤ i ≤ m, STOP. Then concatenate all subsequences found in Step 1 to be the solution. Else go to Step 1.

In Figure 2-10, the subroutine sub_seq() is the function to find the longest subsequence in which the pattern pairs of each group do not appear more than the requirements in T. The subroutine concatenate() concatenates the sub-sequence found by the sub_seq() function. The subroutine modify() modifies the numbers in T according to the longest subsequence found by the subroutine sub_seq(). The subroutine all_zero() tests whether all entries in T equal to zero.

The number of useless transitions in the final sequence will be the number of subsequences minus one.

Multi_sequence(S[],G[],T[],N)

Figure 2-10. The pseudo code of the multi-sequence algorithm

The time complexity of this algorithm is O(n²) that is dominated by the number of sub_seq() subroutine being executed. In the worst case, the number of sub_seq() subroutine being executed is n divided by the desired compaction ratio. Therefore, the time complexity is O(n²) for our proposed multi-sequence algorithm in the worst case because the operations in this algorithm are similar to those in the single-sequence algorithm, whose time complexity is O(n).

Figure 2-11 is an example of the detailed searching process in the multi-sequence algorithm for G={1,2,3,4} and T={3,1,2,2}. The input sequence S in this example is the same as in Figure 2-3 for explaining the improvement of multi-sequence algorithm. The first step is sub_seq() that will find the longest subsequence in which the pattern pairs of each group do not appear more than the requirements in T. The second step is concatenate(), which concatenates the result of the first step into S’. The third step is modify(), which modifies the numbers in T according to the subsequence found in the first step and marks those transitions

that appear in the subsequence on S. After modified, T becomes to T={1,1,0,0} and passes the examination of the all_zero() subroutine, which tests whether all entries in T equal to zero.

The fourth step is sub_seq() again that finds the longest subsequence in which the pattern pairs of each group do not appear more than the requirements in T. The fifth step is concatenate(), which concatenates the result of the fourth step into S’. The sixth step is modify() again, which modifies T according to the subsequence found in the fourth step and makes proper marks on S. After modified, T becomes to T={0,0,0,0}. Because all entries in T equal to zero, the all_zero() subroutine will return 1 and the process is stopped. At this moment, the sequence S’ is the final result.

1 4 1 4 3 3 3 2 4 4 1 2 3 2 1

modify(); T={1,1,0,0}; all_zero() != 1;

1 2

Figure 2-11. An example of the multi-sequence algorithm

In this example, the compacted sequence S’ is consisted of two subsequences thus still having one useless transition. However, compared with the single-sequence approach, we still save 2 transitions with the multi-sequence approach. Using the single-sequence algorithm, we will find the shortest sequence with 11 transitions as shown in Figure 2-3(a). Using the multi-sequence algorithm, the final sequence has only 9 transitions including one useless

transition as shown in Figure 2-3(b).

2.6 Average Power Calculation

For efficiency consideration, it is not necessary to sample pattern pairs from those groups with too few pattern pairs because those groups have only small contribution on the overall power consumption. If we put too much effort to sample those pattern pairs, the desired compaction ratio may be decreased. Instead, we can directly calculate their contribution to the overall power consumption for those non-sampled groups to provide a trade-off strategy between efficiency and accuracy. Therefore, we will set the sampling numbers to zero for those groups whose sizes are smaller than the desired compaction ratio.

The detailed equations for deriving the average power consumption of a circuit are shown in Equations (2-10), (2-11) and (2-12).

(2-10)

In Equation (2-10), N is the number of pattern pairs in the original input sequence, Pavg

is the average power consumption, Psam is the total power consumption of sampled groups, and Pnon is the total power consumption of non-sampled groups. In Equation (2-11), m is the number of sampled groups, hi is the number of sampled pattern pairs in group i, CDCavg_i is

the average CDC value of group i, CDCj is the CDC value of pattern pair j, Pj is the power consumption of pattern pair j, and htotal_i is the total number of pattern pairs in group i. In Equation (2-12), CDCtotal_non is the total CDC value of non-sampled groups, and CDCtotal_sam

is the total CDC value of sampled groups.

2.7 Experimental Results

In this section, we will demonstrate the experimental results of our approaches with ISCAS’85 benchmark circuits. The estimation environment is a SUN UltraSPARC IIi workstation with 512MB memory. The original input sequence contains 50,000 pseudo random vectors for each circuit. The variance limitation in the grouping process is set to

±2.5%. The number of sampled groups is decided by a user-specified parameter “desired compaction ratio” (DCR). In our experiments, the desired compaction ratio is set to a high number (250) to demonstrate that those sampling techniques could achieve high compaction ratio and high speedup without losing too much accuracy. However, with the same DCR, the achieved compaction ratio may not as high as expected because the useless transitions may exist in the compacted input sequences. Therefore, the effective compaction ratio is also calculated to show the effects of reducing useless transitions in the proposed vector compaction technique.

The experimental results are shown in Table 2-3. The first row is the names of circuits.

The second and third rows are the estimation results and the run time elapsed by PowerMill simulator with the original input sequence. The following six rows show the estimation results of the random sampling technique and the last twelve rows are the estimation results

using the consecutive sampling techniques. The row L/U represents the length of the compacted sequence (L) and the useless transitions (U) in the compacted sequence. ECR is the abbreviation of effective compaction ratio, which is the number of pattern pairs in the original input sequence divided by the number of pattern pairs (including useful and useless transitions) in the compacted sequence. The speedup is the elapsed time of PowerMill with the original input sequence divided by the elapsed time of the power estimation with vector compaction technique that includes Verilog-XL simulation, grouping, sampling, PowerMill simulation and average power calculation.

Table 2-3. A comparison of random, single-sequence and multi-sequence techniques

Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Avg.

I (uA) 289.6 685.5 411.8 790.9 841.7 939.1 2058.7 2540.5 21936.6 3786.6 PowerMill

(L=50,000) Time (s) 8542 16757 13774 22064 22925 18100 57450 49813 290751 63705 I (uA) 299.4 708.3 437.6 824.2 879.7 981.4 2134.3 2613.3 22603.6 3999.4 Error (%) 3.38 3.33 6.27 4.21 4.51 4.50 3.67 2.87 3.04 5.62 4.14

L/U 377/187 385/190 390/194 388/192 380/188 388/192 393/195 391/194 388/193 390/193 ECR 132.63 129.87 128.21 128.87 131.58 128.87 127.23 127.88 128.87 128.21 129.22 Time (s) 93.6 177.8 147.5 232.1 237.9 229.2 563.7 552.3 2640.0 703.4 Random

Sampling

Speedup 91.26 94.25 93.38 95.06 96.36 78.97 101.92 90.19 110.13 90.57 94.21 I (uA) 291.6 694.6 422.0 824.4 891.2 978.4 2160.8 2620.3 22741.0 3887.7

Error (%) 0.69 1.33 2.48 4.24 5.88 4.18 4.96 3.14 3.67 2.67 3.32 L/U 294/0 245/0 233/0 324/0 228/0 347/0 323/0 272/0 262/0 316/0 ECR 170.07 204.08 214.59 154.32 219.30 144.09 154.80 183.82 190.84 158.23 179.41 Time (s) 79.6 129.3 103.2 201.4 169.0 214.3 482.9 433.1 1908.5 609.8 Single

Sequence

Speedup 107.31 129.60 133.47 109.55 135.65 84.46 118.97 115.02 152.35 104.47 119.08 I (uA) 301.7 699.2 430.1 824.3 886.9 975.3 2143.6 2616.9 22829.2 4036.3

Error (%) 4.18 2.00 4.44 4.22 5.37 3.85 4.12 3.01 4.07 6.59 4.19 L/U 195/5 201/6 202/6 200/4 196/4 198/2 202/4 200/3 197/2 201/4 ECR 256.41 248.76 247.52 250.00 255.10 252.53 247.52 250.00 253.81 248.76 251.04 Time (s) 62.3 115.2 94.5 148.9 153.2 160.4 343.0 362.8 1530.4 463.1

Consecutive Sampling

Multi Sequence

Speedup 137.11 145.46 145.76 148.18 149.64 112.84 167.49 137.30 189.98 137.56 147.13

Table 2-4. Consecutive sampling techniques for LFSR input sequences

Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Avg.

I (uA) 293.3 668.1 383.9 772.4 847.1 963.9 1944.6 2341.7 19912.7 3867.8 PowerMill

(L=50,000) Time (s) 8206 15296 13655 21123 21139 19741 52857 46819 273940 64931 I (uA) 307.3 683.4 408.2 794.6 883.4 1007.4 2033.9 2434.1 21290.4 3948.1

Error (%) 4.77 2.29 6.33 2.87 4.29 4.51 4.59 3.95 6.92 2.08 4.26 L/U 380/188 386/192 395/195 386/191 382/189 387/192 383/189 391/193 388/192 390/193

ECR 131.58 129.53 126.58 129.53 130.89 129.20 130.55 127.88 128.87 128.21 129.28 Time (s) 94.1 178.8 144.2 219.4 242.1 238.1 547.7 544.3 2458 701.4

Random Sampling

Speedup 87.21 85.55 94.69 96.28 87.32 82.91 96.51 86.02 111.45 92.57 92.05 I (uA) 311.1 697.4 398.6 795.3 878.7 1000.8 2047.3 2430.8 20973.8 3957.5

Error (%) 6.07 4.39 3.83 2.96 3.73 3.83 5.28 3.80 5.33 2.32 4.15 L/U 260/0 357/0 240/0 306/0 285/0 263/0 266/0 254/0 473/0 244/0

ECR 192.31 140.06 208.33 163.40 175.44 190.11 187.97 196.85 105.71 204.92 176.51 Time (s) 69.4 157.6 96.3 187.9 180.1 195.1 385.6 406.2 2971 528.4

Single Sequence

Speedup 118.24 97.06 141.80 112.42 117.37 101.18 137.08 115.26 92.20 122.88 115.55 I (uA) 315.2 704.5 403 781.3 891 999.9 2059.2 2460.4 21238.4 3956.4

Error (%) 1.57 3.83 6.41 0.17 6.7 3.73 5.89 5.07 6.66 2.29 4.23 L/U 196/4 198/4 205/5 199/4 197/4 198/3 198/4 201/3 200/4 200/3

ECR 255.10 252.53 243.90 251.26 253.81 252.53 252.53 248.76 250.00 250.00 251.04 Time (s) 56.5 106.1 85.4 138.9 140.8 164.3 319.4 358.7 1517.3 467.3

Consecutive Sampling

Multi Sequence

Speedup 145.24 144.17 159.89 152.07 150.13 120.15 165.49 130.52 180.54 138.95 148.72

According to the experimental results, the speedups of three methods are 94.21, 119.08 and 147.03 respectively. The multi-sequence approach improves 56% on speed compared to random sampling approach. The single-sequence approach only improves 26% on speed. It shows that we can obtain the highest speedup using the multi-sequence approach for all test cases in the benchmark. The average compaction ratio achieved in random sampling approach is 129.22 and the average error is 4.14%. The average compaction ratio achieved in single-sequence approach is 179.41 and the average error is 3.32%. The average compaction ratio achieved in multi-sequence approach is 251.04 and the average error is 4.19%. It shows that the multi-sequence approach can dramatically reduce the useless transitions in the random sampling method such that it can almost keep the desired compaction ratio exactly. In

our experiments, the useless transitions in the multi-sequence approach for all cases are not larger than 6. In random sampling approach, the useless transition is at least 187. Compared to the single-sequence approach, the compaction ratio in the multi-sequence approach is still much higher, especially when the pattern pairs of some groups are not uniformly distributed in the original sequence, such as C2670. About the compaction error, the average errors of all three approaches are less than 5%. It shows that this multi-sequence approach can improve the speedup much more with reasonable accuracy.

In order to verify the effects of our approach on those input patterns that are not pure random, we perform another experiment that uses the input sequences generated from a linear feedback shifter register (LFSR) because a LFSR sequence is easier to generate and has highly spatial correlation that is quite different to pure random patterns. The experimental results are shown in Table 2-4. Compared with the results in Table 2-3, we can see that the proposed approach can still be effective with different input distribution.

2.8 Summary

In this work, we proposed a multi-sequence sampling technique to reduce the useless transitions in the compacted sequence and improve the over sampling problem of our previous single-sequence approach. By relaxing the limitation a little bit such that multiple consecutive sequences are allowed, we can find better solutions for vector compaction problem if we minimize the number of sequences instead of setting the number to be one. Of course, the number of sequences could be one as handled in the original single-sequence approach, but it is just a special case in the multi-sequence approach. As demonstrated in the

experimental results, the multi-sequence approach improves 56% on speed compared to the

在文檔中矽智產設計的功率估測方法之研究 (頁 46-0)