Building Neural Network - Power Model Construction with Neural Networks

Chapter 4 High-Level Power Model with Neural Network 67

4.3 Power Model Construction with Neural Networks

4.3.1 Building Neural Network

As described in Section 4.2, we decide to use the 3-layer fully connected feedforward neural network structure and the Levenberg-Marquardt training algorithm with the mean square error function in our power model. In the first step, we have to decide the input data type of this neural network, the number of hidden neurons and the transfer function of internal neurons. In typical experiences, the best decisions might be different in different cases, which are hard to be theoretically analyzed. Therefore, we will use a simple experiment on the circuit C1355, which is arbitrarily chosen in ISCAS’85 benchmark circuits, to explain our decisions for those parameters. Because it is not feasible to show all detailed analysis for each circuit, we will try to verify the feasibility of our approach with complete benchmark set in Section 4.4 by using the metrics defined in Section 4.3.

A. Input Data Type and Transfer Function

In this work, the input data type of neural networks and the transfer function of internal

neurons are decided together because the most suitable transfer function depends on the behavior between the input data and the output values of the training set. If the relationship between input data and output values has non-linear characteristics, the accuracy of the trained neural networks would be lost when using a linear transfer function for internal neurons because it is similar to use a piece-wise linear curve to fit a non-linear curve as illustrated in Figure 4-1.

Since we are building a high-level power model, the input data of this neural power model can only use primary input and output information of circuits. The most straightforward idea is to use the complete 4 states of bit transitions (0→0, 0→1, 1→0, 1→1) at the input and output pins of circuits as the input data of the neural networks because the effects of both state-dependent leakage power and transition-dependent switching power can be considered. If we use such bit-level transition data (bit-level statistics) to be the input data of our neural power model, the number of neurons in the input layer will be fixed as 4*(n+m) for a circuit with n inputs and m outputs, which are TIi,00, TIi,01, TIi,10, TIi,11, TOj,00, TOj,01, TOj,10 and TOj,11. Here, TIi,xy=1 represents that the i^th input pin changes from logic state x to y and TOj,xy=1 represents that j^th output bit changes from logic state x to y in a pattern pair.

Instead of using bit-level statistics, we could use the word-level statistics as the input data of our neural power model. If we consider the input statistics and the output statistics separately, the number of neurons in the input layer will be fixed as 8, which are TI00, TI01, TI10, TI11, TO00, TO01, TO10 and TO11. Here, TIxy represents the ratio of input signals change from logic state x to y in a input pattern pair and TOxy represents the ratio of output signals change from logic state x to y in a output pattern pair. For example, given a circuit with 10 inputs and 10

outputs, we assume that its corresponding output signals will change from 0101101100 to 0110110111 when the input signals change from 0001110101 to 1010101011. For this pattern pair, both the bit-level and word-level statistics are shown in the 4^th to 7^th rows in Figure 4-5(a) and Figure 4-5(c) respectively. According to the definitions, their input data will be formed as shown in Figure 4-5(b) and Figure 4-5(d).

Input vectors

Figure 4-5(a). An example of the bit-level statistics characterization

total length=40 bit-level statistics

Input vector part Corresponding output vector part TI0,00 TI0,01 . . . TI9,10 TI9,11 TO0,00 TI0,01 . . . TI9,10 TI9,11

0 1 … 0 1 1 0 … 0 0

Figure 4-5(b). Input data format with bit-level statistics

Input vectors

Figure 4-5(c). An example of the word-level statistics characterization

total length=8 (fixed) word-level statistics

Input vector part Corresponding output vector part TI00 TI01 TI10 TI11 TO00 TO01 TO10 TO11

0.1 0.4 0.3 0.2 0.1 0.4 0.2 0.3

Figure 4-5(d). Input data format with word-level statistics

If we use bit-level statistics to be the input data of our neural power model, the neural network may recognize the individual contribution to the total power consumption from each

input transition such that the estimation error of the power model is more possible to be reduced. However, the size of this power model will increase very fast especially for the circuits with large amount of I/O pins. Moreover, because the complexity of bit-level statistics is too high, it will become much harder to learn such complex relationship between the bit-level statistics and the power consumptions for a neural network. If we use word-level statistics as the input data of neural power model, some individual characteristics of each possible input transition may be lost, especially for the control-dominated circuits with significantly different power modes. However, it is a common heuristic method used in many power models [46][65,66][73] to reduce the modeling complexity with reasonable error loss.

According to their experimental results, the induced errors are indeed in an acceptable range in most cases.

When we are selecting the transfer function, we also have to consider the output format of our neural power model. The output of our neural power model is expected to represent the estimated power consumption of pattern pairs. Because the values of power consumptions often continuously distribute on a wide range, those transfer functions that use discrete values, such as the unit-step or the sign functions are not suitable. In our observations, the three commonly used functions, logarithmic sigmoid (logsig), hyperbolic tangent sigmoid (tansig) and linear (linear) functions, are more suitable for our application, which are defined in Equation (4-10), (4-11) and (4-12) respectively. However, it should be noted that the values of power consumption have to be normalized between 0 to 1 in both the training set and the test set if logarithmic sigmoid and hyperbolic tangent sigmoid functions are used in the output neuron. In order to save the normalization effort, we use the linear function as the

transfer function of the output neuron. In the following discussions, we will focus on the comparison between those three transfer functions for hidden neurons.

logsig transfer function: _s

In order to help us making better decisions, we perform an experiment to compare the accuracy and performance of 6 combinations between 2 input data types (bit-level and word-level statistics of the input and output pattern pairs) and 3 transfer functions (logsig, tansig, linear). Because this experiment is only used for evaluating the input data types and suitable transfer functions, many parameters in the neural network are arbitrarily chosen and fixed in this experiment. All the training process will be stopped after 15 iterations instead of using the validation error checking as the stop criterion. The number of hidden neurons is fixed as 4. The training set includes 20 sequences, which are uniformly distributed over the population of the average signal probability (Pin) and the average signal transition density (Din) and each sequence includes 1,000 random input pattern pairs with the chosen PD combination. The comparison using AESP and STDESP for those test cases on C1355 is shown in Table 4-1.

The neural power model using bit-level statistics has higher complexity than that of the neural power model using word-level statistics, which can be observed in the number of weight |W| and the constructing time. However, as shown in the experimental results, the neural power model using bit-level statistics does not have many improvements in terms of

AESP and STDESP. According to the analysis above, we select word-level statistics as the input data of our neural power model. While checking the neural power model using word-level statistics, we can find that the neural power model using tansig function as the transfer function of hidden neurons provides better results on AESP and STDESP. Therefore we select tansig function as the transfer function of internal hidden neurons in this work.

Table 4-1. Comparison of input data types vs. transfer functions

Average Power

Circuits Input Data f(s) Construction

& |W| AESP

B. Number of Hidden Neurons

Another issue to be decided is the number of hidden neurons required in the neural power model. Typically, the minimal number of hidden neurons depends on the complexity of the relationship between the input data and output values in the training set. However, according to the experience in neural network researches, there is no easy or general way to determine the optimal solution for the number of hidden neurons to be used [87]. As mentioned in Section 4.2, our strategy is initially using a neural network with a small number of hidden neurons and increasing the hidden neurons until the stop criterion has satisfied. In the following discussions, we will explain the reason of using this strategy through a simple experiment.

We first build a neural power model for C1355 and set the initial number of hidden

neurons as 2. The training set is the same one as used in the experiment of Section 4.3.1.A.

The validation set includes 20 sequences with the same PD distribution as in the training set, but and the size of each sequence is increased to 3,000. In the following experiment, we increase the number of hidden neurons from 2 to 15 and test the accuracy of the neural networks after 15 training iterations using the same test set as used in the experiment of Section 4.3.1.A. The experimental results about the effects of number of hidden neurons are shown in Table 4-2.

Table 4-2. The effects of the number of hidden neurons

Average Power

2 4.96 3.57 0.004264 50.51

3 4.92 3.56 0.004255 53.25

4 3.81 2.81 0.004098 57.17

5 3.93 3.06 0.004272 59.78

6 5.67 3.21 0.004550 62.97

7 3.25 3.38 0.004252 66.41

8 3.62 3.30 0.004205 71.49

9 3.20 2.24 0.004011 73.14

10 2.96 2.64 0.003969 78.99

11 3.79 2.89 0.004406 81.83

12 3.72 2.65 0.004357 88.52

13 3.36 2.81 0.004675 90.28

14 3.26 2.41 0.004179 96.56

C1355 PI=41 PO=32

15 3.31 2.59 0.004163 99.76

According to the results in Table 4-2, increasing the number of hidden neurons does improve both AESP and STDESP when the number is small. However, when the number of hidden neurons is larger than 10, we could find that the validation error, AESP, and STDESP may become worse due to the over fitting problem mentioned in Section 4.2.2. Therefore, for this case, the best choice is to set the number of hidden neurons as 10.

4.3.2 Design of Training Sets

Typically, a power model is expected to be used for different test sets with various input distributions. In order to achieve this target, the neural power model should be trained over a wide range in the input space such that it can learn enough experiences. Therefore, we will randomly decide the Pin and Din while generating each test sequence. Because an input signal is assumed to make at most one single transition per cycle, there is a relationship between Pin

and Din as shown in Equation (4-13), whose detailed proof can be found in [65]. Therefore, while generating those training sets over a wide range of Pin and Din distribution, we can use only the PD combinations that satisfy Equation (4-13) such that neural networks could learn the correct characteristics between the input signal statistics and the power consumption of circuits.

1 2 2

in in

in D

D ≤P ≤ − (4-13)

The size of a training set is also an important issue while training the neural power model. According to the related study [88], it suggested to determine the size of training set according to Equation (4-14), in which P is the number of samples, |W| is the number of weights to be adjusted and a is the expected accuracy. In this work, our target is set as a ≥ 95%. According to this error requirement, we suggest generate the training set with size P >>

20 |W|. A larger training set is supposed to produce a more accurate neural power model.

However, the characterization time of this power model is also increased. In the following experiment, we will show the observation of the relationship between the size of training set and the modeling accuracy.

P a

> − 1

W (4-14)

In this experiment, we use the best neural network structure for C1355 decided in Section 4.3.1, which is a neural network that uses 10 hidden neurons and word-level statistics.

As mentioned above, the samples of the training sets should be distributed over the space with a wide range of Pin and Din. Therefore, we generate 4 training sets consisting of 20 sequences in each, which have the same uniformly distribution on the input space that satisfies the Pin and Din constrains in Equation (4-13). However, the length of each sequence is different, which are 500, 1,000, 2,000 and 5,000 pattern pairs respectively. In other words, the total sizes of the training sets are 10,000, 20,000, 40,000 and 100,000 respectively. The validation sets consist of 20 sequences that have the same distribution as those test sequences but their sequence lengths are multiplied by 3. Therefore, the sizes of validation sets are 30,000, 60,000, 120,000 and 300,000 respectively. The experimental results of those 4 neural networks under different training conditions after 15 training iterations, which are evaluated using the same test set that consists of 20 sequences with 3,000 pattern pairs are shown in Table 4-3.

Table 4-3. The effects of the size of training set

Circuit C1355

The experimental results show that the size of training set will not affect the accuracy too much on accuracy if the training set is large enough. According to this observation, we

generate 20 sequences with 3,000 pattern pairs in each sequence to be the training set of our neural power model in the following experiments to make a trade off between the size of test sequences and accuracy. Of course, those 20 sequences will have a distribution that covers a wide range of the input space.

4.4 Experimental Results

In this section, we will demonstrate the accuracy and efficiency of our power model with ISCAS’85 benchmark circuits and one real design, a combinational divider with 32-bit dividend/quotient and 12-bit divisor/remainder. All circuits in our experiments were synthesized by using 0.35μm cell library. The accuracy will also be compared with traditional 3D-LUT power modeling methodology, which uses Pin, Din and average output signal transition density (Dout) as its three dimensions, and the interval size of each dimension is set to 0.1. Our neural power models including the training algorithms were all constructed on MATLAB by using an Intel Pentium III 1GHz mobile CPU and 384M RAM.

In the model construction phase, the input training sequences are generated over a wide range of input distribution as described in Section 4.3.2. The real power of those input sequences is simulated by a transistor-level simulator, PowerMill such that the measured power consumptions can include switching power and leakage power and can be characterized in the power model. In order to show that the power models can be used for various input distribution, we test those models by using 200 test sequences with 3,000 pattern pairs. Each sequence has different Pin and Din that are randomly selected over a wide range satisfies the condition in Equation (4-13). After simulation, the estimated average

power consumption with this power model is also compared to the simulation results from PowerMill.

All the test circuits will be tested using the two power estimation approaches with the same information: traditional 3D-LUT power model and our neural power model. The same training and test sequences will be used for both approaches to make a fair comparison. The performances of both power models are summarized in Table 4-4. The construction time of neural power models includes the data pre-loading time of training and validation sets, the establishing time of neural network and the elapsed time of network training process. The simulation time of transistor-level simulation is not included.

According to Table 4-4, the average values of AESP and STDESP are 17.58% and 18.59% respectively while we use the traditional 3D-LUT power models. The convergences of this approach are quite poor that can be observed from the large values of STDESP. It implies that using the LUT-based power model may have large errors in some cases.

Compared to the traditional 3D-LUT power model, we only have 4.72% error for all cases on average, and the largest AESP is only 8.93% for the 32-bit divider. The improvement of our neural power model can be shown in the STDESP. The largest STDESP is only 5.88% for the 32-bit divider using our approach, which shows a good agreement with real powers. The combined scatter plots of all ISCAS’85 circuits by using our approach and the 3D-LUT approach are shown in Figure 4-6 and Figure 4-7 respectively. In order to examine all circuits on the same plot, the power consumptions of all circuits are normalized with the circuit size and operating frequency. Comparing the two plots, we can see that our approach can really provide better trend of estimation accuracy.

Table 4-4. The comparison between traditional 3D LUT power model and our neural power model

Circuits C432 C499 C880 C1355 C1908C2670C3540C5315C6288C7552 Divider |Average|

Input 36 41 60 41 33 233 50 178 32 207 44 -

Time (sec) 313.07 216.28 310.83 323.51 262.48 390.29 391.53 267.34 320.19 324.29 325.23 313.19

Training

Figure 4-6. Scatter plot of neural power model estimation versus PowerMill simulation in ISCAS’85 benchmarks

f 3D-LUT estimation versus Power

Figure 4-7. Scatter plot o Mill simulation in

The storage requirements are also much less in our approach. According to the results shown in Table 4-4, the maximum number of hidden neurons is 9 in our experiments. It means that we only use up to 9 hidden neurons structure with 91 elements in the weight matrix |W| to record the power characteristics, which is quite small compared to the lookup tables, which require 500 (=10*10*10/2) numbers to record the tables. These experimental results have also shown that the complexity of our neural power model has almost no relationship with circuit size and number of inputs and outputs. Even for large circuits such as

ISCAS’85 benchmarks

the 32-bit divider, the complexity of its power model is still the same as the complexity of smaller circuits such as C432. Besides, the construction of neural power model is rapid that can be observed from the short construction time and the total training iterations of each neural power model in Table 4-4. Therefore, using such a power model can be very efficient even for complex circuits and also has high accuracy.

Another important information not shown in Table 4-4 is the estimation time while using our power model. Actually, the estimation time of our neural power model is dominated by the functional simulation time with a logic simulator, which simulates the circuits with specific input vectors to obtain the corresponding output vectors. If we assume that the corresponding output values under specific input sequences are also provided by users, the estimation time of our neural power model is always less than one second for all ISCAS’85 circuits. Therefore, the estimation time is not shown in Table 4-4 because it is almost equal to the logic simulation time, which is quite small compared with low-level power estimation methods such as PowerMill.

In order to demonstrate that the neural power model can handle specific functional patterns in practical use, we also test the practical design, the 32-bit divider design, with user-given functional patterns. The functional sequence consists of 1,000 pattern-pairs only.

However, the average estimation error is only 5.98% compared to the PowerMill results.

4.5 Summary

In this work, we propose a novel power model for complex digital circuits, which uses neural networks to learn the power characteristics during simulation including both leakage

power and switching power. Unlike the power characterization process in traditional

在文檔中矽智產設計的功率估測方法之研究 (頁 91-0)