Advantage of applying importance sampling on QMC

Chapter 3 Table-lookup Monte-Carlo (MC) Framework

3.4 Applying importance sampling on QMC

3.4.2 Advantage of applying importance sampling on QMC

Two problems arise when the expectation of some function with respect to a

integration: the integrand can have singularities when the domain of the distribution is unbounded and it can be very expensive or difficult to sample points from a general multivariate distribution.

( ) ( ) ( ) ( ) ( )

d d

f q

R R

E

∫ q x f x dx

∫ q x dF x

For typical applications, we want to derive expectation, variance and some quantities of all marginal distributions together with all correlations between the variables. If we consider, for instance, the expectation, it is obvious that we have to use q(x) = xk to obtain the expectation of the marginal.

The convergence rate often can be increased when highly uniform point sets (HUPS, also called low discrepancy sequences or quasi-random numbers) are used instead of (pseudo-) random points. Such methods are called quasi-Monte Carlo methods (QMC). There exist HUPS where the star discrepancy (and thus the QMC estimator) converges withO((lnN)^d/N). When E f (q) has to be evaluated with respect to some non-uniform distribution F with bounded domain, similar results exist.

The QMC approach requires point sets with low F–discrepancy. Such point sets are also created by applying appropriate transformation methods on low discrepancy sequences. However, for general multivariate distributions such transformations are hard to ﬁnd and/or numerically very expensive. Moreover, these may introduce singularities into our integration problem and thus convergence is not guaranteed by the Koksma-Hlawka inequality.

Thus, we need to transform low discrepancy point sets into sets of points with low F-discrepancy when we calculate the QMC estimator. However, the transformation methods that have been developed for non-uniform random variety generation cannot be applied for QMC, because these destroy the structure of the underlying point set. Moreover, the theory of non-uniform random numbers does not directly apply when quasi-random numbers are used.

The problem of generating non-uniform random points and that of generating non-uniform quasi-random points should be seen as different problems. For the first one, we need to transform uniform random numbers into random points. The correctness of the transformation is verified using probability theory. The structure of the used uniform pseudo-random point set is usually not taken into consideration. The

latter problem of generating quasi-random points should be interpreted as transforming the integration problem with respect to F over Rd into an equivalent one over (0, 1) with respect to the Lebesgue measure. This is required as HUPS are constructed to work for the integral over the unit cube. From this perspective, it is somewhat surprising that most papers dealing with QMC methods for evaluating expectations E f (q) do not concern about the problem of appropriate transformations.

There are some problems with this approach. First, the inverse CDF, F⁻¹, is often not given in closed form and thus numerical methods that only compute F⁻¹(u) approximately have to be used. There exist fast methods for this task, but they either require the CDF or compute it by integrating the density function numerically. In the multivariate case the inversion method can be applied to the marginal distributions, if the components of the random vector X are stochastically independent. Otherwise, the conditional distribution method must be used which can be seen as the multivariate generalization of the inversion method. It needs the (inverse) CDF of conditional distributions of marginal distributions which is practically never available in practice.

Moreover, the F-discrepancy is increased when the components are not independent.

A more serious problem in the framework of QMC is the fact that the inter-grand q(F⁻¹(u)) is often unbounded and thus has unbounded variation when the support of the distribution is unbounded. This is for instance the case when the m-th moment of the i-th variable has to be computed in Bayesian inference where q(x) = xim

or in derivative pricing in ﬁnancial engineering when q(x) behaves like exp (∑ xd i

i=1 ). In this case the Koksma-Hlawka inequality does not apply and convergence of the QMC estimator is not guaranteed.

Chapter 4 Support-Vector-Regression (SVR) Learning Framework

The table-lookup Monte-Carlo framework is inherently limited in execution efficiency because it computes δstrike and δprop indirectly using extensive samplings of Monte-Carlo runs. In this section, we propose another learning-based framework to do the task directly with support of support vector regression (SVR) and is found to be both more efficient and more accurate. Note that our SVR-learning framework can be represented in the same flowchart as Figure 1.3 with the replacement of first-strike tables (Tstrike) and propagation tables (Tprop) with respective learning models (δstrike

and δprop) as shown in Figure 4.1.

Figure 4.1: Proposed statistical SER framework using support-vector-regression models

By definition, δstrikeand δprop are functions of pw that is a random variable. From Figure 2.1 and Figure 2.2, we assume pw follows the normal distribution, which can be written as:

pw~N(μ_pw, σ_pw) (13) Therefore, we can decompose δstrike and δprop into four models: δ_strike^μ , δstrikeσ

, δpropμ

, and δpropσ

where each can be defined as:

δ: x�⃗ → y (14)

where x�⃗ denotes a vector of input variables and y is called the model’s label or target value

Integrating the impact of process variations, four models are traditionally built using lookup tables. However, lookup tables have two limitations on applicability: (1) inaccurate interpolation and (2) coarse model size control. First, lookup tables can take only finite table indices and must use interpolation. However, interpolation functions are often not accurate enough or difficult to obtain, especially as the table dimensionality grows. Second, a lookup table stores data samples in a grid-like fashion, where the table will grow prohibitively large for fine resolution. Meanwhile, the information richness often differs across different parts of a table. For example, we observe that pulse widths generated by strong charges behave much simpler than weaker charges do. Naturally, simple behaviors can be encoded with fewer data points in the model, whereas complicated behaviors need to be encoded with more.

In statistical learning theory, such models are built using regression, which can be roughly divided into linear [27] and non-linear [28] methods. Among them, Support Vector Regression (SVR) [29] [30] combines linear methods’ efficiency and non-linear methods’ descriptive power. SVR has two advantages over lookup tables:

(1) It gives an explicit function and need no interpolation. (2) It filters out unnecessary points and yields compact models. In the following, we propose a methodology to adapt the framework in Chapter 3 to a learning-based one based on SVR models, which comprises training sample preparation, SVR model training, and parameter selection. Also, the modification of the MAX operation in Equation (11) is addressed.

4.1 Training sample preparation

SVR models differ from lookup tables on the way we prepare training samples for them. For lookup tables, one starts from selecting a finite set of points along each table dimension. On one hand, they should be chosen economically; on the other hand, it is difficult to cover all corner cases with only a limited numbers of points. For SVR models, we do not need to select these points. Instead, we provide large sets of training samples, and let the SVR algorithm do the selection task.

A training sample set S of m samples is defined as:

S ∈ (X��⃗ × Y)^m = {(x�⃗₁, y₁), ⋯ , (x�⃗m, y_m)} (15)

where m pairs of input variables x�⃗i’s and target values yi’s are obtained from massive Monte-Carlo SPICE simulation. For δ_strike^μ , δstrikeσ

, we use input variables including charge strength, driving gate, input pattern, and output loading; for δpropμ

, δpropσ

, we use input variables including input pattern, pin index, driving gate, input pulse-width distribution ( μ_pwⁱ⁻¹ and σ_pwⁱ⁻¹ ), propagation depth, and output loading.

In our training samples, we implement output loading using combinations of arbitrary cell input pins. Doing so preserves additional information for the output loading status and saves the labor (and risk) of characterizing the capacity of each cell’s input pin. Although the number of such combinations can easily explode, there are usually only a limited number of representatives, which are automatically identified by SVR. Furthermore, from a learning perspective, since both peak voltage and pulse width are the responses of charge injection current formulated in Equation (10), they are highly correlated. Empirically, using pulse-width information alone can yield satisfactory SSERs and thus in our framework, we do not need to incorporate models for peak voltage.

4.2 Support vector machine and its extension to regression

Support vector machine (SVM) is one of the most widely used algorithms for learning problems [29] and can be summarized with the following characteristics:

 SVM is an efficient algorithm and finds a global minimum (or maximum) for a convex optimization problem formulated from the learning problem.

 SVM avoids the curse of dimensionality by capacity control and works well

with high-dimensional data.

 SVM automatically finds the decision boundary for a collection of samples using a small subset where each sample is called a support vector.

The basic idea behind SVM is to find a function as the decision boundary with minimal errors and a maximal margin to separate data in multi-dimensional space.

Given a training set S, with x�⃗i ∈ Rⁿ, yi ∈ R, the SVM learning problem is to find a function f (first assume y = f(x�⃗i) = 〈w��⃗ ∙ x�⃗〉 + b that models S properly. Accordingly, the learning task is formulated into a constrained optimization problem as follows,

minimize ‖w��⃗‖²+ C(∑ ξ^m_i=1 _i)^k

subject to �yi(〈w��⃗ ∙ x�⃗〉 + b) ≥ 1 − ξ_i, i = 1, … , m,

ξ_i ≥ 0, i = 1, … , m (16) ξ_i is a slack variable providing an estimate of the error induced by the current decision boundary; C and k are user-specified parameters indicating the penalty of function errors in control. Later, the Lagrange multiplier method can efficiently solve such a constrained optimization problem [29] and finds w��⃗ and b for f(x�⃗i) =

〈w��⃗ ∙ x�⃗〉 + b with a maximal margin 2/|w��⃗| between 〈w��⃗ ∙ x�⃗〉 + b = +1 and h

〈w��⃗ ∙ x�⃗〉 + b = −1. Figure 4.2 shows an example for a two-dimensional data set containing samples of two different classes. Figure 4.2(a) illustrates many possible decision boundaries to separate the data set whereas Figure 4.2(b) shows the one with the maximal margin and the minimal errors that the user can tolerate among all boundaries.

Figure 4.2: Linear decision boundaries for a two-class data set

One SVM algorithm can be applied to regression problems with three steps: (1) primal form optimization, (2) dual form expansion, and (3) kernel function

substitution. The primal form presents the nature of the regression whereas the dual form provides the key to the later non-linear extension using kernel functions. In our framework, ϵ-SVR [29] is implemented to realize a family of highly non-linear regression models f(x�⃗_i) ∶ x�⃗ → y for δ_strike^μ , δ_strike^σ , δ_prop^μ , and δ_prop^σ for pulse-width mean and sigma of first-strike functions and pulse-width mean and sigma of propagation functions, respectively.

4.2.1 Primal form optimization

The regression’s goal is to derive a function that minimizes slacks and meanwhile to make f as smooth as possible. The corresponding constrained optimization problem for ϵ-SVR is modified as follows,

minimize ‖w��⃗‖²+ C ∑ (ξ^m_i=1 _i²+ ξ�_ı²)

subject to �

(〈w��⃗ ∙ x�⃗_i〉 − y_i) ≤ ϵ + ξ_i, i = 1, … , m, y_i− (〈w��⃗ ∙ x�⃗〉 + b) ≤ ϵ + ξ�, i = 1, … , m,_ı

ξ_i, ξ� ≥ 0, i = 1, … , m_ı

(17)

where the two slack variables ξ_i and ξ� represent variations of the error exceeding _ı and below the target value by more than ǫ, respectively. The parameter C determines the trade-off between the smoothness of f(x�⃗i) and the variation amount of errors (ξ_i and ξ�) to be tolerated. Equation (17) is termed the regression’s primal form. _ı

4.2.2 Dual form expansion

Instead of finding w��⃗ directly, the Lagrange multiplier method transforms the optimization problem from the primal form to its dual form and derives f as,

f(x�⃗i) = ∑ (α^m_i=1 i− α_i^∗)〈x�⃗ ∙ x�⃗i〉 + b (18)

where α_i, α_i^∗ are Lagrange multipliers and b is a function of ϵ, C, α’s and α^∗’s [30].

Several findings can be inferred from Equation (18). First, the only inner product

〈x�⃗ ∙ x�⃗i〉 implies that only an unseen sample x�⃗ and a training sample x�⃗i, are sufficient to predict a new unseen target value y. Second, only training samples x�⃗i’s that correspond to nonzero (α_i− α_i^∗)’s contribute to the prediction outcome. All other samples are unnecessary for the model and are filtered out during the training process.

Third, the inner product operation is a form of linear combination. As a result, the predicted target values of such a model are all linear combinations of training samples and thus f is a linear model. In practice, SVR often keeps only few samples (i.e., x�⃗i’s with nonzero coefficients) in its models and thus benefits from both smaller model size and faster prediction efficiency.

4.2.3 Kernel function substitution

According to the statistical learning theory [29], SVM remains valid if the inner product operation 〈u�⃗ ∙ v�⃗〉 in Equation (18) is substituted by a kernel function K(u�⃗ ∙ v�⃗) [31]. That is,

f(x�⃗i) = ∑ (α^m_i=1 i− α_i^∗)K(x�⃗ ∙ x�⃗i) + b (19) Radial Basis Function (RBF) is one kernel function used in our framework and can be formulated as K(u�⃗ ∙ v�⃗) = e−γ∙‖u��⃗−v��⃗‖² where γ is a controlling parameter. Unlike the inner product operation, the RBF kernel is highly non-linear. This enables the SVM algorithm to produce families of non-linear models that are suitable to capture complicated behaviors, like that of generation and propagation of pulse-width distributions of transient faults.

4.3 Parameter search

Now we return to the issue of selecting parameters (ϵ,C,γ) that have an unbounded number of combinations and is critical to achieving fine model quality.

Figure 4.3 illustrates 200 models built from the same training sample set; each point represents one model using a distinct parameter combination. Their quality is measured along two coordinates: Y-axis denotes the error rate for prediction; X-axis denotes the sample compression ratio, the ratio between the number of samples kept by the model and the original size of S. Figure 4.3 shows that while it is possible to obtain an ideal model that is small and accurate (indicated by the circle), it is also possible to obtain a large and inaccurate model (indicated by the square). The differences are 20X in both axes, and there is so far no deterministic method to find the best combination.

Since exhaustive search is clearly impractical, we need an efficient searching process with an effective cost function, which is written as:

�ϵ�, C�, γ�� = arg min (E^kR) (20)

In Equation (20), E^kR denotes the cost function, where E and R respectively denote error rate and compression ratio, and k is a parameter controlling the trade-off between E and R. A larger k makes the cost function more sensitive to the error rate, and vice versa. Note that if a single k is used, the cost function may wrongly select a combination with one matrix being extremely low whereas the other being undesirably high, e.g. E = 0.01% and R = 0.99%, as indicated by the triangle in Figure 4.3.

Figure 4.3: Quality comparison of 200 models using different parameter combinations Therefore we applied a prioritized scheme according to predefined goals on both matrices as illustrated in Figure 4.4. Assuming the goal (E < 6%, R < 10%), we draw finite grids near these goals, and prioritize them accordingly. For example, G0 is preferred over G1, G1 is preferred over G2, and G0 ∼ G⁵ are preferred over G^′. The main idea is that in grids where E is small or R is large, the cost function is adjusted to be more insensitive to error. Therefore, k is assigned smaller in grids with lower error rate or larger compression ratio, as illustrated in Figure 4.4. In G^′, k = 3.5 generally works well.

After determining the cost function, exhaustive search may still take months. To speed up the searching process, we observe two helpful properties from samples of all our four types of models. First, a sufficiently large (> 500) sample subset shares similar behaviors as the complete sample set. Second, points forming a cluster in Figure 4.3 have similar parameter combinations. For example, the combination (ϵ, C, γ) of points within the circle has a range of [2⁻⁴, 2⁻⁶] on ϵ, [2⁴, 2⁸] on C, and [2⁻², 2⁻⁶] on γ. The first property enables the use of subset search; the second property allows for incremental search with granularity. Parameter search is critical for building SVR models. Using the prioritized cost function, we can systematically find a good

parameter combination

Figure 4.4: Prioritized scheme for parameter search

4.4 An intensified learning framework

This section uses an intensified learning with data construction method for statistical model extraction. The proposed algorithm also incorporates an automatic bounding-charge selection technique to remove unnecessary charges for facilitating SER estimation.

4.4.1 Intensified learning with data reconstruction

Although the SVM provides accurate and compact models estimate SER, two problems remain unsolved: (1) the training time for data preparation and (2) parameter search for high quality models. For these two problems, this framework incorporates a metaheurisitc, particle swarm optimization (PSO), to facilitate the search for the optimal setting within short training time.

PSO is one of the evolutionary computation techniques developed by Kennedy and Eberhart in 1995 [32]. PSO adopts a strategy to search for potential solutions based on the behavior of particle swarms which are inspired by swarm intelligence from insects, birds and fish. Initially, PSO generates a set of random particles in a multi-dimensional search space. The position and velocity are each represented by a particle. The position indicates a possible solution of the optimization problem and the velocity is used to determine the search direction. During each iteration, particles change their positions by tracking the best position of all particles (Gbest) and their best positions (Pbest). The velocity and position of particle i is updated according to the following equation:

Vik+1= wVik+ c1r1�Pbest − Xik� + c2r2�Gbest − Xik� X_i^k+1= X_i^k+ V_i^k+1

,where k is the iteration number, w is the inertia weight, c1 and c2 are the learning factor, and r1 and r2 are random numbers among range [0,1].

The advantages of PSO are easy implementation, it requires only a few setting parameters to be adjusted, and it is capable of avoiding being trapped in a local optimum solution when compared with other evolutionary algorithms, such as the genetic algorithm (GA).

Figure 4.5 illustrates the interaction between our intensified SVM-learning and PSO. First, PSO generates a set of training parameters required for SVM to build behavioral models. After building the training models, the SVM reports model’s accuracy to PSO as its fitness value. Based on the model’s accuracy, PSO will breed new generations and generate better parameters for training. This process iterates for a specific number of generations or until achieving a stopping criteria.

Figure 4.5: Intensified SVM learning with PSO

Besides PSO, this study uses a data reconstruction technique to reduce the size of the training data and greatly improve the training time and the compression ratio of models. This data reconstruction calculates the average value of training data in each block (Fig. 4.6). The red points represent the raw data from the extensive SPICE simulation. The green points illustrate the average values of each block. After reconstruction, the size of the training data is greatly reduced. Combining the intensified learning with data reconstruction, the framework can systematically find a set of high quality parameters to build accurate models. Furthermore, training time significantly reduces from the order of months to the order of hours.

Figure 4.6: Example for data construction

4.4.2 Automatic bounding-charge selection

Computing SER with full-spectrum charge collection is still challenging, even using the new models. Therefore, to save time from too many rounds of statistical analysis, a technique of automatic bounding-charge selection is further proposed to discover charges that only need to be computed by traditional static analysis.

Figure 4.7 shows the mean, sigma, lower bound (mean-3*sigma), and upper bound (mean+3*sigma) of TF distributions, which are induced by different levels of deposited charges. Results show that the mean of pulse widths increases monotonically as the deposited charge increases. The larger deposited charge also leads to a smaller sigma of its pulse width. Hence, larger lower- and upper- bounds of the TF distribution can be observed when the level of charge collection increases.

Figure 4.7: The mean, sigma, lower bound (mean-3*sigma) and upper bound (mean+3*sigma) of TF distribution which are induced by different electrical charges.

Based on this finding, a technique of automatic bounding-charge selection is proposed to accelerate the overall SER estimation. For computing overall SERs, this study only needs to consider the distribution of a pulse width, which overlaps the

latching-window (Fig. 4.8). The pulse-width distributions in dotted lines are entirely masked. The pulse-width distributions in solid lines undoubtedly results in soft errors.

In other words, when the lower bound of a TF distribution exceeds than the latching-window size, the SER from such distribution can be replaced by the corresponding static results. On the contrary, the SER from a distribution in the dotted line induced by a weaker deposited charge (its upper bound is smaller than the

在文檔中後次微米時代新興電子設計自動化技術之研究---子計畫四：應用計算智慧推理處理後深次微米時代電路設計上的可靠度挑戰(III) (頁 38-0)