Third year - Research method - 後次微米時代新興電子設計自動化技術之研究---子計畫四：應用計算智慧推理處理後深次微米時代電路設計上的可靠度挑戰(III)

Chapter 1 Introduction

1.2 Research method

1.2.3 Third year

In this year, we proposed a novel approach, similar to that of block-based SSTA, for SSER in which a transient fault is decomposed into two transitions for analysis: a rising edge and a falling edge. Each edge is processed using an analytical approach and statistical static timing analysis, which is based on a first-order closed-form.

Because the transient fault is analyzed using a mathematical method, the timing cost can be largely reduced and timing information can be preserved, which is helpful for describing the interactive behavior of transient faults. However, correlations are the main concern when applying a closed-form block-based approach to the estimation of SSER. Theoretically, all correlations between transition signals and corresponding gate delays must be considered; however, the correlation between transition signals can be overlooked because the difference in SER has been shown to be less than 1%

according to our experiments. Thus, we devised a parameterized SSTA framework that takes into account the timing correlation to derive more accurate SER.

Experimental results demonstrate that our approach can provide reasonable results much more rapidly than all previous works.

Chapter 2 Fundamental of Statistical Soft Error Rate

2.1 Transient-fault behavior in very deep submicron era

Transient faults exhibit two characteristics in the very deep sub-micron era. One makes the faults more unpredictable whereas the other causes the discrepancy in Figure 1.2. In this section, the discrepancy are explained and associated with the electrical and timing masking mechanisms, respectively

2.1.1 To be electrically better or worse?

The first observation is conducted by running static SPICE simulation on a path consisting of various gates (including 2 AND, 2 OR and 4 NOT gates) in the 45nm PTM technology. As shown in Figure 2.1, the radiation particle first strikes the output of the first NOT gate with a collection charge of 32fC, and then propagates the transient fault along other gates with all side-inputs being set properly. The pulse widths (pwi’s) in voltage of the transient fault starting at the struck node and after passing gates along the path in order are 171ps, 183ps, 182ps, 177ps, 178ps, 169ps, 166ps and 173ps, respectively. Each pwi and pwi+1 can be compared to show the changes of voltage pulse widths during propagation in Figure 2.1.

Figure 2.1: Static SPICE simulation of a path in the 45nm technology

As we can see, the voltage pulse widths of such transient fault grow larger through gate #1, #4, and #7 while gate #2, #3, #5 and #6 attenuate such transient fault.

Furthermore, gates of the same type behave differently when receiving different voltage pulses. To take AND-type gates for example, the output pw1 is larger than the input pw on gate #1 while the contrary situation (pw < pw ) occurs on gate #3. This

result suggests that the voltage pulse width of a transient fault is not always diminishing, which contradicts some assumptions made in traditional static analysis [10]. A similar phenomenon called Propagation Induced Pulse Broadening (PIPB) is discovered in [25] and states that the voltage pulse width of a transient fault widens as it propagates along the long inverter chain.

2.1.2 When error-latching probability meets process variations

The second observation is dedicated to the timing-masking effect under process variations. In [9][18], the error-latching probability (PL) for one flip-flop is defined as

PL =^pw−w_t

clk (1)

where pw, w and tclk denote the pulse width of the arrival transient fault, the latching window of the flip-flop, and the clock period, respectively. However, process variations make pw and w become random variables. Therefore, we need to redefine Equation (1) as following.

Definition (P

_err−latch, error-latching probability)

Assume that the pulse width of one arrival transient fault and the latching window (t

setup

+t

hold

) of the flip-flop are random variables and denoted as pw and w, respectively. Let x = pw − w be another random variable and μ

and σ

be its mean and variance. The latch probability is defined as:

Perr−latch(pw, w) =_t¹

clk∫₀^u^x^+3σ^xx ∙ P(x > 0) ∙ 𝑑𝑑 (2) With the above definition, we further illustrate the impact of process variations on SER analysis. Figure 4(a) shows three transient-fault distributions with the same pulse-width mean (95ps) under different σproc’s: 1%, 5% and 10%. A fixed latching window w = 100ps is assumed as indicated by the solid lines. According to Equation (1), static analysis result in zero SER under all σproc’s because 95 − 100 < 0.

From a statistical perspective, however, these transient faults all yield positive and different SER’s. It is illustrated using two terms: P(x > 0) and x in Equation (2).

First, in Figure 2.2(a), the cumulative probabilities for pw > w under three different σproc’s are 17%, 40%, and 49%, respectively. The largest σproc corresponds to the

largest P(x > 0) term. Second, in Figure 2.2(b), we compute the pulse-width averages for the portion x = pw − w > 0 and they are 1, 13 and 26, respectively. Again, the largest σproc corresponds to the largest x term.

These two effects jointly suggest that larger σproc leads to larger P_err−latch, which has been neglected in traditional static analysis, and also explain the increasing discrepancy shown in Figure 1.2. In summary, process variations make traditional static analysis no longer effective and should be considered in accurate SER estimation for scaled CMOS designs.

Figure 2.2: Process-variation vs. error-latching probabilities

2.2 Impact of spatial correlation

Variations have become important as technology scales further. High levels of device parameter variations are changing the design flows from deterministic to probabilistic as technology nodes beyond 90nm experience increasingly. Process variations can be classified into the two categories. One is the inter-die variations and the other is intra-die variations. Intra-die variations can significantly affect the variability of performance parameters on a chip due to the modern technologies are rapidly and steadily growing. Intra-die variations are locally layout-dependent, and

therefore it is spatially correlated.

Devices tend to have similar characteristics as it with similar layout patterns and proximity structures. In other words, it is globally location-dependent. Devices have the similar characteristics than placed far away as it located close to each other. With increased process scaling, intra-die variations are becoming a more dominant portion of the overall variability of device features, meaning that devices on the same die can no longer be treated as identical copies of the same device.

If we do not take into account the value of process variations, it will lead to underestimated/overoptimistic estimation on SSER. However, all previous works consider the impact of process variations but do not include spatial correlations in the statistical soft error rate, leading to incorrect SSERs. Therefore, we investigate the impact of spatial correlations in our project to comprehend the accuracy of SSERs as shown in Figure 2.3.

Figure 2.3: SSER comparison from static and Monte Carlo SPICE simulations, the proposed MC with spatial correlations and without spatial correlations frameworks

According to Figure 2.3, circuit SER is overestimated under the process variation 5% without considering spatial correlations. Circuit SER that considers spatial correlations under the process variation 5% is generally lower when comparing with the circuit SER under the process variation 5% without considering spatial correlations. Therefore, we propose an effective model considering spatial correlations of statistical soft error rate. The analysis is extended to include spatial

correlations. Then we explain the model used for process variations and spatial correlations of intra-die variations.

There are a few models in order to handle parameter correlations. First, we introduce the grid model. Grid model is a die area divided by a square grid. A group of fully correlated devices is assumed to correspond to each square of the grid. Each square is modeled as a random variable (RV) which correlates with the random variables corresponding to the rest of the squares. Another one model is called the

quadtree model. This method is recursively dividing the die area into four squares

until individual gates into the grid. The partitions are stacked on top of another level.

We then assign each of them an independent random variable. By summing all areas that cover this particular device, the random variable corresponding to the gate is computed. Due to share common random variables on higher levels, the spatial correlations can be addressed properly.

Without losing the generality, in the beginning of our project, we used the grid model to apply spatial correlations to soft error. We partitioned the region of die into nrow*ncol = n² grids for modeling the intra-die spatial correlations of parameters. We assumed that perfect correlations among the devices are in the same grid. Low or zero correlations are between far-away grids, and high correlations between close grids.

The devices are more likely to have more similar characteristics than those placed far away due to they are close to each other. For example, Figure 2.4 shows that gate a in grid (1, 1) and gate e in grid (3, 3). Since they are far away from each other, we assume that their parameters are uncorrelated. Gate c in grid (1, 2), gate a and gate c lie in neighboring grids, and due to their spatial proximity, their parameter variations are not identical but should be highly correlated. Since gate a and gate b are located in the same grid, we assume that the variations of their gate length are identical.

Figure 2.4: The gates in different grid with different process variations

Our algorithm makes a second assumption. Assume that there are no correlations between different types of process parameters, and nonzero correlations may exist only among the same type of process parameters in different grids. For instance, the Lg values for transistors in nearby grids are correlated, but the other parameters such as Wg or Wint in any grid are uncorrelated. In other words, we assume that interconnect parameters in different layers to be different types of parameters.

2.3 Full-spectrum analysis or not

Some previous works simplify the SER estimation by injecting only four levels of electrical charges. Therefore, our project poses a simple, yet important question,

“Are four levels of electrical charges enough to converge SER correctly and properly address the process-variation effect?”

Figure 2.5: (a) SERs of four-level and full-spectrum charge collection w.r.t. different latching-window size (b) SERs w.r.t. different levels of charge collection

Figure 2.5(a) compares of SERs from Monte-Carlo SPICE simulations. These SERs had different levels of charges when collected onto a sample circuit (c17 from ISCAS’85) with different latching-window sizes. The line with square symbols and the line with circle symbols represent the SERs induced by four-level and full-spectrum charge collection, respectively. As the latching-window size was set to 100ps, the SERs obtained from four-level and full-spectrum analyses were the same.

However, as the latching-window size grew to 150ps, the effective range of charge collection for SSER analysis increased from 35fC to 132fC. Therefore, the SER difference between four-level and full-spectrum analyses grew to 69%. Another question naturally arises, “If four levels of charge collection are not sufficient to

derive accurate SERs, how many levels are sufficient?”

Figure 2.6: Transient-fault distributions induced by four-level and full-spectrum charge collection

Figure 2.5(b) suggests the answer. All levels of deposited charges should be considered because SERs increase with charge collections. SER difference using different levels of deposited charges is further illustrated (Fig. 2.6), where the upper and lower parts show SER estimation by only four levels of charges and by all levels of charges, respectively. The X-axis and Y-axis denote the pulse width of transient faults and the effective frequency for a particle strike of different levels of deposited charges. For the analysis using four-level deposited charges, only four transient-fault (TF) distributions were generated and could contribute to the final soft error rate. In other words, soft errors can only be generated from four concentrated distributions,

and therefore may result in mistakes on SER integration. As the latching-window size of one flip-flop was far from the first TF distribution, soft errors from such TF distributions were entirely masked due to the timing-masking effect. For example, the biggest pulse width distribution in the upper part of Figure 2.6 is excluded from SER estimation. But, only part of them (those smaller decomposed TF distributions) were masked during analysis using all levels of deposited charges (Figure 2.6, lower part).

As a result, SER estimation was no longer valid with analysis using only four levels of charges and instead should comprehensively consider full-spectrum charge collection.

2.4 Problem formulation of statistical soft error rate (SSER)

In this section, we formulate the statistical soft error rate (SSER) problem for general cell-based circuit designs. Figure 2.7 illustrates a sample circuit subject to process variations, where the geometries of each cell vary [21]. Once high-energy particles strike the diffusion regions of these variable-size cells, according to Figure 1.2, 2.1 and 2.2, the electrical performances of the resulting transient faults also vary a lot. Accordingly, to accurately analyze the soft error rate (SER) of a circuit, we need to integrate both process-variation impacts and three masking affects discussed in Chapter 1 simultaneously, which brings up the statistical soft error rate (SSER) problem.

Figure 2.7: An example for illustrating the SSER problem

The SSER problem is composed of three elements: (1) electrical-probability computation, (2) propagation-probability computation and (3) overall SER estimation.

A bottom-up mathematical explanation of the SSER problem will start reversely from overall SER estimation to electrical probability computation.

2.4.1 Overall SER estimation

The overall SER for the circuit under test (CUT) can be computed by summing up the SER’s of each individual node in the circuit. That is,

𝑆𝑆𝑆

_𝐶𝐶𝐶

= ∑

^N_i=0^node

SER

(3)

where Nnode denotes the total number of possible nodes to be struck by radiation particles in the CUT and SERi denotes the SER results from node i, respectively.

Each SERi can be further formulated by integrating over the range q = 0 to Qmax (the maximum collection charge from the environment) the products of particle-hit rate and the total number of soft errors that q can induce at node i.

Therefore,

SER_i = ∫_q=0^Q^max(R_i(q) × F_soft−err(i, q))dq (4)

In a circuit, F_soft−err(i, q) represents the total number of expected soft errors from each flip-flop that a transient fault from node i can propagate to. Ri(q) represents the effective frequency for a particle hit of charge q at node i in unit time according to [1][8]. That is,

𝑆_𝑖(q) = F × K × A_i×_Q¹

se^−q^Qs (5) where F, K, Ai and Qs denote the neutron flux (> 10MeV), a technology-independent fitting parameter, the susceptible area of node i in cm², and the charge collection slope, respectively.

2.4.2 Logical probability computation

F_soft−err(i, q) depends on all three masking effects and can be decomposed into F_soft−err(i, q) = ∑ P^N_j=0^ff _logic(i, j)× P_elec(i, j, q) (6)

where Nff denotes the total number of flip-flops in the circuit under test. Plogic(i, j) denotes the overall logical probability of successfully generating a transient fault and propagating it through all gates along the path from node i to flip-flop j. It can be computed by multiplying the signal probabilities for specific values on target gates as follows.

𝑃𝑙𝑙𝑙𝑖𝑙(i, j) = Psig(i = 0) × ∏k∈i→jPside(k) (7) where k denotes one gate along the target path (i→j) starting from node i and ending at flip-flop j, Psig denotes the signal probability for the designated logic value, and Pside denotes the signal probability for the non-controlling values (i.e. 1 for AND gates and 0 for OR gates) on all side inputs along the target path.

Figure 2.8 illustrates an example where a particle striking net a results in a transient fault that propagates through net c and net e. Suppose that the signal probability of being 1 and 0 on one arbitrary net i is Pi and (1-Pi), respectively. In order to propagate the transient fault from a towards e successfully, net a needs to be 0 while net b, the side input of a, and net d, the side input of c, need to be non-controlling, simultaneously.

Figure 2.8: Logical probability computation for one sample path

Therefore, according to Equation (7),

𝑃𝑙𝑙𝑙𝑖𝑙(a, e) = Psig(a = 0) × Pside(a) × Pside(c)

= Psig(a = 0) × Psig(b = 1) × Psig(d = 0) = (1 − Pa) × Pb× (1 − Pd)

2.4.3 Electrical probability computation

Electrical probability Pelec(i, j, q) comprises the electrical and timing masking effects and can be further defined as

𝑃𝑒𝑙𝑒𝑙(i, j, q) = Perr−latch�pwj, wj�

= Perr−latch�𝜆elec−mask(i, j, q), wj� (8) While P_err−latch accounts for the timing making effect as defined in Equation (2), λelec−mask accounts for the electrical masking effect with the following definition.

Definition

(λelec−mask, electrical masking function)

Given the node i where the particle strikes to cause a transient fault and flip-flop j is the destination that the transient fault finally ends at, assume that the transient fault propagates along one path (i ; j) through v

, v

, ..., v

, v

m+1

where v

and v

m+1

denote node i and flip-flop j, respectively. Then the electrical masking function is defined as

𝜆elec−mask(i, j, q) = δprop�⋯ �δprop�δprop(pw0, 1), 2�, ⋯ �, m, � (9)

where pw0 = δstrike(q, i) and pwk = δprop(pw_k−1, k) ∀k ∈ [1,m]

In the above definition, two undefined functions, δstrike and δprop, respectively, represent the first-strike function and the electrical propagation function of transient-fault distributions. δstrike(q, i) is invoked once and maps the collection charge q at node i into a voltage pulse width pw0. δprop(pw_k−1, k) is invoked m times and iteratively computes the pulse width pwk after the input pulse width pw_k−1 propagates through the k-th cell from node i. These two types of functions are also the most critical components to the success of a statistical SER analysis framework due to the difficulty from integrating process-variation impacts.

The theoretical SSER in Equation (7) and Equation (9) is analyzed from a path perspective. However, in reality, since both the signal probabilities and transient-pulse changes through a cell are independent to each other, the computation of SSER only needs to proceed stage by stage and thus can be implemented in a block-based fashion.

Chapter 3, Chapter 4 and Chapter 5 will present three different block-based SSER frameworks, a table-lookup framework, SVR learning framework, and SSTA-like framework, respectively. These frameworks consider process variations but differ

from the way they compute δstrikeand δprop.

Chapter 3 Table-lookup Monte-Carlo (MC) Framework

The first framework combines the current static approaches with the Monte-Carlo (MC) method, a computational algorithm using repeated random samplings to mimic complex statistical behaviors of physical or mathematical systems.

As depicted in Figure 1.3, this framework maps to the formulation in Section 2.4 using three loops: the outmost loop considers various levels of collection charge qi, which forms the discrete approximation of Equation (4); the second loop accounts for all vulnerable nodes within a circuit, which corresponds to Equation (6); the innermost loop maps to Equation (9) and computes δstrike and δprop implicitly. As the key component of the framework, the last loop can be further decomposed into two parts: (1) cell pre-characterization and (2) sampling and renewal of transient faults.

3.1 Cell pre-characterization

To reflect the electrical masking effect of transient faults on one cell intertwined with process variations, an approach similar to [26] is employed to extract pre-characterized tables. The objective of such pre-characterized tables is to model the pulse width and voltage magnitude for each cell as random variables that can be sampled during the particle-strike process and transient-fault propagation of one cell.

Table contents are derived on the basis of data from Monte-Carlo SPICE simulation with targeted process-variation parameters (or direct silicon measurement on test structures if applicable). Considering the mapping relationship, two types of tables are built for each cell separately: one for the particle-strike process, Tstrike, and the other for transient-fault propagation, Tprop

3.1.1 Particle-strike table T

strike

Tstrike maps the collection charge q incurred by the particle strike to electrical properties of cells. Figure 3.1 illustrates the example to pre-characterize one AND gate by properly setting up SPICE simulation environment. Figure 3.1(a) is the circuit netlist where a charge q is injected at the output of the AND gate as an independent

current source according to [7]:

I(q, t) =_𝜏 ^q

𝛼−𝜏_𝛽× (e⁻^𝜏𝛼^t − e⁻^𝜏𝛽^t) (10) An arbitrary number of cells are also generated and connected as the output loading for the AND gate. Capacitance of each cell will be normalized in terms of the unit-size inverter (NOT). The final output loading is obtained from summing up each output cell and represented by a total number of equivalent NOTs.

Figure 3.1: Pre-characterization of particle-strike table Tstrike for an AND gate Given a fixed q, a number of MC runs with different SPICE settings are repeated in Figure 1.3 to compute the means and variances of pulse width and voltage magnitude, respectively, for the resulting transient fault. Figure 3.1(b) shows the table for the AND gate including four matrices: pulse-width mean matrix (M_pw^μ ), pulse-width variance matrix (M_pw^σ ), voltage-magnitude mean matrix (M_vm^μ ) and voltage-magnitude variance matrix (Mvmσ ) to store mean and sigma values for pulse widths and voltage magnitudes of transient-fault propagation. Note that since first-strike transient faults are sensitive to input vectors, the input vector also serves as an index in Tstrike.

3.1.2 Transient-fault propagation table T

prop

The transient-fault propagation table Tprop, on the other hand, reflects the changes of electrical properties when propagating the transient fault through one cell.

Figure 3.2(a) shows the sample SPICE simulation environment to pre-characterize the

在文檔中後次微米時代新興電子設計自動化技術之研究---子計畫四：應用計算智慧推理處理後深次微米時代電路設計上的可靠度挑戰(III) (頁 16-0)