Parallelization for Nanoscale Double-Gate MOSFETs Simulation . 34

3.3 The Developed Parallelization Technique

3.3.1 Parallelization for Nanoscale Double-Gate MOSFETs Simulation . 34

In this section, we introduce the parallelization for semiconductor devices simulation which is based on the partition of the simulation domain. In this investigation, based on a poste-riori error estimation, the triangular mesh generation, the adaptive finite volume method,

the monotone iterative method, and the parallel domain decomposition algorithm, a set of two-dimensional quantum correction hydrodynamic (HD) equations is solved numerically for the double-gate metal-oxide-semiconductor field effect transistors (MOSFETs) on our constructed cluster system.

Classical HD model consists of at least five coupled partial differential equations (PDEs).

The HD equations in semiconductor device simulation are as follows,

4φ = q

where φ is the electrostatic potential and n and p are classical electron and hole concen-trations. T_n and T_p are electron and hole temperature (K). The electric field E (V/cm) is defined by E = −∇φ, q is the elementary charge its unit is coulomb. ε_sis the semiconduc-tor permittivity (F/cm). w0 is the average carrier energy (eV) in the thermal equilibrium, and the net doping concentration is D(x, y) = N_D⁺(x, y) − N_A⁻(x, y). R is the net recom-bination rate (cm⁻³s⁻¹), and τ_nw and τ_pware the electron and hole energy relaxation time

(s) approximations. The average carrier energy consists of the thermal energy and the drift

for electrons and holes, respectively. m^∗_nand m^∗_pare the electron and hole effective masses (Kg). vn and vp are the electron and hole mean velocities (cm/s). The carrier’s currents and energy flux densities are given by

J_n= −qµ_nn∇φ + qD_n∇n + µ_nk_Bn∇T_n, (3.8)

where µn and µp are the carrier mobility (cm²/V−s). The diffusion coefficients, Dn and D_p (cm²/s), satisfy the Einstein relation. Qn and Qp are the heat flows. However, to simulate the nanoscale double-gate MOSFETs, the classical HD model is insufficient due to significant quantum confinement effects. Therefore, we have to consider the quantum mechanical effects in the classical HD model above. The quantum correction equation for the quantum corrected inversion-layer charge densities n_QM is

n_QM = a₀n_CL(1 − exp(−a₁ξ²(1 −1 2(ξ

ξ0

)²) − a₂ξ³), (3.12)

Gate Oxide 1

Figure 3.8: An illustration of the cross-section view for the n-type double-gate MOSFET device.

where n_CLis the classical electron density solved from the Poisson equation. ξ = x/λ_th is dimensionless and λth = (_2m₀^~_k²_B_T)^1/2 is the thermal wavelength ( ˚A), ~ is the reduced Planck constant (J−s), m₀ is the electron rest mass (Kg), k_B is the Boltzmann constant (J/K), T is the absolute temperature (K) and ξ₀ = T_si/2λ_this dimensionless. Together with the auxiliary equation (3.12), the conventional HD model forms a quantum correction HD model. The associated boundary condition for the model is the same with the conventional HD model.

We solve the above quantum correction HD model with a parallel adaptive computing technique for nanoscale double-gate MOSFETs, as shown in Fig. 3.8. The full set of quantum HD model is firstly decoupled with Gummel’s method, and each decoupled PDE is solved sequentially. Figure 3.9 shows the adaptive computational procedure. For each

Finite Volume

Figure 3.9: Adaptive finite volume solution algorithm for each

decoupled semiconductor device equation. A is the system matrix, Z is the unknown vector, and F is the nonlinear vector.

decoupled semiconductor device equation, we first partition the solution domain into a set of finite volumes. Each decoupled PDE is then approximated by the finite volume method. After the assembling the nonlinear algebraic system, the monotone iterative solver is directly applied to solve the system of nonlinear equations. Once an approximate solution is computed, a posteriori error analysis is performed to assess the quality of the approximate solution. The error analysis will produce error indicators and an error estimator. If the

Gate Oxide 1

Figure 3.10: Parallel domain decomposition for the two-dimensional double-gate MOSFET. The dash-lines indicate the boundary of partition domains.

estimator is less than a preset error tolerance (TOL), the adaptive process will be terminated and the approximate solution can be post-processed for solving next equation or analyzing physical properties. Otherwise, a refinement scheme is employed to refine the current elements depending on the magnitude of the error indicator for that element. A newer partition of the domain is thus created and a new solution procedure is repeated.

In parallelization, a pre-processor will prepare the input data required for each client.

In the configuration of MPI and Linux cluster, input data is prepared on the pre-processor and is sent to each client through TCP/IP, and then all clients do their own local jobs and exchange essential data with the neighbor clients. When each client completes its own jobs, it sends the results to a post-processor. The post-processor then estimates the solution er-ror. If the estimator is larger than a TOL, the post-processor delivers the computed results

to the pre-processor for the next mesh refinement. After the calculation and adaptation stages for error estimation as well as error indicators, a workload imbalance may exist due to the change of the number of meshes on individual client. As shown in Fig. 3.10, based on device structure and bias condition, the simulation domain is dynamically partitioned into several disjoint sub-domains. When a refined tree structure is created, the number of processors for the next computation will first be dynamically assigned and allocated fol-lowing the total number of nodes. We apply the geometric dynamic graph partitioning method in x- or/and y-direction to partition the total number of nodes and assign those par-titioned nodes to each processor. For the quantum correction HD model simulation, each partition sub-domain contains five nonlinear systems to be solved, where the systems have arisen from Gummel’s decoupled and adaptive FV approximated device PDEs. Once previ-ous results are given, the boundaries for partitioned sub-domains are totally separated. We solve the nonlinear systems with the MI method independently. When newer MI solutions of nonlinear systems are computed, we perform the boundary data exchange for the next Gummel’s iteration loop. The parallel domain decomposition is shown in Algorithm 2 and a procedure for the corresponding parallel dynamic partition is shown in Algorithm 3. The communicate between processors is based on the MPI libraries.

Begin Parallel Domain Decomposition Algorithm While (Error>Tolerance)

For (all elements)

Count the number of nodes and Do the dynamic partition algorithm End For

For (all jobs) Solve this job End For

For (all elements)

Perform error estimation and mesh refinement End For

End While

End Parallel Domain Decomposition Algorithm

Begin Dynamic Partition of Domain Decomposition For (all elements)

Count the number S of nodes sequentially End For

Estimate an optimal number N of CPUs with the number of nodes empirically

Figure 3.11: The number of nodes (and elements) in Log scale versus the refinement levels.

The number M of jobs for each CPU = S / N For (i = 1 to N)

Assign M nodes to CPU(i) End For

End Dynamic Partition of Domain Decomposition

The number of refined elements and nodes grows as the refinement levels increase. At

Table 3.2: A list of the achieved sequential and parallel time, efficiency, and speedup with respect to different number of nodes. It is performed on an 8-processors PC-based Linux cluster system.

Nodes Sequential time (sec.) Parallel time (sec.) of Speedup Efficiency the 8-processors system

1000 9.4 4.1 2.29 28.66 %

4000 87 26.1 3.30 41.67 %

8000 367 83.6 4.39 54.87 %

16000 794 179.2 4.43 55.39 %

32000 4041 735.6 5.49 68.67 %

64000 9403 1993.6 5.30 66.27 %

90000 27775 4487 6.19 77.38 %

250000 151016 20906 7.22 90.29 %

the beginning, the number of refined nodes (and elements) grows fast due to significant variations of solution gradients. After several refinements and solution processes, the in-creasing rate of the number of nodes (and elements) gradually becomes slow when the refinements increase. It eventually reaches to a saturated condition. Figure 3.11 reports the relationship of the number of nodes (and elements) versus the refinement levels. The increasing rate of the number of nodes (and elements) gradually becomes slow when the refinements increase. Therefore, it confirms the computational effectiveness of the adaptive computing method in the numerical simulation of the quantum correction transport model.

The explored double-gate MOSFETs is shown in Fig. 3.8, where the length of the device is equal to 20nm.

Figure 3.12: The maximum difference versus the number of nodes.

Parallelization of the adaptive simulation is performed on our Linux cluster. To verify the parallel performance of the domain decomposition method for the nanoscale double-gate MOSFETs simulation, the same device under V_G1 = V_G2 = 1.0 V and V_DS = 0 V is considered for the following cases. Table 1 reports the details of the achieved sequential and parallel time, efficiency, and speedup with respect to different number of nodes. It is performed on an 8 nodes PC-based Linux cluster system. In our numerical experience, a 7.22 speedup factor is obtained on the tested 8 nodes system. Figure 3.12 is the maximum

Figure 3.13: The parallel speedup and efficiency versus the number of processors.

difference versus the number of nodes. The maximum difference is defined as the maxi-mum difference of the code execution time divided by the maximaxi-mum execution time. For the simulation with 2, 4, 8, and 16 processors, the maximum difference decreases and tends to a stable value when the number of nodes increases. It shows a good dynamic load bal-ancing for the domain decomposition. Figure 3.13 is the achieved speedup and efficiency, where the speedup is the ratio of the code execution time on a single processor to that on

multiple processors. Efficiency is defined as the speedup divided by the number of proces-sors. The speedup is about 12.2 for the simulation running on a 16 nodes system and 75%

efficiency is maintained.

3.3.2 Parallelization for Large-scale Protein Folding Dynamics

In this section, one parallelization technique for the implicitly restarted Arnoldi algorithm that used to solve the eigenvalue problem for large-scale protein folding dynamics are pre-sented. Dynamics of protein folding has recently been of great interest [131–133]. It is nowadays explored in different way, such as mass action models, all atoms, lattice, and off-lattice model, and methods between macroscopic and microscopic models. One of biopolymer folding dynamic core problems is finding an ensemble of transition state con-formations or rate-limiting steps. For small molecules, numerical methods theoretically could find transition states, i.e. the saddle points of the potential energy surface. However, for proteins or RNA macromolecules, the potential energy surface of them is usually more complicated and has many pathways. Among approaches, a master equation is crucial in dynamic simulation of protein folding. Deriving from the Liouvillian, the master equa-tion basically describes time evoluequa-tion of a distribuequa-tion of trajectories. Numerical soluequa-tion of the master equation requires computing eigenvalues λ_N and eigenvectors for the cor-responding N by N transition matrix. Those negative and larger eigenvalues determine

the slowest dynamical relaxation processes. The size of transition matrix grows dramat-ically with respect to the length of studied protein, which results in a large-scale eigen-value problem to be solved. Three different approaches, the QR [134], implicitly restarted Arnoldi [135–137], and Jacobi-Davidson [138] methods are compared in calculating all or first few larger nonpositive eigenvalues of the matrix arising from the master equation in protein folding problems. The implicitly restarted Arnoldi method presents its robust in calculating the desired eigenpairs of the master equation. Therefore, parallelization of the implicitly restarted Arnoldi method is implemented for protein folding dynamics with large sequences. The developed parallelization scheme principally partitions the operations of the matrix.

The master equation describing time evolution of trajectory (i.e., the transition of states) is expressed as

dP (t)

dt = AP (t). (3.13)

To study the dynamics of Eq. (3.13), we have to compute the eigenvalues (in general, the first few largest nonpositive eigenvalue) of the matrix A, where P (t) is the N-dimensional vector of the instantaneous probability of the N conformations. Depending on different free energy, A is a sparse and asymmetric N by N transition (or rate) matrix, where the matrix entries is defined as

A_ij =

½ k_i→j ,for i 6= j P

l6=iki→l ,for i = j, (3.14)

where k_i→j is the rate constant for a protein changes its conformation from the i^th con-formation to the j^th conformation, and only depends on the energy states between two conformations

k_i→j =

½ 1 ,for E_i ≥ E_j e^(Eⁱ^−E^j^)/kT ,for Ei < Ej

. (3.15)

where E_i and E_j are energy states. k is the Boltzmann constant and T means the tem-perature. For any initial conformations, solutions of Eq. (3.13) provides the dynamics of population

P (t) =

N =1X

m=0

C_m^(c)n_me^−λ^m^t, (3.16)

where the −λ_mand n_mare the m^theigenvalue and eigenvector. The overall folding kinetics is the linear combination by a coefficient Cm^(c)of N possible protein folding conformations.

Therefore, Cm^(c)means contribution to overall kinetics from the m^thmode. The eigenvalues can be ordered as λ₀ < λ₁ ≤ λ₂ ≤ ... ≤ λ_{N −1}. The eigenvector n₀ for the equilibrium mode gives the equilibrium population distribution. The eigenvector n₁ for the slowest mode indicates the intermediate state of protein folding process which takes longest time to fold to the next conformation.

Furthermore, people may interest in the largest nonpositive eigenvalue λ₁. If there is a large gap between the eigenvalues of slow modes group and a fast modes group, then the overall folding speed is limited by these slow modes. Therefore, this protein may fold fast in the beginning because of the group of fast modes; then there is a rate-determining process

Table 3.3: The number of conformations of proteins with the length of L residues in a 2D square lattice model.

Protein length (L) Number of conformations (N )

4 5

because of the group of slow modes, and finally it fold as the ground state conformation with its function. Especially for the eigenvalue spectrum with extremely slow modes and a large gap from other modes, this mode is the bottleneck process of the whole folding process. It suggests that some eigenvectors with λ_m << 0 disappear quickly. On the other hand, any other eigenvectors with λ_m ∼ 0 will determine the slowest dynamic relaxation processes. The overall folding processing time of the protein is approximated as 1/λ1. We use HP (hydrophobic and polar residues) model in a 2D square lattice. The number of conformations of proteins with L residues is listed in Table 3.3.

To calculate the first few larger nonpositive eigenvalues (or all eigenvalues) of the con-structed matrix above, the implicitly restarted Arnoldi method is employed and imple-mented in this investigation. The implicitly restarted Arnoldi method is known as a direct algorithm for reducing a general matrix into upper Hessenberg form, and it could be more effective than subspace iteration methods for computing the dominant eigenvalues of a large, sparse, real, and asymmetric matrix. An algorithm for implementing the implicitly restarted Arnoldi method is shown below.

Arnoldi Algorithm:

Make an Arnoldi factorization AV_m = V_mH_m+ f_me^T_m While (converge)

1. Compute the eigenvalues λ_j, j = 1, 2, ..., m;

2. Sorting eigenvalues into a wanted set λ_j, j = 1, 2, ..., k and unwanted set λ_j, j = k + 1, k + 2, ..., m;

3. Perform m − k = p steps of the QR iteration with the unwanted eigenvalues shift to obtain H_mQ_m = Q_mH⁺_m;

4. AV_mQ_k = V_mQ_kH⁺_k+f_k⁺e^T_k, where H⁺_k is the leading principle submatrix of order k for H⁺_m

5. Extend length k Arnoldi factorization to a length m factorization.

Parallelization scheme used in this work is to partition the operations of the matrix. For the Arnoldi factorization

AV_m = V_mH_m+ f_me^T_m, (3.17)

where V_mis the set of Arnoldi vectors, H_mis the upper Hessenberg matrix, and f_m is the residual vector, it replicates Hm on every processor. Vm and fm are partitioned by rows and are distributed to every processor in the cluster.

Figure 3.14 is a computational flowchart for the proposed parallel procedure of the so-lution method. The communication between processors is based on the MPI, where the initialization of the MPI environment has to be carried out firstly. We then perform an Arnoldi factorization, and replicate matrix H on each node. The next step is to parti-tion the calculaparti-tion of V matrix by the available number of processors, and distributes it to each node. Each processor performs the respective matrix operation, exchanges data to other processors. With this approach there are two communication points within the construction of the Arnoldi factorization: the computation of the 2-norm of the distrib-uted vector f_m and the orthogonalization of f_m to V_m using classical Gram-Schmidt with DGKS correction. Typically the partitioning of V is comparable with the parallel decom-position of the matrix A. For n-node PCs cluster, the V matrix is partitioned by blocks V^T = (V^(1)T, V^(2)T, ..., V^(n)T) with one block per processor and with H replicated on

Figure 3.14: A computational flowchart of the implemented parallel method.

each node. Since H is replicated on each processor, all operations on the matrix H are replicated on each node and there is no communication overhead. We note that the parallel technique is that the partition of operations on the matrix can accelerate the calculations and reduce communications among processors.

To verify the stability and correctness of the parallel computational method, we test the sequence 0011101011 with 10 residues which has unique conformation in the ground state. To verify the correctness of the parallel solution technique, we plot the distributions

Table 3.4: The achieved parallel speedup and efficiency of the parallel computing algorithm, where the tested case is a 17-mer with only hydrophobic residues.

Processors Simulation Time (sec.) Speedup Efficiency

1 34704 – –

of all computed eigenvalues, shown in Figure 3.15, from a single processor PC and 32-node PC-based Linux cluster. Obviously, the computed eigenvalues are totally the same in a single PC and 32-node PC-based cluster. We also perform other verifications with larger sequences and have similar results. This indicates that the implemented parallel method provides a computationally efficient way to accurately calculate the eigenvalues in studying dynamics of protein folding. Figure 3.16 shows the maximum norm error of the smallest and fiftieth eigenvalue versus the number of iterations in a single PC and 32-node PC-based cluster. The convergence of the smallest eigenvalue is much faster than that of the fiftieth eigenvalue. It is also found that our parallel method has a similar behavior of convergence in 32-nodes cluster compared with the behavior in a single PC.

Table 3.4 summarizes the parallel speedup and efficiency of the implemented parallel technique. The tested cases is a 17-mer with only hydrophobic residues, and we compute

Figure 3.15: The distribution of all computed eigenvalues for the sequence 0011101011 (L = 10) in (a) a single processor (b) and 32-node PC-based Linux cluster.

the first 50 eigenvalues of the corresponding transition matrix. Figure 3.17 is the computed eigenvalues of the transition matrix of a 17-mer with only hydrophobic residues. The di-mension of the transition matrix is 2155667 by 2155667, and 50 eigenvalues have been computed. We find that the number of processors increases, and the efficiency decreases, as shown in Tab. 3.4. However, a 23-times speedup is maintained and the efficiency is over 72% on the 32-nodes cluster for the 17-mer case. For the same tested case, as shown in Fig.

Figure 3.16: The maximum norm error of the (a) first and (b) the 50^th eigenvalues versus the number of iterations.

3.18, the variation of maximum difference is within 8% when the number of processors is increased from 2 to 32. The maximum difference is defined as the maximum difference of the code execution time divided by the maximum execution time. Figure 3.19 shows the achieved parallel efficiency versus the matrix size for different number of processors. The parallel efficiency can hold over 70% for parallelization on 2, 4 and 8 processors, which is almost independent of matrix size. It even achieves 90 % efficiency on 2 processors for all matrix sizes. The parallelization on 16 and 32 processors shows the efficiency in-creases when the matrix size increase. However, the efficiency is degraded for the cases

Figure 3.17: The eigenvalues of the transition matrix of a 17-mer with only hydrophobic residues. The dimension of the

transition matrix is 2155667 by 2155667, and 50 eigenvalues have been computed.

Figure 3.18: The achieved load balancing versus the number of processors for the tested case of a 17-mer with only hydrophobic residues.

Figure 3.19: The achieved parallel efficiency versus the matrix size for different number of processors.

Figure 3.20: The achieved load balancing versus the matrix size for different number of processors.

with small matrix size due to the increased data communication time among the processors.

According to the results, when the matrix size is less than 10⁵, it is suitable to perform par-allelization on the small number of processors (e.g., < 8) for maintaining optimal speedup

在文檔中統合性的最佳化理論與架構在電子設計自動化應用之研究 (頁 60-84)