www.elsevier.com/locate/cpc
A new parallel adaptive finite volume method for the numerical
simulation of semiconductor devices
Yiming Li
a,∗, Jinn-Liang Liu
b, Tien-Sheng Chao
c, S.M. Sze
aaDepartment of Electronics Engineering and Institue of Electronics, National Chiao Tung University, Hsinchu 300, Taiwan bDepartment of Applied Mathematics, National Chiao Tung University, Hsinchu 300, Taiwan
cNational Nano Device Laboratories, Hsinchu 300, Taiwan
Abstract
Based on adaptive finite volume approximation, a posteriori error estimation, and monotone iteration, a novel system is proposed for parallel simulations of semiconductor devices. The system has two distinct parallel algorithms to perform a complete set of I–V simulations for any specific device model. The first algorithm is a domain decomposition on 1-irregular unstructured meshes whereas the second is a parallelization of multiple I–V points. Implemented on a Linux cluster using message passing interface libraries, both algorithms are shown to have excellent balances on dynamic loading and hence result in efficient speedup. Compared with measurement data, computational results of sub-micron MOSFET devices are given to demonstrate the accuracy and efficiency of the system.2001 Elsevier Science B.V. All rights reserved.
PACS: 73.40.Ty; 73.40.Qv; 02.70.Fj; 02.70.-c
Keywords: Adaptive FVM; Parallel semiconductor device simulation; Load balancing
1. Introduction
Parallel numerical simulation of semiconductor de-vices has been proven to be an indispensable tool for fast characterization and optimal design of semicon-ductor devices (see [1] and references therein). Adap-tive computation is currently one of the major con-cepts in large-scale simulations [2]. Considerable ef-forts have been directed to the development of high-performance computational techniques for semicon-ductor physics and devices. We propose here a proto-type of parallel system for semiconductor device sim-ulation. The main features of the system are adaptive finite volume method (FVM) with 1-irregular mesh
re-*Corresponding author.
E-mail address: [email protected] (Y. Li).
finement strategy, a posteriori error estimation, con-structive monotone iteration [3], domain decomposi-tion, and parallel I–V computations. Implemented on a Linux-cluster with message passing interface (MPI), the system has been tested on, such as PN diode, MOSFET, and SOI devices [4].
For most practical semiconductor devices, the phys-ical quantities such as potential and electron densities exhibit extreme jump layers particularly in the neigh-borhood of p-n junctions [4]. The presence of lay-ers results in highly unstructured grids and hence in the complexity of coding structure and parallelization, which can be alleviated by exploiting object-oriented programming (OOP) principles [5,6] and by develop-ing suitable parallel algorithms with good dynamic load balancing. Two distinct parallel algorithms are proposed in this paper. They are designed to perform 0010-4655/01/$ – see front matter 2001 Elsevier Science B.V. All rights reserved.
a complete set of I–V simulations for any specific de-vice model. The first algorithm is a domain decom-position on 1-irregular unstructured meshes whereas the second is a parallelization of multiple I–V points. Compared with measurement data, simulation results of LDD N-MOSFET [4] are given to show the accu-racy and effectiveness of the system.
2. Semiconductor device model
Hydrodynamic (HD) equations are used to model submicron MOSFET devices in which hot electron and non-local effects are main concerns [4,7]. The following is a commonly used model [8,9]:
φ= q εs n− p + NA−− ND+, (1) 1 q∇ · Jn= R(n, p), (2) −1 q ∇ · Jp= R(n, p), (3) ∇ · Sn= Jn· E − n ωn− ω0 τnw(Tn) , (4) ∇ · Sp= Jp· E − p ωp− ω0 τpw(Tp) , (5)
where φ is the electrostatic potential, n and p carrier concentrations, NA− and ND+ ionized doping profiles,
Jn and Jp carrier current densities, Sn and Sp carrier energy fluxes, R(n, p) the generation recombination rate, E= −∇φ the electric field, ωn and ωp carrier energies, τnw and τnw carrier energy relaxation times, and ω0 = 32kBTL the thermal equilibrium carrier energy. Explicitly, Jn, Jp, Sn, and Spare as follows: Jn= −qµnn∇φ + qDn∇n + nµnkB∇Tn, (6) Jp= −qµpp∇φ − qDp∇p − pµpkB∇Tp, (7) Sn= Jn −qωn+ Jn −qkBTL+ Qn, (8) Sp= Jp q ωp+ Jp q kBTL+ Qp. (9)
Here µn, µp, Dn, Dp, Qn, and Qpare the carrier mo-bility, diffusion coefficient, and heat flow, respectively. The model is subject to suitable boundary conditions. In this simulation, based on Fermi–Dirac statistics [4, 10,11], the model also can be expressed in terms of quasi-Fermi levels instead of carrier concentrations.
3. Adaptive numerical methods
We now briefly outline the adaptive algorithm and numerical methods that are implemented in our device simulation system.
Adaptive Algorithm.
Step 1. Initialization and initial mesh generation. Step 2. Construction of data structure on the current
mesh.
Step 3. Outer loop iteration (i.e. Gummel’s iteration [3,10]).
Step 3.1. Inner loop iteration on FV solution of Eq. (1) for φ.
Step 3.2. Inner loop iteration on FV solution of Eq. (2) for n.
Step 3.3. Inner loop iteration on FV solution of Eq. (3) for p.
Step 3.4. Computation of Jnand Jp. Step 3.5. Inner loop iteration on FV solution
of Eq. (4) for Tn.
Step 3.6. Inner loop iteration on FV solution of Eq. (5) for Tp.
Step 4. A posteriori error estimation.
Step 5. Run mesh refinement and go to Step 2 if stopping criteria aren’t satisfied.
Step 6. Postprocessing.
A description of FV approximation and a posteri-ori error estimation for linear elliptical partial differ-ential equations with unstructured mesh can be found in [5]. For semiconductor device simulation, the error estimation has to be modified to account for the funda-mental principle of flux and charge conservation. For each decoupled equation, FVM results in a system of nonlinear algebraic equations which are solved by a monotone iterative scheme similar to that of [3]. The refinement process is guided by local error indicators that are based on element-by-element calculations of the maximum gradient of electrostatic potential φ and the variation of current densities Jnand Jp.
4. Parallel algorithms
Due to the nature of p-n junction properties in semi-conductor device, meshes adaptively generated by the refinement procedure are highly unstructured and con-sequently lead to complicated data structures. This
causes the load balancing a difficult task among mul-tiple processors. In connection with special proper-ties of the monotone iterative method [3], two paral-lel algorithms are proposed here. The first algorithm is a domain decomposition on 1-irregular unstructured meshes whereas the second is a parallelization of mul-tiple I–V points. Note that the I–V curve of a submi-cron device is the main objective of device simula-tion which in general require tremendous amount of working-time for various parameters. The constructed Linux-cluster system and network configuration are including, such as cluster, NFS, NIS, UDP, server, TCP/IP, and Internet in this work. The cluster contains 8 PCs in this study; files access and share are through network file system (NFS) and network information system (NIS). The user datagram protocol (UDP) that controlled by MPI is applied to the short distance fast communication.
Domain Decomposition Algorithm.
Step 1. Initialize the MPI environment and configura-tion parameters.
Step 2. Generate a tree data structure of the current mesh.
Step 3. All nodes are numbered in accordance with refinement levels and the critical nodes are identi-fied. Count the total number of regular nodes of the mesh.
Step 4. In a server, uniformly partition the nodes into two categories. One corresponds to the bulk region and another to the surface region. Dynamically as-sign nodes in each category to different processors (clients) in which the same iterative solver is in-stalled. The assignment is performed along x- or
y-direction (from left to right and bottom to top)
in 2D device domain. In the neighborhood of p-n junction one may have to change the assignment di-rection for obtaining a better load balancing config-uration if necessary.
Step 5. All processors perform inner and outer itera-tions in a synchronized way.
Step 6. Computed data that is relevant to the rele-vant neighboring processors is exchanged among the processor via MPI protocol until the stopping criteria of both iterations are satisfied.
Step 7. Each processor computes local error indica-tors on an element-by-element basis and subdivides each one of elements that exhibit large errors into
four sub-elements provided that the global stopping criteria are not satisfied.
Step 8. Repeat steps 2–7, if the refinement process is invoked. Otherwise, stop the iteration and run postprocessing for the final data.
Parallel I–V Algorithm.
Step 1. Initialize the MPI environment and configura-tion parameters for all processors.
Step 2. Corresponding to a set of I–V points that are to be calculated for some device model, a queue of jobs with various biasing voltages is created in the server. Each job represents a complete process of adaptive computations as described above.
Step 3. The server assigns a job to the next available processor until the queue is empty.
Step 4. Each processor performs its own job (an I– V point) independent of the other (another I–V point corresponding to a set of different boundary conditions) due to the global convergent property of the monotone iterative method [3].
5. Numerical results
Organized into two examples, we now present some numerical results of our device simulation. The first example is given to show the effective-ness of the adaptive algorithm and the efficiency of the domain decomposition algorithm. A 0.35 µm LDD N-MOSFET device model with biasing condi-tions VDS= 2V and VGS= 2V is used for this
ex-ample. The adaptive process begins with an initial mesh of 16 elements on the solution domain and ends with the final mesh as shown in Fig. 1. The
Table 1
Efficiency and CPU time for parallel domain decomposition simu-lation on a 6-processors Linux-cluster
Nodes Sequential Parallel Speedup Efficiency time (s) time (s) 2000 29 8 3.62 60.4% 4000 101 28 3.61 60.1% 8000 1250 306 4.08 68.0% 16,000 5233 1192 4.39 73.1% 22,000 9878 2054 4.81 80.2%
Fig. 1. Adaptive final refined mesh for the simulated device.
Fig. 2. Speedup for parallel domain decomposition and I–V points calculation algorithms.
same adaptive process was first performed on a sin-gle processor then on a Linux cluster of 6 proces-sors. Table 1 shows a comparison of the perfor-mance in CPU time and efficiency. The lower line, in Fig. 2, indicates the achieved speedup for a
typi-Fig. 3. Measured and simulated device I–V characteristics.
cal mesh with 22,000 nodes on an 8-processors Linux-cluster.
For the same device, the second example presents the excellent performance of the parallel I–V algo-rithm applied to compute a set of 189 I–V points on
the Linux-cluster. The speedup factor is larger than 7 as shown in Fig. 2 (middle line). A subset of those sim-ulated I–V points are plotted in Fig. 3 along with the fabricated and measured data obtained from National Nano Device Laboratories, Taiwan.
Acknowledgements
This work was supported in part by the National Science Council of Taiwan under contract numbers NSC-89-2215-E-317-009.
References
[1] N.R. Aluru et al., IEEE Trans. CAD 15 (1996) 1029–1047. [2] V. Verfurth, A Review of a Posteriori Error Estimation
and Adaptive Mesh-Refinement Techniques, Teubner-Wiley, Stuttgart, 1996.
[3] Y. Li et al., in: Proc. IEEE Int. Symp. VLSI-TSA, 1999, pp. 27–30.
[4] S.M. Sze, Physics of Semiconductor Devices, 2nd edn., Wiley-Interscience, New York, 1981.
[5] T. Gallonet et al., SIAM J. Num. Anal. 37 (2000) 1935. [6] J.-L. Liu et al., Appl. Num. Math. 21 (1996) 439–467. [7] M. Ieong, T.-W. Tang, IEEE Trans. ED 44 (1997) 2242–2251. [8] P. Degond et al., SIAM J. Sci. Comp. 22 (2000) 986–1007. [9] K. Bløtekjer, IEEE Trans. ED 17 (1970) 38–47.
[10] W. Jerome, SIAM J. Appl. Math. 45 (1985) 565–590. [11] J.W. Slotboom, IEEE Trans. ED 20 (1973) 669–679.