A Parallel Additive Schwarz Preconditioned Jacobi-Davidson Algorithm for Polynomial Eigenvalue Problems in Quantum Dot Simulation

(1)

A Parallel Additive Schwarz Preconditioned Jacobi-Davidson Algorithm for Polynomial Eigenvalue Problems in Quantum

Dot Simulation

Feng-Nan Hwang^a, Zih-Hao Wei^a, Tsung-Ming Huang^b, Weichung Wang^c,∗

aDepartment of Mathematics, National Central University, Jhongli 320, Taiwan

bDepartment of Mathematics, National Taiwan Normal University, Taipei 116, Taiwan

cDepartment of Mathematics, National Taiwan University, Taipei 106, Taiwan

Abstract

We develop a parallel Jacobi-Davidson approach for finding a partial set of eigenpairs of large sparse polynomial eigenvalue problems with application in quantum dot simulation. A Jacobi-Davidson eigenvalue solver is implemented based on the Portable, Extensible Toolkit for Scientific Computation (PETSc). The eigensolver thus inherits PETSc’s efficient and various parallel operations, linear solvers, preconditioning schemes, and easy usages.

The parallel eigenvalue solver is then used to solve higher degree polynomial eigenvalue problems arising in numerical simulations of three dimensional quantum dots governed by Schr¨odinger’s equations. We find that the parallel restricted additive Schwarz preconditioner in conjunction with a parallel Krylov subspace method (e.g. GMRES) can solve the correction equations, the most costly step in the Jacobi-Davidson algorithm, very efficiently in parallel. Besides, the overall performance is quite satisfactory. We have observed near perfect superlinear speedup by using up to 320 processors. The parallel eigensolver can find all target interior eigenpairs of a quintic polynomial eigenvalue problem with more than 32 million variables within 12 minutes by using 272 Intel 3.0 GHz processors.

Key words: Parallel computing, restricted additive Schwarz preconditioning, Jacobi-Davidson methods, polynomial eigenvalue problems, Schr¨odinger’s equation, quantum dot simulation.

∗Corresponding author. Tel: +886-2-3366-2871; Fax +886-2-2391-4439

Email addresses: [email protected] (Feng-Nan Hwang), [email protected] (Zih-Hao Wei), [email protected] (Tsung-Ming Huang), [email protected] (Weichung Wang)

Preprint submitted to Elsevier December 4, 2009

(2)

1. Introduction

Polynomial eigenvalue problems (PEPs) arise in many scientific and engi- neering applications. A PEP can be written as

A(λ)f ≡ (

τ

X

i=0

λⁱA_i)f = 0, (1)

where (λ, f ) is an eigenpair that λ ∈ C and f ∈ C^N, τ is the degree of the matrix polynomial, and A_i∈ R^{N ×N} are the corresponding coefficient matrices.

Solving some PEPs can be a computational challenge, especially when the PEPs are very large and only a small number of eigenpairs are of interest. In this article, we develop a parallel Jacobi-Davidson approach for solving PEPs with a particular application in quantum dot (QD) simulation.

Nanoscale semiconductor QD is a particular electronic structure in which the carriers are confined within the dots in three dimensions. A great deal of research related to the laboratory fabrication, theoretical investigation, and real-world applications of QDs has been conducted. Among these stud- ies, numerical simulations based on Schr¨odinger equations play an important role in exploring properties of QDs. A computational focus of QD simulations is to solve the eigenvalue problems that are derived from the discretized Schr¨odinger equations. Various discretization schemes such as finite differences [27, 34], finite volumes [24], finite elements [30, 49], boundary elements [16], scanning methods [12], and pseudo-spectral methods [15, 33] have been used. Furthermore, various physical models and numerical schemes result in different standard [20, 23], generalized [22], polynomial [20, 53], or rational [48, 50] eigenvalue problems. Such eigenvalue problems are then solved to find approximate energy levels (eigenvalues) and wave or envelope functions (eigenvectors).

Many eigensolvers have been proposed to solve the eigenvalue problems associated with the QD simulations [6, 7, 14, 20, 21, 22, 23, 29, 43, 44, 46, 47, 52, 53]. In the literature, some sequential and parallel eigensolvers have been proposed for use in solving standard or generalized eigenvalue problems. On the other hand, sequential Jacobi- Davidson type methods have also been used to solve the PEPs. As an extension of these previous efforts, we aim to develop and implement an efficient parallel Jacobi-Davidson (JD) algorithm for solving PEPs.

Use of the JD algorithms to compute interior eigenpairs of PEPs can be justified as follows. To solve a PEP, one can recast the original PEP as a linearized eigenvalue problem. The linearized eigenvalue problem can be solved using Arnoldi or Lanczos type methods. From the numerical viewpoint, however, this approach has some potential drawbacks. For example, since the dimension of the enlarged matrix is much larger than the original problem, the memory requirement is more demanding and the linearized eigenvalue problem is more ill-conditioned [42]. Furthermore, in order to find selected interior eigenvalues, Arnoldi or Lanczos type methods need to be used in conjunction with a shift-and-invert technique, which requires to solve linear systems with a certain degree of accuracy. The computational cost of this type of operation is quite expensive. On the other hand, the use of a JD type algorithm is an efficient alternative. JD was proposed by

2

(3)

Sleijpen and Van der Vorst for solving linear eigenvalue problems [38] and later extended for higher degree PEPs [37, 51, 53]. The JD algorithm belongs to a class of subspace methods that consists of two key steps. One first extracts an approximate eigenpair from a given search space using the Rayleigh-Ritz procedure. If the approximate eigenpair is not close enough, one must enlarge the search space by adding a new basis vector, which is the approximate solution of a large sparse linear system of equations, known as the correction equation. The major advantages of the JD algorithm are that it deals directly with PEPs, the interior spectrum can be found without using any shift-and- invert technique, and an inexact correction equation solve (very often requiring only a few linear iterations) is sufficient for rapid JD convergence.

Parallelism is a natural choice as a method for accelerating large-scale JD eigensolvers.

Some variants of parallel JD algorithms have been studied for use in solving linear and quadratic eigenvalue problems. In most existing parallel JD research, the focus has been on design of appropriate parallel preconditioners for the correction equations, the most expensive part of JD algorithms. For example, Nool and Ploeg [31, 32] first used a shift- and-invert technique to convert a generalized magnetohydrodynamics eigenvalue problem into a standard eigenvalue problem so that the Alfven spectrum, (which are certain interior eigenvalues) is shifted to dominant one then solved it by a parallel JD methods, in which Generalized Minimal Residual Method (GMRES) [36] with an inexact block Cholesky’s based preconditioner was used for the correction equation. Bergamaschi et al. [4, 5] reported some computational experiences with a parallel approximate inverse (AINV) preconditioned JD method for solving the standard symmetric eigenvalue problem using a variety of applications in the 3D porous media flow, the 2D Richard’s equation and the 3D Laplacian equation. The parallel efficiency of their parallel JD code achieved ranges from 26% to 62%. Arbenz et al.[1] proposed a hybrid preconditioner combining a hierarchical basis preconditioner and an algebraic multigrid preconditioner for the correction equation in the JD algorithm for solving symmetric generalized Maxwell eigenvalue problem; their parallel code did not attain below 65% parallel efficiency using up to 16 processors. In addition, Gijzen [17] implemented a parallel JD method for a quadratic acoustic with damping eigenvalue problem, in which the GMRES method is used without any preconditioning for the solution of the correction equation and showed that the method scales almost linearly up to 64 processors. Note that several state-of-the-art parallel eigensolver packages such as PARPACK [26], PLOPEX [25], PRIMME [40, 41], and Scalable Library for Eigenvalue Problem Computation (SLEPc) [18] are publicly available for standard or generalized eigenvalue problems.

In order to develop an efficient parallel preconditioned Jacobi-Davidson algorithm for solving PEPs arising in QD simulations, we address the following issues:

• Preconditioning for the correction equations. We propose using additive Schwarz type preconditioners to solve the correction equations; this represents the most expensive part of the proposed algorithm. This type of preconditioner is efficient, widely used, and is well-understood as it applies to solving linear systems arising in partial differential equations such as symmetric positive elliptic problems, non-symmetric and indefinite elliptic problems, parabolic convection-diffusion, and hyperbolic problems [39, 45]. However, its use in solving eigenvalue problems for specific applications has not yet been studied.

• Parallel implementation. We discuss our parallel implementations on a dis- 3

(4)

tributed memory parallel computer in terms of a parallel correction equation solver and parallel basic linear algebraic operations. Our implementation is mainly based on the Portable, Extensible Toolkit for Scientific Computation (PETSc) [2] and thus shares all the convenient features provided by PETSc.

• Parameter tuning. We identify tunable parameter combinations for achieving best performances. The parameters include overlapping size of additive Schwarz preconditioner, solution quality of subdomain problems, inner JD iteration (for the correction equations) stopping criterion, and scalability with respect to number of processors and problem sizes.

• Parallel performance. Numerical results exhibit superior performance in speedup and parallel efficiency, since the additive Schwarz preconditioner in conjunction with a Krylov subspace method (e.g. GMRES) can improve the overall convergence rate of the proposed algorithm. For example, we have observed superlinear speedup and over 100% parallel efficiency up to 320 processors.

• Performance on a large-scale problem. Our parallel eigensolver also solves a quintic PEP with more than 32 million variables. The eigensolver computes all the five target eigenpairs within 12 minutes by using 272 Intel 3.0 GHz processors.

The remainder of this paper is organized as follows: In Section 2, we describe the mathematical model of QDs and the corresponding PEPs. Section 3 describes the parallel additive Schwarz preconditioned JD algorithm for solving PEPs. Section 4 discusses our implementation of the proposed algorithm using PETSc and SLEPc on a distributed memory parallel computer. Section 5 reports parallel performance of the proposed algorithms. The concluding remarks are given in Section 6.

2. Mathematical model for quantum dot and its discretization

The governing equation for three-dimensional QDs with single particle in the conduction band is the time-independent Schr¨odinger equation.

−∇ · ( ~²

2m(x)∇f ) + c(x)f = λf, (2)

where ~ is the reduced Planck constant; the eigenvalue λ and eigenvector f represent the total electron energy and the corresponding wave function, respectively; m(x) represents the electron effective mass and c(x) is the confinement potential.

The above equation describes the hetero-structures as shown in Figure 1. In the interface of the hetero-junction, both m(x) and c(x) are discontinuous. In cases in which constant effective mass models are considered, m(x) is a piecewise constant function with respect to the space variable x:

m(x) =

m1 in the dot, m2 in the matrix.

Alternatively, the nonparabolicity of the electron’s dispersion relation [11] gives the effective mass as a rational function of the energy. In particular, the effective mass model

4

(5)

Pyramid Dot Cuboid Matrix

Figure 1: Structure schema of a pyramidal and a cylindrical quantum dot. Each of the quantum dots is embedded in a hetero-structure matrix.

becomes

1

m_`(λ) = P_`²

~²

2

λ + g_`− c`

+ 1

λ + g_`− c`+ δ_`

, (3)

for ` = 1, 2. Here, P`, g`, and δ` are the momentum, main energy gap, and spin-orbit splitting in the `th region, respectively. Similarly, the confinement potential c(x) is a piecewise constant function that

c(x) =

c₁ in the dot, c2 in the matrix.

We impose the Ben Daniel-Duke interface conditions f |_D

+= f |_D

−,

~² 2m₂

∂f

∂n

_∂D

+

= _2m^~²

1

∂f

∂n

_∂D

−

(or _2m^~²

2(λ)

∂f

∂n

_∂D

+

= _2m^~²

1(λ)

∂f

∂n

_∂D

−

).

(4)

By letting D be the QD domain, we use D+ and D₋ to denote the outward normal derivatives of the interface for the matrix and dot, respectively. Additionally, we apply homogeneous Dirichlet conditions on the boundary of the quantum matrix.

This QD model can be derived from the following approximations. We first take account of the Hamiltonian acting on the envelope functions for multi-band electrons, in which the envelope functions are equivalent to the k · p bandstructure Kane model [3, 10]. The effective mass theory is further applied to project the multi-band Hamiltonian onto the conduction band and consequently results in the single Hamiltonian eigenvalue problem (2).

Such approximations are desirable as they can reduce the intensive computational burden from more complicated yet complete models like the ab initio simulations of many-electron Schr¨odinger equations. However, the simplifica- tions also limit the applicability of the model and consequently the numerical methods considered in this article.

A variety of numerical schemes for simulating QDs with different geometries have been developed for cylinders [21, 23, 53], cones [28, 50], pyramids [13, 20, 35, 52], and irregular shapes [22]. Based on the variants of the QD model and numerical schemes,

5

(6)

PEPs with different degrees are derived. Table 1 lists six representative QD eigenvalue problems available in the literature. Numerical investigations on the comparison of our proposed eigensolver with other state-of-the-art parallel eigensolvers for first three linear problems in the table will be reported elsewhere. Instead, we focus on the two higher degree PEPs: quintic and cubic eigenvalue problems, which correspond to the pyramidal and cylindrical QD, respectively and both problems assume the nonparabolic effective mass model. Alternatively, the QD eigenvalue problem (2) in the rational form considered in [48, 50] is solved by the nonlinear Arnoldi methods, the JD methods, and fully approximation methods.

Type QD geometry Effective mass model Discretization Reference(s)

Standard cylindrical constant FVM [23]

Standard pyramidal constant FVM [20]

Generalized irregular constant FDM [22]

Cubic cylindrical nonparabolic FDM [53]

Quintic pyramidal nonparabolic FVM [20]

Rational conical/pyramidal nonparabolic FEM/FDM [28, 48, 50]

Table 1: Some representative QD eigenvalue problems available in literature. FDM: finite differences methods, FVM: finite volumes methods, and FEM: finite elements methods

For the pyramidal QD case, the width of the QD base is 12.4 nm, the height of the QD is 6.2 nm and the cuboid matrix of the dimension 24.8 × 24.8 × 18.6 nm is uniformly partitioned into (L + 1), (M + 1), and (N + 1) grids in each direction so that the grid size ∆x = 24.8/L, ∆y = 2.48/M , and ∆z = 18.6/N . Since homogeneous Dirichlet boundary conditions are imposed, the total number of unknowns, or the dimension of A⁰_is, is therefore (L−1)×(M −1)×(N −1). The resulting quintic PEPs are obtained using a finite volumes method. For the cylindrical QD case, the cylindrical QD of diameter 15 nm and height 2.5 nm is placed in the cylindrical matrix of radius 75 nm and height 12.5 nm. The matrix is non-uniformly partitioned into (ρ + 1), η, and (ζ + 1) grids in the radial, azimuthal, and axial directions, respectively, where the Schr¨odinger equation is discretized by using finite differences schemes. By applying the periodicity in the azimuthal direction, suitable permutations, and Fourier Transformation, η independent 2D cubic PEPs can be decoupled from a 3D eigenvalue problem. As a test case, only the grid associated with azimuthal index 1 is considered. The derivations of these two PEPs are quite tedious; we refer interested readers to the references given in Table 1.

3. A parallel additive Schwarz preconditioned Jacobi-Davidson Algorithm We propose a parallel additive Schwarz preconditioned Jacobi-Davidson algorithm (ASPJD) for solving the PEP (1) as shown in Algorithm 1. The ASPJD shares a similar framework to that of other polynomial Jacobi-Davidson type algorithms such as [20].

Here, we focus on the efficiency of the algorithm on parallel computers. Its parallel implementation will be discussed in the next section. The algorithm contains two loops.

In the inner while-loop between line 4 and 12, we compute the Ritz pair that approx- imates the jth desired eigenpair of (1). In the outer for-loop between line 2 and 16, we compute the k desired eigenpairs one-by-one using a locking technique.

6

(7)

Algorithm 1 Parallel Additive Schwarz Preconditioned Jacobi-Davidson Algorithm (ASPJD) for Polynomial Eigenvalue Problems

Input: Coefficient matrices Aifor i = 0, . . . , τ , the number of desired eigenvalues k, an initial orthonormal vector Vini

Output: the desired eigenpairs (λj, fj) for j = 1, . . . , k 1: Set V = [Vint], VF= [ ], and Λ = ∅

2: for j = 1 to k do

3: Compute Wi= AiV and Mi= V^∗Wifor i = 0, . . . , τ 4: while (user-defined stopping criteria are not satisfied) do 5: Compute the eigenpairs (θ, s) of (Pτ

i=0θⁱMi)s = 0 6: Select the desired eigenpair (θ, s) with ksk2= 1 and θ /∈ Λ 7: Compute u = V s, p = A⁰(θ)u, r = A(θ)u

8: Solve the correction equation

„ I −pu^∗

u^∗p

«

A(θ)(I − uu^∗)t = −r

approximately for t ⊥ u by a Krylov subspace method with an additive Schwarz type preconditioner

9: Orthogonalize t against V , v = t/ktk2. 10: Compute wi= Aiv,

Mi=

» Mi V^∗wi

v^∗Wi v^∗wi

–

for i = 0, . . . , τ .

11: Expand V = [V, v] and Wi= [Wi, wi] 12: end while

13: Set λj= θ, fj= u, Λ = Λ ∪ {λj}

14: Perform locking by orthogonalizing fjagainst VF; Compute fj= fj/kfjk2; Update VF = [VF, fj]

15: Choose an orthonormal matrix Vini⊥ V_F; Set V = [VF, Vini] 16: end for

7

(8)

3.1. Inner loop

The inner loop contains the following components:

• Small size projected eigenvalue problems. In the while-loop starting from line 4, we compute the eigenpair (θ, s), where ksk₂ = 1 of the projected eigenvalue problem, (Pτ

i=0θⁱM_i)s = 0, by solving the corresponding linearized projected eigenvalue problem,

MAz = θMBz, (5)

where

MA=







0 I 0 . . . 0

0 0 I . . . 0

... ... ... . .. ...

0 0 0 . . . I

M0 M1 M2 . . . Mτ −1





 ,

M_B =







I 0 0 . . . 0 0 I 0 . . . 0 0 0 I . . . 0 ... ... ... . .. ... 0 0 0 . . . −M_τ





 , z =





 s θs θ²s

... θ^{τ −1}s





 .

Let Θ be the set of the eigenvalues θ. For a given target value, µ, which is near the desired eigenvalue, we select one of eigenvalues, say θj, from the set Θ\Λ so that

|µ − θ_j| = min

θ∈Θ\Λ

|µ − θ|. (6)

(θj, sj) refers to the desired eigenpair in line 6 of Algorithm 1. Depending on the application some additional condition for (6) may need to be imposed. For example, we typically select µ = 0 and require θ to be positive for our QD simulation. Note that the dimension of V^TA(θ)V is usually small and not larger than a user-defined restarting number.

• Restarting. To avoid loss of numerical orthogonality in V and to keep a manage- able size of the generalized eigenvalue problem being solved in (5), we restart the eigenpair search and choose a new orthogonal V after a certain number of inner iterations.

• Correction vector. The main purpose of the inner loop is to keep searching a Ritz pair with small residual over the space V that is gradually expanded. As shown in line 7 and 8 of Algorithm 1, we compute

p = A⁰(θ)u, (7)

where A⁰(θ) =

τ

X

i=1

iθⁱ⁻¹A_i and then solve the correction equation

I −pu^∗

u^∗p

A(θ)(I − uu^∗)t = −r (8)

8

(9)

for a correction vector t in each inner while-loop and then append the normalized correction vector to V . The derivation of Eq. (8) is based on Taylor’s expansion and the fact that we can find the orthogonal complement t ⊥ u to correct the error of the current approximation u such that

A(λ)(u + t) = 0. (9)

See [19] for a detailed explanation regarding the correction vector.

Here, the correction equation (8) is solved approximately by a Krylov subspace method (e.g. GMRES) with the preconditioning operator, B⁻¹_d , where

Bd=

I −pu^∗

u^∗p

B(I − uu^∗) ≈

I −pu^∗

u^∗p

A(θ)(I − uu^∗) (10) and B is an approximation of A(θ). In practice, there is no need to explicitly form Bd to solve Bdz = y with z ⊥ u for a given y, as it can be done equivalently by computing

z = B⁻¹y − ηB⁻¹p, with η = u^∗B⁻¹y

u^∗B⁻¹p. (11)

Note that the preconditioning operation B⁻¹p and inner product u^∗B⁻¹p need be computed only once for solving each correction equation; there is no need to re- compute them in the Krylov subspace iteration. Consequently, the Krylov subspace iteration in line 8 needs only preconditioning operations in the form of B⁻¹y.

While some sequential preconditioners such as SSOR were studied in [20, 21], we consider next a parallel domain-decomposed preconditioner B⁻¹ based on an additive Schwarz framework.

• Construction of the additive Schwarz preconditioner B⁻¹.

To define parallel Schwarz-type preconditioners, we begin by partitioning the computational quantum dot domain D into Nsdisjoint subdomains {D_i⁰, i = 1, ...., Ns}, whose union covers the entire domain. Then, overlapping subdomains can be obtained by expanding each subdomain D_i⁰to a larger subdomain D_i^δwith the boundary ∂D_i^δ. Here δ is a nonnegative integer indicating the level of overlap. Let n^δ_i and n be the total number of interior grid points on D^δ_i and D, respectively. We define R^δ_i as a n^δ_i × n restriction matrix with a value of either 0 or 1. The interpolation operator (R^δ_i)^T can then be defined as the transpose of R_i^δ. The multiplication of R^δ_i (and (R^δ_i)^T) with a vector does not involve any arithmetic operations, but does involve some communication in a distributed memory implementation, i.e., the global-to-local restriction operator R^δ_i, which collects data from neighboring subdomains, and the local-to-global interpolation operator (R^δ_i)^T, which sends partial solutions to neighboring subdomains.

Using the restriction and interpolation matrices, we write the one-level restricted additive Schwarz preconditioner (RAS) [9] in matrix form as

B⁻¹ =

Ns

X

i=1

(R⁰_i)^TB_i⁻¹R^δ_i, (12) 9

(10)

where B⁻¹_i is the subspace inverse of B_i and B_i= R^δ_iA(θ)(R_i^δ)^T. Note that when δ = 0, the RAS preconditioner in (12) is reduced to the block Jacobi preconditioner.

Although the theoretical analysis for the convergence of RAS is still incomplete, much numerical evidence suggests that using RAS to solve a variety of linear systems can not only reduce by half the communication time needed when the classical additive Schwarz preconditioner is used, but can also accelerate the convergence of iterative methods.

3.2. Outer loop

In the outer loop, k desired eigenpairs are computed successively. To compute the (j + 1)th eigenpair, we apply the following locking technique with suitably chosen orthonormal searching space V . Such technique is incorporated in Algorithm 1 as follows:

• In line 13, we take a Ritz pair (θj, fj), where fj= V sj, with small residual kA(θj)fjk2 as the jth approximate eigenpair and append θj into Λ.

• In line 14, the approximate eigenvector fj is normalized and then appended into VF. Consequently, the columns of VF form an orthonormal basis of the eigenspace spanned by f1, . . . , fj.

• In line 15, we choose an orthonormal matrix Vini and set V = [V_F, V_ini] such that V^TV = I.

In so doing, the first desired j eigenpairs will be included (or locked) in the eigenpairs corresponding to the projected eigenvalue problem in line 3 of Algorithm 1. On the other hand, these j eigenvalues will not be chosen in line 6, as we search the (j+1)th eigenvector over the space that is orthogonal to the V defined in line 15 of the jth iteration.

4. Detailed parallel implementations using PETSc and SLEPc

In this section, we describe the parallel implementation of the ASPJD algorithm based on two powerful scientific software libraries, namely the PETSc [2] and the SLEPc [18].

As shown in Figure 2, the design of PETSc adopts the principle of software layering. As an application code of PETSc, the major component in our ASPJD eigensolver, the JD object, is built on top of the KSP, a Linear Equation Solver. All PETSc libraries are based on Message Passing Interface (MPI) and two modules of linear algebra libraries: Basic Linear Algebra Subproblems (BLAS) and Linear Algebra Packages (LAPACK) library.

The vector (Vec) and matrix (Mat) are two basic objects in PETSc. The eigenvector xl

and other working vectors are created as parallel vectors in the Vec object. The column vectors of V and Wi in line 3 and 11 of Algorithm 1 are stored as an array of parallel vectors. The coefficient matrices Aiand the matrix A(θ) are created in a parallel sparse matrix format. We do explicitly form A(θ) using parallel matrix-matrix addition and it is used in the construction of a preconditioner.

We discuss parallel implementations of the ASPJD algorithm in detail, as follows, by classifying the algorithm into three main parts: (1) linearized projected eigenvalue solve, (2) correction equation solve, and (3) a sequence of basic linear algebraic operations.

10

(11)

SLEPc PETSc

JD Object

MPI LAPACK BLAS

Linear Solver PC+KSP

Eigenvalue Solver EPS

Spectral Transform Mat Vec DA AO IS ST

Figure 2: The organization of PETSc, SLEPc, and the ASPJD Eigensolver

• Redundant linearized projected eigenvalue solve. As mentioned in the previous section, the linearized projected eigenvalue problem (5) is typically quite small. Solving such small problems in parallel, especially in cases where large numbers of processors are used, may result in large computational cost in data communications compared to floating-point operations. Therefore, we adopts an- other approach. On each processor, the sequential QZ routine, called ZGGEVX in LAPACK, is employed to redundantly solve the same linearized projected eigenvalue problem, M_Az = θM_Bz, through an interface provided by SLEPc [18]. Here, the matrices M_A and M_B, as well as M_i, are stored in the sequential dense matrix format and their sizes increase as ASPJD iterates. Additional blocks V^∗w_i(or v^∗W_i) and v^∗w_ito be inserted into new M_iare computed in parallel; the results are then distributed to each processor via the vector multiple and single inner product, VecMDot and VecDot, respectively.

• Parallel correction equation solve.

We use a Krylov subspace method (e.g., GMRES) in conjunction with preconditioner (10) and the RAS preconditioner B⁻¹ in (12) to approximately solve the correction equation (8). We implement the parallel correction equation solve mainly by using the following three PETSc objects. First, a Krylov subspace type method, such as parallel GMRES, can be chosen from the KSP object at runtime. Second, the preconditioning operation defined in (10) is implemented by a PETSc user- defined preconditioner named PCSHELL, as described in Algorithm 2. Third, the RAS preconditioner, B⁻¹, as described in (12), is a built-in PC object of PETSc named PCASM.

11

(12)

We remark that the multiplication of B⁻¹ with a given vector, v = B⁻¹w, consists of three steps. On each subdomain, (i) collects the data from the subdomain and the neighboring subdomains, wi= R^δ_iw; (ii) solves the subdomain problem,

Bivi= wi; (13)

(iii) computes the sum, v =

N s

X

i=1

(R⁰_i)^Tvi by ignoring the computed value in the overlapping region. Furthermore, to save computational cost and memory use in practice, local subdomain problems (13) are solved by an inexact solver incomplete LU (ILU). The ILU solver performs an incomplete LU decomposition and then performs a forward and backward substitution using the incomplete factors.

Algorithm 2 Preconditioning operation: Bdz = y

Input: y, z2= B⁻¹p, and µ2= u^∗z2

Output: z

1: Apply the preconditioning: z1= B⁻¹y 2: Compute the vector dot product: µ1= u^∗z1

3: Compute η = −µ1/µ2

4: Compute z = z1+ ηz2

• Parallel basic linear algebraic operations.

PETSc provides a numerous of Vec and Mat commends for performing basic vector and matrix operations in parallel. Some of these are used in the implementation of the ASPJD eigensolver, as follows: (i) Matrix-by-vector product (MatMult ):

Wi= AiV (line 3), r = A(θ)u (line 7), and wi = Aiv (line 10) in Algorithm 1. (ii) Dot product (VecDot or VecMDot for multiple vectors cases): Mi= V^∗Wi(line 3), v^∗Wi, V^∗wi, and v^∗wi (line 10) in Algorithm 1; µ1= u^∗z1(line 2 of Algorithm 2).

(iii) Vector updates (VecWAXPY or VecMAXPY for the multiple vectors cases):

u = V s and p = A⁰(θ)u after performing the matrix-vector products Aiu in line 7 of Algorithm 1; z = z1+ ηz2(line 4 of Algorithm 2). In addition, IPOrthogonalize is used in SLEPc to perform the process of orthogonalization in line 9 of Algorithm 1.

5. Numerical results

To validate our parallel ASPJD eigensolver and to evaluate the parallel performance of our proposed algorithm, we consider the test cases listed in Table 2. These cases arise in the discretization of two QD models described in Section 2 by using different grid sizes. Physical parameters used in the numerical simulations are summarized in Table 3.

The ASPJD eigensolver has been implemented by using C language and several components in PETSc and SLEPc. The Intel C compiler version 9.1 with -O2 optimization flag is used to compile the codes. The numerical experiments were performed on the Vger cluster at the National Central University in Taiwan.

The Vger consists of 108 computing nodes; each node has two Intel Xeon 3.0GHz Dual-Core processors with 4 GB of memory. The computing nodes are interconnected by InfiniBand switches of 2 GB/s bandwidth with a peak performance rate of 5184 Gflop/s.

All computations were done in double precision complex arithmetic. The execution 12

(13)

Problem PEP type QD model Grid (L, M, N )/(ρ,ζ) Matrix size

Q1 Quintic pyramidal uniform (64, 64, 48) 186, 543

Q2 Quintic pyramidal uniform (128, 128, 96) 1, 532, 255 Q3 Quintic pyramidal uniform (352, 352, 264) 32, 401, 863 C1 Cubic cylindrical non-uniform (535, 190) 101, 650 C2 Cubic cylindrical non-uniform (1720, 585) 1, 006, 200 C3 Cubic cylindrical uniform (1742, 579) 1, 008, 618

Table 2: Test problems and their statistics. The notations L, M , and N represent for the number of grid points in the x-, y-, and z-coordinates, respectively. The notations ρ and ζ represent the number of grid points in the radial and height coordinates, respectively.

QD model Hetero-structure c` g` δ` P`

Pyramidal Dot (` = 1) 0.000 0.420 0.48 0.8503 Matrix (` = 2) 0.770 1.520 0.34 0.8878 Cylindrical Dot (` = 1) 0.000 0.235 0.81 0.2875 Matrix (` = 2) 0.350 1.590 0.80 0.1993

Table 3: Physical parameters used the for non-parabolic effective mass models described in (3). These parameters are taken from [20, 53].

timings are reported in seconds and exclude the time spent on coefficient matrix loading and construction of the working vectors.

We declare the JD iterations, i.e. the while-loop starting from line 4 of Algorithm 1, converge to an eigenpair, if absolute or relative residual of equation (1) is less than 10⁻¹⁰. V_ini = (1, 1, . . . , 1)^T is set to be in the initial search space. The maximum number of ASPJD restarts is set at 30.

A left restricted additive Schwarz preconditioned GMRES with a zero initial guess is used for solving the correction equation (8). The GMRES iterations are referred to the iterations used by GMRES to solve the correction equation. The subdomain problems (13) are solved by an incomplete LU decomposition. For simplicity of implementation, the actual partitioning and overlapping size are determined internally by PETSc. Except when explicitly stated otherwise, the target eigenvalue is the smallest positive eigenvalue.

Computed eigenvalues Reference values Relative differences

Problem Q2

λ1 0.4164706472791792 0.416470647281298 2.11880e-12 λ2 0.5992679796464911 0.599267979636767 9.72411e-12 λ3 0.5992679796452610 0.599267979636393 8.86801e-12 λ4 0.7179076438816558 0.717907643900873 1.92171e-11 λ5 0.7296997979449217 0.729699797946620 1.69830e-12 λ6 0.7924924167774992 0.792492416781907 4.40780e-12

Problem C2

λ1 0.087344809377190 0.087344809345140 3.20501e-11 λ2 0.150294727564833 0.150294727561288 3.54500e-12 λ3 0.245994432693207 0.245994432699191 5.98435e-12 λ4 0.330502438790559 0.330502438793776 3.21687e-12 λ5 0.350930026609811 0.350930026592148 1.76629e-11

Table 4: Comparison of a few computed and reference eigenvalues of Problems Q2 and C2. Note that only the real parts of the eigenvalues are shown in this table and that the imaginary parts of the eigenvalues are all less than 1.0e-12.

13

(14)

5.1. Parallel code validation

To check the correctness of our parallel implementation, we compute the target positive eigenvalues of Problem Q2 and Problem C2 using 64 processors. In these cases, the target eigenvalues for the pyramidal QD and cylindrical QD model are within the confinement potential intervals I₁ ≡ [0, 0.77] and I₂ ≡ [0, 0.35], respectively. For both cases, we use a 10 fixed-step block Jacobi preconditioned GMRES and ILU with zero level fill-ins is selected to be a subdomain solver. As presented in Table 4, our computed results are compared with the reference results obtained by a sequential Fortran JD code [20, 53]. The table shows that the relative differences in 2-norm are all less than 10⁻¹⁰. Note that the sixth positive eigenvalue of Problem Q2 and the fifth positive eigenvalue of Problem C2 listed in the table are beyond the intervals I1 and I2.

5.2. Algorithmic parameter tuning for the parallel ASPJD solver

We study how the following factors affect the number of JD iterations and the overall execution time:

1. The overlapping size: Let RAS(δ) denote the restricted additive Schwarz preconditioner (12) with overlapping size δ = 0, 1, 2, 3.

2. The quality of subdomain solve: Let µ be the level of fill-ins in ILU(µ), which is used to solve the subdomain problems (13). We choose µ = 0, 1, 2, or 3.

3. The stopping strategies for GMRES: Let s be the iteration number performed by GMRES(s) while solving the correction equation in line 8 of Algorithm 1. We choose s = 10, 15, or 20.

Numerical results of JD iterations and execution time are summarized in terms of RAS(δ), ILU(µ), and GMRES(s) in Table 5. Here, we compute only the first positive eigenpair. In the tables, we also include results obtained by setting the GMRES iteration tolerance for relative residual reduction gtol = 10⁻³. Additionally, results of JD iterations and execution time in terms of GMRES(s) with respect to the number of processors are summarized in Tables 6 and 7. We highlight the observations regarding parameter tuning as follows:

To avoid over-solving in the correction equations. One advantage of the ASPJD algorithm is that the convergence of the algorithm can be achieved by solving the correction equations with modest accuracy and usually without downgrading overall performance. Such behavior has been observed for linear eigenvalue problems [38]. We present similar observations for high degree PEPs here.

Table 5 suggests that the correction equations are solved to a precision beyond what is needed for GMRES with gtol = 10⁻³, which is commonly used in an inner- outer iterative algorithm such as an inexact Newton method. Although in this case the ASPJD algorithm takes fewer outer iterations in general, the corresponding GMRES iterations are excessively large. Consequently, all timing results are much slower than those observed using fixed-step GMRES.

Note that the actual relative residual reductions corresponding to GMRES(10), GMRES(15), and GMRES(20) for Problem Q2 (C2) are roughly equal to 0.2870, 0.2620, and 0.1912 (0.3254, 0.2715, and 0.2306), respectively.

14

(15)

(a) Problem Q2

GMRES(s) ILU(µ) RAS(0) RAS(1) RAS(2)

ILU(0) 12 (31.9); 87^∗ 12 (45.3); 86^∗ 13 (63.9); 80^∗ GMRES with ILU(1) 12 (34.3); 86^∗ 14 (62.3); 84^∗ 12 (67.3); 74^∗ gtol = 10⁻³ ILU(2) 11 (35.8); 88^∗ 11 (54.6); 76^∗ 19 (142.6); 77^∗

ILU(3) 12 (42.1); 86^∗ 12 (77.5); 72^∗ 15 (159.3); 76^∗

GMRES(10)

ILU(0) 21 (7.6) 25 (16.9) 28 (26.6) ILU(1) 20 (7.8) 28 (21.8) 16 (16.7) ILU(2) 19 (8.4) 19 (17.5) 13 (17.5) ILU(3) 19 (9.9) 18 (23.4) 13 (25.8)

GMRES(15)

ILU(0) 22 (11.7) 13 (11.1) 13 (15.3) ILU(1) 26 (14.7) 12 (11.3) 14 (19.4) ILU(2) 21 (12.6) 12 (14.4) 16 (28.6) ILU(3) 24 (18.0) 16 (26.7) 17 (44.9)

GMRES(20)

ILU(0) 19 (12.3) 15 (17.0) 13 (19.2) ILU(1) 15 (10.2) 14 (16.8) 13† (22.7) ILU(2) 15 (11.9) 12 (18.2) 15 (33.9) ILU(3) 14 (12.6) 12 (24.9) 14 (46.1)

(b) Problem C2

GMRES(s) ILU(µ) RAS(0) RAS(1) RAS(2)

ILU(0) 11 (15.0); 100^∗ 11 (17.3); 100^∗ 11 (18.9); 100^∗ GMRES with ILU(1) 12 (17.4); 100^∗ 12 (20.6); 100^∗ 16 (29.8); 100^∗ gtol = 10⁻³ ILU(2) 11 (18.2); 100^∗ 14 (27.7); 100^∗ 12 (26.0); 100^∗ ILU(3) 11 (18.6); 100^∗ 15 (30.4); 100^∗ 15 (36.6); 100^∗

GMRES(10)

ILU(0) 67 (11.1) 58 (12.7) 57 (18.2)

ILU(1) 53 (9.3) 41 (9.1) 38 (9.4)

ILU(2) 48 (9.0) 36 (8.6) 30 (8.1)

ILU(3) 44 (9.1) 29 (7.9) 27 (8.0)

GMRES(15)

ILU(0) 44 (9.6) 41 (11.5) 41 (12.5)

ILU(1) 33 (8.0) 27 (8.1) 26 (8.5)

ILU(2) 30 (7.8) 23 (7.2) 22 (7.6)

ILU(3) 28 (8.0) 22 (7.7) 19 (7.4)

GMRES(20)

ILU(0) 31 (9.9) 27 (9.9) 27 (10.7)

ILU(1) 25 (7.9) 21 (8.0) 20 (8.3)

ILU(2) 24 (8.0) 19 (7.7) 18 (8.0)

ILU(3) 25 (9.3) 17 (7.6) 27 (13.9)

Table 5: JD iterations and execution time in seconds (shown within parentheses) for Problems Q2 and C2. Various combinations of RAS(δ), ILU(µ), and GMRES(s) are applied. The number of processors is 64. The mark ∗ indicates average number of GMRES iterations while setting gtol = 10⁻³. The mark † indicates that the eigensolver converges to an undesired eigenpair.

15

(16)

Problem Q1 Problem Q2

np s = 5 s = 10 s = 15 s = 20 s = 5 s = 10 s = 15 s = 20

4 27 (13.9) 16 (10.2) 16 (13.9) 13 (14.7) - - - -

8 28 (6.9) 16 (4.6) 21 (8.6) 17 (8.9) 28 (89.9) 23 (96.5) 18 (98.4) 11 (75.5) 16 27 (3.3) 16 (2.0) 13 (2.2) 14 (3.1) 30 (38.1) 22 (35.7) 19 (40.5) 14 (37.6) 32 22 (1.4) 17 (1.1) 15 (1.1) 21 (5.4) 33 (19.9) 21 (16.4) 22 (23.4) 13 (17.1) 64 24 (1.1) 22 (0.9) 17 (0.7) 21 (1.2) 38 (11.2) 21 (7.6) 22 (11.7) 19 (12.3)

128 - - - - 41 (5.9) 21 (3.8) 20 (4.5) 17 (6.5)

256 - - - - 41 (3.6) 21 (2.1) 20 (2.5) 17 (2.7)

320 - - - - 41 (3.2) 21 (1.7) 21 (2.2) 17 (2.0)

Table 6: Effects of GMRES(s) on JD iterations and timing (shown within parentheses) for Problem Q1 and Q2. The shortest execution time for each number of processors (np) is shown in boldface. Note that Problems Q1 (186, 543) is too small to be solved efficiently using more than 64 processors. Problem Q2 (1, 532, 255) is too large to be solved by using 4 processors due to memory constraints.

Recommended parameter combinations of RAS(δ) and ILU(µ). For Problem Q2, as shown in part(a) of Table 5, we find that there is no obvious benefit from in- creasing overlapping size for RAS or using more levels of fill-ins in ILU.

The situation for Problem C2 as shown in part (b) of Table 5 is slightly different.

An overlapping RAS with a more accurate inexact subdomain solve enables us to reduce both the number of JD iterations and execution time.

Based on the numerical results, we used the following parameters for the rest of the experiments: (i) RAS(0) and ILU(0) for quintic pyramidal QD problems, and (ii) RAS(1) and ILU(2) for cubic cylindrical QD problems.

Effects of GMRES(s) on JD iterations and timing. As shown in Table 6 for Prob- lem Q1 and Q2, we observed that the choice of GMRES(10) results in the best timing results in almost all cases. Two exceptions are np = 64 for Problem Q1 and np = 8 for Problem Q2. Furthermore, the number of JD iterations for each np corresponding to the least execution time is nearly independent of np and is mildly dependant of problem size.

For Problems C1 and C2, Table 7 shows different convergence behaviors. There is no single s achieves the best JD iterations and timing results overall. However, roughly speaking, more GMRES iterations are needed as np increase to achieve the best timing results. Unlike Problems Q1 and Q2, the number of JD iterations for Problem C1 and C2 corresponding to the least execution time are more sensitive to np.

Learnt from numerical experiences for solving linear systems, the RAS preconditioner for these two problems may need to add a coarse space when the number of processors increases. Without a coarse space, the increase of overlap is required to improve global communications, and smaller tolerance for the GMRES and more exact local problems are necessary.

16

(17)

Problem C1 Problem C2

np s = 5 s = 10 s = 15 s = 20 s = 5 s = 10 s = 15 s = 20

4 21 (4.9) 16 (5.3) 19 (8.9) 15 (9.1) - - - -

8 35 (4.0) 17 (2.4) 18 (3.6) 19 (5.1) 50 (79.2) 25 (54.5) 19 (54.5) 21 (79.3) 16 37 (2.0) 22 (1.5) 15 (1.3) 18 (2.0) 51 (37.3) 24 (24.3) 19 (25.7) 20 (35.4) 32 40 (1.3) 23 (0.8) 17 (0.6) 22 (1.1) 55 (19.6) 29 (14.6) 21 (13.8) 18 (15.2) 64 47 (1.0) 22 (0.5) 21 (0.5) 17 (0.4) 74 (14.2) 36 (8.6) 23 (7.2) 19 (7.7)

128 - - - - 102 (10.2) 48 (6.2) 29 (5.0) 24 (4.9)

192 - - - - 119 (9.8) 52 (5.3) 32 (4.0) 25 (3.7)

Table 7: Effects of GMRES(s) on JD iterations and timing (shown within parentheses) for Problems C1 and C2. The shortest execution time for each number of processors (np) is shown in boldface. Note that Problems C1 (101, 650) is too small to be solved efficiently by using more than 64 processors. Problem C2 (1, 006, 200) is too large to be solved by using 4 processors due to memory constraints.

5.3. Parallel performance

To evaluate the parallel performance of the eigensolvers, we consider speedup and parallel efficiency. Speedup and parallel efficiency are defined as

Tnp₁

T_np₂ and np1× Tnp₁

np₂× T_np₂,

respectively. Speedup indicates how much faster a parallel program with np2 is than a corresponding parallel program with np1. Parallel efficiency shows how much we can gain from parallelization and the percentage of the execution time spent on communication and synchronization. Figure 3 presents the speedup and parallel efficiency plots for Problem Q2 and C2. We choose the results obtained by using np1 = 16 (np1 = 4) for Problem Q2 (C2) as the reference timings. All data presented in the figures is based on the best timing for each np excerpted from Tables 6 and 7.

0 50 100 150 200 250 300 350

0 25 50 75 100 125

Number of processors

Parallel efficiency (%)

0 5 10 15 20 25

Speedup

Parallel efficiency Speedup Ideal Speedup

(a) Problem Q2

0 40 80 120 160 200

0 25 50 75 100 125

0 10 20 30 40 50

Speedup

(b) Problem C2

Figure 3: Parallel efficiency and speedup for Problem Q2 and C2. Algorithmic parameters used: (a) RAS(0) and ILU(0), (b) RAS(1) and ILU(2).

Observing from Figure 3, we highlight the following.

17

(18)

np = 8 np = 256

µ λ JD ites (Time) JD ites (Time) Parallel efficiency (%)

-0.1 -0.4199999999844743 54 (250.6) 56 (7.7) 101

0.1 0.4164706473388473 30 (136.9) 28 (3.7) 117

0.4 0.4164706473196381 23 (96.0) 21 (2.2) 138

0.6 0.5992679796328200 31 (138.1) 42 (5.1) 84

0.7 0.7179076438971341 47 (205.5) 52 (6.7) 95

0.75 0.7296997979519239 54 (226.6) 77 (9.5) 75

0.8 0.8064207692754761 72 (320.6) 99 (13.2) 76

Table 8: The robustness and scalability test of the ASPJD algorithm for finding a variety of target eigenvalues. Problem Q2 is considered. The notations µ and λ stand for target value and computed eigenvalue, respectively.

• For Problem Q2, the ASPJD eigensolver exhibits very impressive parallel performance: superlinear speedup and over 100% parallel efficiency up to 320 processors.

• For Problem C2, the eigensolver achieves nearly linear speedup for np2< 64. The performance of the eigensolver is degraded somewhat, but it still maintains 70%

parallel efficiency up to np2= 192.

Additionally, to verify the robustness and scalability of the ASPJD algorithm numerically, we also include the results of the ASPJD eigensolver for finding eigenvalues other than the smallest positive eigenvalues. As shown in Table 8, the ASPJD algorithm is quite robust and scalable. In the numerical experiments, we employ GMRES(10) with RAS(0), where the subdomain problem is solved by ILU(0) for np = 8 and np = 256. Problem Q2 is considered. The ASPJD eigensolver successfully converges to the eigenvalues close to the given target values, µ and retains good parallel efficiency that range from 75% to 138%. Note that the ASPJD eigensolver take longer time (greater 200 and 6 seconds for np = 8 and np = 256, respectively) to solve the eigenvalue problems for µ = −0.1, 0.7, 0.75, and 0.8. It is because the eigenvalues are clustered and the corresponding eigenvectors are more oscillatory for these µ’s. Actually, these particular µ’s are close or beyond the boundary of the confinement potential interval [c1, c2] = [0, 0.77]. The mathematical QD model is invalid if the energy levels (i.e. eigenvalues) are out of the interval.

5.4. Performance analysis

We further analyze parallel performance of the ASPJD eigensolver (Figure 3) by answering the following questions: Why does the ASPJD eigensolver perform remarkably for Problem Q2? Why does the parallel efficiency of Problem C2 degrade as the number of processors increases? We believe such phenomena are caused by a mixture effect due to the preconditioning strategy, and the characteristics of the QD eigenvalue problem and the structure of its corresponding the coefficient matrices, as discussed below.

• The no-communication preconditioner with RAS(0) leads to superlinear scale-up for correction equation solve and overall execution time.

18

(19)

We begin the study by presenting a scaling analysis for each key component in the ASPJD algorithm in Table 9. Part (a) of the table shows that all components scale very well for Problem Q2, with the exception of a relatively inexpensive linearized projected eigenvalue problem solve. In particular, ASPJD achieves a superlinear time scaling behavior for the most expensive component, i.e., correction equation solve. We believe that the cause of the superlinear speedup is interplay among the following factors: First, the preconditioner used for this particular problem, i.e., RAS(0), does not involve any data communications between processors. Second, the number of GMRES iterations is fixed to 10 and the number of JD iterations (as shown in Table 6) remains nearly the same with respect to the number of processors.

Third, the CPU cache memories further accelerate the data movements.

In contrast, Table 9-(b) shows that ASPJD does not achieve superlinear scale-up in CorrEq solve for Problem C2, while scaling behaviors for all other components are similar to those for Problem Q2. Recall that, in order to keep the JD iteration under control, we use RAS(1) for preconditioning Problem C2. This is the price needed in order to use RAS with some size of overlapping, which at least minimally involves some sort of communications. Consequently, scalability for CorrEq solve is not as good as in Problem Q2. Overall efficiency for Problem C2 is thus degraded.

It should be noted that the QD eigenvalue problem has a special structure that the eigenvectors corresponding to the eigenvalues of interest are localized to the dot. That is, the components of the eigenvector corresponding to the matrix (outside of the QD) are mostly zero. In our simulations, the ratio of the cuboid matrix to the pyramidal dot in the computational domain is roughly 35 : 1, compared to the ratio which is roughly 10 : 1 for the cylindrical QD problem. Consequently, we are able to decouple the original pyramidal QD eigenvalue problem into many subproblems using RAS(0) for the without penalty in terms of increased number of the JD iterations, as the cuboid matrix problem (Q2) has fewer non-zero elements relatively in the eigenvectors than the cylindrical matrix problem (C2). We believe this is at least partially responsible for the different behaviors between the Q2 and C2 problems illustrated in Table 5.

• The discretization based on non-uniform grid may lead to lower parallel efficiency.

As shown in Table 2, Problem Q2 is formulated by a (finite volumes) discretization with a uniform grid and Problem C2 is formulated by a (finite differences) discretization with a non-uniform grid. In a non-uniform grid, grid points around the hetero-structure are chosen with fine grid and other grid points are chosen with rough grid. Such non-uniform grids are essential for efficiently computing eigenpairs with higher accuracy [53]. However, the non-uniform grid also introduces more un- balanced entry magnitude in A_i’s that can result in lower computational accuracy while solving the corresponding correction equation [8]. We conjecture that these coefficient matrices due to the non-uniform grids further affect the parallel efficiency of the ASPJD eigensolver. This conjecture is supported by the following numerical experiments.

19

(20)

(a) Problem Q2

Component np = 16 np = 32 np = 64 np = 128 np = 256

LPEP solve 0.3 (0.9%) 0.3 (1.7%) 0.3 (3.5%) 0.3 (7.0%) 0.3 (13.0%) CorrEq solve 22.3 (62.5%) 10.3 (62.7%) 4.7 (60.8%) 2.1 (55.5%) 1.0 (46.6%) Form Mi 7.1 (19.8%) 3.2 (19.2%) 1.5 (19.4%) 0.8 (21.5%) 0.5 (25.3%) Compute u, p, r 2.3 (6.4%) 1.1 (6.4%) 0.5 (7.1%) 0.3 (8.3%) 0.2 (8.6%) Orthogonalization 1.3 (3.5%) 0.3 (2.0%) 0.1 (1.8%) 0.1 (1.9%) 0.1 (2.5%) Form A(θ) 2.4 (6.7%) 1.3 (7.8%) 0.6 (7.3%) 0.2 (5.6%) 0.1 (3.9%)

Total time 35.7 16.5 7.7 3.8 2.2

(b) Problem C2

Component np = 16 np = 32 np = 64 np = 128

LPEP solve 0.2 (0.6%) 0.1 (0.7%) 0.1 (1.7%) 0.1 (2.9%) CorrEq solve 17.7 (72.9%) 11.0 (81.4%) 5.9 (81.5%) 4.1 (84.1%)

Form Mi 3.3 (13.5%) 1.2 (8.5%) 0.7 (9.3%) 0.4 (7.5%) Compute u, p, r 1.2 (4.7%) 0.4 (3.2%) 0.2 (3.0%) 0.1 (2.2%) Orthogonalization 0.7 (2.7%) 0.3 (2.0%) 0.1 (1.3%) 0.1 (1.6%) Form A(θ) 1.3 (5.4%) 0.6 (4.1%) 0.2 (3.1%) 0.1 (1.7%)

Total 24.4 13.6 7.2 4.9

Table 9: Timing and the corresponding percentage (shown within parentheses) breakdown of each key component in the ASPJD algorithm for Problems Q2 and C2. The components include (a) LPEP solve:

linearized projected eigenvalue problem solve; (b) CorrEq solve: correction equation solve; (c) Form Mi: Micalculation; (d) Compute u,p,r : the vectors u, p, and r calculation; (e) Form A(θ): A(θ) calculation.

0 40 80 120 160 200

0 25 50 75 100 125

0 10 20 30 40 50

Speedup

(a) Problem C3 preconditioned by RAS(1)

0 40 80 120 160 200

0 25 50 75 100 125

0 10 20 30 40 50

Speedup

(b) Problem C3 preconditioned by RAS(0)

Figure 4: Parallel efficiency and speedup for Problem C3 formulated by a uniform grid. Preconditioning strategy used in (a) is RAS(1) and in (b) is RAS(0). Note that ILU(2) are applied in both cases.

To verify the above claims regarding how RAS(δ) and the matrix structures can affect parallel performance, we present performance plots of Problem C3 in Figure 4.

Both Problem C3 and Problem C2 are derived by modeling the same nano-structure and using the same discretization scheme. The only difference is that Problem C3 is formulated by a uniform grid, while Problem C2 is formulated by a non-uniform grid.

Comparing Figure 3-(b) (associated with non-uniform grid, RAS(1), and ILU(2)) with 20