B.21 Complexity of Postprocessing (Block-Column Parallelism)
III- 7 Wen-Yin Integration (Gramian)
Require: 𝑸[1], 𝑸[2], … , 𝑸[𝑁](real𝑚×𝑙 orthogonal matrices), 𝑸init(initial guess),𝜏0≥ 0 (initial step size), 𝛽 ∈ (0, 1) (scaling parameter for step size searching), 𝜎 ∈ (0, 1) (parameter for step size searching), 𝜂 ∈ (0, 1) (parameter for next step searching), 𝜏max, 𝜏min(maximum and minimum predicting step size).
Ensure: Integrated orthogonal basis𝑸opt.
1: Combine𝕼 = [𝑸[1] 𝑸[2] ⋯ 𝑸[𝑁]].
2: Compute𝕭 = 𝕼⊤𝕼.
3: Initialize𝑩𝑐 ← 𝕼⊤𝑸init, ̃𝑭 ← 𝑰𝑙, ̃𝑬 ← 𝑶𝑁𝑙×𝑙,𝜏𝑔 ← 𝜏0,𝜁 ← 1, 𝜙 ← 2𝑁1 ‖𝑩𝑐‖2𝐹.
4: Assign𝑫𝑐 ← 𝑁1𝑩⊤𝑐𝑩𝑐.
5: while (not convergent) do
6: Assign𝑩𝑐𝑔 = 𝑁1𝕭𝑩𝑐and𝑫𝑐𝑔 ← 𝑁1𝑩𝑐⊤𝑩𝑔𝑐.
7: Compute𝜇 ← tr(𝑫𝑐𝑔) − ‖𝑫𝑐‖2𝐹.
8: repeat for𝑡 = 0, 1, … do
9: Let𝜏 = 𝜏𝑔𝛽𝑡. Compute𝑪 byeq. (3.18)and find its inverse.
10: Compute𝑭𝑐and𝑭𝑐𝑔byeq. (3.21).
11: Assign𝑩+ ← 𝑩𝑐𝑭𝑐+ 𝑩𝑐𝑔𝑭𝑐𝑔.
12: Assign𝜙 ←̃ 2𝑁1 ‖𝑩+‖2𝐹.
13: until𝜙 ≥ 𝜙 + 𝜏𝜎𝜇̃
14: Assign𝑫+ ← 𝑁1𝑩⊤+𝑩+ and𝑬𝑐 ← 𝑁1𝑩𝑐𝑭𝑐𝑔.
15: Update𝜙 ← 𝜂𝜁 𝜙+ ̃𝜂𝜁 +1𝜙 and then𝜁 ← 𝜂𝜁 + 1.
16: Update ̃𝑭 ← ̃𝑭 𝑭𝑐and ̃𝑬 ← ̃𝑬𝑭𝑐+ 𝑬𝑐.
17: Update𝜏𝑔 ← max (min (𝜏guess, 𝜏max) , 𝜏min) usingeq. (3.30), where
𝜏guess= tr(𝜟⊤1𝜟1)
|tr(𝜟⊤1𝜟2)|
or |tr(𝜟⊤1𝜟2)|
tr(𝜟⊤2𝜟2) .
18: Update𝑩𝑐 ← 𝑩+and𝑫𝑐 ← 𝑫+.
19: end while
20: Output𝑸opt ← 𝑸init𝑭 + 𝕼 ̃̃ 𝑬.
Chapter 4 Parallelism
In this chapter, we use𝑃 as the number of processors, and assume that 𝑁 = 𝑃 for sim-plicity. For 𝑁 > 𝑃 , one may just handle more matrices in a processor. Note that in the Block-Row/Column Parallelism,𝑁 and 𝑃 are independent so that one may use more processors.
We propose three parallelism ideas — Naïve Parallelism, Block-Row Parallelism, and Block-Column Parallelism. The Block-Row/Column Parallelism split the matrix into row/column-blocks. These data structures are similar to the one-dimensional block row/column distribution in ScaLAPACK [7]. These block parallelism ideas are proposed to reduce communication cost. Precisely, the Block-Row Parallelism is targeted to the problems with near-square or square𝑨, and the Block-Column Parallelism is specialized for short and fat𝑨. All of the block-row/column algorithms are balanced, so that the parallel over-heads are limited.
4.1 Naïve Parallelism
Sketching and orthogonalization can be easily parallelized by using 𝑁 processors — computes𝜴[𝑖], 𝒀[𝑖], 𝑸[𝑖] in the𝑖-th processor — without any communication. Also, the Kolmogorov-Nagumo Integration (Algorithm III-1) can also be easily parallelized using the same idea.
Algorithm III-8 Kolmogorov-Nagumo Integration (Naïve Parallelism) Require: 𝑸[1], 𝑸[2], … , 𝑸[𝑁], 𝑸init.
Ensure: Integrated orthogonal basis𝑸opt.
1: Initialize the current iterate𝑸𝑐 ← 𝑸init.
2: while (not convergent) do
3: Assign𝑩[𝑖] ← 𝑸⊤[𝑖]𝑸𝑐in processor𝑖.
4: Assign𝑿[𝑖] ← 𝑁1 (𝑸[𝑖]𝑩[𝑖]− 𝑸𝑐𝑩[𝑖]⊤𝑩[𝑖]) in processor 𝑖.
5: Sum𝑿 ← ∑𝑁𝑖=1𝑿[𝑖]to all processors (MPI_Allreduce).
6: Compute𝑪 ← (
𝑰
2 + (𝑰4 − 𝑿⊤𝑿)1/2)
1/2
.
7: Update𝑸𝑐 ← 𝑸𝑐𝑪 + 𝑿𝑪−1.
8: end while
9: Output𝑸opt ← 𝑸𝑐.
Similarly, the Hierarchical Reduction Integration (Algorithm III-3) can be implemented as follow. However, the algorithm is unbalanced due to the reduction structure.
Algorithm III-9 Hierarchical Reduction Integration (Naïve Parallelism) Require: 𝑸[1], 𝑸[2], … , 𝑸[𝑁](real𝑚 × 𝑙 orthogonal matrices).
Ensure: Integrated orthogonal basis𝑸hr.
1: Set ̃𝑁 ← 𝑁.
2: while ̃𝑁 > 1 do
3: Setℎ ← ⌊𝑁2̃⌋
4: Transfer𝑸[𝑖+ℎ]from processor𝑖 + ℎ to processor 𝑖 (MPI_SendandMPI_Receive).
5: if the processor ID𝑖 ≤ ℎ then
6: Compute𝑾 𝑺𝑻⊤ ← svd(𝑸⊤[𝑖]𝑸[𝑖+ℎ]) in processor 𝑖.
7: Update𝑸[𝑖] ← (𝑸[𝑖]𝑾 + 𝑸[𝑖+ℎ]𝑻 ) (2(𝑰 + 𝑺))−1/2 in processor𝑖.
8: end if
9: Update ̃𝑁 ← ⌈𝑁̃2⌉.
10: end while
11: Output𝑸hr ← 𝑸[1]in processor1.
4.2 Block-Row Parallelism for Integration
We proposed block-row algorithms for lower communication cost. The Naïve Parallelism (Algorithm III-8) requires an𝑚×𝑙 data communication (Step 5) in each iteration. It will be
very slow when𝑚 is huge. Therefore, we propose a new algorithm base on the following
InAlgorithm III-10(Block-Row Parallelism of the Optimized Kolmogorov-Nagumo Integration,Algorithm III-4), we split the matrices into𝑚𝑏× 𝑙 row-blocks (𝑚𝑏 = 𝑚/𝑃 ) and host them in different processors.
𝕼 =
Algorithm III-10 Kolmogorov-Nagumo Integration (Block-Row Parallelism)
Require: 𝑸[1], 𝑸[2], … , 𝑸[𝑁], 𝑸init (real𝑚 × 𝑙 orthogonal matrices), 𝑃 (number of pro-cessors).
Ensure: Integrated orthogonal basis𝑸opt.
1: Rearrange𝑸[𝑖]to𝔔(𝑗)in processor𝑗 (MPI_Alltoall).
2: Initialize the current iterate𝘘𝑐(𝑗) ← 𝘘init(𝑗) in processors𝑗.
3: Sum𝑩𝑐 ← ∑𝑃𝑗=1𝔔(𝑗)⊤𝘘𝑐(𝑗)to all processors (MPI_Allreduce).
4: while (not convergent) do
5: Assign𝘎𝑐(𝑗)← 𝑁1𝔔(𝑗)𝑩𝑐 in processor𝑗.
Similarly, we propose Algorithm III-11 (Block-Row Parallelism of the Optimized Wen-Yin Integration,Algorithm III-5) using the block-row idea by splitting the following matrices.
Algorithm III-11 Wen-Yin Integration (Block-Row Parallelism)
Require: 𝑸[1], 𝑸[2], … , 𝑸[𝑁](real𝑚×𝑙 orthogonal matrices), 𝑸init(initial guess),𝜏0≥ 0 (initial step size), 𝛽 ∈ (0, 1) (scaling parameter for step size searching), 𝜎 ∈ (0, 1) (parameter for step size searching), 𝜂 ∈ (0, 1) (parameter for next step searching), 𝜏max, 𝜏min(maximum and minimum predicting step size),𝑃 (number of processors).
Ensure: Integrated orthogonal basis𝑸opt.
1: Rearrange𝑸[𝑖]to𝔔(𝑗)in processor𝑗 (MPI_Alltoall).
2: Initialize the current iterate𝘘𝑐(𝑗) ← 𝘘init(𝑗) in processors𝑗.
3: Sum𝑩𝑐 ← ∑𝑃𝑗=1𝔔(𝑗)⊤𝘘𝑐(𝑗)to all processors (MPI_Allreduce).
4: Assign𝑫𝑐 ← 𝑁1𝑩⊤𝑐𝑩𝑐,𝘎𝑐(𝑗) ← 𝑁1𝔔(𝑗)𝑩𝑐, and𝘟𝑐(𝑗) ← 𝘎𝑐(𝑗)− 𝘘𝑐(𝑗)𝑫𝑐in processor𝑗.
5: Initialize𝜏𝑔 ← 𝜏0,𝜁 ← 1, 𝜙 ← 2𝑁1 ‖𝑩𝑐‖2𝐹.
6: while (not convergent) do
7: Sum𝑩𝑐𝑔 ← ∑𝑃𝑗=1𝔔(𝑗)⊤𝘎𝑐(𝑗)to all processors (MPI_Allreduce).
8: Assign𝑫𝑐𝑔 ← 𝑁1𝑩𝑐⊤𝑩𝑐𝑔.
9: Compute𝜇 ← tr(𝑫𝑐𝑔) − ‖𝑫𝑐‖2𝐹.
10: repeat for𝑡 = 0, 1, … do
11: Let𝜏 = 𝜏𝑔𝛽𝑡. Compute𝑪 byeq. (3.18)and find its inverse.
12: Compute𝑭𝑐and𝑭𝑐𝑔byeq. (3.21).
13: Assign𝑩+ ← 𝑩𝑐𝑭𝑐+ 𝑩𝑐𝑔𝑭𝑐𝑔.
14: Assign𝜙 ←̃ 2𝑁1 ‖𝑩+‖2𝐹.
15: until𝜙 ≥ 𝜙 + 𝜏𝜎𝜇̃
16: Update𝜙 ← 𝜂𝜁 𝜙+ ̃𝜂𝜁 +1𝜙 and then𝜁 ← 𝜂𝜁 + 1.
17: Assign 𝑫+ ← 𝑁1𝑩⊤+𝑩+, 𝘘+(𝑗) ← 𝘘𝑐(𝑗)𝑭𝑐 + 𝘎𝑐(𝑗)𝑭𝑐𝑔, 𝘎+(𝑗) ← 𝑁1𝔔(𝑗)𝑩+, 𝘟+(𝑗) ← 𝘎+(𝑗)− 𝘘+(𝑗)𝑫+ in processor𝑗.
18: Compute the differences Δ(𝑗)1 = 𝘘+(𝑗)− 𝘘𝑐(𝑗)and Δ(𝑗)2 = 𝘟+(𝑗)− 𝘟𝑐(𝑗) in processor𝑗.
19: Sum 𝑡11 ← ∑𝑃𝑗=1tr(Δ(𝑗)1 ⊤Δ(𝑗)1 ), 𝑡12 ← ∑𝑃𝑗=1tr(Δ(𝑗)1 ⊤Δ(𝑗)2 ), and 𝑡22 ←
∑𝑃𝑗=1tr(Δ(𝑗)2 ⊤Δ(𝑗)2 ) to all processors (MPI_Allreduce).
20: Update𝜏𝑔 ← max (min (𝜏guess, 𝜏max) , 𝜏min), where 𝜏guess = 𝑡11
𝑡12 or 𝑡12 𝑡22.
21: Update𝘘𝑐(𝑗) ← 𝘘+(𝑗),𝘎𝑐(𝑗) ← 𝘎+(𝑗),𝘟𝑐(𝑗)← 𝘟+(𝑗),𝑩𝑐 ← 𝑩+,𝑫𝑐 ← 𝑫+in processor 𝑗.
22: end while
23: Gather𝑸opt ← [𝘘𝑐(1)⊤ 𝘘𝑐(2)⊤ ⋯ 𝘘𝑐(𝑃 )⊤]
⊤
(MPI_Gather).
Algorithm III-12 (Block-Row Parallelism of the Hierarchical Reduction Integration, Algorithm 3) also use the block-row idea. Unlike the Naïve Parallelism (Algorithm III-9), this algorithm is balanced.
Algorithm III-12 Hierarchical Reduction Integration (Block-Row Parallelism)
Require: 𝑸[1], 𝑸[2], … , 𝑸[𝑁](real𝑚×𝑙 orthogonal matrices), 𝑃 (number of processors).
Ensure: Integrated orthogonal basis𝑸hr.
1: Rearrange𝑸[𝑖]to𝔔(𝑗)in processor𝑗 (MPI_Alltoall).
4.3 Block-Row Parallelism for Gramian Integration
With the improvement on the Gramian integration (Algorithm III-6 and Algorithm III-7), we only need to handle tiny matrices (with size𝑙 or 𝑁𝑙) in the iteration, which cost 𝑂(𝑁2𝑙3) ⋅ #Iter flops only. Since 𝑙 is extremely small, the iteration does not need paral-lelism. Therefore, we focus computing𝕭 = 𝕼⊤𝕼. To reduce data communication, we use the block-row idea. Define
𝕼 =
We proposed several subroutines by splitting
Subroutines S-1toS-3are used before the iteration, andSubroutines S-4toS-5are used after the iteration. Furthermore, if we pick the𝑸[1]as the initial guess (i.e.𝑸init = 𝑸[1]), we can skipSubroutine S-3and copy𝑩𝑐from the first column-block of𝕭.
Subroutine S-1 Matrices Rearrangement (Block-Row Parallelism) Require: 𝑸[𝑖],𝑸init,𝑃 (number of processors).
Ensure: 𝔔(𝑗),𝘘init(𝑗).
1: Split the𝑸[𝑖] into𝑃 row-blocks and send 𝘘[𝑖](𝑗) to processor𝑗 (MPI_Alltoall).
2: Combine𝔔(𝑗)= [𝘘[1](𝑗) 𝘘[2](𝑗) ⋯ 𝘘[𝑁](𝑗)] in processor 𝑗.
3: Split the𝑸init into𝑃 row-blocks and send 𝘘init(𝑗) to processor𝑗 (MPI_Scatter).
Subroutine S-2 Computing𝕭 = 𝕼⊤𝕼 (Block-Row Parallelism) Require: 𝔔(𝑗).
Ensure: 𝕭.
1: Sum𝕭 ← ∑𝑃𝑗=1𝔔(𝑗)⊤𝔔(𝑗)to all processors (MPI_Allreduce).
Subroutine S-3 Computing𝑩𝑐 = 𝕼⊤𝑸init (Block-Row Parallelism) Require: 𝔔(𝑗),𝘘init(𝑗).
Subroutine S-5 Matrices Gathering (Block-Row Parallelism)
(MPI_GatherorMPI_Allgather).
4.4 Block-Row Parallelism for Sketching, Orthogonaliza-tion, and Postprocessing
The block-row idea can be also used on sketching, orthogonalization, and postprocessing.
With the idea, the rearrangement (Subroutine S-1) and the gathering (Subroutine S-5) are unnecessary. Also, splitting the input matrix𝑨 can reduce memory usage so that we can handle larger problems. We split The following matrices into𝑚𝑏×𝑛 and 𝑚𝑏×𝑙 row-blocks
𝑨 =
For sketching, we split𝑨 to row-blocks 𝘈(𝑗) and computes𝘠[𝑖](𝑗) in the𝑗-th processors.
The only communication is sending the random seed to all processors (with𝑞 = 0).
Algorithm I-2 Gaussian Projection Sketching (Block-Row Parallelism)
Require: 𝘈(𝑗)(real𝑚𝑏×𝑛 matrix, row-block), 𝑙 (dimension of the sketched column space), 𝑞 (exponent of the power method), 𝑁 (number of random sketches), 𝑃 (number of processors).
Ensure: 𝘠[𝑖](𝑗)(real 𝑚𝑏 × 𝑙 matrices, row-blocks), where columns of 𝒀[𝑖] spans a column subspace of𝑨 for 𝑖 = 1, … , 𝑁.
1: Broadcast the random seed to all processors (MPI_Bcast).
2: Generate𝑛 × 𝑙 random matrices 𝜴[𝑖]using Gaussian normal distribution in all proces-sors.
3: Assign𝘠[𝑖](𝑗) ← 𝘈(𝑗)𝜴[𝑖]in processor𝑗.
4: loop𝑞 times do
5: Sum ̃𝒀[𝑖] ← ∑𝑃𝑗=1𝘈(𝑗)⊤𝘠[𝑖](𝑗)to all processors (MPI_Allreduce).
6: Update𝘠[𝑖](𝑗) ← 𝘈(𝑗)𝒀̃[𝑖]in processor𝑗.
7: end loop
We also use the block-row idea on Gramian orthogonalization (Algorithm II-2). Here we only need to communicate𝑙 × 𝑙 matrices.
Algorithm II-3 Gramian Orthogonalization (Block-Row Parallelism)
Require: 𝘠[𝑖](𝑗)(real𝑚𝑏× 𝑙 matrices, row-blocks), 𝑃 (number of processors).
Ensure: 𝘘[𝑖](𝑗)(real𝑚𝑏×𝑙 matrices, row-blocks), where columns of 𝑸[𝑖]are an orthonormal basis of𝒀[𝑖].
1: Sum𝑴[𝑖]← ∑𝑃𝑗=1𝘠[𝑖](𝑗)⊤𝘠[𝑖](𝑗)to all processors (MPI_Allreduce).
2: Compute𝑾[𝑖]𝑺[𝑖]2 𝑾[𝑖]⊤← eig(𝑴[𝑖]).
3: Assign𝘘[𝑖](𝑗) ← 𝘠[𝑖](𝑗)𝑾[𝑖]𝑺[𝑖]−1in processor𝑗.
To parallel the Gramian Postprocessing (Algorithm IV-2) and the symmetric postpro-cessing (Algorithm IV-3), we split𝒁 = 𝑨⊤𝑸 into 𝑛𝑏× 𝑙 row-blocks
𝒁 =
⎡⎢
⎢⎢
⎢⎢
⎢⎣ 𝘡(1) 𝘡(2)
⋮ 𝘡(𝑃 )
⎤⎥
⎥⎥
⎥⎥
⎥⎦
(4.9)
with an extra communication usingMPI_Reduce_scatter_block.
Algorithm IV-4 Gramian Postprocessing (Block-Row Parallelism)
Require: 𝘈(𝑗)(real𝑚𝑏× 𝑛 matrix, row-block), 𝘘(𝑗)(real𝑚𝑏× 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).
Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼𝑘𝜮̂𝑘𝑽̂𝑘⊤.
1: Compute𝘈(𝑗)⊤𝘘(𝑗)in each processor and sum the𝑗-th row-blocks to 𝘡(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).
2: Sum𝑴 ← ∑𝑃𝑗=1𝘡(𝑗)⊤𝘡(𝑗) to all processors (MPI_Allreduce).
3: Compute ̂𝑾𝑙𝜮̂𝑙2𝑾̂𝑙⊤ ← eig(𝑴).
4: Extract the largest𝑘 eigen-pairs from ̂𝑾𝑙, ̂𝜮𝑙 to obtain ̂𝑾𝑘, ̂𝜮𝑘.
5: Assign ̂𝘜𝑘(𝑗) ← 𝘘(𝑗)𝑾̂𝑘.
6: Assign ̂𝘝𝑘(𝑗) ← 𝘡(𝑗)𝑾̂𝑘𝜮̂𝑘−1.
7: Gather ̂𝑼𝑘 ← [ ̂𝘜𝑘(1)⊤ 𝘜̂𝑘(2)⊤ ⋯ 𝘜̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
8: Gather ̂𝑽𝑘 ← [ ̂𝘝𝑘(1)⊤ 𝘝̂𝑘(2)⊤ ⋯ 𝘝̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
Algorithm IV-5 Symmetric Postprocessing (Block-Row Parallelism)
Require: 𝘈(𝑗)(real𝑚𝑏× 𝑚 matrix, row-block), 𝘘(𝑗)(real𝑚𝑏× 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).
Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼𝑘𝜮̂𝑘𝑼̂𝑘⊤.
1: Compute𝘈(𝑗)⊤𝘘(𝑗)in each processor and sum the𝑗-th row-blocks to 𝘡(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).
2: Sum𝑴 ← ∑𝑃𝑗=1𝘡(𝑗)⊤𝘘(𝑗)to all processors (MPI_Allreduce).
3: Compute ̂𝑾𝑙𝜮̂𝑙𝑾̂𝑙⊤ ← eig(𝑴).
4: Extract the largest𝑘 eigen-pairs from ̂𝑾𝑙, ̂𝜮𝑙 to obtain ̂𝑾𝑘, ̂𝜮𝑘.
5: Assign ̂𝘜𝑘(𝑗) ← 𝘘(𝑗)𝑾̂𝑘.
6: Gather ̂𝑼𝑘 ← [ ̂𝘜𝑘(1)⊤ 𝘜̂𝑘(2)⊤ ⋯ 𝘜̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
Demmel et al. proposed Block-Row Parallelism based Tall Skinny QR (TSQR) [8]
for computing QR decomposition (see Appendix A.2 for detail). Algorithm II-4is the orthogonalization using TSQR.
Table 4.1: Communication Tree of Tall Skinny QR Level
1 1&2 3&4 5&6 7&8 ⋯ 2 1&3 2&4 5&7 6&8 ⋯ 3 1&5 2&6 3&7 4&8 ⋯
⋮ ⋮ ⋮ ⋮ ⋮ ⋱
Algorithm II-4 TSQR [8] Orthogonalization (Block-Row Parallelism) Require: 𝘠[𝑖](𝑗)(real𝑚𝑏× 𝑙 matrix, row-block), 𝑃 (number of processors).
Ensure: 𝘘[𝑖](𝑗)(real𝑚𝑏×𝑙 matrices, row-blocks), where columns of 𝑸[𝑖]are an orthonormal basis of𝒀[𝑖].
1: Denote𝐻 = ⌈log2𝑃 ⌉.
2: Compute𝘘[𝑖],0(𝑗) 𝘙[𝑖],0(𝑗) ← qr(𝘠[𝑖](𝑗)), where 𝘘
(𝑗)
[𝑖],0 ∈ ℝ𝑚𝑏×𝑙and𝘙[𝑖],0(𝑗) ∈ ℝ𝑙×𝑙.
3: for𝑡 = 1 to 𝐻 do
4: if processor𝑗 has neighbor at level 𝑡 then
5: Combine𝘠[𝑖],𝑡(𝑗 ̌𝑗) from𝘙[𝑖],𝑡−1(𝑗) and𝘙[𝑖],𝑡−1( ̌𝑗) , where ̌𝑗 denote the neighbor processor (seeTable 4.1). (MPI_Sendrecv).
6: Compute𝘘[𝑖],𝑡(𝑗 ̌𝑗)𝘙[𝑖],𝑡(𝑗 ̌𝑗) ← qr(𝘠[𝑖],𝑡(𝑗 ̌𝑗)), where 𝘘[𝑖],𝑡(𝑗 ̌𝑗) ∈ ℝ2𝑙×𝑙 and𝘙[𝑖],𝑡(𝑗 ̌𝑗) ∈ ℝ𝑙×𝑙.
7: Assign 𝘘[𝑖],𝑡(𝑗) as the correspond row-block of 𝘘[𝑖],𝑡(𝑗 ̌𝑗) (upper/lower block if 𝑗 is smaller/larger than ̌𝑗).
8: Denote𝘙[𝑖],𝑡(𝑗) ← 𝘙[𝑖],𝑡(𝑗 ̌𝑗).
9: else
10: Assign𝘙[𝑖],𝑡(𝑗) ← 𝘙[𝑖],𝑡−1(𝑗) and𝘘[𝑖],𝑡 ← 𝑰𝑙 (the latter is stored implicitly).
11: end if
12: end for
13: Output𝘘[𝑖](𝑗) ← 𝘘[𝑖],0(𝑗) 𝘘[𝑖],1(𝑗) ⋯ 𝘘[𝑖],𝐻(𝑗) .
Also, we can compute the SVD of 𝒁 = 𝑨⊤𝑸 by applying in TSQR and computing the small SVD of the upper triangular part. Using the idea, we proposeAlgorithm IV-6.
Algorithm IV-6 TSQR [8] Postprocessing (Block-Row Parallelism)
Require: 𝘈(𝑗)(real𝑚𝑏× 𝑛 matrix, row-block), 𝘘(𝑗)(real𝑚𝑏× 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).
Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼𝑘𝜮̂𝑘𝑽̂𝑘⊤.
1: Denote𝐻 = ⌈log2𝑃 ⌉.
2: Compute𝘈(𝑗)⊤𝘘(𝑗)in each processor and sum the𝑗-th row-blocks to 𝘡(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).
3: Compute𝘝0(𝑗)𝘙0(𝑗) ← qr(𝘡(𝑗)), where 𝘝0(𝑗) ∈ ℝ𝑚𝑏×𝑙and𝘙0(𝑗)∈ ℝ𝑙×𝑙.
4: for𝑡 = 1 to 𝐻 do
5: if processor𝑗 has neighbor at level 𝑡 then
6: Combine ̃𝘡𝑡 from 𝘙𝑡−1(𝑗) and 𝘙𝑡−1( ̌𝑗), where 𝑗 denote the neighbor processor (seě Table 4.1). (MPI_Sendrecv).
7: Compute ̃𝘝𝑡𝘙𝑡 ← qr( ̃𝘡𝑡), where𝘝̃𝑡 ∈ ℝ2𝑙×𝑙 and𝘙𝑡 ∈ ℝ𝑙×𝑙.
8: Assign 𝘝𝑡(𝑗) as the correspond row-block of ̃𝘝𝑡 (upper/lower block if 𝑗 is smaller/larger than ̌𝑗).
9: else
10: Assign𝘙𝑡 ← 𝘙𝑡−1and𝘝𝑡 ← 𝑰𝑙 (the latter is stored implicitly).
11: end if
12: end for
13: Compute ̂𝑾𝑙𝜮̂𝑙𝑻̂𝑙⊤ ← svd(𝘙𝐻⊤).
14: Extract the largest𝑘 singular-pairs from ̂𝑾𝑙, ̂𝜮𝑙, ̂𝑻𝑙 to obtain ̂𝑾𝑘, ̂𝜮𝑘, ̂𝑻𝑘.
15: Assign ̂𝘜𝑘(𝑗) ← 𝘘(𝑗)𝑾̂𝑘.
16: Assign ̂𝘝𝑘(𝑗) ← 𝘝0(𝑗)𝘝1(𝑗)⋯ 𝘝𝐻(𝑗)𝑻̂𝑘.
17: Gather ̂𝑼𝑘 ← [ ̂𝘜𝑘(1)⊤ 𝘜̂𝑘(2)⊤ ⋯ 𝘜̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
18: Gather ̂𝑽𝑘 ← [ ̂𝘝𝑘(1)⊤ 𝘝̂𝑘(2)⊤ ⋯ 𝘝̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
4.5 Block-Column Parallelism
The Block-Column Parallelism is applied only on the input matrix𝑨 since it is the only short and fat matrix in the algorithm. Therefore, we only apply Block-Column Parallelism on sketching and postprocessing since𝑨 is only used in this two stages.
For the cases with extremely large𝑛,Step 1inAlgorithm IV-4(Block-Row Parallelism of the Gramian Postprocessing) cost𝑛𝑙 words communication. For 𝑚 ≪ 𝑛 (more precisely,
𝑁𝑚 < 𝑛), we choose to parallelize the input matrix 𝑨 into column blocks
𝑨 = [𝘈⟨1⟩ 𝘈⟨2⟩ ⋯ 𝘈⟨𝑃 ⟩] , (4.10)
and useMPI_Reduce_scatter_blockto convert the result into row-blocks. HereStep 2in Algorithm I-3only cost𝑁𝑚𝑙 words. (SeeChapter 5for more detail.)
Algorithm I-3 Gaussian Projection Sketching (Block-Column Parallelism)
Require: 𝘈⟨𝑗⟩(real𝑚 × 𝑛𝑏 matrix, column-block),𝑙 (dimension of the sketched column space),𝑁 (number of random sketches), 𝑃 (number of processors).
Ensure: 𝘠[𝑖](𝑗)(real 𝑚𝑏 × 𝑙 matrices, row-blocks), where columns of 𝒀[𝑖] spans a column subspace of𝑨 for 𝑖 = 1, … , 𝑁.
1: Generate𝑛𝑏× 𝑙 random matrices Ω(𝑗)[𝑖] using Gaussian normal distribution in all pro-cessors (with different random seed).
2: Compute𝘈⟨𝑗⟩Ω(𝑗)[𝑖] in each processor and sum the𝑗-th row-blocks to 𝘠[𝑖](𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).
With column blocks𝘈⟨𝑗⟩,Step 2inAlgorithm IV-7can be done without communica-tion. Through we need to gather𝑸 (usingSubroutine S-5, require𝑚𝑙 words communica-tion) before postprocessing, the communication is still reduced since𝑚 ≪ 𝑛. Furthermore, with column blocks, if one only needs left singular vectors ̂𝑼𝑘, iSVD can be done without any commutation related to𝑛.
Algorithm IV-7 Gramian Postprocessing (Block-Column Parallelism)
Require: 𝘈(𝑗) (real𝑚𝑏 × 𝑛 matrix, column-block), 𝘘(𝑗) (real𝑚𝑏 × 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).
Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼𝑘𝜮̂𝑘𝑽̂𝑘⊤.
1: Gather𝑸 ← [𝘘(1)⊤ 𝘘(2)⊤ ⋯ 𝘘(𝑃 )⊤]
⊤
to all processors (MPI_Allgather).
2: Assign𝘡(𝑗)← 𝘈⟨𝑗⟩⊤𝑸 in processor 𝑗.
3: Sum𝑴 ← ∑𝑃𝑗=1𝘡(𝑗)⊤𝘡(𝑗) to all processors (MPI_Allreduce).
4: Compute ̂𝑾𝑙𝜮̂𝑙2𝑾̂𝑙⊤ ← eig(𝑴).
5: Extract the largest𝑘 eigen-pairs from ̂𝑾𝑙, ̂𝜮𝑙 to obtain ̂𝑾𝑘, ̂𝜮𝑘.
6: Assign ̂𝑼𝑘 ← 𝑸 ̂𝑾𝑘.
7: Assign ̂𝘝𝑘(𝑗) ← 𝘡(𝑗)𝑾̂𝑘𝜮̂𝑘−1.
8: Gather ̂𝑽𝑘 ← [ ̂𝘝𝑘(1)⊤ 𝘝̂𝑘(2)⊤ ⋯ 𝘝̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
The TSQR Postprocessing (Algorithm IV-8) can also use column block 𝘈⟨𝑗⟩ in the same way asAlgorithm IV-7.
Algorithm IV-8 TSQR [8] Postprocessing (Block-Column Parallelism)
Require: 𝘈(𝑗) (real𝑚𝑏 × 𝑛 matrix, column-block), 𝘘(𝑗) (real𝑚𝑏 × 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).
Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼𝑘𝜮̂𝑘𝑽̂𝑘⊤.
1: Denote𝐻 = ⌈log2𝑃 ⌉.
2: Gather𝑸 ← [𝘘(1)⊤ 𝘘(2)⊤ ⋯ 𝘘(𝑃 )⊤]
⊤
to all processors (MPI_Allgather).
3: Assign𝘡(𝑗)← 𝘈⟨𝑗⟩⊤𝑸 in processor 𝑗.
4: Compute𝘝0(𝑗)𝘙0(𝑗) ← qr(𝘡(𝑗)), where 𝘝0(𝑗) ∈ ℝ𝑚𝑏×𝑙and𝘙0(𝑗)∈ ℝ𝑙×𝑙.
5: for𝑡 = 1 to 𝐻 do
6: if processor𝑗 has neighbor at level 𝑡 then
7: Combine ̃𝘡𝑡 from 𝘙𝑡−1(𝑗) and 𝘙𝑡−1( ̌𝑗), where 𝑗 denote the neighbor processor (seě Table 4.1). (MPI_Sendrecv).
8: Compute ̃𝘝𝑡𝘙𝑡 ← qr( ̃𝘡𝑡), where𝘝̃𝑡 ∈ ℝ2𝑙×𝑙 and𝘙𝑡 ∈ ℝ𝑙×𝑙.
9: Assign 𝘝𝑡(𝑗) as the correspond row-block of ̃𝘝𝑡 (upper/lower block if 𝑗 is smaller/larger than ̌𝑗).
10: else
11: Assign𝘙𝑡 ← 𝘙𝑡−1and𝘝𝑡 ← 𝑰𝑙 (the latter is stored implicitly).
12: end if
13: end for
14: Compute ̂𝑾𝑙𝜮̂𝑙𝑻̂𝑙⊤ ← svd(𝘙𝐻⊤).
15: Extract the largest𝑘 singular-pairs from ̂𝑾𝑙, ̂𝜮𝑙, ̂𝑻𝑙 to obtain ̂𝑾𝑘, ̂𝜮𝑘, ̂𝑻𝑘.
16: Assign ̂𝑼𝑘 ← 𝑸 ̂𝑾𝑘.
17: Assign ̂𝘝𝑘(𝑗) ← 𝘝0(𝑗)𝘝1(𝑗)⋯ 𝘝𝐻(𝑗)𝑻̂𝑘.
18: Gather ̂𝑽𝑘 ← [ ̂𝘝𝑘(1)⊤ 𝘝̂𝑘(2)⊤ ⋯ 𝘝̂𝑘(𝑃 )⊤]
⊤
(MPI_Gather).
We also propose the block-column version of the Symmetric Postprocessing (Algo-rithm IV-3). Although the matrix𝑨 is not short and fat (due to the symmetry of 𝑨), since 𝑨 is symmetric, we can use Block-Row Parallelism in sketching and Block-Column Par-allelism in forming to minimize the communication.
Algorithm IV-9 Symmetric Postprocessing (Block-Column Parallelism)
Require: 𝘈(𝑗) (real𝑚𝑏× 𝑚 matrix, column-block), 𝘘(𝑗) (real𝑚𝑏 × 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).
Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼𝑘𝜮̂𝑘𝑼̂𝑘⊤.
1: Gather𝑸 ← [𝘘(1)⊤ 𝘘(2)⊤ ⋯ 𝘘(𝑃 )⊤]
⊤
to all processors (MPI_Allgather).
2: Assign𝘡(𝑗)← 𝘈⟨𝑗⟩⊤𝑸 in processor 𝑗.
3: Sum𝑴 ← ∑𝑃𝑗=1𝘡(𝑗)⊤𝘘(𝑗)to all processors (MPI_Allreduce).
4: Compute ̂𝑾𝑙𝜮̂𝑙𝑾̂𝑙⊤ ← eig(𝑴).
5: Extract the largest𝑘 eigen-pairs from ̂𝑾𝑙, ̂𝜮𝑙 to obtain ̂𝑾𝑘, ̂𝜮𝑘.
6: Assign ̂𝑼𝑘 ← 𝑸 ̂𝑾𝑘.
4.6 GPU Acceleration
Sketching and postprocessing stages are the only algorithms using the huge input matrix 𝑨, and cost square to the input size (𝑂(𝑚𝑛𝑙) flops). Fortunately, all the operations require 𝑨 are matrix-matrix multiplication, which can be highly accelerated using GPU [9].
We use GPU to compute𝒀[𝑖] ← 𝑨𝜴[𝑖]in sketching and𝒁 ← 𝑨⊤𝑸 in postprocessing stages. For𝑨 with extremely large sizes that exceed the GPU memory limit, we split the matrices into column/row blocks and compute the multiplication successively.
Chapter 5
Comparison of Algorithms
In this chapter, we denote 𝐼 as the number of iteration, and 𝐼𝑠 as the total number of iteration in finding step size (𝐼𝑠 ≥ 𝐼). Here we assume 𝑚 ≫ 𝑁𝑙, 𝑙 = 𝑘, and 𝑞 = 0. The following tables only show the leading terms.1 In the table, we abbreviate the Kolmogorov-Nagumo Iteration, the Wen-Yin Iteration, and the Hierarchical Reduction Integration as KN, WY, and HR respectively.
Tables 5.1to5.3summarize the computational and communication complexity of all the algorithms. For integration, we obtain that the optimized algorithms indeed reduce the computational complexity. We do not recommend the Gramian algorithms, unless one expects the number of iteration to be larger than 𝑁
4. For orthogonalization and postpro-cessing, we recommend the Gramian algorithms, which are fast and well-parallelized.
By comparing the parallelisms, we obtain that the block-row postprocessing requires an extra𝑛𝑙 communication, and the block-column sketching requires an extra 𝑁𝑚𝑙 com-munication. Therefore we recommend using Block-Column Parallelisms on sketching and postprocessing if and only if the input matrix𝑨 is short and fat enough (that is, 𝑁𝑚 < 𝑛).
1SeeAppendix Bfor detail.
Table 5.1: The Complexity of Sequential Algorithms
Stage Name #flops
Sketching Algorithm I-1Gaussian Projection 2𝑁𝑛𝑚𝑙 Orthogonalization Algorithm II-1Canonical QR 4𝑁𝑚𝑙2
Algorithm II-2Gramian 3𝑁𝑚𝑙2
Integration
Algorithm III-1KN (Original) (4𝑁 + 10)𝑚𝑙2𝐼 Algorithm III-4KN (Optimized) (4𝑁 + 4)𝑚𝑙2𝐼 Algorithm III-6KN (Gramian) (𝑁2+ 2𝑁 + 2)𝑚𝑙2 Algorithm III-2WY (Original) (2𝑁 + 8)𝑚𝑙2𝐼𝑠+ (4𝑁 + 4)𝑚𝑙2𝐼 Algorithm III-5WY (Optimized) (4𝑁 + 6)𝑚𝑙2𝐼
Algorithm III-7WY (Gramian) (𝑁2+ 2𝑁 + 2)𝑚𝑙2
Algorithm III-3HR 6𝑁𝑚𝑙2
Postprocessing
Algorithm IV-1SVD with QR 2𝑛𝑚𝑙 + 6𝑛𝑙2+ 2𝑚𝑙2 Algorithm IV-2Gramian 2𝑛𝑚𝑙 + 3𝑛𝑙2+ 2𝑚𝑙2 Algorithm IV-3Symmetric 2𝑚2𝑙 + 3𝑚𝑙2 Table 5.2: The Complexity of Block-Row Algorithms
Stage Name #flops #words
Sketching Algorithm I-2Gaussian Projection 2𝑁𝑃𝑛𝑚𝑙 -Orthogonalization Algorithm II-4TSQR 4𝑁𝑃𝑚𝑙2 (log 𝑃 )𝑁𝑙2
Algorithm II-3Gramian 3𝑁𝑃𝑚𝑙2 (log 𝑃 )𝑁𝑙2
Integration
Algorithm III-10KN 1
𝑃(4𝑁 + 4)𝑚𝑙2𝐼 (log 𝑃 )𝑁𝑙2𝐼 Algorithm III-6KN (Gramian) 1
𝑃(𝑁2+ 2𝑁 + 2)𝑚𝑙2 (log 𝑃 )𝑁2𝑙2
Algorithm III-11WY 1
𝑃(4𝑁 + 6)𝑚𝑙2𝐼 (log 𝑃 )𝑁𝑙2𝐼 Algorithm III-7WY (Gramian) 1
𝑃(𝑁2+ 2𝑁 + 2)𝑚𝑙2 (log 𝑃 )𝑁2𝑙2 Table 5.3: The Complexity of Block-Column Algorithms
Stage Name #flops #words
Sketching Algorithm I-3Gaussian Projection 2𝑁𝑃𝑛𝑚𝑙 𝑁𝑚𝑙
Postprocessing
Chapter 6
Implementation
The iSVD algorithm is implemented in C++ with more than 20000 lines of codes. The package provides multinode, multithread, and GPU support. We also use some tools for automatic installation and correctness test.
6.1 C++ Implementation
We design and implement iSVD package in C++ with objective oriented programming model. It is paralleled using MPI for multicore clusters. Intel Math Kernel Library [10]
is used for BLAS/LAPACK [11] routines. The package is highly benefited from adopt-ing object-oriented programmadopt-ing, C++11 standard [12], and template meta-programmadopt-ing.
We also wrap the BLAS/LAPACK routines that delegate the selection of the correct type of routines to the compiler. With these techniques, the package can be easily used with different types of input matrix and different iSVD algorithms without modifying exist codes.
6.1.1 Object Oriented Programming
Object-oriented programming (OOP) is a programming paradigm based on objects, which combines data and functions. The benefit to adopt OOP is huge in our implementation.
With OOP, we warp data structures into object, and design linear algebra routines based on these object instead of using raw data directly. This technique makes programming
easier and less fallible. The iSVD algorithms are also implemented with OOP. The user can use them easily without dealing with parameter transfer and memory control.
6.1.2 C++11 Standard
Our implementation benefited plenty from C++11 standard [12]. This new standard is a great improvement from C++03, with many new features such as right value reference helps developer manipulate temporary objects generated by some syntax and avoid per-formance traps, constexpr specifier allows the compiler to do more compile-time de-cision and gain better optimization. The standard library of C++11 provided a smart pointer classes such asshared_ptrwith features such as automatic memory management or bounds checking. All of the data in our package are controlled by smart pointers. One no need to concerned about the dynamic memory allocations.
6.1.3 Template Meta-Programming
Template, one of an important feature of C++, allows function and classes to operate with generic types. In iSVD Package, almost all the codes, such as storage structures, linear algebra functions, and iSVD drivers, are implemented with template. With template usage, we can combine write a single code for all data types and storage layout instead of having multiple similar codes.
The implementation only contains header files. One can easily use it without any binary library to link to, and any configured header file. Our package is a pure template library defined in the headers.
6.1.4 Curiously Recurring Template Pattern
The curiously recurring template pattern (CRTP) was formalized in the 1980s as “F-bounded quantification” [13], and was named by Jim Coplien [14] in 1995. CRTP is an idiom in C++ in which a classXderives from a class template instantiation usingXitself as template argument [15]. With CRTP, we implement interfaces without virtual functions to reduce run-time overheads.
6.1.5 Data Storage and Linear Algebra Wrappers
We implement the matrix structures using OOP, and implement methods so that one may use it similar to the usage of MATLAB matrices. The expressions such as transpose, sub-matrix access, and data type changes are done in compile time. We also warp BLAS/LAPACK routines with our data structures using template to decide the correct type of routines in compile-time. These wrappers are written in intuitive interfaces. With above these tech-nique, one does not need to concern for the complicated matrix memory structure and BLAS/LAPACK routines.
6.2 Parallelism
6.2.1 MPI
The support of MPI allows iSVD package to access more computing resources. In Chap-ter 4, we discussed the parallelism using multiple computing nodes. Like BLAS/LAPACK, MPI routines are also wrapped using template for convenience.
6.2.2 OpenMP
The Intel Math Kernel Library [10] provided multi-threaded BLAS/LAPACK libraries based on OpenMP threading. However, since the vector statistics library of MKL does not support multi-threading, we use OpenMP [16] to parallel the random generating and gain near-linear acceleration.
6.2.3 GPU
The support of GPU accelerates the BLAS-3 matrix-matrix multiplication of the input huge matrix𝑨. The MAGMA library [9] provides efficient linear algebra routines imple-mented on GPU. Our implementation benefited plenty from MAGMA library and gains further 4X faster timing performance.
Gaussian Projection Sketching
Canonical Orthogonalization
Gramian Orthogonalization
Tall-Skinny QR Orthogonalization
Kolmogorov-Nagumo Integration
Wen-Yin Integration
Hierarchical Reduction Integration
Canonical Postprocessing
Gramian Postprocessing
Tall-Skinny QR Postprocessing
Symmetric Postprocessing Naïve Parallelism
Block-Row Parallelism Block-Column Parallelism GPU
To be Implemented
Figure 6.1: The Table of Implemented Algorithms
6.3 Tools
6.3.1 CMake
CMake is a cross-platform tool to build and test package. It allows the user to install the package without modifying any configuration codes. It can automatically determine the compile flags for linking libraries. With CMake, one can use a simple GUI to choose the library, and switch the options in our package.
6.3.2 Google Test
iSVD uses Google Test, a unit testing library for C++, to test the correctness of the pack-age. We also combine it with CMake’s test system. It allows us to make sure that each modification of our code does not harm existing features, and allows the user to check if his/her system is compatible with our code.
6.4 Package Structure
Figure 6.1 lists the algorithms implemented in iSVD package. The red blocks are the sketching algorithms, the orange ones are the orthogonalization algorithms, the yellow ones are the integration algorithms, and the green ones are the postprocessing algorithms.
The light blue stickers represent that the algorithm is implemented with Naïve Parallelism, the dark blue ones represent the Block-Row Parallelism, and the purple ones represent the Block-Column Parallelism. The NVIDIA logos represent that the algorithm has GPU support. The translucent items are not finished yet but planned to be done in the future updates. We also provide some routines for converting data between naïve storage, block-row storage, and block-column storage.
The Canonical Orthogonalization and the Canonical Postprocessing are implemented in neither Block-Row Parallelism nor Block-Column Parallelism since the canonical QR and SVD are difficult to be paralleled. Instead, we provide Tall-Skinny QR as the block-row/column version. The Block-Column Parallelism is only available in the sketching and postprocessing stages since they are the only stages that contain the short and fat matrix 𝑨. (SeeSection 4.5for detail)
Chapter 7
Numerical Results
We use two computer systems to test our program. The single node tests use a computer with two Intel Xeon E5-2650 v4 CPU (Broadwell-EP 2.20GHz, 12cores, 30MB L3-cache) and 512GB RAM (DDR4-2400, 153.6 GB/sec). We also use the Reedbush supercomputer system of the University of Tokyo. Each node of the Reedbush-U has two Intel Xeon E5-2695 v4 CPU (Broadwell-EP 2.10GHz, 18cores, 45MB L3-cache) with 256GB RAM (DDR4-2400, 153.6 GB/sec). The connection between nodes uses InfiniBand EDR 4x (100 Gbps/node). The Reedbush-H is an extension of the Reedbush-U. It contains two NVIDIA Tesla P100 GPU (16 GB Memory) per node.
In this chapter, we first (inSection 7.1) compare the Naïve Parallelism and the Block-Row Parallelism to show that the advantage of the Block-Block-Row Parallelism. InSection 7.2, we show the scalability of the integration algorithms. Last, we display out implementation and apply it on two real big data application inSection 7.3.
7.1 Comparison of Parallelisms
In this section, we compare the Naïve Parallelism (Algorithm III-8) and the Block-Row Parallelism (Algorithm III-10) on the Reedbush-U supercomputer system1. Here we use Kolmogorov-Nagumo Integration as an example since the algorithm structure of the
in-1Each node contains Intel Xeon E5-2695 v4 CPU ×2 (Broadwell-EP 2.10GHz, 18cores, 45MB L3-cache), 256GB RAM (DDR4-2400, 153.6 GB/sec), InfiniBand EDR 4x (100 Gbps/node).
Seconds
Naïve Block-Row Naïve Block-Row Naïve Block-Row Naïve Block-Row
0.754
Naïve Block-Row Naïve Block-Row Naïve Block-Row Naïve Block-Row
0.69
(b) Iteration Part of Kolmogorov-Nagumo Integration
(b) Iteration Part of Kolmogorov-Nagumo Integration