7 Wen-Yin Integration (Gramian) - Complexity of Postprocessing (Block-Column Parallelism)

B.21 Complexity of Postprocessing (Block-Column Parallelism)

III- 7 Wen-Yin Integration (Gramian)

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚×𝑙 orthogonal matrices), 𝑸_init(initial guess),𝜏₀≥ 0 (initial step size), 𝛽 ∈ (0, 1) (scaling parameter for step size searching), 𝜎 ∈ (0, 1) (parameter for step size searching), 𝜂 ∈ (0, 1) (parameter for next step searching), 𝜏_max, 𝜏_min(maximum and minimum predicting step size).

Ensure: Integrated orthogonal basis𝑸_opt.

1: Combine𝕼 = [𝑸^[1] 𝑸_[2] ⋯ 𝑸_[𝑁]].

2: Compute𝕭 = 𝕼^⊤𝕼.

3: Initialize𝑩_𝑐 ← 𝕼^⊤𝑸_init, ̃𝑭 ← 𝑰_𝑙, ̃𝑬 ← 𝑶_{𝑁𝑙×𝑙},𝜏_𝑔 ← 𝜏₀,𝜁 ← 1, 𝜙 ← _2𝑁¹ ‖𝑩𝑐‖²_𝐹.

4: Assign𝑫_𝑐 ← _𝑁¹𝑩^⊤_𝑐𝑩_𝑐.

5: while (not convergent) do

6: Assign𝑩_𝑐^𝑔 = _𝑁¹𝕭𝑩_𝑐and𝑫_𝑐^𝑔 ← _𝑁¹𝑩_𝑐^⊤𝑩^𝑔_𝑐.

7: Compute𝜇 ← tr(𝑫^𝑐^𝑔) − ‖𝑫𝑐‖²_𝐹.

8: repeat for𝑡 = 0, 1, … do

9: Let𝜏 = 𝜏_𝑔𝛽^𝑡. Compute𝑪 byeq. (3.18)and find its inverse.

10: Compute𝑭_𝑐and𝑭_𝑐^𝑔byeq. (3.21).

11: Assign𝑩₊ ← 𝑩_𝑐𝑭_𝑐+ 𝑩_𝑐^𝑔𝑭_𝑐^𝑔.

12: Assign𝜙 ←̃ _2𝑁¹ ‖𝑩+‖²_𝐹.

13: until𝜙 ≥ 𝜙 + 𝜏𝜎𝜇̃

14: Assign𝑫₊ ← _𝑁¹𝑩^⊤₊𝑩₊ and𝑬_𝑐 ← _𝑁¹𝑩_𝑐𝑭_𝑐^𝑔.

15: Update𝜙 ← ^{𝜂𝜁 𝜙+ ̃}_{𝜂𝜁 +1}^𝜙 and then𝜁 ← 𝜂𝜁 + 1.

16: Update ̃𝑭 ← ̃𝑭 𝑭_𝑐and ̃𝑬 ← ̃𝑬𝑭_𝑐+ 𝑬_𝑐.

17: Update𝜏_𝑔 ← max (min (𝜏guess, 𝜏_max) , 𝜏min) usingeq. (3.30), where

𝜏_guess= tr(𝜟^⊤1𝜟₁)

|tr(𝜟^⊤₁𝜟₂)|

or |tr(𝜟^⊤₁𝜟₂)|

tr(𝜟^⊤2𝜟₂) .

18: Update𝑩_𝑐 ← 𝑩₊and𝑫_𝑐 ← 𝑫₊.

19: end while

20: Output𝑸_opt ← 𝑸_init𝑭 + 𝕼 ̃̃ 𝑬.

Chapter 4 Parallelism

In this chapter, we use𝑃 as the number of processors, and assume that 𝑁 = 𝑃 for sim-plicity. For 𝑁 > 𝑃 , one may just handle more matrices in a processor. Note that in the Block-Row/Column Parallelism,𝑁 and 𝑃 are independent so that one may use more processors.

We propose three parallelism ideas — Naïve Parallelism, Block-Row Parallelism, and Block-Column Parallelism. The Block-Row/Column Parallelism split the matrix into row/column-blocks. These data structures are similar to the one-dimensional block row/column distribution in ScaLAPACK [7]. These block parallelism ideas are proposed to reduce communication cost. Precisely, the Block-Row Parallelism is targeted to the problems with near-square or square𝑨, and the Block-Column Parallelism is specialized for short and fat𝑨. All of the block-row/column algorithms are balanced, so that the parallel over-heads are limited.

4.1 Naïve Parallelism

Sketching and orthogonalization can be easily parallelized by using 𝑁 processors — computes𝜴_[𝑖], 𝒀_[𝑖], 𝑸_[𝑖] in the𝑖-th processor — without any communication. Also, the Kolmogorov-Nagumo Integration (Algorithm III-1) can also be easily parallelized using the same idea.

Algorithm III-8 Kolmogorov-Nagumo Integration (Naïve Parallelism) Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁], 𝑸_init.

Ensure: Integrated orthogonal basis𝑸_opt.

1: Initialize the current iterate𝑸_𝑐 ← 𝑸_init.

2: while (not convergent) do

3: Assign𝑩_[𝑖] ← 𝑸^⊤_[𝑖]𝑸_𝑐in processor𝑖.

4: Assign𝑿_[𝑖] ← _𝑁¹ (𝑸^[𝑖]𝑩_[𝑖]− 𝑸_𝑐𝑩_[𝑖]^⊤𝑩_[𝑖]) in processor 𝑖.

5: Sum𝑿 ← ∑^𝑁_𝑖=1𝑿_[𝑖]to all processors (MPI_Allreduce).

6: Compute𝑪 ← (

𝑰

2 + (^𝑰4 − 𝑿^⊤𝑿)^1/2)

1/2

7: Update𝑸_𝑐 ← 𝑸_𝑐𝑪 + 𝑿𝑪⁻¹.

8: end while

9: Output𝑸_opt ← 𝑸_𝑐.

Similarly, the Hierarchical Reduction Integration (Algorithm III-3) can be implemented as follow. However, the algorithm is unbalanced due to the reduction structure.

Algorithm III-9 Hierarchical Reduction Integration (Naïve Parallelism) Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚 × 𝑙 orthogonal matrices).

Ensure: Integrated orthogonal basis𝑸_hr.

1: Set ̃𝑁 ← 𝑁.

2: while ̃𝑁 > 1 do

3: Setℎ ← ⌊^𝑁2^̃⌋

4: Transfer𝑸_[𝑖+ℎ]from processor𝑖 + ℎ to processor 𝑖 (MPI_SendandMPI_Receive).

5: if the processor ID𝑖 ≤ ℎ then

6: Compute𝑾 𝑺𝑻^⊤ ← svd(𝑸^⊤^[𝑖]𝑸_[𝑖+ℎ]) in processor 𝑖.

7: Update𝑸_[𝑖] ← (𝑸[𝑖]𝑾 + 𝑸_[𝑖+ℎ]𝑻 ) (2(𝑰 + 𝑺))^−1/2 in processor𝑖.

8: end if

9: Update ̃𝑁 ← ⌈^𝑁^̃2⌉.

10: end while

11: Output𝑸_hr ← 𝑸_[1]in processor1.

4.2 Block-Row Parallelism for Integration

We proposed block-row algorithms for lower communication cost. The Naïve Parallelism (Algorithm III-8) requires an𝑚×𝑙 data communication (Step 5) in each iteration. It will be

very slow when𝑚 is huge. Therefore, we propose a new algorithm base on the following

InAlgorithm III-10(Block-Row Parallelism of the Optimized Kolmogorov-Nagumo Integration,Algorithm III-4), we split the matrices into𝑚_𝑏× 𝑙 row-blocks (𝑚_𝑏 = 𝑚/𝑃 ) and host them in different processors.

𝕼 =

Algorithm III-10 Kolmogorov-Nagumo Integration (Block-Row Parallelism)

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁], 𝑸_init (real𝑚 × 𝑙 orthogonal matrices), 𝑃 (number of pro-cessors).

Ensure: Integrated orthogonal basis𝑸_opt.

1: Rearrange𝑸_[𝑖]to𝔔^(𝑗)in processor𝑗 (MPI_Alltoall).

2: Initialize the current iterate𝘘_𝑐^(𝑗) ← 𝘘_init^(𝑗) in processors𝑗.

3: Sum𝑩_𝑐 ← ∑^𝑃_𝑗=1𝔔^(𝑗)⊤𝘘_𝑐^(𝑗)to all processors (MPI_Allreduce).

4: while (not convergent) do

5: Assign𝘎_𝑐^(𝑗)← _𝑁¹𝔔^(𝑗)𝑩_𝑐 in processor𝑗.

Similarly, we propose Algorithm III-11 (Block-Row Parallelism of the Optimized Wen-Yin Integration,Algorithm III-5) using the block-row idea by splitting the following matrices.

Algorithm III-11 Wen-Yin Integration (Block-Row Parallelism)

Ensure: Integrated orthogonal basis𝑸_opt.

1: Rearrange𝑸_[𝑖]to𝔔^(𝑗)in processor𝑗 (MPI_Alltoall).

2: Initialize the current iterate𝘘_𝑐^(𝑗) ← 𝘘_init^(𝑗) in processors𝑗.

3: Sum𝑩_𝑐 ← ∑^𝑃_𝑗=1𝔔^(𝑗)⊤𝘘_𝑐^(𝑗)to all processors (MPI_Allreduce).

4: Assign𝑫_𝑐 ← _𝑁¹𝑩^⊤_𝑐𝑩_𝑐,𝘎_𝑐^(𝑗) ← _𝑁¹𝔔^(𝑗)𝑩_𝑐, and𝘟_𝑐^(𝑗) ← 𝘎_𝑐^(𝑗)− 𝘘_𝑐^(𝑗)𝑫_𝑐in processor𝑗.

5: Initialize𝜏_𝑔 ← 𝜏₀,𝜁 ← 1, 𝜙 ← _2𝑁¹ ‖𝑩𝑐‖²_𝐹.

6: while (not convergent) do

7: Sum𝑩_𝑐^𝑔 ← ∑^𝑃_𝑗=1𝔔^(𝑗)⊤𝘎_𝑐^(𝑗)to all processors (MPI_Allreduce).

8: Assign𝑫_𝑐^𝑔 ← _𝑁¹𝑩_𝑐^⊤𝑩_𝑐^𝑔.

9: Compute𝜇 ← tr(𝑫^𝑐^𝑔) − ‖𝑫𝑐‖²_𝐹.

10: repeat for𝑡 = 0, 1, … do

11: Let𝜏 = 𝜏_𝑔𝛽^𝑡. Compute𝑪 byeq. (3.18)and find its inverse.

12: Compute𝑭_𝑐and𝑭_𝑐^𝑔byeq. (3.21).

13: Assign𝑩₊ ← 𝑩_𝑐𝑭_𝑐+ 𝑩_𝑐^𝑔𝑭_𝑐^𝑔.

14: Assign𝜙 ←̃ _2𝑁¹ ‖𝑩+‖²_𝐹.

15: until𝜙 ≥ 𝜙 + 𝜏𝜎𝜇̃

16: Update𝜙 ← ^{𝜂𝜁 𝜙+ ̃}_{𝜂𝜁 +1}^𝜙 and then𝜁 ← 𝜂𝜁 + 1.

17: Assign 𝑫₊ ← _𝑁¹𝑩^⊤₊𝑩₊, 𝘘₊^(𝑗) ← 𝘘_𝑐^(𝑗)𝑭_𝑐 + 𝘎_𝑐^(𝑗)𝑭_𝑐^𝑔, 𝘎₊^(𝑗) ← _𝑁¹𝔔^(𝑗)𝑩₊, 𝘟₊^(𝑗) ← 𝘎₊^(𝑗)− 𝘘₊^(𝑗)𝑫₊ in processor𝑗.

18: Compute the differences Δ^(𝑗)₁ = 𝘘₊^(𝑗)− 𝘘_𝑐^(𝑗)and Δ^(𝑗)₂ = 𝘟₊^(𝑗)− 𝘟_𝑐^(𝑗) in processor𝑗.

19: Sum 𝑡₁₁ ← ∑^𝑃_𝑗=1tr(Δ^(𝑗)¹ ^⊤Δ^(𝑗)₁ ), 𝑡¹² ← ∑^𝑃_𝑗=1tr(Δ^(𝑗)¹ ^⊤Δ^(𝑗)₂ ), and 𝑡²² ←

∑^𝑃_𝑗=1tr(Δ^(𝑗)² ^⊤Δ^(𝑗)₂ ) to all processors (MPI_Allreduce).

20: Update𝜏_𝑔 ← max (min (𝜏guess, 𝜏_max) , 𝜏min), where 𝜏_guess = 𝑡₁₁

𝑡₁₂ or 𝑡₁₂ 𝑡₂₂.

21: Update𝘘_𝑐^(𝑗) ← 𝘘₊^(𝑗),𝘎_𝑐^(𝑗) ← 𝘎₊^(𝑗),𝘟_𝑐^(𝑗)← 𝘟₊^(𝑗),𝑩_𝑐 ← 𝑩₊,𝑫_𝑐 ← 𝑫₊in processor 𝑗.

22: end while

23: Gather𝑸_opt ← [𝘘^𝑐⁽¹⁾^⊤ 𝘘_𝑐⁽²⁾^⊤ ⋯ 𝘘_𝑐^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

Algorithm III-12 (Block-Row Parallelism of the Hierarchical Reduction Integration, Algorithm 3) also use the block-row idea. Unlike the Naïve Parallelism (Algorithm III-9), this algorithm is balanced.

Algorithm III-12 Hierarchical Reduction Integration (Block-Row Parallelism)

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚×𝑙 orthogonal matrices), 𝑃 (number of processors).

Ensure: Integrated orthogonal basis𝑸_hr.

1: Rearrange𝑸_[𝑖]to𝔔^(𝑗)in processor𝑗 (MPI_Alltoall).

4.3 Block-Row Parallelism for Gramian Integration

With the improvement on the Gramian integration (Algorithm III-6 and Algorithm III-7), we only need to handle tiny matrices (with size𝑙 or 𝑁𝑙) in the iteration, which cost 𝑂(𝑁²𝑙³) ⋅ #Iter flops only. Since 𝑙 is extremely small, the iteration does not need paral-lelism. Therefore, we focus computing𝕭 = 𝕼^⊤𝕼. To reduce data communication, we use the block-row idea. Define

𝕼 =

We proposed several subroutines by splitting

Subroutines S-1toS-3are used before the iteration, andSubroutines S-4toS-5are used after the iteration. Furthermore, if we pick the𝑸_[1]as the initial guess (i.e.𝑸_init = 𝑸_[1]), we can skipSubroutine S-3and copy𝑩_𝑐from the first column-block of𝕭.

Subroutine S-1 Matrices Rearrangement (Block-Row Parallelism) Require: 𝑸_[𝑖],𝑸_init,𝑃 (number of processors).

Ensure: 𝔔^(𝑗),𝘘_init^(𝑗).

1: Split the𝑸_[𝑖] into𝑃 row-blocks and send 𝘘_[𝑖]^(𝑗) to processor𝑗 (MPI_Alltoall).

2: Combine𝔔^(𝑗)= [𝘘[1]^(𝑗) 𝘘_[2]^(𝑗) ⋯ 𝘘_[𝑁]^(𝑗)] in processor 𝑗.

3: Split the𝑸_init into𝑃 row-blocks and send 𝘘_init^(𝑗) to processor𝑗 (MPI_Scatter).

Subroutine S-2 Computing𝕭 = 𝕼^⊤𝕼 (Block-Row Parallelism) Require: 𝔔^(𝑗).

Ensure: 𝕭.

1: Sum𝕭 ← ∑^𝑃_𝑗=1𝔔^(𝑗)⊤𝔔^(𝑗)to all processors (MPI_Allreduce).

Subroutine S-3 Computing𝑩_𝑐 = 𝕼^⊤𝑸_init (Block-Row Parallelism) Require: 𝔔^(𝑗),𝘘_init^(𝑗).

Subroutine S-5 Matrices Gathering (Block-Row Parallelism)

(MPI_GatherorMPI_Allgather).

4.4 Block-Row Parallelism for Sketching, Orthogonaliza-tion, and Postprocessing

The block-row idea can be also used on sketching, orthogonalization, and postprocessing.

With the idea, the rearrangement (Subroutine S-1) and the gathering (Subroutine S-5) are unnecessary. Also, splitting the input matrix𝑨 can reduce memory usage so that we can handle larger problems. We split The following matrices into𝑚_𝑏×𝑛 and 𝑚_𝑏×𝑙 row-blocks

𝑨 =

For sketching, we split𝑨 to row-blocks 𝘈^(𝑗) and computes𝘠_[𝑖]^(𝑗) in the𝑗-th processors.

The only communication is sending the random seed to all processors (with𝑞 = 0).

Algorithm I-2 Gaussian Projection Sketching (Block-Row Parallelism)

Require: 𝘈^(𝑗)(real𝑚_𝑏×𝑛 matrix, row-block), 𝑙 (dimension of the sketched column space), 𝑞 (exponent of the power method), 𝑁 (number of random sketches), 𝑃 (number of processors).

Ensure: 𝘠_[𝑖]^(𝑗)(real 𝑚_𝑏 × 𝑙 matrices, row-blocks), where columns of 𝒀_[𝑖] spans a column subspace of𝑨 for 𝑖 = 1, … , 𝑁.

1: Broadcast the random seed to all processors (MPI_Bcast).

2: Generate𝑛 × 𝑙 random matrices 𝜴_[𝑖]using Gaussian normal distribution in all proces-sors.

3: Assign𝘠_[𝑖]^(𝑗) ← 𝘈^(𝑗)𝜴_[𝑖]in processor𝑗.

4: loop𝑞 times do

5: Sum ̃𝒀_[𝑖] ← ∑^𝑃_𝑗=1𝘈^(𝑗)⊤𝘠_[𝑖]^(𝑗)to all processors (MPI_Allreduce).

6: Update𝘠_[𝑖]^(𝑗) ← 𝘈^(𝑗)𝒀̃_[𝑖]in processor𝑗.

7: end loop

We also use the block-row idea on Gramian orthogonalization (Algorithm II-2). Here we only need to communicate𝑙 × 𝑙 matrices.

Algorithm II-3 Gramian Orthogonalization (Block-Row Parallelism)

Require: 𝘠_[𝑖]^(𝑗)(real𝑚_𝑏× 𝑙 matrices, row-blocks), 𝑃 (number of processors).

Ensure: 𝘘_[𝑖]^(𝑗)(real𝑚_𝑏×𝑙 matrices, row-blocks), where columns of 𝑸_[𝑖]are an orthonormal basis of𝒀_[𝑖].

1: Sum𝑴_[𝑖]← ∑^𝑃_𝑗=1𝘠_[𝑖]^(𝑗)^⊤𝘠_[𝑖]^(𝑗)to all processors (MPI_Allreduce).

2: Compute𝑾_[𝑖]𝑺_[𝑖]² 𝑾_[𝑖]^⊤← eig(𝑴[𝑖]).

3: Assign𝘘_[𝑖]^(𝑗) ← 𝘠_[𝑖]^(𝑗)𝑾_[𝑖]𝑺_[𝑖]⁻¹in processor𝑗.

To parallel the Gramian Postprocessing (Algorithm IV-2) and the symmetric postpro-cessing (Algorithm IV-3), we split𝒁 = 𝑨^⊤𝑸 into 𝑛_𝑏× 𝑙 row-blocks

𝒁 =

⎡⎢

⎢⎢

⎢⎣ 𝘡⁽¹⁾ 𝘡⁽²⁾

⋮ 𝘡^{(𝑃 )}

⎤⎥

⎥⎥

⎥⎦

(4.9)

with an extra communication usingMPI_Reduce_scatter_block.

Algorithm IV-4 Gramian Postprocessing (Block-Row Parallelism)

Require: 𝘈^(𝑗)(real𝑚_𝑏× 𝑛 matrix, row-block), 𝘘^(𝑗)(real𝑚_𝑏× 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤.

1: Compute𝘈^(𝑗)⊤𝘘^(𝑗)in each processor and sum the𝑗-th row-blocks to 𝘡^(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).

2: Sum𝑴 ← ∑^𝑃_𝑗=1𝘡^(𝑗)⊤𝘡^(𝑗) to all processors (MPI_Allreduce).

3: Compute ̂𝑾_𝑙𝜮̂_𝑙²𝑾̂_𝑙^⊤ ← eig(𝑴).

4: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

5: Assign ̂𝘜_𝑘^(𝑗) ← 𝘘^(𝑗)𝑾̂_𝑘.

6: Assign ̂𝘝_𝑘^(𝑗) ← 𝘡^(𝑗)𝑾̂_𝑘𝜮̂_𝑘⁻¹.

7: Gather ̂𝑼_𝑘 ← [ ̂𝘜_𝑘⁽¹⁾^⊤ 𝘜̂_𝑘⁽²⁾^⊤ ⋯ 𝘜̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

8: Gather ̂𝑽_𝑘 ← [ ̂𝘝_𝑘⁽¹⁾^⊤ 𝘝̂_𝑘⁽²⁾^⊤ ⋯ 𝘝̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

Algorithm IV-5 Symmetric Postprocessing (Block-Row Parallelism)

Require: 𝘈^(𝑗)(real𝑚_𝑏× 𝑚 matrix, row-block), 𝘘^(𝑗)(real𝑚_𝑏× 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑼̂_𝑘^⊤.

1: Compute𝘈^(𝑗)⊤𝘘^(𝑗)in each processor and sum the𝑗-th row-blocks to 𝘡^(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).

2: Sum𝑴 ← ∑^𝑃_𝑗=1𝘡^(𝑗)⊤𝘘^(𝑗)to all processors (MPI_Allreduce).

3: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑾̂_𝑙^⊤ ← eig(𝑴).

4: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

5: Assign ̂𝘜_𝑘^(𝑗) ← 𝘘^(𝑗)𝑾̂_𝑘.

6: Gather ̂𝑼_𝑘 ← [ ̂𝘜_𝑘⁽¹⁾^⊤ 𝘜̂_𝑘⁽²⁾^⊤ ⋯ 𝘜̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

Demmel et al. proposed Block-Row Parallelism based Tall Skinny QR (TSQR) [8]

for computing QR decomposition (see Appendix A.2 for detail). Algorithm II-4is the orthogonalization using TSQR.

Table 4.1: Communication Tree of Tall Skinny QR Level

1 1&2 3&4 5&6 7&8 ⋯ 2 1&3 2&4 5&7 6&8 ⋯ 3 1&5 2&6 3&7 4&8 ⋯

⋮ ⋮ ⋮ ⋮ ⋮ ⋱

Algorithm II-4 TSQR [8] Orthogonalization (Block-Row Parallelism) Require: 𝘠_[𝑖]^(𝑗)(real𝑚_𝑏× 𝑙 matrix, row-block), 𝑃 (number of processors).

Ensure: 𝘘_[𝑖]^(𝑗)(real𝑚_𝑏×𝑙 matrices, row-blocks), where columns of 𝑸_[𝑖]are an orthonormal basis of𝒀_[𝑖].

1: Denote𝐻 = ⌈log₂𝑃 ⌉.

2: Compute𝘘_[𝑖],0^(𝑗) 𝘙_[𝑖],0^(𝑗) ← qr(𝘠^[𝑖]^(𝑗)), where 𝘘

(𝑗)

[𝑖],0 ∈ ℝ^𝑚^𝑏^×𝑙and𝘙_[𝑖],0^(𝑗) ∈ ℝ^𝑙×𝑙.

3: for𝑡 = 1 to 𝐻 do

4: if processor𝑗 has neighbor at level 𝑡 then

5: Combine𝘠_[𝑖],𝑡^{(𝑗 ̌}^𝑗) from𝘙_{[𝑖],𝑡−1}^(𝑗) and𝘙_{[𝑖],𝑡−1}^{( ̌}^𝑗) , where ̌𝑗 denote the neighbor processor (seeTable 4.1). (MPI_Sendrecv).

6: Compute𝘘_[𝑖],𝑡^{(𝑗 ̌}^𝑗)𝘙_[𝑖],𝑡^{(𝑗 ̌}^𝑗) ← qr(𝘠^[𝑖],𝑡^{(𝑗 ̌}^𝑗)), where 𝘘^[𝑖],𝑡^{(𝑗 ̌}^𝑗) ∈ ℝ^2𝑙×𝑙 and𝘙_[𝑖],𝑡^{(𝑗 ̌}^𝑗) ∈ ℝ^𝑙×𝑙.

7: Assign 𝘘_[𝑖],𝑡^(𝑗) as the correspond row-block of 𝘘_[𝑖],𝑡^{(𝑗 ̌}^𝑗) (upper/lower block if 𝑗 is smaller/larger than ̌𝑗).

8: Denote𝘙_[𝑖],𝑡^(𝑗) ← 𝘙_[𝑖],𝑡^{(𝑗 ̌}^𝑗).

9: else

10: Assign𝘙_[𝑖],𝑡^(𝑗) ← 𝘙_{[𝑖],𝑡−1}^(𝑗) and𝘘_[𝑖],𝑡 ← 𝑰_𝑙 (the latter is stored implicitly).

11: end if

12: end for

13: Output𝘘_[𝑖]^(𝑗) ← 𝘘_[𝑖],0^(𝑗) 𝘘_[𝑖],1^(𝑗) ⋯ 𝘘_[𝑖],𝐻^(𝑗) .

Also, we can compute the SVD of 𝒁 = 𝑨^⊤𝑸 by applying in TSQR and computing the small SVD of the upper triangular part. Using the idea, we proposeAlgorithm IV-6.

Algorithm IV-6 TSQR [8] Postprocessing (Block-Row Parallelism)

Require: 𝘈^(𝑗)(real𝑚_𝑏× 𝑛 matrix, row-block), 𝘘^(𝑗)(real𝑚_𝑏× 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤.

1: Denote𝐻 = ⌈log₂𝑃 ⌉.

2: Compute𝘈^(𝑗)⊤𝘘^(𝑗)in each processor and sum the𝑗-th row-blocks to 𝘡^(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).

3: Compute𝘝₀^(𝑗)𝘙₀^(𝑗) ← qr(𝘡^(𝑗)), where 𝘝₀^(𝑗) ∈ ℝ^𝑚^𝑏^×𝑙and𝘙₀^(𝑗)∈ ℝ^𝑙×𝑙.

4: for𝑡 = 1 to 𝐻 do

5: if processor𝑗 has neighbor at level 𝑡 then

6: Combine ̃𝘡_𝑡 from 𝘙_𝑡−1^(𝑗) and 𝘙_𝑡−1^{( ̌}^𝑗), where 𝑗 denote the neighbor processor (seě Table 4.1). (MPI_Sendrecv).

7: Compute ̃𝘝_𝑡𝘙_𝑡 ← qr( ̃𝘡_𝑡), where𝘝̃_𝑡 ∈ ℝ^2𝑙×𝑙 and𝘙_𝑡 ∈ ℝ^𝑙×𝑙.

8: Assign 𝘝_𝑡^(𝑗) as the correspond row-block of ̃𝘝_𝑡 (upper/lower block if 𝑗 is smaller/larger than ̌𝑗).

9: else

10: Assign𝘙_𝑡 ← 𝘙_𝑡−1and𝘝_𝑡 ← 𝑰_𝑙 (the latter is stored implicitly).

11: end if

12: end for

13: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑻̂_𝑙^⊤ ← svd(𝘙𝐻^⊤).

14: Extract the largest𝑘 singular-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙, ̂𝑻_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘, ̂𝑻_𝑘.

15: Assign ̂𝘜_𝑘^(𝑗) ← 𝘘^(𝑗)𝑾̂_𝑘.

16: Assign ̂𝘝_𝑘^(𝑗) ← 𝘝₀^(𝑗)𝘝₁^(𝑗)⋯ 𝘝_𝐻^(𝑗)𝑻̂_𝑘.

17: Gather ̂𝑼_𝑘 ← [ ̂𝘜_𝑘⁽¹⁾^⊤ 𝘜̂_𝑘⁽²⁾^⊤ ⋯ 𝘜̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

18: Gather ̂𝑽_𝑘 ← [ ̂𝘝_𝑘⁽¹⁾^⊤ 𝘝̂_𝑘⁽²⁾^⊤ ⋯ 𝘝̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

4.5 Block-Column Parallelism

The Block-Column Parallelism is applied only on the input matrix𝑨 since it is the only short and fat matrix in the algorithm. Therefore, we only apply Block-Column Parallelism on sketching and postprocessing since𝑨 is only used in this two stages.

For the cases with extremely large𝑛,Step 1inAlgorithm IV-4(Block-Row Parallelism of the Gramian Postprocessing) cost𝑛𝑙 words communication. For 𝑚 ≪ 𝑛 (more precisely,

𝑁𝑚 < 𝑛), we choose to parallelize the input matrix 𝑨 into column blocks

𝑨 = [𝘈^⟨1⟩ 𝘈^⟨2⟩ ⋯ 𝘈^{⟨𝑃 ⟩}] , (4.10)

and useMPI_Reduce_scatter_blockto convert the result into row-blocks. HereStep 2in Algorithm I-3only cost𝑁𝑚𝑙 words. (SeeChapter 5for more detail.)

Algorithm I-3 Gaussian Projection Sketching (Block-Column Parallelism)

Require: 𝘈^⟨𝑗⟩(real𝑚 × 𝑛_𝑏 matrix, column-block),𝑙 (dimension of the sketched column space),𝑁 (number of random sketches), 𝑃 (number of processors).

Ensure: 𝘠_[𝑖]^(𝑗)(real 𝑚_𝑏 × 𝑙 matrices, row-blocks), where columns of 𝒀_[𝑖] spans a column subspace of𝑨 for 𝑖 = 1, … , 𝑁.

1: Generate𝑛_𝑏× 𝑙 random matrices Ω^(𝑗)_[𝑖] using Gaussian normal distribution in all pro-cessors (with different random seed).

2: Compute𝘈^⟨𝑗⟩Ω^(𝑗)_[𝑖] in each processor and sum the𝑗-th row-blocks to 𝘠_[𝑖]^(𝑗)in processor 𝑗 (MPI_Reduce_scatter_block).

With column blocks𝘈^⟨𝑗⟩,Step 2inAlgorithm IV-7can be done without communica-tion. Through we need to gather𝑸 (usingSubroutine S-5, require𝑚𝑙 words communica-tion) before postprocessing, the communication is still reduced since𝑚 ≪ 𝑛. Furthermore, with column blocks, if one only needs left singular vectors ̂𝑼_𝑘, iSVD can be done without any commutation related to𝑛.

Algorithm IV-7 Gramian Postprocessing (Block-Column Parallelism)

Require: 𝘈^(𝑗) (real𝑚_𝑏 × 𝑛 matrix, column-block), 𝘘^(𝑗) (real𝑚_𝑏 × 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤.

1: Gather𝑸 ← [𝘘^(1)⊤ 𝘘^(2)⊤ ⋯ 𝘘^{(𝑃 )⊤}]

⊤

to all processors (MPI_Allgather).

2: Assign𝘡^(𝑗)← 𝘈^{⟨𝑗⟩⊤}𝑸 in processor 𝑗.

3: Sum𝑴 ← ∑^𝑃_𝑗=1𝘡^(𝑗)⊤𝘡^(𝑗) to all processors (MPI_Allreduce).

4: Compute ̂𝑾_𝑙𝜮̂_𝑙²𝑾̂_𝑙^⊤ ← eig(𝑴).

5: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

6: Assign ̂𝑼_𝑘 ← 𝑸 ̂𝑾_𝑘.

7: Assign ̂𝘝_𝑘^(𝑗) ← 𝘡^(𝑗)𝑾̂_𝑘𝜮̂_𝑘⁻¹.

8: Gather ̂𝑽_𝑘 ← [ ̂𝘝_𝑘⁽¹⁾^⊤ 𝘝̂_𝑘⁽²⁾^⊤ ⋯ 𝘝̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

The TSQR Postprocessing (Algorithm IV-8) can also use column block 𝘈^⟨𝑗⟩ in the same way asAlgorithm IV-7.

Algorithm IV-8 TSQR [8] Postprocessing (Block-Column Parallelism)

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤.

1: Denote𝐻 = ⌈log₂𝑃 ⌉.

2: Gather𝑸 ← [𝘘^(1)⊤ 𝘘^(2)⊤ ⋯ 𝘘^{(𝑃 )⊤}]

⊤

to all processors (MPI_Allgather).

3: Assign𝘡^(𝑗)← 𝘈^{⟨𝑗⟩⊤}𝑸 in processor 𝑗.

4: Compute𝘝₀^(𝑗)𝘙₀^(𝑗) ← qr(𝘡^(𝑗)), where 𝘝₀^(𝑗) ∈ ℝ^𝑚^𝑏^×𝑙and𝘙₀^(𝑗)∈ ℝ^𝑙×𝑙.

5: for𝑡 = 1 to 𝐻 do

6: if processor𝑗 has neighbor at level 𝑡 then

7: Combine ̃𝘡_𝑡 from 𝘙_𝑡−1^(𝑗) and 𝘙_𝑡−1^{( ̌}^𝑗), where 𝑗 denote the neighbor processor (seě Table 4.1). (MPI_Sendrecv).

8: Compute ̃𝘝_𝑡𝘙_𝑡 ← qr( ̃𝘡_𝑡), where𝘝̃_𝑡 ∈ ℝ^2𝑙×𝑙 and𝘙_𝑡 ∈ ℝ^𝑙×𝑙.

9: Assign 𝘝_𝑡^(𝑗) as the correspond row-block of ̃𝘝_𝑡 (upper/lower block if 𝑗 is smaller/larger than ̌𝑗).

10: else

11: Assign𝘙_𝑡 ← 𝘙_𝑡−1and𝘝_𝑡 ← 𝑰_𝑙 (the latter is stored implicitly).

12: end if

13: end for

14: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑻̂_𝑙^⊤ ← svd(𝘙𝐻^⊤).

15: Extract the largest𝑘 singular-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙, ̂𝑻_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘, ̂𝑻_𝑘.

16: Assign ̂𝑼_𝑘 ← 𝑸 ̂𝑾_𝑘.

17: Assign ̂𝘝_𝑘^(𝑗) ← 𝘝₀^(𝑗)𝘝₁^(𝑗)⋯ 𝘝_𝐻^(𝑗)𝑻̂_𝑘.

18: Gather ̂𝑽_𝑘 ← [ ̂𝘝_𝑘⁽¹⁾^⊤ 𝘝̂_𝑘⁽²⁾^⊤ ⋯ 𝘝̂_𝑘^{(𝑃 )}^⊤]

⊤

(MPI_Gather).

We also propose the block-column version of the Symmetric Postprocessing (Algo-rithm IV-3). Although the matrix𝑨 is not short and fat (due to the symmetry of 𝑨), since 𝑨 is symmetric, we can use Block-Row Parallelism in sketching and Block-Column Par-allelism in forming to minimize the communication.

Algorithm IV-9 Symmetric Postprocessing (Block-Column Parallelism)

Require: 𝘈^(𝑗) (real𝑚_𝑏× 𝑚 matrix, column-block), 𝘘^(𝑗) (real𝑚_𝑏 × 𝑙 orthogonal matrix, row-block),𝑘 (desired rank of approximate SVD), 𝑃 (number of processors).

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑼̂_𝑘^⊤.

1: Gather𝑸 ← [𝘘^(1)⊤ 𝘘^(2)⊤ ⋯ 𝘘^{(𝑃 )⊤}]

⊤

to all processors (MPI_Allgather).

2: Assign𝘡^(𝑗)← 𝘈^{⟨𝑗⟩⊤}𝑸 in processor 𝑗.

3: Sum𝑴 ← ∑^𝑃_𝑗=1𝘡^(𝑗)⊤𝘘^(𝑗)to all processors (MPI_Allreduce).

4: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑾̂_𝑙^⊤ ← eig(𝑴).

5: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

6: Assign ̂𝑼_𝑘 ← 𝑸 ̂𝑾_𝑘.

4.6 GPU Acceleration

Sketching and postprocessing stages are the only algorithms using the huge input matrix 𝑨, and cost square to the input size (𝑂(𝑚𝑛𝑙) flops). Fortunately, all the operations require 𝑨 are matrix-matrix multiplication, which can be highly accelerated using GPU [9].

We use GPU to compute𝒀_[𝑖] ← 𝑨𝜴_[𝑖]in sketching and𝒁 ← 𝑨^⊤𝑸 in postprocessing stages. For𝑨 with extremely large sizes that exceed the GPU memory limit, we split the matrices into column/row blocks and compute the multiplication successively.

Chapter 5 Comparison of Algorithms

In this chapter, we denote 𝐼 as the number of iteration, and 𝐼_𝑠 as the total number of iteration in finding step size (𝐼_𝑠 ≥ 𝐼). Here we assume 𝑚 ≫ 𝑁𝑙, 𝑙 = 𝑘, and 𝑞 = 0. The following tables only show the leading terms.¹ In the table, we abbreviate the Kolmogorov-Nagumo Iteration, the Wen-Yin Iteration, and the Hierarchical Reduction Integration as KN, WY, and HR respectively.

Tables 5.1to5.3summarize the computational and communication complexity of all the algorithms. For integration, we obtain that the optimized algorithms indeed reduce the computational complexity. We do not recommend the Gramian algorithms, unless one expects the number of iteration to be larger than ^𝑁

4. For orthogonalization and postpro-cessing, we recommend the Gramian algorithms, which are fast and well-parallelized.

By comparing the parallelisms, we obtain that the block-row postprocessing requires an extra𝑛𝑙 communication, and the block-column sketching requires an extra 𝑁𝑚𝑙 com-munication. Therefore we recommend using Block-Column Parallelisms on sketching and postprocessing if and only if the input matrix𝑨 is short and fat enough (that is, 𝑁𝑚 < 𝑛).

1SeeAppendix Bfor detail.

Table 5.1: The Complexity of Sequential Algorithms

Stage Name #flops

Sketching Algorithm I-1Gaussian Projection 2𝑁𝑛𝑚𝑙 Orthogonalization Algorithm II-1Canonical QR 4𝑁𝑚𝑙²

Algorithm II-2Gramian 3𝑁𝑚𝑙²

Integration

Algorithm III-1KN (Original) (4𝑁 + 10)𝑚𝑙²𝐼 Algorithm III-4KN (Optimized) (4𝑁 + 4)𝑚𝑙²𝐼 Algorithm III-6KN (Gramian) (𝑁²+ 2𝑁 + 2)𝑚𝑙² Algorithm III-2WY (Original) (2𝑁 + 8)𝑚𝑙²𝐼_𝑠+ (4𝑁 + 4)𝑚𝑙²𝐼 Algorithm III-5WY (Optimized) (4𝑁 + 6)𝑚𝑙²𝐼

Algorithm III-7WY (Gramian) (𝑁²+ 2𝑁 + 2)𝑚𝑙²

Algorithm III-3HR 6𝑁𝑚𝑙²

Postprocessing

Algorithm IV-1SVD with QR 2𝑛𝑚𝑙 + 6𝑛𝑙²+ 2𝑚𝑙² Algorithm IV-2Gramian 2𝑛𝑚𝑙 + 3𝑛𝑙²+ 2𝑚𝑙² Algorithm IV-3Symmetric 2𝑚²𝑙 + 3𝑚𝑙² Table 5.2: The Complexity of Block-Row Algorithms

Stage Name #flops #words

Sketching Algorithm I-2Gaussian Projection 2^𝑁_𝑃𝑛𝑚𝑙 -Orthogonalization Algorithm II-4TSQR 4^𝑁_𝑃𝑚𝑙² (log 𝑃 )𝑁𝑙²

Algorithm II-3Gramian 3^𝑁_𝑃𝑚𝑙² (log 𝑃 )𝑁𝑙²

Integration

Algorithm III-10KN ¹

𝑃(4𝑁 + 4)𝑚𝑙²𝐼 (log 𝑃 )𝑁𝑙²𝐼 Algorithm III-6KN (Gramian) ¹

𝑃(𝑁²+ 2𝑁 + 2)𝑚𝑙² (log 𝑃 )𝑁²𝑙²

Algorithm III-11WY ¹

𝑃(4𝑁 + 6)𝑚𝑙²𝐼 (log 𝑃 )𝑁𝑙²𝐼 Algorithm III-7WY (Gramian) ¹

𝑃(𝑁²+ 2𝑁 + 2)𝑚𝑙² (log 𝑃 )𝑁²𝑙² Table 5.3: The Complexity of Block-Column Algorithms

Stage Name #flops #words

Sketching Algorithm I-3Gaussian Projection 2^𝑁_𝑃𝑛𝑚𝑙 𝑁𝑚𝑙

Postprocessing

Chapter 6 Implementation

The iSVD algorithm is implemented in C++ with more than 20000 lines of codes. The package provides multinode, multithread, and GPU support. We also use some tools for automatic installation and correctness test.

6.1 C++ Implementation

We design and implement iSVD package in C++ with objective oriented programming model. It is paralleled using MPI for multicore clusters. Intel Math Kernel Library [10]

is used for BLAS/LAPACK [11] routines. The package is highly benefited from adopt-ing object-oriented programmadopt-ing, C++11 standard [12], and template meta-programmadopt-ing.

We also wrap the BLAS/LAPACK routines that delegate the selection of the correct type of routines to the compiler. With these techniques, the package can be easily used with different types of input matrix and different iSVD algorithms without modifying exist codes.

6.1.1 Object Oriented Programming

Object-oriented programming (OOP) is a programming paradigm based on objects, which combines data and functions. The benefit to adopt OOP is huge in our implementation.

With OOP, we warp data structures into object, and design linear algebra routines based on these object instead of using raw data directly. This technique makes programming

easier and less fallible. The iSVD algorithms are also implemented with OOP. The user can use them easily without dealing with parameter transfer and memory control.

6.1.2 C++11 Standard

Our implementation benefited plenty from C++11 standard [12]. This new standard is a great improvement from C++03, with many new features such as right value reference helps developer manipulate temporary objects generated by some syntax and avoid per-formance traps, constexpr specifier allows the compiler to do more compile-time de-cision and gain better optimization. The standard library of C++11 provided a smart pointer classes such asshared_ptrwith features such as automatic memory management or bounds checking. All of the data in our package are controlled by smart pointers. One no need to concerned about the dynamic memory allocations.

6.1.3 Template Meta-Programming

Template, one of an important feature of C++, allows function and classes to operate with generic types. In iSVD Package, almost all the codes, such as storage structures, linear algebra functions, and iSVD drivers, are implemented with template. With template usage, we can combine write a single code for all data types and storage layout instead of having multiple similar codes.

The implementation only contains header files. One can easily use it without any binary library to link to, and any configured header file. Our package is a pure template library defined in the headers.

6.1.4 Curiously Recurring Template Pattern

The curiously recurring template pattern (CRTP) was formalized in the 1980s as “F-bounded quantification” [13], and was named by Jim Coplien [14] in 1995. CRTP is an idiom in C++ in which a classXderives from a class template instantiation usingXitself as template argument [15]. With CRTP, we implement interfaces without virtual functions to reduce run-time overheads.

6.1.5 Data Storage and Linear Algebra Wrappers

We implement the matrix structures using OOP, and implement methods so that one may use it similar to the usage of MATLAB matrices. The expressions such as transpose, sub-matrix access, and data type changes are done in compile time. We also warp BLAS/LAPACK routines with our data structures using template to decide the correct type of routines in compile-time. These wrappers are written in intuitive interfaces. With above these tech-nique, one does not need to concern for the complicated matrix memory structure and BLAS/LAPACK routines.

6.2 Parallelism

6.2.1 MPI

The support of MPI allows iSVD package to access more computing resources. In Chap-ter 4, we discussed the parallelism using multiple computing nodes. Like BLAS/LAPACK, MPI routines are also wrapped using template for convenience.

6.2.2 OpenMP

The Intel Math Kernel Library [10] provided multi-threaded BLAS/LAPACK libraries based on OpenMP threading. However, since the vector statistics library of MKL does not support multi-threading, we use OpenMP [16] to parallel the random generating and gain near-linear acceleration.

6.2.3 GPU

The support of GPU accelerates the BLAS-3 matrix-matrix multiplication of the input huge matrix𝑨. The MAGMA library [9] provides efficient linear algebra routines imple-mented on GPU. Our implementation benefited plenty from MAGMA library and gains further 4X faster timing performance.

Gaussian Projection Sketching

Canonical Orthogonalization

Gramian Orthogonalization

Tall-Skinny QR Orthogonalization

Kolmogorov-Nagumo Integration

Wen-Yin Integration

Hierarchical Reduction Integration

Canonical Postprocessing

Gramian Postprocessing

Tall-Skinny QR Postprocessing

Symmetric Postprocessing Naïve Parallelism

Block-Row Parallelism Block-Column Parallelism GPU

To be Implemented

Figure 6.1: The Table of Implemented Algorithms

6.3 Tools

6.3.1 CMake

CMake is a cross-platform tool to build and test package. It allows the user to install the package without modifying any configuration codes. It can automatically determine the compile flags for linking libraries. With CMake, one can use a simple GUI to choose the library, and switch the options in our package.

6.3.2 Google Test

iSVD uses Google Test, a unit testing library for C++, to test the correctness of the pack-age. We also combine it with CMake’s test system. It allows us to make sure that each modification of our code does not harm existing features, and allows the user to check if his/her system is compatible with our code.

6.4 Package Structure

Figure 6.1 lists the algorithms implemented in iSVD package. The red blocks are the sketching algorithms, the orange ones are the orthogonalization algorithms, the yellow ones are the integration algorithms, and the green ones are the postprocessing algorithms.

The light blue stickers represent that the algorithm is implemented with Naïve Parallelism, the dark blue ones represent the Block-Row Parallelism, and the purple ones represent the Block-Column Parallelism. The NVIDIA logos represent that the algorithm has GPU support. The translucent items are not finished yet but planned to be done in the future updates. We also provide some routines for converting data between naïve storage, block-row storage, and block-column storage.

The Canonical Orthogonalization and the Canonical Postprocessing are implemented in neither Block-Row Parallelism nor Block-Column Parallelism since the canonical QR and SVD are difficult to be paralleled. Instead, we provide Tall-Skinny QR as the block-row/column version. The Block-Column Parallelism is only available in the sketching and postprocessing stages since they are the only stages that contain the short and fat matrix 𝑨. (SeeSection 4.5for detail)

Chapter 7 Numerical Results

We use two computer systems to test our program. The single node tests use a computer with two Intel Xeon E5-2650 v4 CPU (Broadwell-EP 2.20GHz, 12cores, 30MB L3-cache) and 512GB RAM (DDR4-2400, 153.6 GB/sec). We also use the Reedbush supercomputer system of the University of Tokyo. Each node of the Reedbush-U has two Intel Xeon E5-2695 v4 CPU (Broadwell-EP 2.10GHz, 18cores, 45MB L3-cache) with 256GB RAM (DDR4-2400, 153.6 GB/sec). The connection between nodes uses InfiniBand EDR 4x (100 Gbps/node). The Reedbush-H is an extension of the Reedbush-U. It contains two NVIDIA Tesla P100 GPU (16 GB Memory) per node.

In this chapter, we first (inSection 7.1) compare the Naïve Parallelism and the Block-Row Parallelism to show that the advantage of the Block-Block-Row Parallelism. InSection 7.2, we show the scalability of the integration algorithms. Last, we display out implementation and apply it on two real big data application inSection 7.3.

7.1 Comparison of Parallelisms

In this section, we compare the Naïve Parallelism (Algorithm III-8) and the Block-Row Parallelism (Algorithm III-10) on the Reedbush-U supercomputer system¹. Here we use Kolmogorov-Nagumo Integration as an example since the algorithm structure of the

in-1Each node contains Intel Xeon E5-2695 v4 CPU ×2 (Broadwell-EP 2.10GHz, 18cores, 45MB L3-cache), 256GB RAM (DDR4-2400, 153.6 GB/sec), InfiniBand EDR 4x (100 Gbps/node).

Seconds

Naïve Block-Row Naïve Block-Row Naïve Block-Row Naïve Block-Row

0.754

Naïve Block-Row Naïve Block-Row Naïve Block-Row Naïve Block-Row

0.69

(b) Iteration Part of Kolmogorov-Nagumo Integration

在文檔中利用高度平行之演算法整合多重隨機奇異值分解並應用於巨量資料分析 (頁 47-110)