利用高度平行之演算法整合多重隨機奇異值分解並應用於巨量資料分析

(1)

國⽴臺灣⼤學理學院應⽤數學科學研究所碩⼠論⽂

Institute of Applied Mathematical Sciences College of Science

National Taiwan University Master Thesis

利⽤⾼度平⾏之演算法整合多重隨機奇異值分解並應

⽤於巨量資料分析

Highly Scalable Parallelism of Integrated Randomized Singular Value Decomposition with Big Data Applications

楊慕 Mu Yang

指導教授：王偉仲　博⼠

Advisor: Weichung Wang, Ph. D.

中華民國 106 年 7 ⽉

July, 2017

(2)

(3)

(4)

(5)

誌謝

在此論⽂完成之際，必須要感謝在我研究路程上幫助我的所有⼈，

沒有你們的幫助，我無法有現在的成果。

在研究過程中，給予我最多幫助的是我的指導教授　王偉仲博⼠，

從認識他以來，他不遺餘⼒地指導我的學習，帶給我許多收穫。他的辦公室⼤⾨始終對我敞開，每當我的研究或寫作有新的進展或是遇到瓶頸時，總是不厭其煩的與我討論並給予全⾯性的回應。在王⽼師的帶領下，我獲得許多機會參與各個學術會議及參訪，得以與各領域的專家交流，對我的研究有相當程度的幫助。此外，也⾮常感謝中央研究院統計科學研究所的陳素雲博⼠與陳定⽴博⼠，他們的研究在這個領域上奠下基⽯，我能夠做出此研究並寫出論⽂，可說是站在巨⼈的肩膀上。

我衷⼼感謝我的同學張⼤衛，他熱情的參與和投⼊，極⼤地協助了我的研究⼯作，與他的討論啟發了我很多想法。在寫作上他也給了我很多意⾒，⼤⼤改善了論⽂的品質。此外，還要感謝東京⼤學情報基盤中⼼提供 Reedbush-U/H 超級電腦系統，以及周世恩學⾧提供臉書資料庫，幫助我完成這份論⽂的數值測試。也感謝我研究室的同伴們，

不僅在研究上給予我許多幫助，也提供我最歡樂的研究環境，並在我遇到瓶頸給予我精神上的⽀持。

最後，感謝我的家⼈這些年來給予我的⽀持與關懷，讓我得以求學

⽣涯中無後顧之憂，專⼼於學習與研究。謹以本⽂獻給我敬愛的家⼈

與所有關⼼我的⼈，並將這份成果呈獻給你們。

(6)

(7)

Acknowledgements

I’m glad to express my sincere gratitude to my thesis advisor Dr. We- ichung Wang of the Institute of Applied Mathematical Sciences at National Taiwan University for his support of my study and research. The door to Prof.

Wang’s office was always open whenever I ran into a trouble spot or had a question about my research or writing.

I also like to show my gratitude to Dr. Su-Yun Huang and Dr. Ting-Li Chen at the Institute of Statistical Science of Academia Sinica. Without their insight and expertise, the research could not have been successfully con- ducted.

My sincere thanks also go to my colleague Dawei Chang for his passionate participation and input that greatly assisted the research. The discussion with him inspires me a lot. He also gave me many comments that greatly improved the manuscript.

I would also like to thank the Information Technology Center at the Uni- versity of Tokyo for providing Reedbush-U/H supercomputer systems, and Mr. Shih-En Chou for providing the Facebook datasets. I am glad to thank my fellow labmates for the stimulating discussions, and for all the fun we have had.

Last but not the least, I would like to thank my family for supporting me spiritually throughout my life.

(8)

(9)

摘要

低秩近似在⼤數據分析中佔了重要的地位，整合奇異值分解（Inte- grated Singular Value Decomposition，iSVD）是⼀種⽤於計算⼤矩陣的低秩近似奇異值分解的演算法。iSVD 集成了從多個隨機⼦空間抽樣

⽽獲得的不同的低秩奇異值分解，並達到更⾼的精準度和更好的穩定性。雖然多個隨機抽樣與合併的過程需要更⾼的計算成本，但由於這些操作可以平⾏化，iSVD 仍然可以節省計算時間。我們在多核⼼計算集群上平⾏此演算法，並對計算⽅法及資料結構進⾏了修改，以增加可擴展性並減少資料傳輸。透過平⾏化，iSVD 可以找到巨⼤矩陣的近似奇異值分解，達到相對於矩陣尺⼨和機器數量接近線性的可擴展性，並透過使⽤ GPU 在抽樣的步驟達到四倍的加速。我們⽤ C++

實作此演算法，並應⽤了幾種提⾼可維護性、可擴展性和可⽤性的技術。我們在使⽤混合 CPU-GPU 的超級電腦系統上使⽤ iSVD 求解⼀些

⼤規模的應⽤問題。

關鍵詞：奇異值分解，平⾏演算法，分散式演算法，隨機演算法，

圖形處理器，⼤數據分析

(10)

(11)

Abstract

Low-rank approximation plays an important role in big data analysis. In- tegrated Singular Value Decomposition (iSVD) is an algorithm for computing low-rank approximate singular value decomposition of large size matrices.

The iSVD integrates different low-rank SVDs obtained by multiple random subspace sketches and achieve higher accuracy and better stability. While iSVD takes higher computational costs due to multiple random sketches and the integration process, these operations can be parallelized to save computational time. We parallelize iSVD for multicore clusters, and modify the algorithms and data structures to increase the scalability and reduce communication. With parallelization, iSVD can find the approximate SVD of matrices with huge size, and achieve near-linear scalability with respect to the matrix size and the number of machines, and gained further 4X faster tim- ing performance on sketching by using GPU. We implement the algorithms in C++, with several techniques for high maintainability, extensibility, and usability. The iSVD is applied on some huge size application using hybrid CPU-GPU supercomputer systems.

Keywords. Singular value decomposition, Parallel algorithms, Distributed algorithms, Randomized algorithms, Graphics processing units, Big data analysis

(12)

(13)

List of Figures

6.1 The Table of Implemented Algorithms . . . 48

7.1 Comparison of Naïve and Block-Row Parallelism . . . 52

7.2 Scalability of Integration Algorithms . . . 53

7.4 The Time of iSVD using MATLAB and C++ . . . 56

7.5 The Time of iSVD using CPU and GPU . . . 56

(18)

(19)

List of Tables

1.1 Notations Used in This Article . . . 3

3.1 Notations and Formulas Used in Optimization Integrations . . . 16

4.1 Communication Tree of Tall Skinny QR . . . 37

5.1 The Complexity of Sequential Algorithms . . . 44

5.2 The Complexity of Block-Row Algorithms . . . 44

5.3 The Complexity of Block-Column Algorithms . . . 44

B.1 Complexity of Canonical Methods . . . 70

B.2 Complexity of𝖄 Orthogonalization . . . 71

B.3 Complexity of Sketching . . . 72

B.4 Complexity of Orthogonalization . . . 72

B.5 Complexity of Integration . . . 73

B.8 Complexity of Postprocessing . . . 76

B.9 Complexity of Sketching (Naïve Parallelism) . . . 77

B.10 Complexity of Sketching (Block-Row Parallelism) . . . 77

B.11 Complexity of Sketching (Block-Column Parallelism) . . . 77

B.12 Complexity of Orthogonalization (Naïve Parallelism) . . . 78

B.13 Complexity of Orthogonalization (Block-Row Parallelism) . . . 78

B.14 Complexity of Integration (Naïve Parallelism) . . . 79

B.15 Complexity of Integration (Block-Row Parallelism) . . . 80

B.19 Complexity of Postprocessing (Block-Row Parallelism) . . . 84

B.21 Complexity of Postprocessing (Block-Column Parallelism) . . . 86

(20)

(21)

List of Algorithms

Main Algorithms

1 Integrated Singular Value Decomposition . . . 6

Sketching Algorithms I-1 Gaussian Projection Sketching . . . 7

I-2 Gaussian Projection Sketching (Block-Row Parallelism) . . . 35

I-3 Gaussian Projection Sketching (Block-Column Parallelism) . . . 39

Orthogonalization Algorithms II-1 Canonical Orthogonalization . . . 7

II-2 Gramian Orthogonalization . . . 7

II-3 Gramian Orthogonalization (Block-Row Parallelism) . . . 35

II-4 TSQR Orthogonalization (Block-Row Parallelism) . . . 37

Integration Algorithms III-1 Kolmogorov-Nagumo Integration . . . 9

III-2 Wen-Yin Integration . . . 10

III-3 Hierarchical Reduction Integration . . . 11

III-4 Kolmogorov-Nagumo Integration (Optimized) . . . 19

III-5 Wen-Yin Integration (Optimized) . . . 22

III-6 Kolmogorov-Nagumo Integration (Gramian) . . . 24

III-7 Wen-Yin Integration (Gramian) . . . 25

(22)

III-8 Kolmogorov-Nagumo Integration (Naïve Parallelism) . . . 28 III-9 Hierarchical Reduction Integration (Naïve Parallelism) . . . 28 III-10Kolmogorov-Nagumo Integration (Block-Row Parallelism) . . . 30 III-11Wen-Yin Integration (Block-Row Parallelism) . . . 31 III-12Hierarchical Reduction Integration (Block-Row Parallelism) . . . 32

Postprocessing Algorithms

IV-1 Canonical Postprocessing . . . 12 IV-2 Gramian Postprocessing . . . 12 IV-3 Symmetric Postprocessing . . . 13 IV-4 Gramian Postprocessing (Block-Row Parallelism) . . . 36 IV-5 Symmetric Postprocessing (Block-Row Parallelism) . . . 36 IV-6 TSQR Postprocessing (Block-Row Parallelism) . . . 38 IV-7 Gramian Postprocessing (Block-Column Parallelism) . . . 39 IV-8 TSQR Postprocessing (Block-Column Parallelism) . . . 40 IV-9 Symmetric Postprocessing (Block-Column Parallelism) . . . 41

Subroutines

S-1 Matrices Rearrangement (Block-Row Parallelism) . . . 33 S-2 Computing𝕭 = 𝕼^⊤𝕼 (Block-Row Parallelism) . . . 33 S-3 Computing𝑩_𝑐 = 𝕼^⊤𝑸_init (Block-Row Parallelism) . . . 33 S-4 Computing𝑸_opt = 𝑸_init𝑭 + 𝕼 ̃̃ 𝑬 (Block-Row Parallelism) . . . 33 S-5 Matrices Gathering (Block-Row Parallelism) . . . 34

(23)

Chapter 1 Introduction

Big data analysis is one of the important fields in nowadays. We often use low-rank approximation for feature selection and dimension reduction. Singular value decomposition (SVD) is one of an essential tool for finding low-rank approximations. In this article, we consider the rank-𝑘 singular value decomposition

𝑨 = 𝑼 𝜮𝑽^⊤ ≈ 𝑼_𝑘𝜮_𝑘𝑽_𝑘^⊤, (1.1)

where𝑼 𝜮𝑽^⊤is the full SVD and𝑼_𝑘𝜮_𝑘𝑽_𝑘^⊤is the truncated rank-𝑘 SVD. However, tradi- tional algorithms may take about12𝑛𝑚²flops (assuming𝑚 < 𝑛) for computing the SVD of an𝑚 × 𝑛 matrix 𝑨, which lack scalability and leads to a computational problem when the matrix size is large. Randomized singular value decomposition (rSVD) [1,2] reduces the computational cost to4𝑛𝑚𝑘 flops. Furthermore, Chen et al. proposed an efficient algorithm, integrated singular value decomposition (iSVD) (Algorithm 1) [3], which integrates the results from multiple rSVDs and achieve higher accuracy.

In this article, we focus on the implementation and the application of iSVD on large- scale clusters. The development of supercomputer allows us to handle huge scale prob- lems. Multithread and multicore parallelization can speed up the computations. The com- putation is also benefited from GPU acceleration on hybrid CPU-GPU architecture. In these machine structures, data communication becomes an essential problem.

We modify the iSVD algorithms that reuse the matrices to reduce the computational

(24)

cost. For large-scale clusters, we propose new data structure for parallel algorithms to reduce data communication so that the communication cost is independent of the size of the problem. These algorithms are balanced and weak scalable. We compare the computational and communication complexity of each algorithm, and recommend the suitable choice for different shapes of the input matrix. The iSVD is implemented in C++ with several techniques for high maintainability, extensibility, and usability. We test our implementation on the Reedbush supercomputer system of the University of Tokyo with use up to 128 nodes and apply it to some huge size applications.

This article is organized as follows. The algorithms are introduced inChapter 2. The improvements of the algorithms and the parallelism ideas are discussed inChapters 3to5.

Chapter 6describes the techniques of the implementation. Several numerical results are provided inChapter 7. The article ends with a discussion inChapter 8. The details of some formula are derived inAppendix A. The tables of the complexity of all the algorithms are listed inAppendix B.

For the notations, we use italic letters (e.g. 𝑚, 𝛽, 𝑁) for scalars, bold italic uppercase letters (e.g. 𝑨, 𝜴) for matrices, italic sans serif uppercase letters (e.g. 𝘎, 𝘟 ) for row/column-block submatrices, bold fraktur uppercase letters (e.g.𝕭, 𝕼) for the matrices that keep unchanged in the iteration, under-script bracketed numbers (e.g.𝑸_[𝑖], 𝒀_[𝑖]) for the matrices of the𝑖-th sketches, super-script parenthesized numbers (e.g. 𝘜^(𝑗), Ω^(𝑗)) for the𝑗-th block-row in the 𝑗-th processor, super-script angled numbers (e.g. 𝘈^⟨𝑗⟩) for the 𝑗-th block-column in the 𝑗-th processor, under-script 𝑐 (e.g. 𝑸_𝑐, 𝑿_𝑐) for the matrices of the current iteration, and under-script plus sign (e.g.𝑩₊, 𝑫₊) for the matrices of the next iteration. Table 1.1lists the notations and the formulas used in this article.

(25)

𝑚, 𝑛 Row and column dimensions of a matrix𝑨 with the assumption 𝑚 ≤ 𝑛.

𝑘 Desired rank of approximate SVD.

𝑝 Oversampling parameter.

𝑙 Dimension of randomized sketches, i.e.,𝑙 = (𝑘 + 𝑝) ≪ 𝑚.

𝑁 Number of random sketches.

𝑃 Number of processors.

𝑚_𝑏,𝑛_𝑏 Row/column dimensions of a row/column-block, i.e.,𝑚_𝑏 = _𝑃^𝑚,𝑛_𝑏 = _𝑃^𝑛. 𝑨 = 𝑼 𝜮𝑽^⊤ An𝑚 × 𝑛 matrix and its SVD.

𝑨 ≈ 𝑼_𝑘𝜮_𝑘𝑽_𝑘^⊤ Rank-𝑘 SVD defined ineq. (1.1).

𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤ Rank-𝑘 SVD computed byAlgorithm 1(iSVD).

𝜴_[𝑖] The𝑖-th Gaussian random projection matrix.

𝒀_[𝑖] The𝑖-th sketched matrix.

𝑸_[𝑖] The𝑖th orthonormal basis of the sketched subspace.

𝑷 The average of𝑸_[𝑖]𝑸^⊤_[𝑖]defined ineq. (3.2)(stored implicitly).

𝑸 The integrated orthonormal basis of the sketched subspace.

𝑸_opt The𝑸 of the optimization problem defined ineq. (3.2).

𝑸_hr The𝑸 of hierarchical reduction integration defined inAlgorithm III-3.

‖•‖_𝐹 The Frobenius norm. ‖𝑨‖_𝐹 = √∑^𝑖∑_𝑏𝑎_𝑖𝑗 = √tr(𝑨^⊤𝑨).

orth(•) Computes an orthonormal basis of given matrix using QR or SVD.

qr(•) Computes the QR decomposition.

eig(•) Computes the eigenvalue decomposition.

svd(•) Computes the singular value decomposition.

Table 1.1: Notations Used in This Article

(26)

(27)

Chapter 2 Algorithms

In this chapter, we briefly describe the Integrated Singular Value Decomposition algorithm, and the algorithms of each stage.

2.1 Integrated Singular Value Decomposition

Integrated Singular Value Decomposition (iSVD,Algorithm 1) [3] finds an approximate rank-𝑘 SVD

𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤. (2.1)

In the algorithm, we set𝑨 as the matrix with size 𝑚×𝑛, 𝑘 as the desired rank of approximate SVD,𝑝 as the oversampling parameter, 𝑙 = 𝑘 + 𝑝 as the dimension of the sketched column space, and𝑁 as the number of random sketches. We split iSVD into four stages.

• Stage I: Sketching. Sketches𝑁 rank-𝑙 column subspaces of the input matrix 𝑨. In other words, computes𝑚 × 𝑙 matrices 𝒀_[𝑖]whose columns spans a column subspace of𝑨. A naïve way is multiplying 𝑨 by a random generated matrix 𝜴_[𝑖].

• Stage II: Orthogonalization. Computes an approximate basis for the range of the input matrix𝑨 from those 𝒀_[𝑖]; that is, find orthogonal matrices𝑸_[𝑖]with

𝑨 ≈ 𝑸_[𝑖]𝑸^⊤_[𝑖]𝑨.

(28)

With𝒀_[𝑖], one may directly orthogonalize them to obtain𝑸_[𝑖].

• Stage III: Integration. Integrates 𝑸 ← {𝑸[𝑖]}^𝑁_𝑖=1; that is, find an orthonormal basis𝑸 that best represent the 𝑸_[𝑖].

• Stage IV: Postprocessing. Computes a rank-𝑘 approximate SVD ̂𝑼_𝑘, ̂𝜮_𝑘, ̂𝑽_𝑘of𝑨 in the range of𝑸; i.e., find the SVD of

𝑸 𝑸^⊤𝑨 = ̂𝑼_𝑙𝜮̂_𝑙𝑽̂_𝑙^⊤

and extract the largest𝑘 singular-pairs ̂𝑼_𝑘, ̂𝜮_𝑘, ̂𝑽_𝑘.

Algorithm 1 Integrated Singular Value Decomposition [3]

Require: 𝑨 (real 𝑚 × 𝑛 matrix), 𝑘 (desired rank of approximate SVD), 𝑝 (oversampling parameter),𝑙 = 𝑘+𝑝 (dimension of the sketched column space), 𝑁 (number of random sketches).

Ensure: Approximate rank-𝑘 SVD of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤.

1: (Sketching.) Compute𝑚 × 𝑙 matrices 𝒀_[𝑖] whose columns spans a column subspace of𝑨 for 𝑖 = 1, … , 𝑁.

2: (Orthogonalization.) Compute𝑸_[𝑖] whose columns are an orthonormal basis of𝒀_[𝑖]

for𝑖 = 1, … , 𝑁.

3: (Integration.) Integrate𝑸 ← {𝑸[𝑖]}^𝑁_𝑖=1.

4: (Postprocessing.) Compute a rank-𝑘 approximate SVD ̂𝑼_𝑘𝜮̂_𝑘𝑽̂_𝑘^⊤ of𝑨 in the range of𝑸.

2.2 Stage I: Sketching

We use the same sketching algorithm as rSVD.Algorithm I-1(Gaussian Projection Sketch- ing) multiples𝑨 by some random matrices using Gaussian normal distribution.

(29)

Algorithm I-1 Gaussian Projection Sketching

Require: 𝑨 (real 𝑚 × 𝑛 matrix), 𝑙 (dimension of the sketched column space), 𝑞 (exponent of the power method),𝑁 (number of random sketches).

Ensure: 𝒀_[𝑖] (real 𝑚 × 𝑙 matrices) whose columns spans a column subspace of 𝑨 for 𝑖 = 1, … , 𝑁.

1: Generate𝑛 × 𝑙 random matrices 𝜴_[𝑖] using Gaussian normal distribution.

2: Assign𝒀_[𝑖] ← (𝑨𝑨^⊤)^𝑞𝑨𝜴_[𝑖].

According to Halko et al. [1], while using this algorithm with𝑞 > 0, multiplying 𝑨 and𝑨^⊤many times will cause rounding errors. They suggest orthogonalizing the columns between each multiplication of𝑨 and 𝑨^⊤. In this article, we focus on the cases with𝑞 = 0 so that there is no need to be concerned about this situation.

2.3 Stage II: Orthogonalization

In general, we can simply find the orthonormal basis using canonical QR or SVD of𝒀_[𝑖]

(Algorithm II-1). Additional, we may also compute the orthonormal basis using eigenvalue decomposition of𝒀_[𝑖]^⊤𝒀_[𝑖]— the Gramian of𝒀_[𝑖](Algorithm II-2).

Algorithm II-1 Canonical Orthogonalization Require: 𝒀_[𝑖](real𝑚 × 𝑙 matrices).

Ensure: 𝑸_[𝑖](real𝑚 × 𝑙 matrices) whose columns are an orthonormal basis of 𝒀_[𝑖].

1: Compute𝑸_[𝑖]whose columns are an orthonormal basis of𝒀_[𝑖] using QR or SVD.

In Canonical Orthogonalization, we can use both QR (𝒀_[𝑖] = 𝑸_[𝑖]𝑹_[𝑖]) or SVD (𝒀_[𝑖] = 𝑸_[𝑖]𝑺_[𝑖]𝑾_[𝑖]^⊤). Although these two𝑸_[𝑖] might not be exactly the same, they both span the same space; that is, the product𝑸_[𝑖]𝑸^⊤_[𝑖]are exactly equal in both decompositions.

Algorithm II-2 GramianOrthogonalization Require: 𝒀_[𝑖](real𝑚 × 𝑙 matrices).

Ensure: 𝑸_[𝑖](real𝑚 × 𝑙 matrices) whose columns are an orthonormal basis of 𝒀_[𝑖].

1: Compute𝑾_[𝑖]𝑺_[𝑖]² 𝑾_[𝑖]^⊤← eig(𝒀^[𝑖]^⊤𝒀_[𝑖]).

2: Assign𝑸_[𝑖] ← 𝒀_[𝑖]𝑾_[𝑖]𝑺_[𝑖]⁻¹.

(30)

Instead of computing the QR decomposition of a 𝑚 × 𝑙 matrices 𝒀_[𝑖] (Step 1 inAl- gorithm II-1), the Gramian Orthogonalization (Algorithm II-2) compute the eigenvalue decomposition of the𝑙 × 𝑙 matrices 𝒀_[𝑖]^⊤𝒀_[𝑖]. Denoting the SVD as𝒀_[𝑖] = 𝑸_[𝑖]𝑺_[𝑖]𝑾_[𝑖]^⊤, the eigenvalue decomposition of the Gramian matrix𝒀_[𝑖]^⊤𝒀_[𝑖]can be written as

𝒀_[𝑖]^⊤𝒀_[𝑖] = 𝑾_[𝑖]𝑺_[𝑖]𝑸^⊤_[𝑖]𝑸_[𝑖]𝑺_[𝑖]𝑾_[𝑖]^⊤ = 𝑾_[𝑖]𝑺_[𝑖]² 𝑾_[𝑖]^⊤. (2.2)

Note that𝑸^⊤_[𝑖]𝑸_[𝑖] is an identity matrix since 𝑸_[𝑖] is orthonormal. With the eigenvalue decomposition, we can form the orthonormal basis by solving the equation

𝑸_[𝑖] = 𝒀_[𝑖](𝑺^[𝑖]𝑾_[𝑖]^⊤)

−1

. (2.3)

Since𝑾_[𝑖]is orthogonal and𝑺_[𝑖]is diagonal, the inverse can be computed by multiplying the𝑾_[𝑖]and dividing the columns by the diagonal elements of𝑺_[𝑖]; that is,

𝑸_[𝑖] = 𝒀_[𝑖]𝑾_[𝑖]𝑺_[𝑖]⁻¹. (2.4)

As shown inChapter 5, the Gramain algorithm is faster.

2.4 Stage III: Integration

In the integration stage, we solve the optimization problem (seeSection 3.1for detail)

𝑸_opt ∶= arg max

𝑸^⊤𝑸=𝑰

𝑓 (𝑸) with 𝑓 (𝑸) = 1

2tr(𝑸^⊤𝑷 𝑸) . (2.5)

There are two algorithms for this optimization problem.Algorithm III-1uses the Kolmogorov- Nagumo-type average [4] andAlgorithm III-2uses the line search proposed by Wen and Yin [5].

(31)

Algorithm III-1 Kolmogorov-Nagumo Integration [4]

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚 × 𝑙 orthogonal matrices), 𝑸_init (initial guess).

Ensure: Integrated orthogonal basis𝑸_opt.

1: Initialize the current iterate𝑸_𝑐 ← 𝑸_init.

2: while (not convergent) do

3: Assign𝑿_𝑐 ← (𝑰 − 𝑸𝑐𝑸^⊤_𝑐) 𝑷 𝑸𝑐.

4: Compute𝑪 ← (

𝑰

2 + (^𝑰4 − 𝑿_𝑐^⊤𝑿_𝑐)

1/2

)

1/2

.

5: Update𝑸_𝑐 ← 𝑸_𝑐𝑪 + 𝑿_𝑐𝑪⁻¹.

6: end while

7: Output𝑸_opt ← 𝑸_𝑐.

In Kolmogorov-Nagumo Integration, we stop the iteration if (𝑰 − 𝑸^⊤+𝑸_𝑐)is small enough. This condition measures the similarity of 𝑸₊ and 𝑸_𝑐. In the implementation, we use an equivalent condition‖𝑪‖₂< 𝜖 for some tolerance 𝜖.

(32)

Algorithm III-2 Wen-Yin Integration [5]

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚×𝑙 orthogonal matrices), 𝑸_init(initial guess),𝜏₀≥ 0 (initial step size), 𝛽 ∈ (0, 1) (scaling parameter for step size searching), 𝜎 ∈ (0, 1) (parameter for step size searching), 𝜂 ∈ (0, 1) (parameter for next step searching), 𝜏_max, 𝜏_min(maximum and minimum predicting step size).

1: Initialize𝑸_𝑐 ← 𝑸_init,𝜏_𝑔 ← 𝜏₀,𝜁 ← 1, 𝜙 ← 𝑓 (𝑸𝑐).

3: Assign𝑮_𝑐 ← 𝑷 𝑸_𝑐.

4: Let𝜏 = 𝜏_𝑔𝛽^𝑡. Find the smallest integer𝑡 ≥ 0 satisfying the inequality

̃𝜙 ≥ 𝜙 + 𝜏𝜎¹₂‖𝑴‖²_𝐹 ,

where ̃𝜙 = 𝑓 (𝑸+), 𝑸+ = (𝑰 − ^𝜏2𝑴)⁻¹(𝑰 +^𝜏2𝑴) 𝑸^𝑐, and𝑴 = 𝑮_𝑐𝑸^⊤_𝑐 −𝑸_𝑐𝑮^⊤_𝑐.

5: Update𝜙 ← ^{𝜂𝜁 𝜙+ ̃}_{𝜂𝜁 +1}^𝜙 and then𝜁 ← 𝜂𝜁 + 1.

6: Compute the differences𝜟₁= 𝑸₊− 𝑸_𝑐and𝜟₂= 𝑿₊− 𝑿_𝑐, where 𝑿_𝑐 = (𝑰 − 𝑸𝑐𝑸^⊤_𝑐) 𝑷 𝑸𝑐

𝑿₊ = (𝑰 − 𝑸+𝑸^⊤₊) 𝑷 𝑸+

7: Update𝜏_𝑔 ← max (min (𝜏guess, 𝜏_max) , 𝜏min), where

𝜏_guess= tr(𝜟^⊤1𝜟₁)

|tr(𝜟^⊤₁𝜟₂)|

or |tr(𝜟^⊤₁𝜟₂)|

tr(𝜟^⊤2𝜟₂) .

8: Assign𝑸_𝑐 ← 𝑸₊.

9: end while

In Wen-Yin Integration, we stop the iteration if 𝑿_𝑐 small enough [5]. In the implementation, we use an equivalent condition‖𝑿_𝑐‖_𝐹 < 𝜖 for some tolerance 𝜖. Note that

‖𝑿_𝑐‖²_𝐹 = ¹₂‖𝑴‖²_𝐹, which is already computed inStep 4.

Instead of solving the optimization problem, Chang proposed a divide and conquer algorithm (Algorithm III-3) [6]. It integrates every two𝑸_[𝑖]recursively (seeAppendix A.1 for detail).

(33)

Algorithm III-3 Hierarchical Reduction Integration [6]

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚 × 𝑙 orthogonal matrices).

Ensure: Integrated orthogonal basis𝑸_hr.

1: Set ̃𝑁 ← 𝑁.

2: while ̃𝑁 > 1 do

3: Setℎ ← ⌊^𝑁2^̃⌋

4: for𝑡 = 1 to ℎ do

5: Compute𝑾 𝑺𝑻^⊤ ← svd(𝑸^⊤^[𝑖]𝑸_[𝑖+ℎ]).

6: Update𝑸_[𝑖] ← (𝑸[𝑖]𝑾 + 𝑸_[𝑖+ℎ]𝑻 ) (2(𝑰 + 𝑺))^−1/2.

7: end for

8: Update ̃𝑁 ← ⌈^𝑁^̃2⌉.

9: end while

10: Output𝑸_hr ← 𝑸_[1].

As shown inChapter 5, the Hierarchical Reduction Integration much faster than solving the optimization problem. It costs 𝑂(𝑁𝑚𝑙²) only, which is roughly the same the complexity of a single iteration in Kolmogorov-Nagumo Iteration and the Wen-Yin It- eration. However, according to Chang [6], the result is less accurate but still better than any one𝑸_[𝑖]. He suggests using this algorithm as a preprocessing of finding𝑸_init of the Kolmogorov-Nagumo Iteration and the Wen-Yin Iteration to reduce the number of iteration.

2.5 Stage IV: Postprocessing

There are several methods for postprocessing. Algorithm IV-1, the canonical method, forms the decomposition using SVD of𝑸^⊤𝑨. Similar to the Gramian Orthogonalization (Algorithm II-2), Algorithm IV-2 compute the eigenvalue decomposition of 𝑸^⊤𝑨𝑨^⊤𝑸 (the Gramian of𝑨^⊤𝑸) instead of computing SVD.

(34)

Algorithm IV-1 Canonical Postprocessing

Require: 𝑨 (real 𝑚 × 𝑛 matrix), 𝑸 (real 𝑚 × 𝑙 orthogonal matrix), 𝑘 (desired rank of approximate SVD).

1: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑽̂_𝑙^⊤ ← svd(𝑸^⊤𝑨).

2: Extract the largest𝑘 singular pairs from ̂𝑾_𝑙, ̂𝜮_𝑙, ̂𝑽_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘, ̂𝑽_𝑘.

3: Assign ̂𝑼_𝑘 ← 𝑸 ̂𝑾_𝑘.

Since the size of the projected matrix𝑸 𝑸^⊤𝑨 is equal to the input matrix 𝑨, computing the SVD of𝑸 𝑸^⊤𝑨 is unwise. Canonical Postprocessing computes the SVD of the smaller matrix𝑸^⊤𝑨 in order to reduce the computing complexity. Denoting the SVD as 𝑸^⊤𝑨 = 𝑾̂_𝑙𝜮̂_𝑙𝑽̂_𝑙^⊤, the SVD of𝑸 𝑸^⊤𝑨 can be written as

𝑸 𝑸^⊤𝑨 = 𝑸(̂𝑾_𝑙𝜮̂_𝑙𝑽̂_𝑙^⊤) = (𝑸𝑾̂_𝑙)𝜮̂_𝑙𝑽̂_𝑙^⊤ = ̂𝑼_𝑙𝜮̂_𝑙𝑽̂_𝑙^⊤. (2.6)

Note that the product ̂𝑼_𝑙 is a orthogonal matrix since𝑸 and ̂𝑾_𝑙 are orthogonal.

Algorithm IV-2 Gramian Postprocessing

Require: 𝑨 (real 𝑚 × 𝑛 matrix), 𝑸 (real 𝑚 × 𝑙 orthogonal matrix), 𝑘 (desired rank of approximate SVD).

1: Assign of𝒁 ← 𝑨^⊤𝑸.

2: Compute ̂𝑾_𝑙𝜮̂_𝑙²𝑾̂_𝑙^⊤ ← eig(𝒁^⊤𝒁).

3: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

5: Assign ̂𝑽_𝑘 ← 𝒁 ̂𝑾_𝑘𝜮̂_𝑘⁻¹.

For symmetric 𝑨, Halko, Martinsson and Tropp proposed an elegant algorithm [1]

(Algorithm IV-3) for this situation. The algorithm is much faster than the canonical method and keeps the symmetry of the result, with about twice error than the canonical algorithm.

(35)

Algorithm IV-3 Symmetric Postprocessing [1]

Require: 𝑨 (real symmetric 𝑚 × 𝑚 matrix), 𝑸 (real 𝑚 × 𝑙 orthogonal matrix), 𝑘 (desired rank of approximate SVD).

Ensure: Approximate rank-𝑘 eigenvalue decomposition of 𝑨 ≈ ̂𝑼_𝑘𝜮̂_𝑘𝑼̂_𝑘^⊤.

1: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑾̂_𝑙^⊤ ← eig(𝑸^⊤𝑨𝑸).

2: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

(36)

(37)

Chapter 3 Improvements of Integration

In this chapter, we optimize the Kolmogorov-Nagumo Integration (Algorithm III-1) and the Wen-Yin Integration (Algorithm III-2) for better performance, and propose algorithms based on the Gramian idea target for the case with many iterations. Table 3.1 lists the notations and the formulas used in this chapter. Here, we use bold italic uppercase letters (e.g.𝑨, 𝜴) for matrices, bold fraktur uppercase letters (e.g. 𝕭, 𝕼) for the matrices that keep unchanged in the iteration. Under-script𝑐 (e.g. 𝑸_𝑐, 𝑿_𝑐) are used for the matrices of the current iteration, and under-script plus sign (e.g.𝑩₊, 𝑫₊) are used for the matrices of the next iteration. Moreover, we use matrices with super-script𝑔 for the 𝑮_𝑐 terms. For example, in the updating of𝑸₊,𝑭_𝑐 is the coefficient of𝑸_𝑐, and𝑭_𝑐^𝑔 is the coefficient of 𝑮_𝑐.

3.1 Optimization Problem

The integration stage finds an orthogonal matrix 𝑸 that best represent the orthonormal basis𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁]. Here, we define such best𝑸_opt as

𝑸_opt ∶= arg min

𝑸^⊤𝑸=𝑰

1 𝑁

𝑁

∑𝑖=1‖𝑸^[𝑖]𝑸^⊤_[𝑖]− 𝑸𝑸^⊤‖

2

𝐹 . (3.1)

(38)

𝕼 = [𝑸^[1] 𝑸_[2] ⋯ 𝑸_[𝑁]] 𝕭 = 𝕼^⊤𝕼

𝑮_𝑐= 𝑷 𝑸_𝑐 = _𝑁¹𝕼𝑩_𝑐

𝑿_𝑐= (𝑰 − 𝑸^𝑐𝑸^⊤_𝑐) 𝑷 𝑸𝑐 = 𝑮_𝑐− 𝑸_𝑐𝑫_𝑐 𝑩_𝑐= 𝕼^⊤𝑸_𝑐

𝑩_𝑐^𝑔= 𝕼^⊤𝑮_𝑐 = _𝑁¹𝕭𝑩_𝑐

𝑫_𝑐= 𝑸^⊤_𝑐𝑷 𝑸_𝑐 = 𝑸^⊤_𝑐𝑮_𝑐 = 𝑮^⊤_𝑐𝑸_𝑐 = _𝑁¹𝑩_𝑐^⊤𝑩_𝑐 𝑫_𝑐^𝑔= 𝑸^⊤_𝑐𝑷²𝑸_𝑐 = 𝑮^⊤_𝑐𝑮_𝑐 = _𝑁¹𝑩^⊤_𝑐𝑩_𝑐^𝑔 = _𝑁¹₂𝑩_𝑐^⊤𝕭𝑩_𝑐

𝑪 defined inStep 4ofAlgorithm III-1 𝑪₁₁⁻¹, 𝑪₁₂⁻¹

𝑪₂₁⁻¹, 𝑪₂₂⁻¹ defined inEq. (3.19) 𝑭_𝑐=

{

𝑪 − 𝑫_𝑐𝑪⁻¹ in Kolmogorov-Nagumo Integration 𝑰 − 𝑪₂₂⁻¹𝑫_𝑐− 𝑪₂₁⁻¹ in Wen-Yin Integration

𝑭𝑐^𝑔= {

𝑪⁻¹ in Kolmogorov-Nagumo Integration 𝑪₁₂⁻¹𝑫_𝑐− 𝑪₁₁⁻¹ in Wen-Yin Integration

𝑬_𝑐= _𝑁¹𝑩_𝑐𝑭_𝑐^𝑔

𝑸₊= 𝑸_𝑐𝑭_𝑐+ 𝑮_𝑐𝑭_𝑐^𝑔 = 𝑸_𝑐𝑭_𝑐+ 𝕼𝑬_𝑐 𝑩₊= 𝑩_𝑐𝑭_𝑐+ 𝑩_𝑐^𝑔𝑭_𝑐^𝑔 = 𝑩_𝑐𝑭_𝑐+ 𝕭𝑬_𝑐 𝑸_opt = 𝑸_init𝑭 + 𝕼 ̃̃ 𝑬

𝑭̃₊= ̃𝑭_𝑐𝑭_𝑐 𝑬̃₊= ̃𝑬_𝑐𝑭_𝑐+ 𝑬_𝑐

Table 3.1: Notations and Formulas Used in Optimization Integrations

The optimization problem is equivalent to a maximization problem

𝑸_opt ∶= arg max

𝑸^⊤𝑸=𝑰

1

2tr(𝑸^⊤𝑷 𝑸) with 𝑷 ∶= 1 𝑁

𝑁

∑𝑖=1

𝑸_[𝑖]𝑸^⊤_[𝑖]. (3.2)

Here, we define

𝑓 (𝑸) = 1

2tr(𝑸^⊤𝑷 𝑸) (3.3)

as the objective function.

(39)

3.2 Improvements of Kolmogorov-Nagumo Integration

In the implementation, instead of explicitly forming𝑚 × 𝑚 matrices such as 𝑸_𝑐𝑸^⊤_𝑐 (with 2𝑚²𝑙 flops), we compute 𝑙 × 𝑙 matrices such as 𝑸^⊤_[𝑖]𝑸_𝑐 (with2𝑚𝑙²flops) to reduce computational cost and memory usage. For example,Step 3inAlgorithm III-1(Kolmogorov- Nagumo Integration) can be rewritten as

𝑿_𝑐 = (𝐼 − 𝑸𝑐𝑸^⊤_𝑐) 𝑷 𝑸𝑐 = (𝐼 − 𝑸𝑐𝑸^⊤_𝑐)

⎛⎜

⎜⎝ 1 𝑁

𝑁

∑𝑖=1

𝑸_[𝑖]𝑸^⊤_[𝑖]⎞

⎟⎟

⎠ 𝑸_𝑐

= 1 𝑁

𝑁

∑𝑖=1

(𝐼 − 𝑸𝑐𝑸^⊤_𝑐) 𝑸[𝑖]𝑸^⊤_[𝑖]𝑸_𝑐

= 1 𝑁

𝑁

∑𝑖=1

𝑸_[𝑖]𝑸^⊤_[𝑖]𝑸_𝑐− 1 𝑁

𝑁

∑𝑖=1

𝑸_𝑐𝑸^⊤_𝑐𝑸_[𝑖]𝑸^⊤_[𝑖]𝑸_𝑐

= 1 𝑁

𝑁

∑𝑖=1

𝑸_[𝑖](𝑸^⊤^[𝑖]𝑸_𝑐) − 1 𝑁𝑸_𝑐

𝑁

∑𝑖=1(𝑸^⊤^[𝑖]𝑸_𝑐)

⊤

(𝑸^⊤^[𝑖]𝑸_𝑐)

= 1 𝑁

𝑁

∑𝑖=1

𝑸_[𝑖]𝑩_[𝑖]− 1 𝑁𝑸_𝑐

𝑁

∑𝑖=1

𝑩_[𝑖]^⊤𝑩_[𝑖],

(3.4)

where𝑩_[𝑖] = 𝑸^⊤_[𝑖]𝑸_𝑐are𝑙×𝑙 matrices. Moreover, those matrix products can be accelerated by combining the matrices

ℵ_[1]ℶ_[1]+ ℵ_[2]ℶ_[2]+ ⋯ + ℵ_[𝑁]ℶ_[𝑁]= [ℵ^[1] ℵ_[2] ⋯ ℵ_[𝑁]]

⎡⎢

⎢⎢

⎢⎣ ℶ_[1]

ℶ_[2]

⋮ ℶ_[𝑁]

⎤⎥

⎥⎥

⎥⎦

. (3.5)

Therefore,eq. (3.4)can be rewritten as

𝑿_𝑐 = 𝑮_𝑐− 𝑸_𝑐𝑫_𝑐, (3.6)

where𝑮_𝑐 = _𝑁¹𝕼𝑩_𝑐,𝑫_𝑐 = _𝑁¹𝑩_𝑐^⊤𝑩_𝑐,

𝕼 = [𝑸[1] 𝑸_[2] ⋯ 𝑸_[𝑁]] and 𝑩_𝑐 = [𝑩[1] 𝑩_[2] ⋯ 𝑩_[𝑁]] . (3.7)

(40)

Note that we may compute𝑩_𝑐 as𝑩_𝑐 = 𝕼^⊤𝑸_𝑐. Hence, the updating of 𝑸₊ (Step 5in Algorithm III-1Kolmogorov-Nagumo Integration) can be written as

𝑸₊ = 𝑸_𝑐𝑪 + 𝑿_𝑐𝑪⁻¹ = 𝑸_𝑐𝑪 + 𝑮_𝑐𝑪⁻¹− 𝑸_𝑐𝑫_𝑐𝑪⁻¹ = 𝑸_𝑐𝑭_𝑐+ 𝑮_𝑐𝑭_𝑐^𝑔, (3.8)

where

𝑭_𝑐 = 𝑪 − 𝑫_𝑐𝑪⁻¹, 𝑭_𝑐^𝑔 = 𝑪⁻¹. (3.9) Similarly, we can update𝑩₊ = 𝕼^⊤𝑸₊ as

𝑩₊ = 𝕼^⊤𝑸₊ = 𝕼^⊤𝑸_𝑐𝑭_𝑐+ 𝕼^⊤𝑮_𝑐𝑭_𝑐^𝑔 = 𝑩_𝑐𝑭_𝑐+ 𝑩_𝑐^𝑔𝑭_𝑐^𝑔, (3.10)

where𝑩_𝑐^𝑔 = 𝕼^⊤𝑮_𝑐. Furthermore, instead of forming𝑿_𝑐, we may compute𝜩 = 𝑿_𝑐^⊤𝑿_𝑐 directly as

𝜩 = 𝑿_𝑐^⊤𝑿_𝑐 = (𝑮^⊤𝑐 − 𝑫_𝑐𝑸^⊤_𝑐)(𝑮𝑐− 𝑸_𝑐𝑫_𝑐)

= 𝑮^⊤_𝑐𝑮_𝑐− 𝑮^⊤_𝑐𝑸_𝑐𝑫_𝑐− 𝑫_𝑐𝑸^⊤_𝑐𝑮_𝑐+ 𝑫_𝑐𝑸^⊤_𝑐𝑸_𝑐𝑫_𝑐

= 𝑮^⊤_𝑐𝑮_𝑐− 𝑫_𝑐𝑫_𝑐− 𝑫_𝑐𝑫_𝑐+ 𝑫_𝑐𝑫_𝑐

= 𝑫_𝑐^𝑔− 𝑫_𝑐²,

(3.11)

where𝑫_𝑐^𝑔 = 𝑮^⊤_𝑐𝑮_𝑐 = _𝑁¹𝑩_𝑐^⊤𝑩_𝑐^𝑔. Note that𝑸^⊤_𝑐𝑮_𝑐 = 𝑮^⊤_𝑐𝑸_𝑐 = 𝑫_𝑐 and𝑸^⊤_𝑐𝑸_𝑐 = 𝑰.

(41)

Algorithm III-4 Kolmogorov-Nagumo Integration (Optimized)

Require: 𝑸_[1], 𝑸_[2], … , 𝑸_[𝑁](real𝑚 × 𝑙 orthogonal matrices), 𝑸_init (initial guess).

1: Combine𝕼 = [𝑸^[1] 𝑸_[2] ⋯ 𝑸_[𝑁]].

2: Initialize the current iterate𝑸_𝑐 ← 𝑸_init.

3: Assign𝑩_𝑐 ← 𝕼^⊤𝑸_𝑐.

5: Assign𝑮_𝑐 ← _𝑁¹𝕼𝑩_𝑐,𝑩_𝑐^𝑔 ← 𝕼^⊤𝑮_𝑐,𝑫_𝑐 ← _𝑁¹𝑩_𝑐^⊤𝑩_𝑐,𝑫_𝑐^𝑔 ← _𝑁¹𝑩^⊤_𝑐𝑩_𝑐^𝑔.

6: Compute𝑪 ← (

𝑰

2 + (^𝑰4 − 𝜩)^1/2)

1/2

, where𝜩 = 𝑫_𝑐^𝑔− 𝑫_𝑐².

7: Assign𝑭_𝑐 ← 𝑪 − 𝑫_𝑐𝑪⁻¹and𝑭_𝑐^𝑔 ← 𝑪⁻¹.

8: Update𝑸_𝑐 ← 𝑸_𝑐𝑭_𝑐+ 𝑮_𝑐𝑭_𝑐^𝑔 and𝑩_𝑐 ← 𝑩_𝑐𝑭_𝑐+ 𝑩_𝑐^𝑔𝑭_𝑐^𝑔.

9: end while

3.3 Improvements of Wen-Yin Integration

Similar to the Optimized Kolmogorov-Nagumo Integration (Algorithm III-4), we combine the matrices𝕼 = [𝑸^[1] 𝑸_[2] ⋯ 𝑸_[𝑁]] and define

𝑮_𝑐 = 𝑷 𝑸_𝑐 = _𝑁¹𝕼𝑩_𝑐, 𝑩_𝑐 = 𝕼^⊤𝑸_𝑐,

𝑩_𝑐^𝑔= 𝕼^⊤𝑮_𝑐,

𝑫_𝑐 = 𝑸^⊤_𝑐𝑷 𝑸_𝑐 = 𝑸^⊤_𝑐𝑮_𝑐 = 𝑮^⊤_𝑐𝑸_𝑐 = _𝑁¹𝑩_𝑐^⊤𝑩_𝑐, 𝑫𝑐^𝑔= 𝑸^⊤_𝑐𝑷²𝑸_𝑐 = 𝑮^⊤_𝑐𝑮_𝑐 = _𝑁¹𝑩_𝑐^⊤𝑩𝑐^𝑔.

(3.12)

We observed that𝑓 (𝑸_𝑐) inStep 4ofAlgorithm III-2(Wen-Yin Integration) can be computed by

𝑓 (𝑸𝑐) = ¹₂tr(𝑸^⊤^𝑐𝑷 𝑸_𝑐) = ¹2tr(𝑁¹𝑩_𝑐^⊤𝑩_𝑐) = 2𝑁¹ ‖𝑩𝑐‖²_𝐹 . (3.13)

(42)

Moreover,‖𝑴‖²_𝐹 can be written as

‖𝑴‖²_𝐹 = tr(𝑴^⊤𝑴) = tr((𝑸𝑐𝑮_𝑐^⊤− 𝑮_𝑐𝑸^⊤_𝑐) (𝑮𝑐𝑸^⊤_𝑐 − 𝑸_𝑐𝑮^⊤_𝑐))

= tr(𝑸𝑐𝑮^⊤_𝑐𝑮_𝑐𝑸^⊤_𝑐 + 𝑮^⊤_𝑐𝑸^⊤_𝑐𝑸_𝑐𝑮^⊤_𝑐 − 𝑸_𝑐𝑮_𝑐^⊤𝑸_𝑐𝑮^⊤_𝑐 − 𝑮_𝑐𝑸^⊤_𝑐𝑮_𝑐𝑸^⊤_𝑐)

= 2 tr(𝑮𝑐^⊤𝑮_𝑐) − 2 tr(𝑸^⊤𝑐𝑮_𝑐𝑸^⊤_𝑐𝑮_𝑐) = 2 tr(𝑫𝑐^𝑔) − 2 ‖𝑫𝑐‖²_𝐹 .

(3.14)

To compute𝑸₊, instead of explicitly forming𝑚 × 𝑚 matrix 𝑴 = 𝑮_𝑐𝑸^⊤_𝑐 − 𝑸_𝑐𝑮_𝑐^⊤, we construct two low-rank matrices [5]

𝑳 = [𝑮^𝑐 𝑸_𝑐] and 𝑹 = [𝑸^𝑐 −𝑮_𝑐] (3.15)

with𝑴 = 𝑳𝑹^⊤. Using Woodbury matrix identity, the inverse can be rewritten as

(𝑰 −^𝜏2𝑴)⁻¹ = (𝑰 − ^𝜏2𝑳𝑹^⊤)

−1

= 𝑰 − 𝑳 (𝑹^⊤𝑳 −²_𝜏𝑰)⁻¹𝑹^⊤. (3.16)

Therefore,

𝑸₊ = (𝑰 − ^𝜏2𝑴)⁻¹(𝑰 +^𝜏2𝑴) 𝑸^𝑐 =

(2 (𝑰 − ^𝜏2𝑴)⁻¹− 𝑰 )𝑸_𝑐

= 𝑸_𝑐− 𝑳 (¹2𝑹^⊤𝑳 −¹_𝜏𝑰)⁻¹𝑹^⊤𝑸_𝑐

= 𝑸_𝑐− [𝑮𝑐 𝑸_𝑐] 𝑪⁻¹

⎡⎢

⎢⎣ 𝑸^⊤_𝑐

−𝑮^⊤_𝑐

⎤⎥

⎥⎦ 𝑸_𝑐

= 𝑸_𝑐− [𝑮𝑐 𝑸_𝑐] 𝑪⁻¹

⎡⎢

⎢⎣ 𝑰

−𝑫_𝑐

⎤⎥

⎥⎦ ,

(3.17)

where

𝑪 = 1

2𝑹^⊤𝑳 −1 𝜏𝑰 = 1

2

⎡⎢

⎢⎣ 𝑸^⊤_𝑐

−𝑮^⊤_𝑐

⎤⎥

⎥⎦

[𝑮^𝑐 𝑸_𝑐] − 1 𝜏𝑰

= 1 2

⎡⎢

⎢⎣

𝑸^⊤_𝑐𝑮_𝑐 𝑸^⊤_𝑐𝑸_𝑐

−𝑮^⊤_𝑐𝑮_𝑐 −𝑮^⊤_𝑐𝑸_𝑐

⎤⎥

⎥⎦

− 1 𝜏𝑰 = 1

2

⎡⎢

⎢⎣

𝑫_𝑐− ²_𝜏𝑰 𝑰

−𝑫_𝑐^𝑔 −𝑫_𝑐− _𝜏²𝑰

⎤⎥

⎥⎦

(3.18)

(43)

is a2𝑙 × 2𝑙 matrix, which is much smaller than 𝑴. Denoting

𝑪⁻¹ =

⎡⎢

⎢⎣

𝑪₁₁⁻¹ 𝑪₁₂⁻¹ 𝑪₂₁⁻¹ 𝑪₂₂⁻¹

⎤⎥

⎥⎦

, (3.19)

eq. (3.17)becomes

𝑸₊ = 𝑸_𝑐− [𝑮𝑐 𝑸_𝑐]

⎡⎢

⎢⎣

𝑪₁₁⁻¹ 𝑪₁₂⁻¹ 𝑪₂₁⁻¹ 𝑪₂₂⁻¹

⎤⎥

⎥⎦

⎡⎢

⎢⎣ 𝑰

−𝑫_𝑐

⎤⎥

⎥⎦

= 𝑸_𝑐− 𝑮_𝑐𝑪₁₁⁻¹+ 𝑮_𝑐𝑪₁₂⁻¹𝑫_𝑐− 𝑸_𝑐𝑪₂₁⁻¹+ 𝑸_𝑐𝑪₂₂⁻¹𝑫_𝑐

= 𝑸_𝑐(𝑪22⁻¹𝑫_𝑐− 𝑪₂₁⁻¹+ 𝑰) + 𝑮𝑐(𝑪12⁻¹𝑫_𝑐− 𝑪₁₁⁻¹)

= 𝑸_𝑐𝑭_𝑐+ 𝑮_𝑐𝑭_𝑐^𝑔,

(3.20)

where

𝑭_𝑐 = 𝑪₂₂⁻¹𝑫_𝑐− 𝑪₂₁⁻¹+ 𝑰, 𝑭_𝑐^𝑔 = 𝑪₁₂⁻¹𝑫_𝑐− 𝑪₁₁⁻¹. (3.21) Therefore, we can update𝑩₊ = 𝕼^⊤𝑸₊as

𝑩₊ = 𝕼^⊤𝑸₊ = 𝕼^⊤𝑸_𝑐𝑭_𝑐+ 𝕼^⊤𝑮_𝑐𝑭_𝑐^𝑔 = 𝑩_𝑐𝑭_𝑐+ 𝑩_𝑐^𝑔𝑭_𝑐^𝑔. (3.22)

利用高度平行之演算法整合多重隨機奇異值分解並應用於巨量資料分析

國⽴臺灣⼤學理學院應⽤數學科學研究所 碩⼠論⽂

Institute of Applied Mathematical Sciences College of Science

National Taiwan University Master Thesis

利⽤⾼度平⾏之演算法整合多重隨機奇異值分解並應

⽤於巨量資料分析

Highly Scalable Parallelism of Integrated Randomized Singular Value Decomposition with Big Data Applications

楊慕 Mu Yang

指導教授：王偉仲 博⼠

Advisor: Weichung Wang, Ph. D.

中華民國 106 年 7 ⽉

July, 2017

誌謝

Acknowledgements

摘要

Abstract

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1 Introduction

Chapter 2 Algorithms

2.1 Integrated Singular Value Decomposition

2.2 Stage I: Sketching

2.3 Stage II: Orthogonalization

2.4 Stage III: Integration

2.5 Stage IV: Postprocessing

Chapter 3

Improvements of Integration

3.1 Optimization Problem

3.2 Improvements of Kolmogorov-Nagumo Integration

3.3 Improvements of Wen-Yin Integration

國⽴臺灣⼤學理學院應⽤數學科學研究所碩⼠論⽂

指導教授：王偉仲　博⼠