GPU Acceleration - 利用高度平行之演算法整合多重隨機奇異值分解並應用於巨量資料分析

to all processors (MPI_Allgather).

2: Assign𝘡^(𝑗)← 𝘈^{⟨𝑗⟩⊤}𝑸 in processor 𝑗.

3: Sum𝑴 ← ∑^𝑃_𝑗=1𝘡^(𝑗)⊤𝘘^(𝑗)to all processors (MPI_Allreduce).

4: Compute ̂𝑾_𝑙𝜮̂_𝑙𝑾̂_𝑙^⊤ ← eig(𝑴).

5: Extract the largest𝑘 eigen-pairs from ̂𝑾_𝑙, ̂𝜮_𝑙 to obtain ̂𝑾_𝑘, ̂𝜮_𝑘.

6: Assign ̂𝑼_𝑘 ← 𝑸 ̂𝑾_𝑘.

4.6 GPU Acceleration

Sketching and postprocessing stages are the only algorithms using the huge input matrix 𝑨, and cost square to the input size (𝑂(𝑚𝑛𝑙) flops). Fortunately, all the operations require 𝑨 are matrix-matrix multiplication, which can be highly accelerated using GPU [9].

We use GPU to compute𝒀_[𝑖] ← 𝑨𝜴_[𝑖]in sketching and𝒁 ← 𝑨^⊤𝑸 in postprocessing stages. For𝑨 with extremely large sizes that exceed the GPU memory limit, we split the matrices into column/row blocks and compute the multiplication successively.

Chapter 5 Comparison of Algorithms

In this chapter, we denote 𝐼 as the number of iteration, and 𝐼_𝑠 as the total number of iteration in finding step size (𝐼_𝑠 ≥ 𝐼). Here we assume 𝑚 ≫ 𝑁𝑙, 𝑙 = 𝑘, and 𝑞 = 0. The following tables only show the leading terms.¹ In the table, we abbreviate the Kolmogorov-Nagumo Iteration, the Wen-Yin Iteration, and the Hierarchical Reduction Integration as KN, WY, and HR respectively.

Tables 5.1to5.3summarize the computational and communication complexity of all the algorithms. For integration, we obtain that the optimized algorithms indeed reduce the computational complexity. We do not recommend the Gramian algorithms, unless one expects the number of iteration to be larger than ^𝑁

4. For orthogonalization and postpro-cessing, we recommend the Gramian algorithms, which are fast and well-parallelized.

By comparing the parallelisms, we obtain that the block-row postprocessing requires an extra𝑛𝑙 communication, and the block-column sketching requires an extra 𝑁𝑚𝑙 com-munication. Therefore we recommend using Block-Column Parallelisms on sketching and postprocessing if and only if the input matrix𝑨 is short and fat enough (that is, 𝑁𝑚 < 𝑛).

1SeeAppendix Bfor detail.

Table 5.1: The Complexity of Sequential Algorithms

Stage Name #flops

Sketching Algorithm I-1Gaussian Projection 2𝑁𝑛𝑚𝑙 Orthogonalization Algorithm II-1Canonical QR 4𝑁𝑚𝑙²

Algorithm II-2Gramian 3𝑁𝑚𝑙²

Integration

Algorithm III-1KN (Original) (4𝑁 + 10)𝑚𝑙²𝐼 Algorithm III-4KN (Optimized) (4𝑁 + 4)𝑚𝑙²𝐼 Algorithm III-6KN (Gramian) (𝑁²+ 2𝑁 + 2)𝑚𝑙² Algorithm III-2WY (Original) (2𝑁 + 8)𝑚𝑙²𝐼_𝑠+ (4𝑁 + 4)𝑚𝑙²𝐼 Algorithm III-5WY (Optimized) (4𝑁 + 6)𝑚𝑙²𝐼

Algorithm III-7WY (Gramian) (𝑁²+ 2𝑁 + 2)𝑚𝑙²

Algorithm III-3HR 6𝑁𝑚𝑙²

Postprocessing

Algorithm IV-1SVD with QR 2𝑛𝑚𝑙 + 6𝑛𝑙²+ 2𝑚𝑙² Algorithm IV-2Gramian 2𝑛𝑚𝑙 + 3𝑛𝑙²+ 2𝑚𝑙² Algorithm IV-3Symmetric 2𝑚²𝑙 + 3𝑚𝑙² Table 5.2: The Complexity of Block-Row Algorithms

Stage Name #flops #words

Sketching Algorithm I-2Gaussian Projection 2^𝑁_𝑃𝑛𝑚𝑙 -Orthogonalization Algorithm II-4TSQR 4^𝑁_𝑃𝑚𝑙² (log 𝑃 )𝑁𝑙²

Algorithm II-3Gramian 3^𝑁_𝑃𝑚𝑙² (log 𝑃 )𝑁𝑙²

Integration

Algorithm III-10KN ¹

𝑃(4𝑁 + 4)𝑚𝑙²𝐼 (log 𝑃 )𝑁𝑙²𝐼 Algorithm III-6KN (Gramian) ¹

𝑃(𝑁²+ 2𝑁 + 2)𝑚𝑙² (log 𝑃 )𝑁²𝑙²

Algorithm III-11WY ¹

𝑃(4𝑁 + 6)𝑚𝑙²𝐼 (log 𝑃 )𝑁𝑙²𝐼 Algorithm III-7WY (Gramian) ¹

𝑃(𝑁²+ 2𝑁 + 2)𝑚𝑙² (log 𝑃 )𝑁²𝑙² Table 5.3: The Complexity of Block-Column Algorithms

Stage Name #flops #words

Sketching Algorithm I-3Gaussian Projection 2^𝑁_𝑃𝑛𝑚𝑙 𝑁𝑚𝑙

Postprocessing

Chapter 6 Implementation

The iSVD algorithm is implemented in C++ with more than 20000 lines of codes. The package provides multinode, multithread, and GPU support. We also use some tools for automatic installation and correctness test.

6.1 C++ Implementation

We design and implement iSVD package in C++ with objective oriented programming model. It is paralleled using MPI for multicore clusters. Intel Math Kernel Library [10]

is used for BLAS/LAPACK [11] routines. The package is highly benefited from adopt-ing object-oriented programmadopt-ing, C++11 standard [12], and template meta-programmadopt-ing.

We also wrap the BLAS/LAPACK routines that delegate the selection of the correct type of routines to the compiler. With these techniques, the package can be easily used with different types of input matrix and different iSVD algorithms without modifying exist codes.

6.1.1 Object Oriented Programming

Object-oriented programming (OOP) is a programming paradigm based on objects, which combines data and functions. The benefit to adopt OOP is huge in our implementation.

With OOP, we warp data structures into object, and design linear algebra routines based on these object instead of using raw data directly. This technique makes programming

easier and less fallible. The iSVD algorithms are also implemented with OOP. The user can use them easily without dealing with parameter transfer and memory control.

6.1.2 C++11 Standard

Our implementation benefited plenty from C++11 standard [12]. This new standard is a great improvement from C++03, with many new features such as right value reference helps developer manipulate temporary objects generated by some syntax and avoid per-formance traps, constexpr specifier allows the compiler to do more compile-time de-cision and gain better optimization. The standard library of C++11 provided a smart pointer classes such asshared_ptrwith features such as automatic memory management or bounds checking. All of the data in our package are controlled by smart pointers. One no need to concerned about the dynamic memory allocations.

6.1.3 Template Meta-Programming

Template, one of an important feature of C++, allows function and classes to operate with generic types. In iSVD Package, almost all the codes, such as storage structures, linear algebra functions, and iSVD drivers, are implemented with template. With template usage, we can combine write a single code for all data types and storage layout instead of having multiple similar codes.

The implementation only contains header files. One can easily use it without any binary library to link to, and any configured header file. Our package is a pure template library defined in the headers.

6.1.4 Curiously Recurring Template Pattern

The curiously recurring template pattern (CRTP) was formalized in the 1980s as “F-bounded quantification” [13], and was named by Jim Coplien [14] in 1995. CRTP is an idiom in C++ in which a classXderives from a class template instantiation usingXitself as template argument [15]. With CRTP, we implement interfaces without virtual functions to reduce run-time overheads.

6.1.5 Data Storage and Linear Algebra Wrappers

We implement the matrix structures using OOP, and implement methods so that one may use it similar to the usage of MATLAB matrices. The expressions such as transpose, sub-matrix access, and data type changes are done in compile time. We also warp BLAS/LAPACK routines with our data structures using template to decide the correct type of routines in compile-time. These wrappers are written in intuitive interfaces. With above these tech-nique, one does not need to concern for the complicated matrix memory structure and BLAS/LAPACK routines.

在文檔中利用高度平行之演算法整合多重隨機奇異值分解並應用於巨量資料分析 (頁 63-69)