• 沒有找到結果。

Algorithm 2 Multi-threaded

N/A
N/A
Protected

Academic year: 2022

Share "Algorithm 2 Multi-threaded"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Algorithm 2

Michael Tsai 2014/1/2

(2)

Scheduling

 Scheduler的工作:把strand指定給processor執行.

 On-line: scheduler事先並不知道什麼時候strand 會spawn, 或者spawn出來的什麼時候會完成.

 Centralized scheduler: 單一的scheduler知道整 體狀況並作scheduling (比較容易分析)

 Distributed scheduler: 每個thread互相溝通合作, 找出最好的scheduling方法

(3)

Greedy scheduler

每一個time step, 把越多strand分給越多processor執行越

如果某個time step的時候, 至少有P個strand是準備好被執 行的, 則稱為complete step

反之則稱為 incomplete step

Lower bounds: 最好的狀況也需要這些時間:

work law: 𝑇

𝑃

=

𝑇1

𝑃

span law: 𝑇

𝑃

= 𝑇

接下來一連串證明, 會說greedy scheduler其實是蠻好的 scheduler.

(4)

processors, a greedy scheduler executes a

multithreaded computation with work 𝑇1 and span 𝑇 in time 𝑇𝑃𝑇1

𝑃 + 𝑇.

 Proof:

 我們想證明complete steps最多為 𝑇1

𝑃 個.

 用反證法: 假設complete steps超過 𝑇1

𝑃 個.

 則所有complete steps所完成的工作有:

 𝑃 ∙ 𝑇1

𝑃 + 1 = 𝑃 𝑇1

𝑃 + 𝑃

= 𝑇1 − 𝑇1 𝑚𝑜𝑑 𝑃 + 𝑃

> 𝑇1

矛盾, 因為全部的工作也才需花𝑇1.

因此complete steps最多為 𝑇1

𝑃 個.

(5)

現在考慮incomplete step的個數:

最長路徑一定從in-degree=0的vertex開始

greedy scheduler的一個incomplete step一定把所有G’裡面in- degree=0的vertex都執行下去 (G’’裡面不會有 in-degree=0的)

G’’裡面的最長路徑一定比G’中的最長路徑長度少1

意思就是說每個incomplete step會使”表示還沒執行strand的dag”

裡面的最長路徑少1

因此最多所有incomplete step個數應為span: 𝑇

.

最後, 所以的step一定是complete或incomplete.

因此𝑇

𝑃

𝑇1

𝑃

+ 𝑇

𝐺 𝐺

𝐺′′

… …

(6)

 Corollary: The running time 𝑇𝑃 of any

multithreaded computation scheduled by a

greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.

 Proof:

 假設𝑇𝑃為最佳scheduler在P個processor的機器 上所需執行的時間.

 𝑇1和𝑇分別為work & span

 𝑇𝑃 ≥ max 𝑇1

𝑃 , 𝑇 .

 𝑇𝑃𝑇1

𝑃 + 𝑇 ≤ 2 max 𝑇1

𝑃 , 𝑇 ≤ 2𝑇𝑃

(7)

Corollary: Let 𝑇𝑃 be the running time of a

multithreaded computation produced by a greedy scheduler on an ideal parallel computer with P

processors, and let 𝑇1 and 𝑇 be the work and span of the computation, respectively. Then, if 𝑷 ≪ 𝑻𝟏

𝑻, we

have 𝑻

𝑷 ≈ 𝑻𝟏/𝑷, or equivalently, a speedup of

approximately P.

Proof:

If 𝑃 ≪ 𝑇1/𝑇, then 𝑇 𝑇1

𝑃.

So 𝑇𝑃 𝑇1

𝑃 + 𝑇 𝑇1

𝑃.

(Work law: 𝑇𝑃 𝑇1

𝑃)

Or, the speedup is 𝑇1

𝑇𝑃 ≈ 𝑃.

When is << ?

When the slackness (𝑇1

𝑇/𝑃) ≥ 10.

(8)
(9)
(10)

Back to P-FIB

 𝑇1 𝑛 = 𝑇 𝑛 = Θ 𝜙𝑛

 𝑇 𝑛 = max 𝑇 𝑛 − 1 , 𝑇 𝑛 − 2 + Θ 1

= 𝑇 𝑛 − 1 + Θ 1

 𝑇 𝑛 = Θ 𝑛

 Parallelism:

𝑇1 𝑛

𝑇 𝑛 = Θ 𝜙𝑛

𝑛

P-FIB(n) if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

即使對大平行電腦, 一個普 通的n都可以使我們的程式 達到near perfect linear

speedup. (Parallelism比P大很 多slackness很大)

(11)

Parallel Loops

 Parallel loops: 把loop的iterations平行地執行.

 在for關鍵字前面加上 “parallel”.

 也可以用spawn和sync, 不過這樣的語法比較方 便.

(12)

矩陣與向量相乘

 A: 大小為n x n的矩陣

 x:大小為n向量

 需計算出 y=Ax.

 𝑦𝑖 = 𝑗=1𝑛 𝑎𝑖𝑗𝑥𝑗.

=

y A x

𝑦𝑖 i

MAT-VEC(A,x) j n=A.rows

let y be a new vector of length n parallel for i=1 to n

𝑦𝑖 = 0

parallel for i=1 to n for j=1 to n

𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 return y

(13)

MAT-VEC-MAIN-LOOP(A,x,y,n,i,i’) if i==i’

for j=1 to n

𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 else

mid= (𝑖 + 𝑖)/2

spawn MAT-VEC-MAIN-LOOP(A,x,y,n,i,mid)

MAT-VEC-MAIN-LOOP(A,x,y,n,mid+1,i’)

sync

i: 從第幾個row開始算 n: A和x的大小

i’: 算到第幾個row

(14)

=

y A x

𝑦𝑖 i

j 1,4 5,8

(15)

Analyze MAT-VEC

 Work: 像是把parallel的部分拿掉計算一般的執 行時間.

 𝑇1 𝑛 = Θ 𝑛2

 (主要是雙重迴圈的部分)

 但是要注意,

spawn/parallel loop 的部分會造成額外 的overhead!

MAT-VEC(A,x) n=A.rows

let y be a new vector of length n parallel for i=1 to n

𝑦𝑖 = 0

parallel for i=1 to n for j=1 to n

𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 return y

(16)

n-1個internal nodes

n個leaves

每個internal node都會spawn一次.

(spawn overhead為constant)

leave比internal node多, 因此spawn 的cost可以平均分攤給leave

因此spawn overhead並不會使𝑇1 𝑛 的order上升.

(但constant變大)

(17)

Analyze MAT-VEC

 現在我們要計算span.

 此時需要考慮spawn overhead了.

 spawn的次數=log(n), 因為每次都分開成兩個. 每次spawn 的overhead為constant.

 𝑖𝑡𝑒𝑟 𝑖 代表第i個iteration的span.

 因此一般來說, parallel loops的span為:

 𝑇 𝑛 = Θ log 𝑛 + max

1≤𝑖≤𝑛 𝑖𝑡𝑒𝑟 𝑖

 𝑇 𝑛 = Θ log 𝑛 + Θ 𝑛

= Θ 𝑛

 Parallelism=Θ 𝑛2

Θ 𝑛 = Θ 𝑛

MAT-VEC(A,x) n=A.rows

let y be a new vector of length n parallel for i=1 to n

𝑦𝑖 = 0

parallel for i=1 to n for j=1 to n

𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 return y

(18)

Race Conditions

 一個multithreaded algorithm應該要是deterministic的

每次執行的結果都一樣

 不管怎麼去把程式的不同部分排程給不同的

processor/core去執行

 通常”失敗”的時候, 都是因為有”determinacy race”.

 Determinacy Race: 當有兩個邏輯上平行的指令存取相 同的記憶體位置而且至少有一個指令是寫入.

(19)

Determinacy Race Examples

Therac-25: 放射線治療機組.

1985-1987年間造成至少6個病人 受到原本設定劑量100倍的輻射, 造成死亡或嚴重輻射灼傷.

延伸閱讀: 殺人的軟體.

http://hiraman-sharma.blogspot.com/2010/07/killer-softwares.html

(20)

Determinacy Race Examples

2003 Northeast Blackout: Affected 10M people in Ontario & 45M people in 8 U.S. states.

Cause: A software bug known existed in General Electric Energy's Unix-based XA/21 energy management system

(21)

Race-Example

RACE-EXAMPLE() x=0

parallel for i=1 to 2 x=x+1

print x

注意: 大部分的執行順序都會得到正確的結 果, 只有少部分的順序才會造成錯誤!

要找出這些錯誤非常困難!

(22)

如何避免race condition?

 有很多方法(Mutex等, OS裡面會教)

 簡單的方法: 只將平行運算運用在獨立的運算上, 也就是說互相之間沒有關聯性.

 spawn出來的child跟parent, 還有其他spawn出來 的child都互相之間沒有關係.

(23)

Socrates chess-playing program

𝑇1 = 2048 𝑇 = 1

𝑇1 = 1024 𝑇 = 8

𝑃 = 32 𝑇32 = 2048 32 + 1

= 65

𝑇32 = 1024 32 + 8

= 40

𝑃 = 512 𝑇512 = 2048 512 + 1

= 5

𝑇512 = 1024 512 + 8

= 10

Original version “Optimized version”

(24)

multiplication

P-SQUARE-MATRIX-MULTIPLY(A,B) n=A.rows

let C be a new n x n matrix

parallel for i=1 to n

parallel for j=1 to n

𝑐𝑖𝑗 = 0

for k=1 to n

𝑐𝑖𝑗 = 𝑐𝑖𝑗 + 𝑎𝑖𝑘 ∙ 𝑏𝑘𝑗 return C

𝑇1(𝑛) = Θ 𝑛3

𝑇 𝑛 = Θ log n + Θ log 𝑛 + Θ 𝑛 = Θ(𝑛) 𝑇1 𝑛

𝑇 𝑛 = Θ 𝑛3

Θ 𝑛 = Θ 𝑛2

(25)

Divide-and-conquer

Multithreaded Algorithm for Matrix Multiplication (勒勒長)

 𝐴 = 𝐴11 𝐴12 𝐴21 𝐴22

 𝐵 = 𝐵11 𝐵12 𝐵21 𝐵22

 𝐶 = 𝐶11 𝐶12

𝐶21 𝐶22 = 𝐴11 𝐴12 𝐴21 𝐴22

𝐵11 𝐵12 𝐵21 𝐵22

= 𝐴11𝐵11 𝐴12𝐵12

𝐴21𝐵11 𝐴22𝐵12 + 𝐴12𝐵21 𝐴12𝐵22 𝐴22𝐵21 𝐴22𝐵22

C T

(26)

if n==1

𝑐11 = 𝑎11𝑏11

else let T be a new n x n matrix

partition A,B,C, and T into n/2 x n/2 submatrices (𝐴𝑖𝑗, 𝐵𝑖𝑗, 𝐶𝑖𝑗, 𝑇𝑖𝑗, 𝑖, 𝑗 = {1,2})

spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶11, 𝐴11, 𝐵11) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶12, 𝐴11, 𝐵12) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶21, 𝐴21, 𝐵11) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶22, 𝐴21, 𝐵12) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇11, 𝐴12, 𝐵21) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇12, 𝐴12, 𝐵22) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇21, 𝐴22, 𝐵21) P-MATRIX-MULTIPLY-RECURSIVE(𝑇22, 𝐴22, 𝐵22)

sync

parallel for i=1 to n

parallel for j=1 to n 𝑐𝑖𝑗 = 𝑐𝑖𝑗 + 𝑡𝑖𝑗

𝑀1 𝑛 = 8𝑀1

2 + Θ 𝑛2 = Θ 𝑛3

𝑀 𝑛 = 𝑀 𝑛

2 + Θ log 𝑛 + Θ log 𝑛 = Θ log2 𝑛 𝑀1 𝑛

𝑀 𝑛 = Θ 𝑛3

Θ log2 𝑛 = Θ 𝑛3 log2 𝑛

(27)

How about Strassen’s method?

 Reading assignment: p.795-796.

 Parallelism: Θ(𝑛log 7

log2 𝑛), slightly less than the original recursive version!

參考文獻

相關文件

• When a system undergoes any chemical or physical change, the accompanying change in internal energy, ΔE, is the sum of the heat added to or liberated from the system, q, and the

FlowQCD, 2014: continuum extrapolation only WHOT-QCD, 2016: small t limit

2003~2010: Control experiment  Initial state effects such as Cronin effect, (anti-)shadowing and saturation. 2010~today: Discussion of possibility to create QGP in small

In an Ising spin glass with a large number of spins the number of lowest-energy configurations (ground states) grows exponentially with increasing number of spins.. It is in

In this paper, we develop a novel volumetric stretch energy minimization algorithm for volume-preserving parameterizations of simply connected 3-manifolds with a single boundary

 Propose eQoS, which serves as a gene ral framework for reasoning about th e energy efficiency trade-off in int eractive mobile Web applications.  Demonstrate a working prototype and

 Corollary: Let be the running time of a multithreaded computation produced by a g reedy scheduler on an ideal parallel comp uter with P processors, and let and be the work

Threads are assigned to processors by the scheduler for executions. Each processes have their own memory (for