Algorithm 2 Multi-threaded

(1)

Algorithm 2

Michael Tsai 2014/1/2

(2)

Scheduling

 Scheduler的工作:把strand指定給processor執行.

 On-line: scheduler事先並不知道什麼時候strand 會spawn, 或者spawn出來的什麼時候會完成.

 Centralized scheduler: 單一的scheduler知道整體狀況並作scheduling (比較容易分析)

 Distributed scheduler: 每個thread互相溝通合作, 找出最好的scheduling方法

(3)

Greedy scheduler

 每一個time step, 把越多strand分給越多processor執行越好

 如果某個time step的時候, 至少有P個strand是準備好被執 行的, 則稱為complete step

 反之則稱為 incomplete step

 Lower bounds: 最好的狀況也需要這些時間:



work law: 𝑇

_𝑃

=

^𝑇¹

𝑃



span law: 𝑇

_𝑃

= 𝑇

_∞

 接下來一連串證明, 會說greedy scheduler其實是蠻好的 scheduler.

(4)

processors, a greedy scheduler executes a

multithreaded computation with work 𝑇₁ and span 𝑇_∞ in time 𝑇_𝑃 ≤ ^𝑇¹

𝑃 + 𝑇_∞.

 Proof:

 我們想證明complete steps最多為 ^𝑇¹

𝑃 個.

 用反證法: 假設complete steps超過 ^𝑇¹

𝑃 個.

 則所有complete steps所完成的工作有:

 𝑃 ∙ ^𝑇¹

𝑃 + 1 = 𝑃 ^𝑇¹

𝑃 + 𝑃

= 𝑇₁ − 𝑇₁ 𝑚𝑜𝑑 𝑃 + 𝑃

> 𝑇₁

矛盾, 因為全部的工作也才需花𝑇₁.

因此complete steps最多為 ^𝑇¹

𝑃 個.

(5)



現在考慮incomplete step的個數:



最長路徑一定從in-degree=0的vertex開始



greedy scheduler的一個incomplete step一定把所有G’裡面in- degree=0的vertex都執行下去 (G’’裡面不會有 in-degree=0的)



G’’裡面的最長路徑一定比G’中的最長路徑長度少1



意思就是說每個incomplete step會使”表示還沒執行strand的dag”

裡面的最長路徑少1



因此最多所有incomplete step個數應為span: 𝑇

_∞

.



最後, 所以的step一定是complete或incomplete.



因此𝑇

_𝑃

≤

^𝑇¹

𝑃

+ 𝑇

_∞

𝐺 𝐺^′

𝐺^′′

…

… …

(6)

 Corollary: The running time 𝑇_𝑃 of any

multithreaded computation scheduled by a

greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.

 Proof:

 假設𝑇_𝑃^∗為最佳scheduler在P個processor的機器上所需執行的時間.

 𝑇₁和𝑇_∞分別為work & span

 𝑇_𝑃^∗ ≥ max ^𝑇¹

𝑃 , 𝑇_∞ .

 𝑇_𝑃 ≤ ^𝑇¹

𝑃 + 𝑇_∞ ≤ 2 max ^𝑇¹

𝑃 , 𝑇_∞ ≤ 2𝑇_𝑃^∗

(7)

 Corollary: Let 𝑇_𝑃 be the running time of a

multithreaded computation produced by a greedy scheduler on an ideal parallel computer with P

processors, and let 𝑇₁ and 𝑇_∞ be the work and span of the computation, respectively. Then, if 𝑷 ≪ ^𝑻^𝟏

𝑻_∞, we

have 𝑻

_𝑷 ≈ 𝑻_𝟏/𝑷, or equivalently, a speedup of

approximately P.

 Proof:

 If 𝑃 ≪ 𝑇₁/𝑇_∞, then 𝑇_∞ ≪ ^𝑇¹

𝑃.

 So 𝑇_𝑃 ≤ ^𝑇¹

𝑃 + 𝑇_∞ ≈ ^𝑇¹

𝑃.

 (Work law: 𝑇_𝑃 ≥ ^𝑇¹

𝑃)

 Or, the speedup is ^𝑇¹

𝑇_𝑃 ≈ 𝑃.

When is << ?

When the slackness (^𝑇¹

𝑇_∞/𝑃) ≥ 10.

(8)

(9)

(10)

Back to P-FIB

 𝑇₁ 𝑛 = 𝑇 𝑛 = Θ 𝜙^𝑛

 𝑇_∞ 𝑛 = max 𝑇_∞ 𝑛 − 1 , 𝑇_∞ 𝑛 − 2 + Θ 1

= 𝑇_∞ 𝑛 − 1 + Θ 1

 𝑇_∞ 𝑛 = Θ 𝑛

 Parallelism:

𝑇₁ 𝑛

𝑇_∞ 𝑛 = Θ ^𝜙^𝑛

𝑛

P-FIB(n) if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

即使對大平行電腦, 一個普通的n都可以使我們的程式達到near perfect linear

speedup. (Parallelism比P大很多slackness很大)

(11)

Parallel Loops

 Parallel loops: 把loop的iterations平行地執行.

 在for關鍵字前面加上 “parallel”.

 也可以用spawn和sync, 不過這樣的語法比較方便.

(12)

矩陣與向量相乘

 A: 大小為n x n的矩陣

 x:大小為n向量

 需計算出 y=Ax.

 𝑦_𝑖 = _𝑗=1^𝑛 𝑎_𝑖𝑗𝑥_𝑗.

=

y A x

𝑦_𝑖 i

MAT-VEC(A,x) j n=A.rows

let y be a new vector of length n parallel for i=1 to n

𝑦_𝑖 = 0

parallel for i=1 to n for j=1 to n

𝑦_𝑖 = 𝑦_𝑖 + 𝑎_𝑖𝑗𝑥_𝑗 return y

(13)

MAT-VEC-MAIN-LOOP(A,x,y,n,i,i’) if i==i’

for j=1 to n

𝑦_𝑖 = 𝑦_𝑖 + 𝑎_𝑖𝑗𝑥_𝑗 else

mid= (𝑖 + 𝑖^′)/2

spawn MAT-VEC-MAIN-LOOP(A,x,y,n,i,mid)

MAT-VEC-MAIN-LOOP(A,x,y,n,mid+1,i’)

sync

i: 從第幾個row開始算 n: A和x的大小

i’: 算到第幾個row

(14)

=

y A x

𝑦_𝑖 i

j 1,4 5,8

(15)

Analyze MAT-VEC

 Work: 像是把parallel的部分拿掉計算一般的執行時間.

 𝑇₁ 𝑛 = Θ 𝑛²

 (主要是雙重迴圈的部分)

 但是要注意,

spawn/parallel loop 的部分會造成額外的overhead!

MAT-VEC(A,x) n=A.rows

𝑦_𝑖 = 0

(16)

n-1個internal nodes

n個leaves

每個internal node都會spawn一次.

(spawn overhead為constant)

leave比internal node多, 因此spawn 的cost可以平均分攤給leave

因此spawn overhead並不會使𝑇₁ 𝑛 的order上升.

(但constant變大)

(17)

Analyze MAT-VEC

 現在我們要計算span.

 此時需要考慮spawn overhead了.

 spawn的次數=log(n), 因為每次都分開成兩個. 每次spawn 的overhead為constant.

 𝑖𝑡𝑒𝑟_∞ 𝑖 代表第i個iteration的span.

 因此一般來說, parallel loops的span為:

 𝑇_∞ 𝑛 = Θ log 𝑛 + max

1≤𝑖≤𝑛 𝑖𝑡𝑒𝑟_∞ 𝑖

 𝑇_∞ 𝑛 = Θ log 𝑛 + Θ 𝑛

= Θ 𝑛

 Parallelism=^{Θ 𝑛}²

Θ 𝑛 = Θ 𝑛

MAT-VEC(A,x) n=A.rows

𝑦_𝑖 = 0

(18)

Race Conditions

 一個multithreaded algorithm應該要是deterministic的

每次執行的結果都一樣

 不管怎麼去把程式的不同部分排程給不同的

processor/core去執行

 通常”失敗”的時候, 都是因為有”determinacy race”.

 Determinacy Race: 當有兩個邏輯上平行的指令存取相同的記憶體位置而且至少有一個指令是寫入.

(19)

Determinacy Race Examples

Therac-25: 放射線治療機組.

1985-1987年間造成至少6個病人受到原本設定劑量100倍的輻射, 造成死亡或嚴重輻射灼傷.

Determinacy Race Examples

2003 Northeast Blackout: Affected 10M people in Ontario & 45M people in 8 U.S. states.

Cause: A software bug known existed in General Electric Energy's Unix-based XA/21 energy management system

(21)

Race-Example

RACE-EXAMPLE() x=0

parallel for i=1 to 2 x=x+1

print x

注意: 大部分的執行順序都會得到正確的結果, 只有少部分的順序才會造成錯誤!

要找出這些錯誤非常困難!

(22)

如何避免race condition?

 有很多方法(Mutex等, OS裡面會教)

 簡單的方法: 只將平行運算運用在獨立的運算上, 也就是說互相之間沒有關聯性.

 spawn出來的child跟parent, 還有其他spawn出來的child都互相之間沒有關係.

(23)

Socrates chess-playing program

𝑇₁ = 2048 𝑇_∞ = 1

𝑇₁^′ = 1024 𝑇_∞^′ = 8

𝑃 = 32 𝑇₃₂ = 2048 32 + 1

= 65

𝑇₃₂^′ = 1024 32 + 8

= 40

𝑃 = 512 𝑇₅₁₂ = 2048 512 + 1

= 5

𝑇₅₁₂^′ = 1024 512 + 8

= 10

Original version “Optimized version”

(24)

multiplication

P-SQUARE-MATRIX-MULTIPLY(A,B) n=A.rows

let C be a new n x n matrix

parallel for i=1 to n

parallel for j=1 to n

𝑐_𝑖𝑗 = 0

for k=1 to n

𝑐_𝑖𝑗 = 𝑐_𝑖𝑗 + 𝑎_𝑖𝑘 ∙ 𝑏_𝑘𝑗 return C

𝑇₁(𝑛) = Θ 𝑛³

𝑇_∞ 𝑛 = Θ log n + Θ log 𝑛 + Θ 𝑛 = Θ(𝑛) 𝑇₁ 𝑛

𝑇_∞ 𝑛 = Θ 𝑛³

Θ 𝑛 = Θ 𝑛²

(25)

Divide-and-conquer

Multithreaded Algorithm for Matrix Multiplication (勒勒長)

 𝐴 = 𝐴₁₁ 𝐴₁₂ 𝐴₂₁ 𝐴₂₂

 𝐵 = 𝐵₁₁ 𝐵₁₂ 𝐵₂₁ 𝐵₂₂

 𝐶 = 𝐶₁₁ 𝐶₁₂

𝐶₂₁ 𝐶₂₂ = 𝐴₁₁ 𝐴₁₂ 𝐴₂₁ 𝐴₂₂

𝐵₁₁ 𝐵₁₂ 𝐵₂₁ 𝐵₂₂

= 𝐴₁₁𝐵₁₁ 𝐴₁₂𝐵₁₂

𝐴₂₁𝐵₁₁ 𝐴₂₂𝐵₁₂ + 𝐴₁₂𝐵₂₁ 𝐴₁₂𝐵₂₂ 𝐴₂₂𝐵₂₁ 𝐴₂₂𝐵₂₂

C T

(26)

if n==1

𝑐₁₁ = 𝑎₁₁𝑏₁₁

else let T be a new n x n matrix

partition A,B,C, and T into n/2 x n/2 submatrices (𝐴_𝑖𝑗, 𝐵_𝑖𝑗, 𝐶_𝑖𝑗, 𝑇_𝑖𝑗, 𝑖, 𝑗 = {1,2})

spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶₁₁, 𝐴₁₁, 𝐵₁₁) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶₁₂, 𝐴₁₁, 𝐵₁₂) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶₂₁, 𝐴₂₁, 𝐵₁₁) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶₂₂, 𝐴₂₁, 𝐵₁₂) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇₁₁, 𝐴₁₂, 𝐵₂₁) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇₁₂, 𝐴₁₂, 𝐵₂₂) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇₂₁, 𝐴₂₂, 𝐵₂₁) P-MATRIX-MULTIPLY-RECURSIVE(𝑇₂₂, 𝐴₂₂, 𝐵₂₂)

sync

parallel for i=1 to n

parallel for j=1 to n 𝑐_𝑖𝑗 = 𝑐_𝑖𝑗 + 𝑡_𝑖𝑗

𝑀₁ 𝑛 = 8𝑀₁

2 + Θ 𝑛² = Θ 𝑛³

𝑀_∞ 𝑛 = 𝑀_∞ 𝑛

2 + Θ log 𝑛 + Θ log 𝑛 = Θ log² 𝑛 𝑀₁ 𝑛

𝑀_∞ 𝑛 = Θ 𝑛³

Θ log² 𝑛 = Θ 𝑛³ log² 𝑛

(27)

How about Strassen’s method?

 Reading assignment: p.795-796.

 Parallelism: Θ(^𝑛^{log 7}

log² 𝑛), slightly less than the original recursive version!