Algorithm 2
Michael Tsai 2014/1/2
Scheduling
Scheduler的工作:把strand指定給processor執行.
On-line: scheduler事先並不知道什麼時候strand 會spawn, 或者spawn出來的什麼時候會完成.
Centralized scheduler: 單一的scheduler知道整 體狀況並作scheduling (比較容易分析)
Distributed scheduler: 每個thread互相溝通合作, 找出最好的scheduling方法
Greedy scheduler
每一個time step, 把越多strand分給越多processor執行越 好
如果某個time step的時候, 至少有P個strand是準備好被執 行的, 則稱為complete step
反之則稱為 incomplete step
Lower bounds: 最好的狀況也需要這些時間:
work law: 𝑇
𝑃=
𝑇1𝑃
span law: 𝑇
𝑃= 𝑇
∞ 接下來一連串證明, 會說greedy scheduler其實是蠻好的 scheduler.
processors, a greedy scheduler executes a
multithreaded computation with work 𝑇1 and span 𝑇∞ in time 𝑇𝑃 ≤ 𝑇1
𝑃 + 𝑇∞.
Proof:
我們想證明complete steps最多為 𝑇1
𝑃 個.
用反證法: 假設complete steps超過 𝑇1
𝑃 個.
則所有complete steps所完成的工作有:
𝑃 ∙ 𝑇1
𝑃 + 1 = 𝑃 𝑇1
𝑃 + 𝑃
= 𝑇1 − 𝑇1 𝑚𝑜𝑑 𝑃 + 𝑃
> 𝑇1
矛盾, 因為全部的工作也才需花𝑇1.
因此complete steps最多為 𝑇1
𝑃 個.
現在考慮incomplete step的個數:
最長路徑一定從in-degree=0的vertex開始
greedy scheduler的一個incomplete step一定把所有G’裡面in- degree=0的vertex都執行下去 (G’’裡面不會有 in-degree=0的)
G’’裡面的最長路徑一定比G’中的最長路徑長度少1
意思就是說每個incomplete step會使”表示還沒執行strand的dag”
裡面的最長路徑少1
因此最多所有incomplete step個數應為span: 𝑇
∞.
最後, 所以的step一定是complete或incomplete.
因此𝑇
𝑃≤
𝑇1𝑃
+ 𝑇
∞𝐺 𝐺′
𝐺′′
…
… …
Corollary: The running time 𝑇𝑃 of any
multithreaded computation scheduled by a
greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.
Proof:
假設𝑇𝑃∗為最佳scheduler在P個processor的機器 上所需執行的時間.
𝑇1和𝑇∞分別為work & span
𝑇𝑃∗ ≥ max 𝑇1
𝑃 , 𝑇∞ .
𝑇𝑃 ≤ 𝑇1
𝑃 + 𝑇∞ ≤ 2 max 𝑇1
𝑃 , 𝑇∞ ≤ 2𝑇𝑃∗
Corollary: Let 𝑇𝑃 be the running time of a
multithreaded computation produced by a greedy scheduler on an ideal parallel computer with P
processors, and let 𝑇1 and 𝑇∞ be the work and span of the computation, respectively. Then, if 𝑷 ≪ 𝑻𝟏
𝑻∞, we
have 𝑻
𝑷 ≈ 𝑻𝟏/𝑷, or equivalently, a speedup ofapproximately P.
Proof:
If 𝑃 ≪ 𝑇1/𝑇∞, then 𝑇∞ ≪ 𝑇1
𝑃.
So 𝑇𝑃 ≤ 𝑇1
𝑃 + 𝑇∞ ≈ 𝑇1
𝑃.
(Work law: 𝑇𝑃 ≥ 𝑇1
𝑃)
Or, the speedup is 𝑇1
𝑇𝑃 ≈ 𝑃.
When is << ?
When the slackness (𝑇1
𝑇∞/𝑃) ≥ 10.
Back to P-FIB
𝑇1 𝑛 = 𝑇 𝑛 = Θ 𝜙𝑛
𝑇∞ 𝑛 = max 𝑇∞ 𝑛 − 1 , 𝑇∞ 𝑛 − 2 + Θ 1
= 𝑇∞ 𝑛 − 1 + Θ 1
𝑇∞ 𝑛 = Θ 𝑛
Parallelism:
𝑇1 𝑛
𝑇∞ 𝑛 = Θ 𝜙𝑛
𝑛
P-FIB(n) if n<=1
return n else
x=spawn P-FIB(n-1) y=P-FIB(n-2)
sync
return x+y
即使對大平行電腦, 一個普 通的n都可以使我們的程式 達到near perfect linear
speedup. (Parallelism比P大很 多slackness很大)
Parallel Loops
Parallel loops: 把loop的iterations平行地執行.
在for關鍵字前面加上 “parallel”.
也可以用spawn和sync, 不過這樣的語法比較方 便.
矩陣與向量相乘
A: 大小為n x n的矩陣
x:大小為n向量
需計算出 y=Ax.
𝑦𝑖 = 𝑗=1𝑛 𝑎𝑖𝑗𝑥𝑗.
=
y A x
𝑦𝑖 i
MAT-VEC(A,x) j n=A.rows
let y be a new vector of length n parallel for i=1 to n
𝑦𝑖 = 0
parallel for i=1 to n for j=1 to n
𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 return y
MAT-VEC-MAIN-LOOP(A,x,y,n,i,i’) if i==i’
for j=1 to n
𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 else
mid= (𝑖 + 𝑖′)/2
spawn MAT-VEC-MAIN-LOOP(A,x,y,n,i,mid)
MAT-VEC-MAIN-LOOP(A,x,y,n,mid+1,i’)sync
i: 從第幾個row開始算 n: A和x的大小
i’: 算到第幾個row
=
y A x
𝑦𝑖 i
j 1,4 5,8
Analyze MAT-VEC
Work: 像是把parallel的部分拿掉計算一般的執 行時間.
𝑇1 𝑛 = Θ 𝑛2
(主要是雙重迴圈的部分)
但是要注意,
spawn/parallel loop 的部分會造成額外 的overhead!
MAT-VEC(A,x) n=A.rows
let y be a new vector of length n parallel for i=1 to n
𝑦𝑖 = 0
parallel for i=1 to n for j=1 to n
𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 return y
n-1個internal nodes
n個leaves
每個internal node都會spawn一次.
(spawn overhead為constant)
leave比internal node多, 因此spawn 的cost可以平均分攤給leave
因此spawn overhead並不會使𝑇1 𝑛 的order上升.
(但constant變大)
Analyze MAT-VEC
現在我們要計算span.
此時需要考慮spawn overhead了.
spawn的次數=log(n), 因為每次都分開成兩個. 每次spawn 的overhead為constant.
𝑖𝑡𝑒𝑟∞ 𝑖 代表第i個iteration的span.
因此一般來說, parallel loops的span為:
𝑇∞ 𝑛 = Θ log 𝑛 + max
1≤𝑖≤𝑛 𝑖𝑡𝑒𝑟∞ 𝑖
𝑇∞ 𝑛 = Θ log 𝑛 + Θ 𝑛
= Θ 𝑛
Parallelism=Θ 𝑛2
Θ 𝑛 = Θ 𝑛
MAT-VEC(A,x) n=A.rows
let y be a new vector of length n parallel for i=1 to n
𝑦𝑖 = 0
parallel for i=1 to n for j=1 to n
𝑦𝑖 = 𝑦𝑖 + 𝑎𝑖𝑗𝑥𝑗 return y
Race Conditions
一個multithreaded algorithm應該要是deterministic的
每次執行的結果都一樣
不管怎麼去把程式的不同部分排程給不同的
processor/core去執行
通常”失敗”的時候, 都是因為有”determinacy race”.
Determinacy Race: 當有兩個邏輯上平行的指令存取相 同的記憶體位置而且至少有一個指令是寫入.
Determinacy Race Examples
Therac-25: 放射線治療機組.
1985-1987年間造成至少6個病人 受到原本設定劑量100倍的輻射, 造成死亡或嚴重輻射灼傷.
延伸閱讀: 殺人的軟體.
http://hiraman-sharma.blogspot.com/2010/07/killer-softwares.html
Determinacy Race Examples
2003 Northeast Blackout: Affected 10M people in Ontario & 45M people in 8 U.S. states.
Cause: A software bug known existed in General Electric Energy's Unix-based XA/21 energy management system
Race-Example
RACE-EXAMPLE() x=0
parallel for i=1 to 2 x=x+1
print x
注意: 大部分的執行順序都會得到正確的結 果, 只有少部分的順序才會造成錯誤!
要找出這些錯誤非常困難!
如何避免race condition?
有很多方法(Mutex等, OS裡面會教)
簡單的方法: 只將平行運算運用在獨立的運算上, 也就是說互相之間沒有關聯性.
spawn出來的child跟parent, 還有其他spawn出來 的child都互相之間沒有關係.
Socrates chess-playing program
𝑇1 = 2048 𝑇∞ = 1
𝑇1′ = 1024 𝑇∞′ = 8
𝑃 = 32 𝑇32 = 2048 32 + 1
= 65
𝑇32′ = 1024 32 + 8
= 40
𝑃 = 512 𝑇512 = 2048 512 + 1
= 5
𝑇512′ = 1024 512 + 8
= 10
Original version “Optimized version”
multiplication
P-SQUARE-MATRIX-MULTIPLY(A,B) n=A.rows
let C be a new n x n matrix
parallel for i=1 to n
parallel for j=1 to n
𝑐𝑖𝑗 = 0for k=1 to n
𝑐𝑖𝑗 = 𝑐𝑖𝑗 + 𝑎𝑖𝑘 ∙ 𝑏𝑘𝑗 return C
𝑇1(𝑛) = Θ 𝑛3
𝑇∞ 𝑛 = Θ log n + Θ log 𝑛 + Θ 𝑛 = Θ(𝑛) 𝑇1 𝑛
𝑇∞ 𝑛 = Θ 𝑛3
Θ 𝑛 = Θ 𝑛2
Divide-and-conquer
Multithreaded Algorithm for Matrix Multiplication (勒勒長)
𝐴 = 𝐴11 𝐴12 𝐴21 𝐴22
𝐵 = 𝐵11 𝐵12 𝐵21 𝐵22
𝐶 = 𝐶11 𝐶12
𝐶21 𝐶22 = 𝐴11 𝐴12 𝐴21 𝐴22
𝐵11 𝐵12 𝐵21 𝐵22
= 𝐴11𝐵11 𝐴12𝐵12
𝐴21𝐵11 𝐴22𝐵12 + 𝐴12𝐵21 𝐴12𝐵22 𝐴22𝐵21 𝐴22𝐵22
C T
if n==1
𝑐11 = 𝑎11𝑏11
else let T be a new n x n matrix
partition A,B,C, and T into n/2 x n/2 submatrices (𝐴𝑖𝑗, 𝐵𝑖𝑗, 𝐶𝑖𝑗, 𝑇𝑖𝑗, 𝑖, 𝑗 = {1,2})
spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶11, 𝐴11, 𝐵11) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶12, 𝐴11, 𝐵12) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶21, 𝐴21, 𝐵11) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝐶22, 𝐴21, 𝐵12) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇11, 𝐴12, 𝐵21) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇12, 𝐴12, 𝐵22) spawn P-MATRIX-MULTIPLY-RECURSIVE(𝑇21, 𝐴22, 𝐵21) P-MATRIX-MULTIPLY-RECURSIVE(𝑇22, 𝐴22, 𝐵22)
sync
parallel for i=1 to n
parallel for j=1 to n 𝑐𝑖𝑗 = 𝑐𝑖𝑗 + 𝑡𝑖𝑗
𝑀1 𝑛 = 8𝑀1
2 + Θ 𝑛2 = Θ 𝑛3
𝑀∞ 𝑛 = 𝑀∞ 𝑛
2 + Θ log 𝑛 + Θ log 𝑛 = Θ log2 𝑛 𝑀1 𝑛
𝑀∞ 𝑛 = Θ 𝑛3
Θ log2 𝑛 = Θ 𝑛3 log2 𝑛
How about Strassen’s method?
Reading assignment: p.795-796.
Parallelism: Θ(𝑛log 7
log2 𝑛), slightly less than the original recursive version!