Greedy scheduler

(1)

Algorithm 2

Michael Tsai 2011/5/27

(2)

Scheduling

 Scheduler的工作:把strand指定給processor執行.

 On‐line: scheduler事先並不知道什麼時候strand 會spawn, 或者spawn出來的什麼時候會完成.

 Centralized scheduler: 單一的scheduler知道整體狀況並作scheduling (比較容易分析)

 Distributed scheduler: 每個thread互相溝通合作, 找出最好的scheduling方法

(3)

Greedy scheduler

 每一個time step, 把越多strand分給越多processor執行越好

 如果某個time step的時候, 至少有P個strand是準備好被執 行的, 則稱為complete step

 反之則稱為 incomplete step

 Lower bounds: 最好的狀況也需要這些時間:



work law:



span law:

 接下來一連串證明, 會說greedy scheduler其實是蠻好的 scheduler.

(4)

processors, a greedy scheduler executes a

multithreaded computation with work and span in time .

 Proof:

 我們想證明complete steps最多為個.

 用反證法: 假設complete steps超過個.

 則所有complete steps所完成的工作有:

 ∙ 1

矛盾, 因為全部的工作也才需花 .

因此complete steps最多為個.

(5)



現在考慮incomplete step的個數:



最長路徑一定從in‐degree=0的vertex開始



greedy scheduler的一個incomplete step一定把所有G’裡面in‐

degree=0的vertex都執行下去 (G’’裡面不會有 in‐degree=0的)



G’’裡面的最長路徑一定比G’中的最長路徑長度少1



意思就是說每個incomplete step會使”表示還沒執行strand的dag”

裡面的最長路徑少1



因此最多所有incomplete step個數應為span: .



最後, 所以的step一定是complete或incomplete.



因此

…

… …

(6)

 Corollary: The running time of any

multithreaded computation scheduled by a

greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.

 Proof:

 假設 ^∗為最佳scheduler在P個processor的機器上所需執行的時間.

 和分別為work & span

 ^∗ max , .

 2 max , 2 ^∗

(7)

 Corollary: Let be the running time of a

multithreaded computation produced by a greedy scheduler on an ideal parallel computer with P

processors, and let and be the work and span of the computation, respectively. Then, if ≪ , we

have

/ , or equivalently, a speedup of approximately P.

 Proof:

 If ≪ / , then ≪ .

 So .

 (Work law: )

 Or, the speedup is .

When is << ?

When the slackness ( / 10.

(8)

(9)

(10)

Back to P‐FIB

 Θ

 max 1 , 2 Θ 1

1 Θ 1

 Θ

 Parallelism:

Θ

P-FIB(n) if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

即使對大平行電腦, 一個普通的n都可以使我們的程式達到near perfect linear

speedup. (Parallelism比P大很多slackness很大)

(11)

Parallel Loops

 Parallel loops: 把loop的iterations平行地執行.

 在for關鍵字前面加上 “parallel”.

 也可以用spawn和sync, 不過這樣的語法比較方便.

(12)

矩陣與向量相乘

 A: 大小為n x n的矩陣

 x:大小為n向量

 需計算出 y=Ax.

 ∑ .

=

y A x

i

MAT-VEC(A,x) j n=A.rows

let y be a new vector of length n parallel for i=1 to n

0

parallel for i=1 to n for j=1 to n

return y

(13)

MAT-VEC-MAIN-LOOP(A,x,y,n,i,i’) if i==i’

for j=1 to n

else

mid= /2

spawn MAT-VEC-MAIN-LOOP(A,x,y,n,i,mid)

MAT-VEC-MAIN-LOOP(A,x,y,n,mid+1,i’)

sync

i: 從第幾個row開始算 n: A和x的大小

i’: 從算道地幾個row

(14)

=

y A x

i

j 1,4 5,8

(15)

Analyze MAT‐VEC

 Work: 像是把parallel的部分拿掉計算一般的執行時間.

 Θ

 (主要是雙重迴圈的部分)

 但是要注意,

spawn/parallel loop 的部分會造成額外的overhead!

MAT-VEC(A,x) n=A.rows

0

parallel for i=1 to n for j=1 to n

return y

(16)

n‐1個internal nodes

n個leaves

每個internal node都會spawn一次.

(spawn overhead為constant)

leave比internal node多, 因此spawn 的cost可以平均分攤給leave

因此spawn overhead並不會使的order上升.

(但constant變大)

(17)

Analyze MAT‐VEC

 現在我們要計算span.

 此時需要考慮spawn overhead了.

 spawn的次數=log(n), 因為每次都分開成兩個. 每次spawn 的overhead為constant.

 代表第i個iteration的span.

 因此一般來說, parallel loops的span為:

 Θ log max

 Θ log Θ

Θ

 Parallelism= Θ

MAT-VEC(A,x) n=A.rows

0

parallel for i=1 to n for j=1 to n return y

(18)

Race Conditions

 一個multithreaded algorithm應該要是deterministic的

每次執行的結果都一樣

 不管怎麼去把程式的不同部分排程給不同的

processor/core去執行

 通常”失敗”的時候, 都是因為有”determinacy race”.

 Determinacy Race: 當有兩個邏輯上平行的指令存取相同的記憶體位置而且至少有一個指令是寫入.

(19)

Determinacy Race Examples

Therac‐25: 放射線治療機組.

1985‐1987年間造成至少6個病人受到原本設定劑量100倍的輻射, 造成死亡或嚴重輻射灼傷.

Determinacy Race Examples

2003 Northeast Blackout: Affected 10M people in Ontario & 45M people in 8 U.S. states.

Cause: A software bug known existed in General Electric Energy's Unix‐based XA/21 energy management system

(21)

Race‐Example

RACE-EXAMPLE() x=0

parallel for i=1 to 2

x=x+1 print x

注意: 大部分的執行順序都會得到正確的結果, 只有少部分的順序才會造成錯誤!

要找出這些錯誤非常困難!

(22)

如何避免race condition?

 有很多方法(Mutex等, OS裡面會教)

 簡單的方法: 只將平行運算運用在獨立的運算上, 也就是說互相之間沒有關聯性.

 spawn出來的child跟parent, 還有其他spawn出來的child都互相之間沒有關係.

(23)

Socrates chess‐playing program

2048 1

1024 8

32 2048

32 1 65

1024 32 8

40

512 2048

512 1 5

1024 512 8

10 Original version “Optimized version”

(24)

multiplication

P-SQUARE-MATRIX-MULTIPLY(A,B) n=A.rows

let C be a new n x n matrix

parallel for i=1 to n

parallel for j=1 to n

0

for k=1 to n

∙ return C

Θ

Θ log n Θ log Θ Θ Θ

Θ Θ

(25)

Divide‐and‐conquer

Multithreaded Algorithm for Matrix Multiplication (勒勒長)



C T

(26)

if n==1

else let T be a new n x n matrix

partition A,B,C, and T into n/2 x n/2 submatrices ( , , , , , 1,2

spawn P-MATRIX-MULTIPLY-RECURSIVE( , , spawn P-MATRIX-MULTIPLY-RECURSIVE( , , spawn P-MATRIX-MULTIPLY-RECURSIVE( , , spawn P-MATRIX-MULTIPLY-RECURSIVE( , , spawn P-MATRIX-MULTIPLY-RECURSIVE( , , spawn P-MATRIX-MULTIPLY-RECURSIVE( , , spawn P-MATRIX-MULTIPLY-RECURSIVE( , , P-MATRIX-MULTIPLY-RECURSIVE( , ,

sync

parallel for i=1 to n

parallel for j=1 to n

8 2 Θ Θ

2 Θ log Θ log Θ log

Θ

Θ log Θ

log

(27)

How about Strassen’s method?

 Reading assignment: p.795‐796.

 Parallelism: Θ , slightly less than the original recursive version!