Algorithm 1 Multi-threaded

(1)

2013/12/31

(2)

多核心處理器的崛起…

(3)

Chip Multiprocessors: 一個IC裡面有多個處理核心, 每一個都可以單獨運算, 並存取共同的記憶體

Computer clusters: 用網路連接起來的許多電腦

Supercomputers: 客製化的架構及處理單元之間的通訊網路

便宜

貴

(4)

What if a problem is NP- hard?

 說不定n不會很大? 說不定constant很小?

 n可能出現的範圍內, 執行時間是合理的嗎?

 我的想法:

n

Multithreaded Algorithm on a parallel platform

Regular (Serial) algorithm

Running time

Reasonable running time

可處理的n的範圍變大了!

(5)

Memory

Processor Processor Processor Processor

Memory Memory Memory Memory

Distributed-Memory Multiprocessing

(6)

Process Thread

Thread

Thread Thread

Memory used by the process

(shared by all threads)

Process Process

Memory exec.

Memory

exec. exec. Memory

Memory exec.

Inter Process Communications

OS Scheduler

Processor 1

Processor 2

Processor n

……

The variables are shared among all threads.

Threads are assigned to processors by the scheduler for executions.

Each processes have their own memory (for storing variables)

Processes are assigned to processors by the scheduler for executions.

(7)

Concurrency platforms

 Static threading: 通常一個program的thread數目是固定的.

 困難之處: 每個thread要分到數量差不多的分量, 才能使執行時間縮到最短 動態分配 困難

 需要一個scheduler來分配資源, 協調並提供給 thread運用

 因此, 創造出concurrency platform, 使得 programmer不需要自己撰寫scheduler.

 可能是一個programming library 或是一整套的程式語言/語法 + compiler + library

(8)

Dynamic Multithreading

 Concurrency platform的一個種類

 通常有兩個feature:

 Nested parallelism: 有點像fork. “Spawn”出一支新的 subroutine, 使得原本的程式可以繼續執行不用等待.

 Parallel loops: for迴圈的不同iteration可以同時執行

 關鍵概念: Programmer只管邏輯上怎麼平行化一個程式, 而不需要管細節如scheduling, 資源分配等等 (platform自己會處理)

(9)

這個model的好處

 平行化後的程式只是原本程式的簡單延伸. 只需要加上一些關鍵字: parallel, spawn, sync.

 易於使用理論分析

 許多divide-and-conquer的演算法可以很容易地用nested parallelism來改成multithreaded的演算法

 許多現存的concurrency platform使用此model

(10)

Computing Fibonacci numbers

FIB(n) if n<=1

return n else

x=FIB(n-1) y=FIB(n-2) return x+y

𝑇 𝑛 − 1 𝑇 𝑛 − 2

Θ(1)

𝑇 𝑛 = 𝑇 𝑛 − 1 + 𝑇 𝑛 − 2 + Θ(1)

(11)

執行時間分析

 歸納法證明:

 假設𝑇 𝑚 ≤ 𝑎𝐹_𝑚 − 𝑏, 𝑎 > 1 𝑎𝑛𝑑 𝑏 > 0是constants, for m<n

 𝑇 𝑛 = 𝑇 𝑛 − 1 + 𝑇 𝑛 − 2 + Θ(1)

≤ 𝑎𝐹_𝑛−1 − 𝑏 + 𝑎𝐹_𝑛−2 − 𝑏 + Θ 1

= 𝑎 𝐹_𝑛−1 + 𝐹_𝑛−2 − 2𝑏 + Θ 1

= 𝑎𝐹_𝑛 − 𝑏 − 𝑏 − Θ 1

≤ 𝑎𝐹_𝑛 − 𝑏

 𝑇 𝑛 = Θ 𝜙^𝑛 , 𝜙 = ^{1+ 5}

2

(12)

Computing Fibonacci numbers

FIB(n) if n<=1

return n else

x=FIB(n-1) y=FIB(n-2) return x+y

這兩個不相干, 可以分別執行!

(13)

Computing Fibonacci numbers

P-FIB(n) if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

新產生出來的subroutine叫child 原本的subroutine叫parent

Child執行spawn後面的call Parent繼續執行下面的, 不等child 等所有child執行完回來才能繼續

這時才能安心使用spawned subroutine 的return value

(14)

一些細節

 “Spawn”不表示”一定要平行地執行”

 (也就是有時候可能由同一個processor依序執行)

 由scheduler去看是否需要/適合

 一個function return的時候, 也隱含了會有sync關鍵字

 (等待所有child subroutine執行完回來)

(15)

between instruction strands

if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

如果有uv的路徑:

u and v are in series.

如果沒有uv的路徑:

u and v are in parallel.

(16)

Continuation edge

Spawn edge

Call edge

Return edge Final Strand

Initial Strand

(17)

Ideal Parallel Computer

 A set of processors and a sequentially consistent shared memory

 Sequential consistency: 雖然實際上可能很多

processor會同時讀寫記憶體, 但是產生的結果和一次只執行一個processor的讀寫指令產生的結果一樣.

 每個Processor的computing power一樣

 Scheduling產生的overhead可忽略

(18)

Performance measures

 Work: 在一個processor上執行整個計算所需的時間

 假設每個strand執行的時間差不多, 則 work=computation dag的vertex數目

 Span: 在computation dag中最長的路徑所需的執行時間

 假設每個strand執行的時間差不多, 則

span=computation dag的最長路徑上vertex數目

(19)

 實際執行時間跟有幾個processor有關

 假設有P個processor

 𝑇_𝑃: P個processor的執行時間

 𝑇₁: 1個processor的執行時間=work

 𝑇_∞: 無限多個processor的執行時間=span

(20)

Lower bounds

 Work law: 𝑇_𝑃 ≥ ^𝑇¹

𝑃

 P個processor做𝑇_𝑃時間後總工作量最多為𝑃 𝑇_𝑃. 全部工作量為𝑇₁.

 Span law: 𝑇_𝑃 ≥ 𝑇_∞

 無限多個processor的機器可以模擬P個

processor的機器. 因此P個processor的執行時間一定永遠比較大或一樣大.

(21)

Speedup

 Speedup=^𝑇¹

𝑇_𝑃 ≤ 𝑃

 Linear speedup: When ^𝑇¹

𝑇_𝑃 = Θ 𝑃

 Perfect linear speedup: When ^𝑇¹

𝑇_𝑃 = 𝑃

(22)

Parallelism

 Parallelism=^𝑇¹

𝑇_∞

 Parallelism可用下面三種方法解釋:

1. 平均可和在critical path上(computation dag的最長路徑上)每一個步驟平行處理的工作量

2. Parallelism是可能得到的max speedup (使用任何數量的processor)

3. 對Perfect linear speedup的可能性做限制: 只要 processor數量比parallelism大, 就不可能達到 perfect linear speedup

 說明: 假設𝑃 > ^𝑇¹

𝑇_∞, 則^𝑇¹

𝑇_𝑃 ≤ ^𝑇¹

𝑇_∞ < 𝑃 . 使用超過 parallelism個processor意義不大. (離perfect speedup越遠)

(23)

Back to P-FIB

 P-FIB(4)的case:

 Work=𝑇₁ =17

 Span=𝑇_∞ =8

 Parallelism=17/8=2.125

 所以最多speedup大概就是雙倍左右 (用再多processor也沒有用!)

 試試看P-FIB(n)當n比較大的時候!

 Parallel Slackness=

𝑇1 𝑇∞

𝑃 = ^𝑇¹

𝑇_∞𝑃

 (Parallelism比P超過多少)

 Slackness < 1表示沒希望perfect linear speedup了.

 當Slackness>1時, 每個processor分配到的工作量是是否能達成perfect linear speedup的關鍵  scheduler可以幫忙!

(24)

Scheduling

 Scheduler的工作:把strand指定給processor執行.

 On-line: scheduler事先並不知道什麼時候strand 會spawn, 或者spawn出來的什麼時候會完成.

 Centralized scheduler: 單一的scheduler知道整體狀況並作scheduling (比較容易分析)

 Distributed scheduler: 每個thread互相溝通合作, 找出最好的scheduling方法

(25)

Greedy scheduler

 每一個time step, 把越多strand分給越多processor執行越好

 如果某個time step的時候, 至少有P個strand是準備好被執行的, 則稱為complete step

 反之則稱為 incomplete step

 Lower bounds: 最好的狀況也需要這些時間:



work law: 𝑇

_𝑃

=

^𝑇¹

𝑃



span law: 𝑇

_𝑃

= 𝑇

_∞

 Greedy scheduling的upper bound為這兩個lower bound的和