• 沒有找到結果。

Algorithm 1 Multi-threaded

N/A
N/A
Protected

Academic year: 2022

Share "Algorithm 1 Multi-threaded"

Copied!
25
0
0

加載中.... (立即查看全文)

全文

(1)

2013/12/31

(2)

多核心處理器的崛起…

(3)

Chip Multiprocessors: 一個IC裡面 有多個處理核心, 每一個都可以單 獨運算, 並存取共同的記憶體

Computer clusters: 用網路連接起來的許 多電腦

Supercomputers: 客製化的架構及處理 單元之間的通訊網路

便宜

(4)

What if a problem is NP- hard?

 說不定n不會很大? 說不定constant很小?

 n可能出現的範圍內, 執行時間是合理的嗎?

 我的想法:

n

Multithreaded Algorithm on a parallel platform

Regular (Serial) algorithm

Running time

Reasonable running time

可處理的n的範圍變大了!

(5)

Memory

Processor Processor Processor Processor

Memory Memory Memory Memory

Distributed-Memory Multiprocessing

(6)

Process Thread

Thread

Thread Thread

Memory used by the process

(shared by all threads)

Process Process

Process Process

Memory exec.

Memory

exec. exec. Memory

Memory exec.

Inter Process Communications

OS Scheduler

Processor 1

Processor 2

Processor n

……

The variables are shared among all threads.

Threads are assigned to processors by the scheduler for executions.

Each processes have their own memory (for storing variables)

Processes are assigned to processors by the scheduler for executions.

(7)

Concurrency platforms

 Static threading: 通常一個program的thread數目 是固定的.

 困難之處: 每個thread要分到數量差不多的分量, 才能使執行時間縮到最短 動態分配 困難

 需要一個scheduler來分配資源, 協調並提供給 thread運用

 因此, 創造出concurrency platform, 使得 programmer不需要自己撰寫scheduler.

 可能是一個programming library 或是一整套的 程式語言/語法 + compiler + library

(8)

Dynamic Multithreading

 Concurrency platform的一個種類

 通常有兩個feature:

Nested parallelism: 有點像fork. “Spawn”出一支新的 subroutine, 使得原本的程式可以繼續執行不用等待.

Parallel loops: for迴圈的不同iteration可以同時執行

 關鍵概念: Programmer只管邏輯上怎麼平行化一個 程式, 而不需要管細節如scheduling, 資源分配等等 (platform自己會處理)

(9)

這個model的好處

 平行化後的程式只是原本程式的簡單延伸. 只需 要加上一些關鍵字: parallel, spawn, sync.

 易於使用理論分析

 許多divide-and-conquer的演算法可以很容易地 用nested parallelism來改成multithreaded的演算 法

 許多現存的concurrency platform使用此model

(10)

Computing Fibonacci numbers

FIB(n) if n<=1

return n else

x=FIB(n-1) y=FIB(n-2) return x+y

𝑇 𝑛 − 1 𝑇 𝑛 − 2

Θ(1)

𝑇 𝑛 = 𝑇 𝑛 − 1 + 𝑇 𝑛 − 2 + Θ(1)

(11)

執行時間分析

歸納法證明:

假設𝑇 𝑚 ≤ 𝑎𝐹𝑚 − 𝑏, 𝑎 > 1 𝑎𝑛𝑑 𝑏 > 0是constants, for m<n

𝑇 𝑛 = 𝑇 𝑛 − 1 + 𝑇 𝑛 − 2 + Θ(1)

≤ 𝑎𝐹𝑛−1 − 𝑏 + 𝑎𝐹𝑛−2 − 𝑏 + Θ 1

= 𝑎 𝐹𝑛−1 + 𝐹𝑛−2 − 2𝑏 + Θ 1

= 𝑎𝐹𝑛 − 𝑏 − 𝑏 − Θ 1

≤ 𝑎𝐹𝑛 − 𝑏

𝑇 𝑛 = Θ 𝜙𝑛 , 𝜙 = 1+ 5

2

(12)

Computing Fibonacci numbers

FIB(n) if n<=1

return n else

x=FIB(n-1) y=FIB(n-2) return x+y

這兩個不相干, 可以分別執行!

(13)

Computing Fibonacci numbers

P-FIB(n) if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

新產生出來的subroutine叫child 原本的subroutine叫parent

Child執行spawn後面的call Parent繼續執行下面的, 不等child 等所有child執行完回來才能繼續

這時才能安心使用spawned subroutine 的return value

(14)

一些細節

 “Spawn”不表示”一定要平行地執行”

 (也就是有時候可能由同一個processor依序執行)

 由scheduler去看是否需要/適合

 一個function return的時候, 也隱含了會有sync關 鍵字

 (等待所有child subroutine執行完回來)

(15)

between instruction strands

if n<=1

return n else

x=spawn P-FIB(n-1) y=P-FIB(n-2)

sync

return x+y

如果有uv的路徑:

u and v are in series.

如果沒有uv的路徑:

u and v are in parallel.

(16)

Continuation edge

Spawn edge

Call edge

Return edge Final Strand

Initial Strand

(17)

Ideal Parallel Computer

A set of processors and a sequentially consistent shared memory

 Sequential consistency: 雖然實際上可能很多

processor會同時讀寫記憶體, 但是產生的結果和 一次只執行一個processor的讀寫指令產生的結 果一樣.

 每個Processor的computing power一樣

 Scheduling產生的overhead可忽略

(18)

Performance measures

 Work: 在一個processor上執行整個計算所需的 時間

 假設每個strand執行的時間差不多, 則 work=computation dag的vertex數目

 Span: 在computation dag中最長的路徑所需的 執行時間

 假設每個strand執行的時間差不多, 則

span=computation dag的最長路徑上vertex數目

(19)

 實際執行時間跟有幾個processor有關

 假設有P個processor

 𝑇𝑃: P個processor的執行時間

 𝑇1: 1個processor的執行時間=work

 𝑇: 無限多個processor的執行時間=span

(20)

Lower bounds

 Work law: 𝑇𝑃𝑇1

𝑃

 P個processor做𝑇𝑃時間後總工作量最多為𝑃 𝑇𝑃. 全部工作量為𝑇1.

 Span law: 𝑇𝑃 ≥ 𝑇

 無限多個processor的機器可以模擬P個

processor的機器. 因此P個processor的執行時間 一定永遠比較大或一樣大.

(21)

Speedup

 Speedup=𝑇1

𝑇𝑃 ≤ 𝑃

 Linear speedup: When 𝑇1

𝑇𝑃 = Θ 𝑃

 Perfect linear speedup: When 𝑇1

𝑇𝑃 = 𝑃

(22)

Parallelism

 Parallelism=𝑇1

𝑇

 Parallelism可用下面三種方法解釋:

1. 平均可和在critical path上(computation dag的 最長路徑上)每一個步驟平行處理的工作量

2. Parallelism是可能得到的max speedup (使用任 何數量的processor)

3. 對Perfect linear speedup的可能性做限制: 只要 processor數量比parallelism大, 就不可能達到 perfect linear speedup

 說明: 假設𝑃 > 𝑇1

𝑇, 則𝑇1

𝑇𝑃𝑇1

𝑇 < 𝑃 . 使用超過 parallelism個processor意義不大. (離perfect speedup越遠)

(23)

Back to P-FIB

P-FIB(4)的case:

Work=𝑇1 =17

Span=𝑇 =8

Parallelism=17/8=2.125

所以最多speedup大概就是雙倍左右 (用再多processor也沒 有用!)

試試看P-FIB(n)當n比較大的時候!

Parallel Slackness=

𝑇1 𝑇∞

𝑃 = 𝑇1

𝑇𝑃

(Parallelism比P超過多少)

Slackness < 1表示沒希望perfect linear speedup了.

當Slackness>1時, 每個processor分配到的工作量是是否能達 成perfect linear speedup的關鍵  scheduler可以幫忙!

(24)

Scheduling

 Scheduler的工作:把strand指定給processor執行.

 On-line: scheduler事先並不知道什麼時候strand 會spawn, 或者spawn出來的什麼時候會完成.

 Centralized scheduler: 單一的scheduler知道整 體狀況並作scheduling (比較容易分析)

 Distributed scheduler: 每個thread互相溝通合作, 找出最好的scheduling方法

(25)

Greedy scheduler

每一個time step, 把越多strand分給越多processor執 行越好

如果某個time step的時候, 至少有P個strand是準備 好被執行的, 則稱為complete step

反之則稱為 incomplete step

Lower bounds: 最好的狀況也需要這些時間:

work law: 𝑇

𝑃

=

𝑇1

𝑃

span law: 𝑇

𝑃

= 𝑇

Greedy scheduling的upper bound為這兩個lower bound的和

參考文獻

相關文件

It is important to allow for all students to add their ideas to the story so giving each student an area of responsibility to add to the story recipe can help prompt this. For

These activities provide chances for students to work on their own, to apply their economic concepts, to develop a critical attitude and, above all, to increase the interest of

To ensure laboratory safety, each school should formulate its own laboratory and special rooms safety rules for compliance by students and teachers apart from

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

The seven columns are potential observables of FRBs and each row gives their consequence for a given model (Blitzars, compact object mergers, exploding primordial blackholes,

Like the proximal point algorithm using D-function [5, 8], we under some mild assumptions es- tablish the global convergence of the algorithm expressed in terms of function values,

Cause: A software bug known existed in General Electric Energy's Unix-based XA/21 energy management

– For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. • Why do memory