Thread Manager Unit 效能分析

第五章實驗結果

5.2. Benchmark 分析

5.2.3. Thread Manager Unit 效能分析

此階段我們使用 Jembench Suite 的 parallel benchmark programs 與 Multi-logic 這 4 個 benchmark programs 測試 context-switching 執行時間，以及不同 time slice 參數對於 JAIP temporal multithreading 機制所造成的影響。

根據 section 3-3 的敘述，當 Thread Controller 執行 context-switching 時會使用 1 clock 切換 ready thread 與 current thread 的 special-purpose registers 例如 stack pointer，local variable pointer，Java program counter 與 current thread ID 等資料；同時切換 Ping-pong Java

Stack 底下的 2 組 interleaving on-chip memories；但是切換 ready thread 與 current thread 的 current method 最快需要 5 個 clock，任意 2 個 methods 切換必須先檢查下個被呼叫的 method image 是否還保留在 Method Area Circular buffer 之中(如圖 18)。表 8說明在 4 核心 JAIP 處理器環境下執行所有 parallel benchmark programs 時，JAIP 處理器的 context-switching 平均執行時間。當 thread 數量達到 8 個以上，每個 JAIP 處理器的 Thread Manager Unit 開始排序不同 threads 使它們輪替執行，由此表可見 context-switching 平均執行時間都趨近於 5，這表示 context-switching 並非執行這些 benchmark programs 時主要的效能瓶頸。

#thread time

Dummy Test Matrix Multiplication

8 12 16 8 12 16

#clock 700330 698554 745249 628350 693833 777639

#count 140029 139674 149013 125633 138730 155490

Avg. Clock 5.0013 5.0013 5.0012 5.0015 5.0013 5.00122

#thread time

Multi-logic N-Queens

8 12 16 8 12 16

#clock 721969 800786 924149 251066 256930 320619

#count 144356 160121 184794 50195 51368 64105

Avg. Clock 5.00131 5.00113 5.00097 5.00189 5.001751 5.001466 表 8 不同 benchmark programs 下 context-switching 執行時間

在單一核心 JAIP 環境下我們分別使用 20、50、100 與 500 microseconds 四種不同 time slice 參數，測試 Thread Manager Unit 的執行效能，圖 50顯示其執行結果，當 time slice 為 20 microseconds 時大部分 benchmark programs 執行分數可以達到最高，因此在本小節 Thread Manager Unit 皆以 20 microseconds 分配個別 thread 相同執行時間並且測試執行結果。另外當 time slice 數值增加時執行分數反而下降，其中一個原因為 benchmark program 控制每個 thread 執行主要程式功能的方式。

圖 50 不同 time slice 參數對於 benchmark program 執行分數的影響 (a)Dummy Test (b) Multi-Logic (c) Matrix Multiplication (d) N-Queens

本章節使用的 benchmark programs 皆由 main thread 紀錄執行時間，當測試程式開始執行時首先 main thread 進入 measure method (圖 51(a))，依照設定產生特定數量的 threads，接著每回合 main thread 會呼叫 Util.getTimeMillis()讀取 JAIP 內部 timer 數值並且執行 executeParallel。圖 51(a)的 r.run()與圖 51(b)的 work.run()代表每個 thread 開始執行 benchmark 主要程式功能，所有 threads (class Worker)皆以一個 object 欄位 finished 控制每個 thread 執行 work.run()。當 main thread 進入 executeParallel 之後先把每個 thread 的變數 finished 改成 false，使得其他 threads 可以開始執行 work.run()。接著 main thread 執行完 r.run()之後進入迴圈檢查所有 threads 的變數 finished 其值是否被改成 true，如果

所有 threads 完成執行主程式功能，則 main thread 才會離開迴圈並且從 executeParallel 返回，最後再由 main thread 呼叫 Util.getTimeMillis()讀取 JAIP 內部 timer 數值並計算執行時間與分數。每回合執行效能計算時，假設 main thread 第一次讀取 JAIP 內部 timer 後尚未將其他 thread 的變數 finished 改成 false，就先被切換到下一個 thread 執行，此時其他 threads 會卡在 run method(圖 51(b))的 while-loop 內無法執行 work.run()主要程式功能，這種情況下當 time slice 越大反而造成 performance 越低。

class Worker extends Thread { volatile boolean finished;

Runnable work;

…

void setExecute(Runnable r) { work = r;

}

…

public void run() { while (1) {

if (!finished) { work.run();

finished = true;

}

public int measure() {

…

start = Util.getTimeMillis();

for (int i=0; i<cnt; ++i){

executeParallel(r);

}

end = Util.getTimeMillis();;

… }

public void executeParallel(Runnable r) { for (int i=0; i<cpus-1; ++i) {

runner[i].setExecute(r);

runner[i].finished = false;

} r.run();

boolean allFinished;

do {

allFinished = true;

for (int i=0; i<cpus-1; ++i) { allFinished &=

runner[i].finished;

圖 51 Jembench Suite 的 parallel benchmark program 計算執行時間之示意圖

在文檔中多執行緒Java處理器設計 (頁 83-86)

第五章 實驗結果

5.2. Benchmark 分析

5.2.3. Thread Manager Unit 效能分析

第五章實驗結果