多執行緒與同步處理相關研究

第二章相關研究

2.2. 多執行緒與同步處理相關研究

其他 Multithreaded Java processors 相關研究中，[22]提出的 Java 加速電路包含 CPU

portion 與一個 thread lifetime unit 電路負責 active threads 管理與排班機制，並使用一個 timer 倒數紀錄 current thread 的執行時間，一旦 timer 的值小於等於 0，thread lifetime unit 會觸發訊號給 CPU portion，此時 current thread 在 CPU portion 下的 register sets(包含 current thread 的 operand stack values 與)將被清除並且備份到 memory，並且載入 Java Virtual Machine 到 CPU portion 執行 thread 排班。

圖 5 The block diagram of KOMODO microcontroller

KOMODO[16]使用 4 個 thread slots 儲存每個 thread 執行資訊包括 program counter、

Instruction window、一組 stack registers (圖 5)，每次從 Memory Interface 依序載入一組 4-byte instruction package 到 Intrsuction windows，此時這 4 個 instruction windows 可能儲存 0 到 4 個 opcode，接著由 Priority Manager 檢查每個 thread 的 characteristic value[19]

判斷哪個 thread 對應的 instruction window 優先被執行並傳到 MEM 或 ALU。其中 characteristic value 為 Priority Manager 的 internal register 並且可視為 thread state，包含這個 thread 是否在等待 Atomic lock、以及這個 thread 的指令是否在執行 external memory

相關操作。在 KOMODO 處理器中，I/O 操作完成時(e.g. 使用外部 timer 計算每個 task 的執行時間、memory-accessing)會傳 external signal 並且觸發 Signal Unit，使 Priority Manager 修改 characteristic value。如果在此架構下要擴張 thread 的數量則容易造成電路資源使用過高。

Jamuth[20]以 KOMODO 為基礎，並且提出一些改進的設計(圖 6)例如：所有 hardware threads 共享一個 2k-entry stack cache、採用 4-KB direct-mapped instruction cache 存放部分 method bytecode 以減少載入指令的 latency。但是 2k-entry stack cache 被切成 4 個部分分別存放 4 個 threads 的 stack frames，此作法同時只能有一個 thread 執行指令並且存取 stack cache 而其他 threads 只能將個別 stack frames 搬入 stack cache 或搬出到 external memory。此作法同樣會增加電路的複雜度。Jamuth 皆以 complex trap routines 實作 thread 同步機制例如 Java 指令 monitorenter[6]因此會有比較高的 synchronization overhead。

圖 6 The block diagram of jamuth

aJile aJ-102[16]包含一個 execution unit、Unified I-cache 與 D-cache 與 32KB microcode RAM，aJ-200 以 extended bytecode instruction 方式支援直接執行 thread 管理與同步功能，

此作法也使用極少的硬體成本。aJ-200 官方文件指出 context-switching overhead 少於 1

microsecond。另外 aJ-200 提出 MJM (multiple JVM manager)功能，可支援 2 個 JVM 同步執行 2 個獨立的 Java 應用程式，每個 JVM 被分配一個內部 timer 用來計算 current thread 的 time slice，MJM 底下可以個別設定每個 JVM 的 thread 排班演算法。

PicoJava-II [13][21]之中每個 thread context 包括一組 registers，用來記錄以下資訊：

紀錄 current method 的 base address、stack 頂端元素的位置、local varaible 起始位置、每個 thread 被分配到的 external memory address range、constant pool 的 base address、program status register 、 program counter 等。 picoJava 採用 interrupt 的方式執行額外 context-switching code，切換 current thread 與 next ready thread 的 context，並且需要額外軟體程式支援 thread scheduling，此做法會導致 context-switching overhead 增加。

picoJava 支援 hardware synchronization 機制，同時可執行最多 2 個 lock object 權限取得或釋放的工作，在執行同步機制操作時會使用特定用途的 registers 暫存 lock object 的參考位置、目前是否存在其他 waiting thread、以及 lock object 擁有者重複取得 lock 的次數。在 memory 之中每個 object header 皆包含 lock bit 用來判斷每個 object 是否已被某個 thread 取得權限。[21]提出在 Java object 內維護一個 pointer 指向自行定義 Java class 資料結構，用以儲存每個 lock object 相關 waiting threads。此作法雖然使用極少的硬體成本，但是執行同步操作時維護 waiting threads 會增加 object fields 存取次數，執行時間將會增加。

在多核心執行環境下，Java Optimized Processor(JOP) [24][28]使用最多 8 核心 Java 處理器加上 shared memory[23][27](圖 7)。而每一個處理器上都有一組 stack cache 與 method cache，分別使用 on-chip memory 實作並且自行定義 microcode 執行堆疊操作。

當開始執行 Java 程式時，多核心 JOP 環境的其中一個處理器(假設為 JOP0)負責初始化整個系統的工作，完成初始化之後開始執行 bytecode 指令，每個處理器會被分配一個 ID number，當系統有新的 thread 被產生，一律由第一個處理器(JOP0)分配 ID number 與 threads 給其他處理器，在此架構下不支援 thread migration。每個處理器輪替執行不同 threads 並且以內部 timer 檢查執行時間，當每個 JOP 處理器需要執行 context-switching 時，藉由 local timer interrupt 觸發 thread scheduler，thread scheduler 由 Java 程式實作並

且執行 thread scheduling 與 2 個 threads 的 stack frames 切換，由於 JOP 以 on-chip memory 實作 stack cache 因此 threads 使用的 stack frames 數量將決定資料來回搬移到 main memory 的時間，也影響 context-switching 整體時間。Hardware synchronization 方面，

Arbiter 依序處理每個 JOP 處理器對 shared memory 的 read/write operation，Synchronization unit 只包含一個 global lock，所有 threads 進入 synchronized method 或 synchronized code block 之前必須先到此模組取得 global lock，當每個 JOP 處理器取得 lock 失敗之後則必須等待，此時在此 JOP 處理器下無法切換到另一個 ready thread 繼續執行 bytecode。

JOP_1 Stack cache Method cache JOP_0

Stack cache Method cache

JOP_N Stack cache Method cache

Arbiter

…

scheduler I/O

Shared memory

Sync.

unit

圖 7 CMP system based on JOP

在文檔中多執行緒Java處理器設計 (頁 17-22)

第二章 相關研究

2.2. 多執行緒與同步處理相關研究

…

第二章相關研究