高级体系结构课程要点

(1)

高级体系结构课程要点

第一章前言

• 计算机技术快速法进步的原因

– 技术进步－－ Moore 定律发展 – 体系结构发展

• 体系结构演化过程

• 现代计算机体系结构的组成

• 高级计算机体系结构研究范畴

(2)

• 计算机系统结构的分类

– Flynn 分类法 --- 定性 – 冯氏分类法 --- 定量

(3)

Instruction-Level 第三章　

Parallelism and Its Dynamic Exploitation

• What is pipelining?

• How is the pipelining Implemented?

• What makes pipelining hard to implement?

• How does CPI descend ?

– CPI=1

– CPI<1

– CPI<<1

(4)

• Ideal Performance for Pipelining

– Ideal speedup equal to Number of pipe stages

• MIPS instruction format

• Works in the MIPS 5 stage pipeline

• The MIPS pipelining

• Pipeline hazard: the major hurdle

– Structural hazards – Data hazards

– Control hazards

– be resolved by Stall

(5)

• Solution imaginable for

Structural hazards

– “ double bump”

– Insert stall

– provide another memory port

– split instruction memory and data memory

– use instruction buffer

– fully pipelined function unit

• Why allow machine with structural hazard ?

(6)

Solution imaginable for

Data hazards

– Interlock: insert stalls

– Detect: Data Hazard Logic

– Forwarding: reduce data hazard stalls – compiler to avoid load stall

(7)

Solution imaginable for

Control hazards

– Move the Branch Computation Forward – Simple solutions

• Freeze or flush the pipeline

• Predict-not-taken (Predict-untaken)

– Treat every branch as not taken

• Predict-taken

– Treat every branch as taken

– Delayed branch

• a,b,c

– Cancelling function

(8)

Extending the MIPS Pipeline to Handle

• complex pipeline structure

• Pipelining time parameter

– Latency

– Initiation interval

• The out of order

• The new types of data hazards

– RAW

Stalls arising

– WAW

(9)

Instruction-Level Parallelism

• CPI

pipelined

= Ideal pipeline CPI+ pipelined stall cycles per instruction

=1+ Structual stalls + RAW stalls + WAR stalls

+ WAW stalls + Control stalls

• Basic Block ILP is quite small

• Data Dependence and Hazards

– True Data Dependence→ RAW( Read after write) – Name dependence

• Anti-dependence →

WAR( Write after read)

• Output dependence →

WAW(Write after write)

(10)

Some Property about

• Dependences are a property of

programs

• hazard or length of any stall is a

property of the pipeline (hardware)

• Control Dependencies

– Branch Behavior

(11)

Overcoming Data Hazards with Dynamic Scheduling

• Key idea: Allow instructions behind

stall to proceed

– in-order issue

– out-of-order execution

– out-of-order completion

(12)

Dynamic Scheduling with a Scoreboard

 Issue: a instruction is issued when

– The functional unit is available and

– No other active instruction has the same destination register.

– Avoid strutural hazard and WAW hazard

 Read Operands (RD)

– The read operation is delayed until the operands are available.

– This means that no previously issued but ncompleted instruction has the operand as its destination.

– This resolves RAW hazards dynamically

 Execution (EX)

– Notify the scoreboard when completed so the functional unit can be reused.

 Write result (WB)

(13)

Dynamic Scheduling with Tomasulo’s Algorithm(renaming in hardware! )



Control & buffers distributed with Function Units

 reservation stations



Registers in instructions replaced by values or pointers to reservation stations(RS)

 called register renaming ;

 avoids WAR, WAW hazards



Results to FU from RS

 not through registers,

 over Common Data Bus that broadcasts results

to all FUs

(14)

Three Stages of Tomasulo Algorithm

• Issue—get instruction from FP Op Queue

If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

• Execute—operate on operands (EX)

When both operands ready then execute;

if not ready, watch Common Data Bus for result

• Write result—finish execution (WB)

Write on Common Data Bus to all awaiting units;

mark reservation station available

(15)

FP adders FP adders

Add1Add2 Add3

FP multipliers FP multipliers

Mult1 Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem FP Op

Queue Load Buffers

Store Buffers

Load1 Load2 Load3 Load4 Load5 Load6

From Intruction unit

(16)

Reservation Station Components

Reservation station:

• Op: Operation to perform in the unit

• Vj, Vk: Value of Source operands

– Store buffers has V field, result to be stored

• Qj, Qk: Reservation stations producing source registers (value to be written)

– Note: Qj,Qk=0 => ready

– Store buffers only have Qi for RS producing result

• A: hold info. for memory address calculation

• Busy: Indicates reservation station or FU is bus

Register result status—Indicates which functional unit will

write each register, if one exists. Blank when no pending

instructions that will write that register.

(17)

(18)

Reducing Branch Costs with Dynamic Hardware Prediction(3.4)

1-bit Branch-Prediction Buffer 2-bit Branch-Prediction Buffer

Correlating Branch Prediction Buffer Tournament Branch Predictor

Branch Target Buffer

Trace Cache

(19)

Hardware-Based Speculation (3.7)

• 基本概念：基于硬件的投机技术实质上是综合了下述三种技术的一种集成技术

，它们是：

– 应用动态转移预测技术选择投机指令；

– 应用投机技术达到在控制相关性消除以前就

执行投机指令；

– 应用动态调度技术来调度程序基本块的不同

组合。

(20)

基于 Tomasulo 动态调度的硬件投机

• 乱序执行

• 按序结束

• 增加流水级： Commit

（交付，或提交）

• 增加流水部件： Reorder Buffer

• Reorder buffer 的作用

(21)

硬件投机指令执行四个节拍的功能

• Issue in order

• Execute out of order

• Write result out of order

• Commit----in order

(22)

指令多发射技术

• 一个时钟周期里发射多条指令，即指令的多发射技术。

• 多发射技术的两种方法 (Two basic flavors) ：

– superscalar( 超标量）方法 – VLIW( 超长指令字）方法

(23)

Scheduled

• Superscalar processor has dynamic issue

capability,

• VLIW processor has static issue

capability

(24)

双发射 Tomasulo 流水线

~7

(25)

Multiple Issue with Speculation( ^without

speculation )

(26)

Multiple Issue with Speculation( with speculation )

8

14

(27)

第四章

Exploiting Instruction Level Parallelism with Software Approaches

• Loop Unrolling

– Using Loop Unrolling and Pipeline Scheduling with Static Multiple Issue

(28)

Static Multiple Issue: the VLIW Approach

(29)

• Data accesses in later iterations are dependent on data values produced in earlier iterations

– Dependence distance 相关距离 ----

• 第 i 次循环调用第 i+n 次循环的数据 , 其中的 n

– 环绕相关 ----

循环体内相关同时又有循环传递

divisor ——GCD

）测试法

• 相关性类型判断

(31)

Software :Pipelining:Symbolic Loop Unrolling

-- 软件流水技术

LD ADD SD

LD ADD SD 迭代 0

迭代 1

迭代 2

迭代 3

迭代 4 启动代码

软件流水循环体

结束代码

(32)

Global Code Scheduling trace scheduling --

路径调度

• Trace selection （路径选择）

• Trace compaction （路径压缩）

• Predict miss compensate( 路径补偿 )

(33)

硬件支持的挖掘更多并行性技术

• 条件指令

• 预测指令

第一条指令第二条指令

LW R1, 40(R2) ADD R3, R4, R5 LWC R8, 0(R10),R10 ADD R6,R3, R7 BEQZ R10, L

LW R9, 0(R8)

第一条指令第二条指令

LW R1, 40(R2) ADD R3, R4, R5 ADD R6,R3, R7

BEQZ R10, L LW R8, 0(R10)

LW R9, 0(R8)

(34)

三种基于硬件支持的编译投机技术

• 由硬件和操作系统协同对投机指令的异常行为不作出反应。即忽略（不处理）这一异常；

• 当投机指令出现异常时，在该投机指令写入结果的寄存器上设置一状态 ( 标志 ) 位，称 poison 位。

当一正常指令试图采用该寄存器值时，该寄存器的 poison 位将提示出错。（即不会采用可能有错

• 的结果）由硬件标记某指令为一投机指令，并由硬件将其结果存入 buffer, 等到该指令不再是投机指令时，

再将 buffer 中的结果转存入寄存器或存储器（可

避免再投机失败是造成出错）

(35)

第六章

Multiprocessors and

Thread-Level Parallelism

• 并行技术定义

• 并行性含意

• 并行性的困难

• 集中共享存储器式系统结构

– ----

SMP （ symmetric (shared-memory) multiprocessors ）

– ----UMA （ uniform memory access)

(36)

• 分布存储器体系结构

– 互联网络 – 消息传递 – 节点

• 分布式存储器结构模型

– Distributed shared memory （ DSM or scalable shared memory)

– Multiple computers

(37)

通信模型

• 共享存储器通信模型（ shared memory ）

• 消息传递模型（ message passing ）

• 两种存储器组织

  single address pace (distributed shared memory)

  private address spaces ( multiple computer)

• 两种通信模型

  share memory

  message passing

• 对应关系

(38)

What Is Multiprocessor Cache Coherence?

• 多处理器 Cache 不一致性

– Write Back – Write through

– Memory 、 Cache 、 Another cache 、 Bus events

• Cache coherence 定义

– Coherence 问题 – Consistency 问题

• 正确的一致性定义包含三个方面

(39)

两类一致性协议

1. Directory based （基于目录）

– 把物理存储器中数据块的共享状态放在一个称为目录的结构之中。（ 6.5 节介绍）

2. Snooping （监听）

– 数据块的共享状态分散保留在每一拥有该数据块 copy 的 Cache 中，即不存在集中保留共享状态的结构。由于 Cache 通常是与共享存储器总线相连接，所有 Cache 控制器监控（ monitor ） or 监听

（ snoop ）总线，去发现它们（ cache ）是否拥

有总线请求的数据块。

(40)

两种监听协议

1. Write invalidate protocol （写时无效协议）

2. Write update or write broadcast protocol

（写时更新或写广播协议）

• 在 Write back 与 Write through 下的共享数据块（ Cache ）的变化

• snooping protocols example

(41)

一致性机制的请求和操作 ( 写无效 ) ：处理器与总线

(42)

Cache’ state transitions based on requests (Ep558)

1. from CPU

Invalid

^Shared

Read only

Exclusive Read/write

C PU w rit e Pl ac e w rit e M is s on b us

CPU read

Place read miss on bus

CPU read hit

CPU read miss Place read

miss on bus CPU write

Place write miss on bus CPU read miss

Write-back block

CPU write miss

Write-back cache block

Place write miss on bus

(43)

Cache’ state transitions based on requests

2. from the bus

W rit e m is s fo r th is b lo ck W ri te -b ac k bl oc k; A bo rt m em or y ac ce ss

Invalid

^Shared

Read only

Exclusive Read/write

Write miss for this block

Read miss for this block Write-back block;

Abort memory access

(44)

Directory protocol 实现方法

• 目录协议中 Cache 块可能的状态

• 共享（ shared ）

– 该块的 copies 存在于一个或多个 processor 的 caches

• 未进入 Cache （ uncached ）

中；

– 没有一个处理器将此块拷入其 cache 中；

• 独占（ exclusive ）

– 只有一个处理器保存此块的拷贝，并更新过数据，

于是内存中的数据已过时。此处理器为此数据的拥 有者（ owner ）。

(45)

目录协议的操作约定

– 写入非独占数据时，一定会导致 Cache 写失配，且处理器将暂停直到一次访问结束。

基于目录的 Cache 一致性协议与 snooping 不同之处：

– Snooping 把总线（互连机制）作为判断点，起仲裁作用。 Director 协议不能把互连网络作为判断点。

– Director 写是面向消息的，（不象总线是面向事务

的，可采用中断方式），所有消息必须明确应答。

(46)

处理器与目录间传递消息的种类 ( 写无效 )

：三类节点

(47)

三类节点的关系

• 本地节点（ local node ）

– 指产生访问请求的节点

• 家节点（ home node ）

– 指该节点拥有要访问地址的存储器单元和目录项

（即要访问的数据的家）

• 远程节点（ remote node ）

– 指拥有要访问数据拷贝的节点。

• 关系与消息： local node 读数据的流向为例

– 本地节点→家节点→远程节点→家节点→本地节点

– Read miss →Fecth →Data value reply

(48)

目录协议中的例

目录协议时 Cache 中的数据块状态转换图

C PU w rit e Se nd w rit e M is s m es sa ge

Invalid ^Shared

Read only

Exclusive Read/write

CPU read

Send read miss message

CPU read hit

CPU read miss Read miss

CPU write

Send write miss message CPU read miss

Date Write-back read miss

CPU write miss Data Write-back write miss

Invalidate

Fe tc h in va lid at e D at a w rit e ba ck

Fetch

Data write-back

(49)

目录中数据块记录的状态转换图

W rit e m is s D at a va lu e re pl y, Sh ar er s ={ P}

Uncached

d

Shared Read only

Exclusive Read/write

Data value reply, Sharers={P}

Read miss

Data value reply, Sharers+={P}

Write miss

Invalidte; Shares={P};

data value reply read miss

Fetch;Date value reply, sharers+={P}

write miss

Fetch/Invalidate, data value reply, shares={P}

D at e w rit e ba ck S ha re rs ={ }

(50)

同步作用与实现

• 硬件原语的功能

• 几种典型的硬件原语

• 利用一致性实现锁同步

• 各种自旋锁性能分析与改进

• 自旋锁同步与锁竞争分析

• Barrier Synchronization

– 实现、改进

(51)

• 大规模多处理机上的同步机制

– 代价

• 在非竞争的情况下，减小同步操作延时；

• 在竞争严重的情况下，使串行化操作最小化

– 带指数后退的自旋锁

– Combining tree barrier – 队列锁的实现

– 取值并增值原语（ fetch-and-increment ）

(52)

存储器连贯性模型（略）

(53)

Multithreading ： Exploiting Thread-level parallelism within a Processor (6.9)

• 多线程结构

• 多线程实现方法

1.Fine-grained multithreading ：

– 每条指令间切换线程，多线程交叉，略过任何阻塞的线程，每个时钟切换线程。

• 优点：减小吞吐率损失： stall

• 缺点：减慢单个线程的执行速度

2.Coarse-grained multithreading ：

– 当出现代价大的阻塞 (stall) 时切换线程

(54)

同时多线程（

Simultaneous

）：

Simultaneous

– SMT-- 多线程的一种变形 : 利用多发射资源

，动态调度处理器在开发 ILP 的同时开发 TLP ：

• 多发射处理器有较多的并行功能单元 ( 资源 )

• 采用 SMT 比单线程能有更高使用效率

– 相关性少