Lecture 1
QCourse Introduction
QComputer Architecture Overview
QTechnology Trend in ComputerArchitecture
Q Quantitative Principle of Computer Design
Course Introduction
Q Instructor : 楊佳玲
Q yangc@csie.ntu.edu.tw
Q Office: CSIE 411
Q Office Hours: Tuesday and Thursday 10:30-11:30 or by appointment
Q Course Web Page
– Lectures, homeworks, project topics
Q TA: to be announced
Q Textbook: Computer Architecture A Quantitative Approach 2nd Edition, John L. Hennessy and David A. Patterson
Computer Architecture Is
the attributes of a [computing] system as seen by the programmer, i.e., the conceptual
structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARE SOFTWARE
Course Focus
Q Why?
– Understanding the design techniques, machine structures, technology factors
Q Analysis and Evaluation
– evaluating the performance of alternative choices in system design using simulators
Q Interesting research topics
Grading
Q 30% Homeworks
– every two weeks
Q 30% Project
– team work
Q 30% Exams
– midterm and final (close book)
Q 10% class participation
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution, Superscalar, Reordering,
Prediction, Speculation
Addressing, Protection,
Exception Handling L1 Cache
L2 Cache DRAM
Disks, WORM, Tape
Coherence, Bandwidth, Latency
Emerging Technologies Interleaving
Bus protocols
RAID
VLSI
Input/Output and Storage
Memory Hierarchy
Pipelining and Instruction Level Parallelism
Computer Architecture Topics
M
Interconnection Network S
P M P M
P M
P ??
Topologies, Routng,
Bandwith, Latency, Reliability
Network Interfaces Shared Memory, Message Passing, Data Parallel
Processor-Memory-Switch
•Multiprocessors
•Networks and Interconnections
Computer Architecture Topics
Q Portable Systems
Computers vs. Intelligent Communicators High Performance vs. Low Power
LCD Technologies
Real-time Input/Output (e.g., Pen Input) Portable Storage Devices (PCMCIA)
Course Outline
Q Fundamentals of Computer Architecture (ch1)
– Cost and performance measurement
Q Instruction Set Design and Basic Pipelining (ch2&3)
– DLX architecture
Q Exploiting Instruction Level Parallelism(ch4)
– Multiple issue processor and compiler support
Q Memory Hierarchy Design (ch5)
– Cache and virtual memory
Q I/O system design (ch6)
Q Multiprocessors (ch8)
– Parallel architecture types and memory consistence models
Q Advanced topics: power management, SMT, CMP
Computer Engineering Methodology
Technology Trends
Computer Engineering Methodology
Evaluate Existing Evaluate Existing
Systems for Systems for Bottlenecks Bottlenecks
Technology Trends
Benchmarks
Computer Engineering Methodology
Evaluate Existing Evaluate Existing
Systems for Systems for Bottlenecks Bottlenecks
Simulate New Simulate New Designs and Designs and Organizations Organizations
Technology Trends
Benchmarks
Workloads
Computer Engineering Methodology
Evaluate Existing Evaluate Existing
Systems for Systems for Bottlenecks Bottlenecks
Simulate New Simulate New Designs and Designs and Organizations Organizations Implement Next
Implement Next Generation System Generation System
Technology Trends
Benchmarks
Workloads Implementation Complexity
Q Application Area
– Special Purpose (e.g., DSP) / General Purpose – Scientific (FP intensive) / Commercial (Integer)
Q Level of Software Compatibility
– Object Code/Binary Compatible (cost HW vs. SW; IBM S/360)
– Programming Language; Why not?
Context for Designing New
Architectures
Q Operating System Requirements for General Purpose Applications
– Size of Address Space
– Memory Management/Protection – Context Switch
– Interrupts and Traps
Q Standards: Innovation vs. Competition
– IEEE 754 Floating Point – I/O Bus
– Networks
– Operating Systems / Programming Languages ...
Context for Designing New
Architectures
Technology Trends (Summary)
Capacity Speed
Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 1.4x in 10 years Disk 2x in 3 years 1.4x in 10 years
How does technology trends affect architectural design?
• Memory subsystem is the performance bottleneck because of the performance gap between logic and Dram speed
• The increase in logic capacity allows computer designers to place a larger cache on chip
Processor Performance
S un-4/260
MIP S M/120
MIP S M2000 IBM
RS 6000/540 HP
9000/750
DEC AXP 3000
0 50 100 150 200 250 300
1987 1988 1989 1990 1991 1992 1993 1994 1995 P
e r f o r m
a n c e
IBM P o we r 2/590 1.54X/yr
1.35X/yr DEC 21064a
S un UltraS parc
Year 1000
10000 100000 1000000 10000000 100000000
1970 1975 1980 1985 1990 1995 2000
i80386
i4004 i8080
Pentium i80486
i80286
i8086
Technology Trends: Microprocessor Capacity
SparcUltra: 5.2 million Pentium Pro: 5.5
Power PC: 6.9 Alpha 21164: 9.3 Alpha 21264: 15 Pentium III: 28 Pentium 4: 42 Alpha21364: 100 Alpha21464: 250
Memory Capacity
(Single Chip DRAM)
size
Year 1000
10000 100000 1000000 10000000 100000000 1000000000
1970 1975 1980 1985 1990 1995 2000
year size cyc time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb
1998 128 Mb
Measurement and Evaluation
Design
Analysis
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Creativity
Good Ideas Good Ideas
Mediocre Ideas
Bad Ideas
Cost /
Performance Analysis
Measurement Tools
Q Benchmarks, Traces, Mixes
Q Cost, delay, area, power estimation
Q Simulation (many levels)
– ISA, RT, Gate, Circuit
Q Queuing Theory
Q Rules of Thumb
Q Fundamental Laws
The Bottom Line:
Performance (and Cost)
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns ?(Performance)
– Throughput, bandwidth Plane
Boeing 747
BAD/Sud Concodre
Speed
610 mph
1350 mph DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph) 286,700
178,200
Performance Terminology
"X is n times faster than Y" means
ExTime(Y) Performance(X) --- = --- ExTime(X) Performance(Y)
" X is n% faster than Y " means:
ExTime(Y) Performance(X) n
--- = --- = 1 + ---
ExTime(X) Performance(Y) 100
n = 100(Performance(X) - Performance(Y)) Performance(Y)
Example
15
10 = 1.5
1.0 = Performance (X) Performance (Y) ExTime(Y)
ExTime(X) =
n = 100 (1.5 - 1.0)
1.0
n = 50%
Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E Speedup(E) = --- = --- ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:
ExTime(E) = Speedup(E) =
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall = ExTimeold ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced Speedupenhanced
Amdahl’s Law
Q Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
Speedupoverall = ExTimenew=
Amdahl’s Law
Q Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
Speedupoverall = 1
0.95 = 1.053
ExTimenew= ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Corollary: Make The Common Case Fast
Q All instructions require an instruction fetch, only a fraction require a data fetch/store.
– Optimize instruction access over data access
Q Programs exhibit locality
Spatial Locality Temporal Locality
Q Access to small memories is faster
– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.
Reg's Cache Memory Disk / Tape
Metrics of Performance
Compiler Programming
Language Application
Datapath Control Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate) Megabytes per second
Answers per month Operations per second
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Instr. Cnt CPI Clock Rate Program
Compiler Instr. Set
Organization Technology
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate Program X
Compiler X
Inst. Set. X X
Organization X X
Technology X
Marketing Metrics
MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
• Machines with different instruction sets ?
• Programs with different instruction mixes ? – Dynamic frequency of instructions
• Uncorrelated with performance
MFLOP/s = FP Operations / Time * 10^6
• Machine dependent
• Often not where time is spent
Cycles Per Instruction
CPU time = CycleTime *
∑
CPI * I i = 1n
i i
CPI =
∑
CPI * F where F = I i = 1n
i i i i
Instruction Count
instruction Frequency
Invest Resources where time is Spent!
CPI = Instruction Count / (CPU Time * Clock Rate)
= Instruction Count / Cycles
average Cycles per Instruction
Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)
Op Freq CPI(i) CPI(i)*Freq (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%) 1.5
Base Machine (Reg / Reg)
Op Freq Cycles
ALU 50% 1
Load 20% 2
Store 10% 2
Branch 20% 2
Typical Mix
Example
Add register / memory operations:
– One source operand in memory – One source operand in register – Cycle count of 2
Branch cycle count to increase to 3.
What fraction of the loads must be eliminated for this to pay off?
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Op Freq Cycles Freq Cycles
ALU .50 1 .5 .5 -X 1 .5 -X
Load .20 2 .4 .2 -X 2 .4 -2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 -X (1.7 -X)/(1 -X)
Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 -X) x (1.7 -X)/(1 -X) 1.5 = 1.7 -X
0.2 = X ALL loads must be eliminated for this to be a win!
Programs to Evaluate Processor Performance
Q (Toy) Benchmarks
– 10-100 line program
– e.g.: sieve, puzzle, quicksort
Q Synthetic Benchmarks
– Attempt to match average frequencies of real workloads – e.g., Whetstone, dhrystone
Q Kernels
– Time critical excerpts – e.g., Livermore loops
Q Real programs
– e.g., gcc, spice
Benchmarking Games
Q Differing configurations used to run the same workload on two systems
Q Compiler wired to optimize the workload
Q Test specification written to be biased towards one machine
Q Workload arbitrarily picked
Q Very small benchmarks used
Q Benchmarks manually translated to optimize performance
SPEC: System Performance Evaluation Cooperative
Q First Round 1989
– 10 programs yielding a single number
Q Second Round 1992
SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs)
– Compiler Flags unlimited.
Q Third Round 1995
– Single flag setting for all programs; new set of programs benchmarks.
Q Spec2000: two options 1) specific flags 2)whatever you want
Common Benchmarking Mistakes
Q Only average behavior represented in test workload
Q Caching effects ignored
Q Inaccuracies due to sampling ignored
Q Ignoring monitoring overhead
Q Not validating measurements
Q Not ensuring same initial conditions
Q Not measuring transient (cold start) performance
How to Summarize Performance
Q Arithmetic mean (weighted arithmetic mean) tracks execution time: SUM(Ti)/n or
SUM(Wi*Ti)
Q Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time:
n/SUM(1/Ri) or n/SUM(Wi/Ri)
Q Normalized execution time is handy for scaling performance
Q But do not take the arithmetic mean of
normalized execution time, use the geometric
Performance Evaluation
Q Given sales is a function of performance relative to the competition, big investment in improving
product as reported by performance summary
Q Good products created when have:
– Good benchmarks
– Good ways to summarize performance
Q If benchmarks/summary inadequate, then choose between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
Q Ex. time is the measure of computer performance!
Q What about cost?