Computer Architecture Is

43  Download (0)

Full text

(1)

Lecture 1

QCourse Introduction

QComputer Architecture Overview

QTechnology Trend in ComputerArchitecture

Q Quantitative Principle of Computer Design

(2)

Course Introduction

Q Instructor : 楊佳玲

Q yangc@csie.ntu.edu.tw

Q Office: CSIE 411

Q Office Hours: Tuesday and Thursday 10:30-11:30 or by appointment

Q Course Web Page

– Lectures, homeworks, project topics

Q TA: to be announced

Q Textbook: Computer Architecture A Quantitative Approach 2nd Edition, John L. Hennessy and David A. Patterson

(3)

Computer Architecture Is

the attributes of a [computing] system as seen by the programmer, i.e., the conceptual

structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

SOFTWARE SOFTWARE

(4)

Course Focus

Q Why?

– Understanding the design techniques, machine structures, technology factors

Q Analysis and Evaluation

– evaluating the performance of alternative choices in system design using simulators

Q Interesting research topics

(5)

Grading

Q 30% Homeworks

– every two weeks

Q 30% Project

– team work

Q 30% Exams

– midterm and final (close book)

Q 10% class participation

(6)

Computer Architecture Topics

Instruction Set Architecture

Pipelining, Hazard Resolution, Superscalar, Reordering,

Prediction, Speculation

Addressing, Protection,

Exception Handling L1 Cache

L2 Cache DRAM

Disks, WORM, Tape

Coherence, Bandwidth, Latency

Emerging Technologies Interleaving

Bus protocols

RAID

VLSI

Input/Output and Storage

Memory Hierarchy

Pipelining and Instruction Level Parallelism

(7)

Computer Architecture Topics

M

Interconnection Network S

P M P M

P M

P ??

Topologies, Routng,

Bandwith, Latency, Reliability

Network Interfaces Shared Memory, Message Passing, Data Parallel

Processor-Memory-Switch

•Multiprocessors

•Networks and Interconnections

(8)

Computer Architecture Topics

Q Portable Systems

Computers vs. Intelligent Communicators High Performance vs. Low Power

LCD Technologies

Real-time Input/Output (e.g., Pen Input) Portable Storage Devices (PCMCIA)

(9)

Course Outline

Q Fundamentals of Computer Architecture (ch1)

– Cost and performance measurement

Q Instruction Set Design and Basic Pipelining (ch2&3)

– DLX architecture

Q Exploiting Instruction Level Parallelism(ch4)

– Multiple issue processor and compiler support

Q Memory Hierarchy Design (ch5)

– Cache and virtual memory

Q I/O system design (ch6)

Q Multiprocessors (ch8)

– Parallel architecture types and memory consistence models

Q Advanced topics: power management, SMT, CMP

(10)

Computer Engineering Methodology

Technology Trends

(11)

Computer Engineering Methodology

Evaluate Existing Evaluate Existing

Systems for Systems for Bottlenecks Bottlenecks

Technology Trends

Benchmarks

(12)

Computer Engineering Methodology

Evaluate Existing Evaluate Existing

Systems for Systems for Bottlenecks Bottlenecks

Simulate New Simulate New Designs and Designs and Organizations Organizations

Technology Trends

Benchmarks

Workloads

(13)

Computer Engineering Methodology

Evaluate Existing Evaluate Existing

Systems for Systems for Bottlenecks Bottlenecks

Simulate New Simulate New Designs and Designs and Organizations Organizations Implement Next

Implement Next Generation System Generation System

Technology Trends

Benchmarks

Workloads Implementation Complexity

(14)

Q Application Area

– Special Purpose (e.g., DSP) / General Purpose – Scientific (FP intensive) / Commercial (Integer)

Q Level of Software Compatibility

– Object Code/Binary Compatible (cost HW vs. SW; IBM S/360)

– Programming Language; Why not?

Context for Designing New

Architectures

(15)

Q Operating System Requirements for General Purpose Applications

– Size of Address Space

– Memory Management/Protection – Context Switch

– Interrupts and Traps

Q Standards: Innovation vs. Competition

– IEEE 754 Floating Point – I/O Bus

– Networks

– Operating Systems / Programming Languages ...

Context for Designing New

Architectures

(16)

Technology Trends (Summary)

Capacity Speed

Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 1.4x in 10 years Disk 2x in 3 years 1.4x in 10 years

How does technology trends affect architectural design?

• Memory subsystem is the performance bottleneck because of the performance gap between logic and Dram speed

• The increase in logic capacity allows computer designers to place a larger cache on chip

(17)

Processor Performance

S un-4/260

MIP S M/120

MIP S M2000 IBM

RS 6000/540 HP

9000/750

DEC AXP 3000

0 50 100 150 200 250 300

1987 1988 1989 1990 1991 1992 1993 1994 1995 P

e r f o r m

a n c e

IBM P o we r 2/590 1.54X/yr

1.35X/yr DEC 21064a

S un UltraS parc

(18)

Year 1000

10000 100000 1000000 10000000 100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004 i8080

Pentium i80486

i80286

i8086

Technology Trends: Microprocessor Capacity

SparcUltra: 5.2 million Pentium Pro: 5.5

Power PC: 6.9 Alpha 21164: 9.3 Alpha 21264: 15 Pentium III: 28 Pentium 4: 42 Alpha21364: 100 Alpha21464: 250

(19)

Memory Capacity

(Single Chip DRAM)

size

Year 1000

10000 100000 1000000 10000000 100000000 1000000000

1970 1975 1980 1985 1990 1995 2000

year size cyc time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb

1998 128 Mb

(20)

Measurement and Evaluation

Design

Analysis

Architecture is an iterative process:

• Searching the space of possible designs

• At all levels of computer systems

Creativity

Good Ideas Good Ideas

Mediocre Ideas

Bad Ideas

Cost /

Performance Analysis

(21)

Measurement Tools

Q Benchmarks, Traces, Mixes

Q Cost, delay, area, power estimation

Q Simulation (many levels)

– ISA, RT, Gate, Circuit

Q Queuing Theory

Q Rules of Thumb

Q Fundamental Laws

(22)

The Bottom Line:

Performance (and Cost)

• Time to run the task (ExTime)

– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns ?(Performance)

– Throughput, bandwidth Plane

Boeing 747

BAD/Sud Concodre

Speed

610 mph

1350 mph DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph) 286,700

178,200

(23)

Performance Terminology

"X is n times faster than Y" means

ExTime(Y) Performance(X) --- = --- ExTime(X) Performance(Y)

" X is n% faster than Y " means:

ExTime(Y) Performance(X) n

--- = --- = 1 + ---

ExTime(X) Performance(Y) 100

n = 100(Performance(X) - Performance(Y)) Performance(Y)

(24)

Example

15

10 = 1.5

1.0 = Performance (X) Performance (Y) ExTime(Y)

ExTime(X) =

n = 100 (1.5 - 1.0)

1.0

n = 50%

Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X?

(25)

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E Speedup(E) = --- = --- ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

ExTime(E) = Speedup(E) =

(26)

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall = ExTimeold ExTimenew

Speedupenhanced

=

1

(1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

(27)

Amdahl’s Law

Q Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

Speedupoverall = ExTimenew=

(28)

Amdahl’s Law

Q Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

Speedupoverall = 1

0.95 = 1.053

ExTimenew= ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

(29)

Corollary: Make The Common Case Fast

Q All instructions require an instruction fetch, only a fraction require a data fetch/store.

– Optimize instruction access over data access

Q Programs exhibit locality

Spatial Locality Temporal Locality

Q Access to small memories is faster

– Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.

Reg's Cache Memory Disk / Tape

(30)

Metrics of Performance

Compiler Programming

Language Application

Datapath Control Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second: MIPS

(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate) Megabytes per second

Answers per month Operations per second

(31)

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Instr. Cnt CPI Clock Rate Program

Compiler Instr. Set

Organization Technology

(32)

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Inst Count CPI Clock Rate Program X

Compiler X

Inst. Set. X X

Organization X X

Technology X

(33)

Marketing Metrics

MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

• Machines with different instruction sets ?

• Programs with different instruction mixes ? – Dynamic frequency of instructions

• Uncorrelated with performance

MFLOP/s = FP Operations / Time * 10^6

• Machine dependent

• Often not where time is spent

(34)

Cycles Per Instruction

CPU time = CycleTime *

CPI * I i = 1

n

i i

CPI =

CPI * F where F = I i = 1

n

i i i i

Instruction Count

instruction Frequency

Invest Resources where time is Spent!

CPI = Instruction Count / (CPU Time * Clock Rate)

= Instruction Count / Cycles

average Cycles per Instruction

(35)

Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq CPI(i) CPI(i)*Freq (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%) 1.5

(36)

Base Machine (Reg / Reg)

Op Freq Cycles

ALU 50% 1

Load 20% 2

Store 10% 2

Branch 20% 2

Typical Mix

Example

Add register / memory operations:

– One source operand in memory – One source operand in register – Cycle count of 2

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this to pay off?

(37)

Example Solution

Exec Time = Instr Cnt x CPI x Clock

Op Freq Cycles Freq Cycles

ALU .50 1 .5 .5 -X 1 .5 -X

Load .20 2 .4 .2 -X 2 .4 -2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 -X (1.7 -X)/(1 -X)

Instr CntOld x CPIOld x ClockOld = Instr CntNew x CPINew x ClockNew 1.00 x 1.5 = (1 -X) x (1.7 -X)/(1 -X) 1.5 = 1.7 -X

0.2 = X ALL loads must be eliminated for this to be a win!

(38)

Programs to Evaluate Processor Performance

Q (Toy) Benchmarks

– 10-100 line program

– e.g.: sieve, puzzle, quicksort

Q Synthetic Benchmarks

– Attempt to match average frequencies of real workloads – e.g., Whetstone, dhrystone

Q Kernels

– Time critical excerpts – e.g., Livermore loops

Q Real programs

– e.g., gcc, spice

(39)

Benchmarking Games

Q Differing configurations used to run the same workload on two systems

Q Compiler wired to optimize the workload

Q Test specification written to be biased towards one machine

Q Workload arbitrarily picked

Q Very small benchmarks used

Q Benchmarks manually translated to optimize performance

(40)

SPEC: System Performance Evaluation Cooperative

Q First Round 1989

– 10 programs yielding a single number

Q Second Round 1992

SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs)

– Compiler Flags unlimited.

Q Third Round 1995

– Single flag setting for all programs; new set of programs benchmarks.

Q Spec2000: two options 1) specific flags 2)whatever you want

(41)

Common Benchmarking Mistakes

Q Only average behavior represented in test workload

Q Caching effects ignored

Q Inaccuracies due to sampling ignored

Q Ignoring monitoring overhead

Q Not validating measurements

Q Not ensuring same initial conditions

Q Not measuring transient (cold start) performance

(42)

How to Summarize Performance

Q Arithmetic mean (weighted arithmetic mean) tracks execution time: SUM(Ti)/n or

SUM(Wi*Ti)

Q Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time:

n/SUM(1/Ri) or n/SUM(Wi/Ri)

Q Normalized execution time is handy for scaling performance

Q But do not take the arithmetic mean of

normalized execution time, use the geometric

(43)

Performance Evaluation

Q Given sales is a function of performance relative to the competition, big investment in improving

product as reported by performance summary

Q Good products created when have:

– Good benchmarks

– Good ways to summarize performance

Q If benchmarks/summary inadequate, then choose between improving product for real programs vs.

improving product to get more sales;

Sales almost always wins!

Q Ex. time is the measure of computer performance!

Q What about cost?

Figure

Updating...

References

Related subjects :