1.1.1 Computer used to run large problems and usually accessed via a network:

(1)

Solution 1.1

1.1.1 Computer used to run large problems and usually accessed via a network:

5 supercomputers

1.1.2 10

¹⁵

or 2

⁵⁰

bytes: 7 petabyte

1.1.3 Computer composed of hundreds to thousands of processors and terabytes of memory: 3 servers

1.1.4 Today’s science fi ction application that probably will be available in near future: 1 virtual worlds

1.1.5 A kind of memory called random access memory: 12 RAM 1.1.6 Part of a computer called central processor unit: 13 CPU 1.1.7 Thousands of processors forming a large cluster: 8 datacenters

1.1.8 A microprocessor containing several processors in the same chip: 10 multi- core processors

1.1.9 Desktop computer without screen or keyboard usually accessed via a net- work: 4 low-end servers

1.1.10 Currently the largest class of computer that runs one application or one set of related applications: 9 embedded computers

1.1.11 Special language used to describe hardware components: 11 VHDL 1.1.12 Personal computer delivering good performance to single users at low cost: 2 desktop computers

1.1.13 Program that translates statements in high-level language to assembly

language: 15 compiler

(2)

1.1.14 Program that translates symbolic instructions to binary instructions:

21 assembler

1.1.15 High-level language for business data processing: 25 cobol

1.1.16 Binary language that the processor can understand: 19 machine language 1.1.17 Commands that the processors understand: 17 instruction

1.1.18 High-level language for scientifi c computation: 26 fortran

1.1.19 Symbolic representation of machine instructions: 18 assembly language 1.1.20 Interface between user’s program and hardware providing a variety of services and supervision functions: 14 operating system

1.1.21 Software/programs developed by the users: 24 application software 1.1.22 Binary digit (value 0 or 1): 16 bit

1.1.23 Software layer between the application software and the hardware that includes the operating system and the compilers: 23 system software

1.1.24 High-level language used to write application and system software: 20 C 1.1.25 Portable language composed of words and algebraic expressions that must be translated into assembly language before run in a computer: 22 high-level language

1.1.26 10

¹²

or 2

⁴⁰

bytes: 6 terabyte

Solution 1.2

1.2.1 8 bits × 3 colors = 24 bits/pixel = 4 bytes/pixel. 1280 × 800 pixels = 1,024,000 pixels. 1,024,000 pixels × 4 bytes/pixel = 4,096,000 bytes (approx 4 Mbytes).

1.2.2 2 GB = 2000 Mbytes. No. frames = 2000 Mbytes/4 Mbytes = 500 frames.

1.2.3 Network speed: 1 gigabit network ==> 1 gigabit/per second = 125 Mbytes/

second. File size: 256 Kbytes = 0.256 Mbytes. Time for 0.256 Mbytes = 0.256/125 =

2.048 ms.

(3)

1.2.4 2 microseconds from cache ==> 20 microseconds from DRAM. 20 micro- seconds from DRAM ==> 2 seconds from magnetic disk. 20 microseconds from DRAM ==> 2 ms from fl ash memory.

Solution 1.3

1.3.1 P2 has the highest performance

performance of P1 (instructions/sec) = 2 × 10

⁹

/1.5 = 1.33 × 10

⁹

performance of P2 (instructions/sec) = 1.5 × 10

⁹

/1.0 = 1.5 × 10

⁹

performance of P3 (instructions/sec) = 3 × 10

⁹

/2.5 = 1.2 × 10

⁹

1.3.2 No. cycles = time × clock rate

cycles(P1) = 10 × 2 × 10

⁹

= 20 × 10

⁹

s cycles(P2) = 10 × 1.5 × 10

⁹

= 15 × 10

⁹

s cycles(P3) = 10 × 3 × 10

⁹

= 30 × 10

⁹

s

time = (No. instr. × CPI)/clock rate, then No. instructions = No. cycles/CPI instructions(P1) = 20 × 10

⁹

/1.5 = 13.33 × 10

⁹

instructions(P2) = 15 × 10

⁹

/1 = 15 × 10

⁹

instructions(P3) = 30 × 10

⁹

/2.5 = 12 × 10

⁹

1.3.3 time

_new

= time

_old

× 0.7 = 7 s

CPI = CPI × 1.2, then CPI(P1) = 1.8, CPI(P2) = 1.2, CPI(P3) = 3 ƒ = No. instr. × CPI/time, then

ƒ(P1) = 13.33 × 10

⁹

× 1.8/7 = 3.42 GHz ƒ(P2) = 15 × 10

⁹

× 1.2/7 = 2.57 GHz ƒ(P3) = 12 × 10

⁹

× 3/7 = 5.14 GHz

1.3.4 IPC = 1/CPI = No. instr./(time × clock rate) IPC(P1) = 1.42

IPC(P2) = 2 IPC(P3) = 3.33

1.3.5 Time

_new

/Time

_old

= 7/10 = 0.7. So ƒ

_new

= ƒ

_old

/0.7 = 1.5 GHz/0.7 = 2.14 GHz.

1.3.6 Time

_new

/Time

_old

= 9/10 = 0.9.

So Instructions

_new

= Instructions

_old

× 0.9 = 30 × 10

⁹

× 0.9 = 27 × 10

⁹

.

(4)

Solution 1.4

1.4.1 P2 Class A: 10

⁵

instr.

Class B: 2 × 10

⁵

instr.

Class C: 5 × 10

⁵

instr.

Class D: 2 × 10

⁵

instr.

Time = No. instr. × CPI/clock rate P1: Time class A = 0.66 × 10

⁻⁴

Time class B = 2.66 × 10

⁻⁴

Time class C = 10 × 10

⁻⁴

Time class D = 5.33 × 10

⁻⁴

Total time P1 = 18.65 × 10

⁻⁴

P2: Time class A = 10

⁻⁴

Time class B = 2 × 10

⁻⁴

Time class C = 5 × 10

⁻⁴

Time class D = 3 × 10

⁻⁴

Total time P2 = 11 × 10

⁻⁴

1.4.2 CPI = time × clock rate/No. instr.

CPI(P1) = 18.65 × 10

⁻⁴

× 1.5 × 10

⁹

/10

⁶

= 2.79 CPI(P2) = 11 × 10

⁻⁴

× 2 × 10

⁹

/10

⁶

= 2.2 1.4.3

clock cycles(P1) = 10

⁵

× 1 + 2 × 10

⁵

× 2 + 5 × 10

⁵

× 3 + 2 × 10

⁵

× 4 = 28 × 10

⁵

clock cycles(P2) = 10

⁵

× 2 + 2 × 10

⁵

× 2 + 5 × 10

⁵

× 2 + 2 × 10

⁵

× 3 = 22 × 10

⁵

1.4.4

(500 × 1 + 50 × 5 + 100 × 5 + 50 × 2) × 0.5 × 10^–9 = 675 ns

1.4.5 CPI = time × clock rate/No. instr.

CPI = 675 × 10^–9 × 2 × 10⁹/700 = 1.92

1.4.6

Time = (500 × 1 + 50 × 5 + 50 × 5 + 50 × 2) × 0.5 × 10^–9 = 550 ns Speed-up = 675 ns/550 ns = 1.22

CPI = 550 × 10^–9 × 2 × 10⁹/700 = 1.57

(5)

Solution 1.5

1.5.1

a. 1G, 0.75G inst/s b. 1G, 1.5G inst/s

1.5.2

a. P2 is 1.33 times faster than P1 b. P1 is 1.03 times faster than P2

1.5.3

a. P2 is 1.31 times faster than P1 b. P1 is 1.00 times faster than P2

1.5.4

a. 2.05 µs b. 1.93 µs

1.5.5

a. 0.71 µs b. 0.86 µs

1.5.6

a. 1.30 times faster b. 1.40 times faster

Solution 1.6

1.6.1

Compiler A CPI Compiler B CPI

a. 1.00 1.17

b. 0.80 0.58

(6)

1.6.2

a. 0.86 b. 1.37

1.6.3

Compiler A speed-up Compiler B speed-up

a. 1.52 1.77

b. 1.21 0.88

1.6.4

P1 peak P2 peak

a. 4G Inst/s 3G Inst/s

b. 4G Inst/s 3G Inst/s

1.6.5 Speed-up, P1 versus P2:

a. 0.967105263 b. 0.730263158

1.6.6

a. 6.204081633 b. 8.216216216

Solution 1.7

1.7.1 Geometric mean clock rate ratio = (1.28 × 1.56 × 2.64 × 3.03 × 10.00 × 1.80 × 0.74)

^1/7

= 2.15

Geometric mean power ratio = (1.24 × 1.20 × 2.06 × 2.88 × 2.59 × 1.37 × 0.92)

^1/7

= 1.62

1.7.2 Largest clock rate ratio = 2000 MHz/200 MHz = 10 (Pentium Pro to Pentium 4 Willamette)

Largest power ratio = 29.1 W/10.1 W = 2.88 (Pentium to Pentium Pro)

(7)

1.7.3 Clock rate: 2.667 × 10

⁹

/12.5 × 10

⁶

= 212.8 Power: 95 W/3.3 W = 28.78

1.7.4 C = P/V

²

× clockrate 80286: C = 0.0105 × 10

⁻⁶

80386: C = 0.01025 × 10

⁻⁶

80486: C = 0.00784 × 10

⁻⁶

Pentium: C = 0.00612 × 10

⁻⁶

Pentium Pro: C = 0.0133 × 10

⁻⁶

Pentium 4 Willamette: C = 0.0122 × 10

⁻⁶

Pentium 4 Prescott: C = 0.00183 × 10

⁻⁶

Core 2: C = 0.0294 × 10

⁻⁶

1.7.5 3.3/1.75 = 1.78 (Pentium Pro to Pentium 4 Willamette) 1.7.6

Pentium to Pentium Pro: 3.3/5 = 0.66

Pentium Pro to Pentium 4 Willamette: 1.75/3.3 = 0.53 Pentium 4 Willamette to Pentium 4 Prescott: 1.25/1.75 = 0.71 Pentium 4 Prescott to Core 2: 1.1/1.25 = 0.88

Geometric mean = 0.68

Solution 1.8

1.8.1 Power

₁

= V

²

× clock rate × C. Power

₂

= 0.9 Power

₁

C₂/C₁ = 0.9 × 5² × 0.5 × 10⁹/3.3² × 1 × 10⁹ = 1.03

1.8.2 Power

₂

/Power

₁

= V

₂²

× clock rate

₂

/V

₁²

× clock rate

₁

Power₂/Power₁ = 0.87 => Reduction of 13%

1.8.3

Power₂ = V₂² × 1 × 10⁹ × 0.8 × C₁ = 0.6 × Power₁ Power1 = 5² × 0.5 × 10⁹ × C1

V₂² × 1 × 10⁹ × 0.8 × C₁ = 0.6 × 5² × 0.5 × 10⁹ × C₁ V2 = ( (0.6 × 5² × 0.5 × 10⁹)/(1 × 10⁹ × 0.8) )^1/2 = 3.06 V

(8)

1.8.4 Power

_new

= 1 × C

old

× V

²old

/(2

^−1/4

)

²

× clock rate × 2

^1/2

= Power

old

. Thus, power scales by 1.

1.8.5 1/2

^−1/2

= 2

^1/2

1.8.6 Voltage = 1.1 × 1/2

^−1/4

= 0.92 V. Clock rate = 2.667 × 2

^1/2

= 3.771 GHz

Solution 1.9

1.9.1

a. 1/49 × 100 = 2%

b. 45/120 × 100 = 37.5%

1.9.2

a. I_leak = 1/3.3 = 0.3 b. I_leak = 45/1.1 = 40.9

1.9.3

a. Powerst/Powerdyn = 1/49 = 0.02 b. Power_st/Power_dyn = 45/57 = 0.6

1.9.4 Power

_st

/Power

_dyn

= 0.6 = > Power

_st

= 0.6 × Power

_dyn

a. Powerst = 0.6 × 40 W = 24 W b. Power_st = 0.6 × 30 W = 18 W

1.9.5

a. I_lk = 24/0.8 = 30 A b. I_lk = 18/0.8 = 22.5 A

(9)

1.9.6

Power_st at 1.0 V I_lk at 1.0 V Power_st at 1.2 V I_lk at 1.2 V Larger

a. 119 W 119 A 136 W 113.3 A Ilk at 1.0 V

b. 93.5 W 93.5 A 110.5 W 92.1 A I_lk at 1.0 V

Solution 1.10

1.10.1

a. Processors Instructions per processor Total instructions

1 4096 4096

2 2048 4096

4 1024 4096

8 512 4096

b. Processors Instructions per processor Total instructions

1 4096 4096

2 2278 4556

4 1464 5856

8 1132 9056

1.10.2

a. Processors Execution time (µs)

1 4.096

2 2.048

4 1.024

8 0.512

b. Processors Execution time (µs)

1 4.096

2 3.203

4 3.164

8 3.582

(10)

1.10.3

a. Processors Execution time (µs)

1 5.376

2 2.688

4 1.344

8 0.672

b. Processors Execution time (µs)

1 5.376

2 3.878

4 3.564

8 3.882

1.10.4

a. Cores Execution time (s) @ 3 GHz

1 4.00

2 2.17

4 1.25

8 0.75

b. Cores Execution time (s) @ 3 GHz

1 4.00

2 2.00

4 1.00

8 0.50

(11)

1.10.5

a.

Cores

Power (W) per core

@ 3 GHz

Power (W) per core

@ 500 MHz

Power (W)

@ 3 GHz

Power (W)

@ 500 MHz

1 15 0.625 15 0.625

2 15 0.625 30 1.25

4 15 0.625 60 2.5

8 15 0.625 120 5

b.

Cores

Power (W) per core

@ 3 GHz

Power (W) per core

@ 500 MHz

Power (W)

@ 3 GHz

Power (W)

@ 500 MHz

1 15 0.625 15 0.625

2 15 0.625 30 1.25

4 15 0.625 60 2.5

8 15 0.625 120 5

1.10.6

a. Processors Energy (J) @ 3 GHz Energy (J) @ 500 MHz

1 60 15

2 65 16.25

4 75 18.75

8 90 22.5

b. Processors Energy (J) @ 3 GHz Energy (J) @ 500 MHz

1 60 15

2 60 15

4 60 15

8 60 15

(12)

Solution 1.11

1.11.1 Wafer area = π × (d/2)

²

a. Wafer area = π × 7.5²= 176.7 cm² b. Wafer area = π × 12.5² = 490.9 cm²

Die area = wafer area/dies per wafer

a. Die area = 176.7/90 = 1.96 cm² b. Die area = 490.9/140 = 3.51 cm²

Yield = 1/(1 + (defect per area × die area)/2)

²

a. Yield = 0.97 b. Yield = 0.92

1.11.2 Cost per die = cost per wafer/(dies per wafer × yield)

a. Cost per die = 0.12 b. Cost per die = 0.16

1.11.3

a. Dies per wafer = 1.1 × 90 = 99

Defects per area = 1.15 × 0.018 = 0.021 defects/cm² Die area = wafer area/Dies per wafer = 176.7/99 = 1.78 cm² Yield = 0.97

b. Dies per wafer = 1.1 × 140 = 154

Defects per area = 1.15 × 0.024 = 0.028 defects/cm² Die area = wafer area/Dies per wafer = 490.9/154 = 3.19 cm² Yield = 0.93

1.11.4 Yield = 1/(1 + (defect per area × die area)/2)

²

Then defect per area = (2/die area)(y

^−1/2

− 1)

Replacing values for T1 and T2 we get

T1: defects per area = 0.00085 defects/mm

²

= 0.085 defects/cm

²

T2: defects per area = 0.00060 defects/mm

²

= 0.060 defects/cm

²

T3: defects per area = 0.00043 defects/mm

²

= 0.043 defects/cm

²

T4: defects per area = 0.00026 defects/mm

²

= 0.026 defects/cm

²

1.11.5 no solution provided

(13)

Solution 1.12

1.12.1 CPI = clock rate × CPU time/instr. count clock rate = 1/cycle time = 3 GHz

a. CPI(pearl) = 3 × 10⁹ × 500/2118 × 10⁹ = 0.7 b. CPI(mcf) = 3 × 10⁹ × 1200/336 × 10⁹ = 10.7

1.12.2 SPECratio = ref. time/execution time.

a. SPECratio(pearl) = 9770/500 = 19.54 b. SPECratio(mcf) = 9120/1200 = 7.6

1.12.3

(19.54 × 7.6)^1/2 = 12.19

1.12.4 CPU time = No. instr. × CPI/clock rate

If CPI and clock rate do not change, the CPU time increase is equal to the increase in the number of instructions, that is, 10%.

1.12.5 CPU time(before) = No. instr. × CPI/clock rate CPU time(after) = 1.1 × No. instr. × 1.05 × CPI/clock rate

CPU times(after)/CPU time(before) = 1.1 × 1.05 = 1.155. Thus, CPU time is increased by 15.5%

1.12.6 SPECratio = reference time/CPU time

SPECratio(after)/SPECratio(before) = CPU time(before)/CPU time(after) = 1/1.1555 = 0.86. That, the SPECratio is decreased by 14%.

Solution 1.13

1.13.1 CPI = (CPU time × clock rate)/No. instr.

a. CPI = 450 × 4 × 10⁹/(0.85 × 2118 × 10⁹) = 0.99 b. CPI = 1150 × 4 × 10⁹/(0.85 × 336 × 10⁹) = 16.10

(14)

1.13.2 Clock rate ratio = 4 GHz/3 GHz = 1.33.

a. CPI @ 4 GHz = 0.99, CPI @ 3 GHz = 0.7, ratio = 1.41 b. CPI @ 4 GHz = 16.1, CPI @ 3 GHz = 10.7, ratio = 1.50

They are different because although the number of instructions has been reduced by 15%, the CPU time has been reduced by a lower percentage.

1.13.3

a. 450/500 = 0.90. CPU time reduction: 10%.

b. 1150/1200 = 0.958. CPU time reduction: 4.2%.

1.13.4 No. instr. = CPU time × clock rate/CPI.

a. No. instr. = 820 × 0.9 × 4 × 10⁹/0.96 = 3075 × 10⁹ b. No. instr. = 580 × 0.9 × 4 × 10⁹/2.94 = 710 × 10⁹

1.13.5 Clock rate = No. instr. × CPI/CPU time.

Clock rate

_new

= No. instr. × CPI/0.9 × CPU time = 1/0.9 clock rate

_old

= 3.33 GHz.

1.13.6 Clock rate = No. instr. × CPI/CPU time.

Clock rate

_new

= No. instr. × 0.85 × CPI/0.80 CPU time = 0.85/0.80 clock rate

_old

= 3.18 GHz.

Solution 1.14

1.14.1 No. instr. = 10

⁶

T_cpu(P1) = 10⁶ × 1.25/4 × 10⁹ = 0.315 × 10^–3 s Tcpu(P2) = 10⁶ × 0.75/3 × 10⁹ = 0.25 × 10^–3 s

clock rate(P1) > clock rate(P2), but performance(P1) < performance(P2)

1.14.2

P1: 10⁶ instructions, T_cpu(P1) = 0.315 × 10^–3 s P2: T_cpu(P2) = N × 0.75/3 × 10⁹ then N = 1.26 × 10⁶

(15)

1.14.3 MIPS = Clock rate × 10

⁻⁶

/CPI

MIPS(P1) = 4 × 10⁹ × 10^–6/1.25 = 3200 MIPS(P2) = 3 × 10⁹ × 10^–6/0.75 = 4000

MIPS(P1) < MIPS(P2), performance(P1) < performance(P2) in this case (from 1.14.1)

1.14.4

a. FP op = 10⁶ × 0.4 = 4 × 10⁵, clock cyles_fp = CPI × No. FP instr. = 4 × 10⁵ T_fp = 4 × 10⁵ × 0.33 × 10^–9 = 1.32 × 10^–4 then MFLOPS = 3.03 × 10³

b. FP op = 3 × 10⁶ × 0.4 = 1.2 × 10⁶, clock cyles_fp = CPI × No. FP instr. = 0.70 × 1.2 × 10⁶ T_fp = 0.84 × 10⁶ × 0.33 × 10^–9 = 2.77 × 10^–4then MFLOPS = 4.33 × 10³

1.14.5 CPU clock cycles = FP cycles + CPI(L/S) × No. instr. (L/S) + CPI(Branch) × No. instr. (Branch)

a. 5 × 10⁵ L/S instr., 4 × 10⁵ FP instr. and 10⁵ Branch instr.

CPU clock cycles = 4 × 10⁵ + 0.75 × 5 × 10⁵ + 1.5 × 10⁵ = 9.25 × 10⁵ T_cpu = 9.25 × 10⁵ × 0.33 × 10^–9 = 3.05 × 10^–4

MIPS = 10⁶/(3.05 × 10^–4 × 10⁶) = 3.2 × 10³

b. 1.2 × 10⁶ L/S instr., 1.2 × 10⁶FP instr. and 0.6 × 10⁶ Branch instr.

CPU clock cycles = 0.84 × 10⁶ + 1.25 × 1.2 × 10⁶ + 1.25 × 0.6 × 10⁶ = 3.09 × 10⁶ T_cpu = 3.09 × 10⁶× 0.33 × 10^–9 = 1.01 × 10^–3

MIPS = 3 × 10⁶/(1.01 × 10^–3 × 10⁶) = 2.97 × 10³

1.14.6

a. performance = 1/T_cpu = 3.2 × 10³ b. performance = 1/T_cpu = 9.9 × 10²

The second program has the higher performance and the higher MFLOPS fi gure, but the fi rst program has the higher MIPS fi gure.

Solution 1.15

1.15.1

a. Tfp = 35 × 0.8 = 28 s, Tp1 = 28 + 85 + 50 + 30 = 193 s. Reduction: 3.5%

b. T_fp = 50 × 0.8 = 40 s, T_p4 = 40 + 80 + 50 + 30 = 200 s. Reduction: 4.7%

(16)

1.15.2

a. T_p1 = 200 × 0.8 = 160 s, T_fp + T_l/s + T_branch = 115 s, T_int = 45 s. Reduction time INT: 47%

b. Tp4 = 210 × 0.8 = 168 s, Tfp + Tl/s + Tbranch = 130 s, Tint = 38 s. Reduction time INT: 52.4%

1.15.3

a. T_p1 = 200 × 0.8 = 160 s, T_fp + T_int + T_l/s = 170 s. NO b. Tp4 = 210 × 0.8 = 168 s, Tfp + Tint + Tl/s = 180 s. NO

1.15.4 Clock cyles = CPI

_fp

× No. FP instr. + CPI

_int

× No. INT instr. + CPI

_l/s

× No. L/S instr. + CPI

_branch

× No. branch instr.

T

_cpu

= clock cycles/clock rate = clock cycles/2 × 10

⁹

a. 1 processor: clock cycles = 8192; T_cpu = 4.096 s b. 8 processors: clock cycles = 1024; Tcpu = 0.512 s

To half the number of clock cycles by improving the CPI of FP instructions:

CPI

improved fp

× No. FP instr. + CPI

int

× No. INT instr. + CPI

l/s

× No. L/S instr. + CPI

_branch

× No. branch instr. = clock cycles/2

CPI

improved fp

= (clock cycles/2 − (CPI

int

× No. INT instr. + CPI

l/s

× No. L/S instr. + CPI

_branch

× No. branch instr.))/No. FP instr.

a. 1 processor: CPIimproved fp = (4096 – 7632)/560 < 0 ==> not possible b. 8 processors: CPIimproved fp = (512 – 944)/80 < 0 ==> not possible

1.15.5 Using the clock cycle data from 1.15.4:

To half the number of clock cycles improving the CPI of L/S instructions:

CPI

_fp

× No. FP instr. + CPI

_int

× No. INT instr. + CPI

improved l/s

× No. L/S instr. + CPI

_branch

× No. branch instr. = clock cycles/2

CPI

improved l/s

= (clock cycles/2 − (CPI

_fp

× No. FP instr. + CPI

_int

× No. INT instr. +

CPI

_branch

× No. branch instr.))/No. L/S instr.

(17)

a. 1 processor: CPIimproved l/s = (4096 – 3072)/1280 = 0.8 b. 8 processors: CPIimproved l/s = (512 – 384)/160 = 0.8

1.15.6 Clock cyles = CPI

_fp

× No. FP instr. + CPI

_int

× No. INT instr. + CPI

_l/s

× No. L/S instr. + CPI

_branch

× No. branch instr.

T

_cpu

= clock cycles/clock rate = clock cycles/2 × 10

⁹

CPI

_int

= 0.6 × 1 = 0.6; CPI

_fp

= 0.6 × 1 = 0.6; CPI

_l/s

= 0.7 × 4 = 2.8; CPI

_branch

= 0.7 × 2 = 1.4

a. 1 processor: T_cpu(before improv.) = 4.096 s; T_cpu(after improv.) = 2.739 s b. 8 processors: T_cpu(before improv.) = 0.512 s; T_cpu(after improv.) = 0.342 s

Solution 1.16

1.16.1 Without reduction in any routine:

a. total time 2 proc = 185 ns b. total time 16 proc = 34 ns

Reducing time in routines A, C and E:

a. 2 proc: T(A) = 17 ns, T(C) = 8.5 ns, T(E) = 4.1 ns, total time = 179.6 ns ==> reduction = 2.9%

b. 16 proc: T(A) = 3.4 ns, T(C) = 1.7 ns, T(E) = 1.7 ns, total time = 32.8 ns ==> reduction = 3.5%

1.16.2

a. 2 proc: T(B) = 72 ns, total time = 177 ns ==> reduction = 4.3%

b. 16 proc: T(B) = 12.6 ns, total time = 32.6 ns ==> reduction = 4.1%

1.16.3

a. 2 proc: T(D) = 63 ns, total time = 178 ns ==> reduction = 3.7%

b. 16 proc: T(D) = 10.8 ns, total time = 32.8 ns ==> reduction = 3.5%

(18)

1.16.4

# Processors Computing time

Computing time

ratio Routing time ratio

2 176

4 96 0.55 1.18

8 49 0.51 1.31

16 30 0.61 1.29

32 14 0.47 1.05

64 6.5 0.46 1.13

1.16.5 Geometric mean of computing time ratios = 0.52. Multiply this by the computing time for a 64-processor system gives a computing time for a 128- processor system of 3.4 ms.

Geometric mean of routing time ratios = 1.19. Multiply this by the routing time for a 64-processor system gives a routing time for a 128-processor system of 30.9 ms.

1.16.6 Computing time = 176/0.52 = 338 ms. Routing time = 0, since no com-

munication is required.

(19)

Solution 2.1

2.1.1

a. add f, g, h add f, f, i add f, f, j b. addi f, h, 5

addi f, f, g

2.1.2

a. 3 b. 2

2.1.3

a. 14 b. 10

2.1.4

a. f = g + h b. f = g + h

2.1.5

a. 5 b. 5

Solution 2.2

2.2.1

a. add f, f, f add f, f, i b. addi f, j, 2 add f, f, g

(20)

2.2.2

a. 2 b. 2

2.2.3

a. 6 b. 5

2.2.4

a. f += h;

b. f = 1–f;

2.2.5

a. 4 b. 0

Solution 2.3

2.3.1

a. add f, f, g add f, f, h add f, f, i add f, f, j addi f, f, 2 b. addi f, f, 5 sub f, g, f

2.3.2

a. 5 b. 2

2.3.3

a. 17 b. –4

(21)

2.3.4

a. f = h – g;

b. f = g – f – 1;

2.3.5

a. 1 b. 0

Solution 2.4

2.4.1

a. lw $s0, 16($s7) add $s0, $s0, $s1 add $s0, $s0, $s2 b. lw $t0, 16($s7)

lw $s0, 0($t0) sub $s0, $s1, $s0

2.4.2

a. 3 b. 3

2.4.3

a. 4 b. 4

2.4.4

a. f += g + h + i + j;

b. f = A[1];

(22)

2.4.5

a. no change b. no change

2.4.6

a. 5 as written, 5 minimally b. 2 as written, 2 minimally

Solution 2.5

2.5.1

a. Address Data

12 1 8 6 4 4 0 2

temp = Array[3];

Array[3] = Array[2];

Array[0] = temp;

b. Address Data

16 1 12 2 8 3 4 4 0 5

temp = Array[4];

Array[0] = temp;

temp = Array[3];

Array[1] = temp;

2.5.2

a. Address Data

12 1 8 6 4 4 0 2

temp = Array[3];

Array[0] = temp;

lw $t0, 12($s6) lw $t1, 8($s6) sw $t1, 12($s6) lw $t1, 4($s6) sw $t1, 8($s6) lw $t1, 0($s6) sw $t1, 4($s6) sw $t0, 0($s6)

b. Address Data

16 1 12 2 8 3 4 4 0 5

temp = Array[4];

Array[0] = temp;

temp = Array[3];

Array[1] = temp;

lw $t0, 16($s6) lw $t1, 0($s6) sw $t1, 16($s6) sw $t0, 0($s6) lw $t0, 12($s6) lw $t1, 4($s6) sw $t1, 12($s6) sw $t0, 4($s6)

(23)

2.5.3

a. Address Data

12 1

8 6

4 4

0 2

temp = Array[3];

Array[0] = temp;

lw $t0, 12($s6) lw $t1, 8($s6) sw $t1, 12($s6) lw $t1, 4($s6) sw $t1, 8($s6) lw $t1, 0($s6) sw $t1, 4($s6) sw $t0, 0($s6)

8 mips instructions, +1 mips inst. for every non- zero offset lw/sw pair (11 mips inst.)

b. Address Data

16 1

12 2

8 3

4 4

0 5

temp = Array[4];

Array[0] = temp;

temp = Array[3];

Array[1] = temp;

lw $t0, 16($s6) lw $t1, 0($s6) sw $t1, 16($s6) sw $t0, 0($s6) lw $t0, 12($s6) lw $t1, 4($s6) sw $t1, 12($s6) sw $t0, 4($s6)

8 mips instructions, +1 mips inst. for every non- zero offset lw/sw pair (11 mips inst.)

2.5.4

a. 305419896 b. 3199070221

2.5.5

Little-Endian Big-Endian

a. Address Data

12 12

8 34

4 56

0 78

Address Data

12 78

8 56

4 34

0 12

b. Address Data

12 be

8 ad

4 f0

0 0d

Address Data

12 0d

8 f0

4 ad

0 be

Solution 2.6

2.6.1

a. lw $s0, 4($s7) sub $s0, $s0, $s1 add $s0, $s0, $s2 b. add $t0, $s7, $s1

lw $t0, 0($t0) add $t0, $t0, $s6 lw $s0, 4($t0)

(24)

2.6.2

a. 3 b. 4

2.6.3

a. 4 b. 5

2.6.4

a. f = 2i + h;

b. f = A[g – 3];

2.6.5

a. $s0 = 110 b. $s0 = 300

2.6.6

a.

Type opcode rs rt rd immed

add $s0, $s0, $s1 R-type 0 16 17 16

add $s0, $s3, $s2 R-type 0 19 18 16

add $s0, $s0, $s3 R-type 0 16 19 16

b.

Type opcode rs rt rd immed

addi $s6, $s6, –20 I-type 8 22 22 –20

add $s6, $s6, $s1 R-type 0 22q 17 22

lw $s0, 8($s6) I-type 35 22 16 8

(25)

Solution 2.7

2.7.1

a. –1391460350 b. –19629

2.7.2

a. 2903506946 b. 4294947667

2.7.3

a. AD100002 b. FFFFB353

2.7.4

a. 01111111111111111111111111111111 b. 1111101000

2.7.5

a. 7FFFFFFF b. 3E8

2.7.6

a. 80000001 b. FFFFFC18

Solution 2.8

2.8.1

a. 7FFFFFFF, no overﬂ ow b. 80000000, overﬂ ow

(26)

2.8.2

a. 60000001, no overﬂ ow b. 0, no overﬂ ow

2.8.3

a. EFFFFFFF, overﬂ ow b. C0000000, overﬂ ow

2.8.4

a. overfl ow b. no overfl ow

2.8.5

a. no overfl ow b. no overfl ow

2.8.6 Solution 2.9

2.9.1

2.9.2

(27)

2.9.3

a. no overfl ow b. overfl ow

2.9.4

a. no overfl ow b. no overfl ow

2.9.5

a. 1D100002 b. 6FFFB353

2.9.6

a. 487587842 b. 1879028563

Solution 2.10

2.10.1

a. sw $t3, 4($s0) b. lw $t0, 64($t0)

2.10.2

a. I-type b. I-type

2.10.3

a. AE0B0004 b. 8D080040

(28)

2.10.4

a. 0x01004020 b. 0x8E690004

2.10.5

a. R-type b. I-type

2.10.6

a. op=0x0, rd=0x8, rs=0x8, rt=0x0, funct=0x0 b. op=0x23, rs=0x13, rt=0x9, imm=0x4

Solution 2.11

2.11.1

a. 1010 1110 0000 1011 1111 1111 1111 1100_two b. 1000 1101 0000 1000 1111 1111 1100 0000two

2.11.2

a. 2920022012 b. 2366177216

2.11.3

a. sw $t3, –4($s0) b. lw $t0, –64($t0)

2.11.4

a. R-type b. I-type

(29)

2.11.5

a. add $v1, $at, $v0 b. sw $a1, 4($s0)

2.11.6

a. 0x00221820 b. 0xAD450004

Solution 2.12

2.12.1

Type opcode rs rt rd shamt funct

a. R-type 6 3 3 3 5 6 total bits = 26

b. R-type 6 5 5 5 5 6 total bits = 32

2.12.2

Type opcode rs rt immed

a. I-type 6 3 3 16 total bits = 28

b. I-type 6 5 5 10 total bits = 26

2.12.3

a. less registers → less bits per instruction → could reduce code size less registers → more register spills → more instructions

b. smaller constants → more lui instructions → could increase code size smaller constants → smaller opcodes → smaller code size

2.12.4

a. 17367056 b. 2366177298

2.12.5

a. add $t0, $t1, $0 b. lw $t1, 12($t0)

(30)

2.12.6

a. R-type, op=0×0, rt=0×9 b. I-type, op=0×23, rt=0×8

Solution 2.13

2.13.1

a. 0x57755778 b. 0xFEFFFEDE

2.13.2

a. 0x55555550 b. 0xEADFEED0

2.13.3

a. 0x0000AAAA b. 0x0000BFCD

2.13.4

a. 0x00015B5A b. 0x00000000

2.13.5

a. 0x5b5a0000 b. 0x000000f0

2.13.6

a. 0xEFEFFFFF b. 0x000000F0

(31)

Solution 2.14

2.14.1

a. add $t1, $t0, $0 srl $t1, $t1, 5

andi $t1, $t1, 0x0001ffff b. add $t1, $t0, $0

sll $t1, $t1, 10

andi $t1, $t1, 0xffff8000

2.14.2

a. add $t1, $t0, $0

andi $t1, $t1, 0x0000000f b. add $t1, $t0, $0

srl $t1, $t1, 14

andi $t1, $t1, 0x0003c000

2.14.3

a. add $t1, $t0, $0 srl $t1, $t1, 28 b. add $t1, $t0, $0 srl $t1, $t1, 14

andi $t1, $t1, 0x0001c000

2.14.4

a. add $t2, $t0, $0 srl $t2, $t2, 11

and $t2, $t2, 0x0000003f and $t1, $t1, 0xffffffc0 ori $t1, $t1, $t2 b. add $t2, $t0, $0

sll $t2, $t2, 3

and $t2, $t2, 0x000fc000 and $t1, $t1, 0xfff03fff ori $t1, $t1, $t2

(32)

2.14.5

a. add $t2, $t0, $0

and $t2, $t2, 0x0000001f and $t1, $t1, 0xffffffe0 ori $t1, $t1, $t2 b. add $t2, $t0, $0

sll $t2, $t2, 14

and $t2, $t2, 0x0007c000 and $t1, $t1, 0xfff83fff ori $t1, $t1, $t2

2.14.6

a. add $t2, $t0, $0 srl $t2, $t2, 29

and $t2, $t2, 0x00000003 and $t1, $t1, 0xfffffffc ori $t1, $t1, $t2 b. add $t2, $t0, $0

srl $t2, $t2, 15

and $t2, $t2, 0x0000c000 and $t1, $t1, 0xffff3fff ori $t1, $t1, $t2

Solution 2.15

2.15.1

a. 0x0000a581 b. 0x00ff5a66

2.15.2

a. nor $t1, $t2, $t2 and $t1, $t1, $t3 b. xor $t1, $t2, $t3 nor $t1, $t1, $t1

2.15.3

a. nor $t1, $t2, $t2 and $t1, $t1, $t3

000000 01010 01010 01001 00000 100111 000000 01001 01011 01001 00000 100100 b. xor $t1, $t2, $t3

nor $t1, $t1, $t1

000000 01010 01011 01001 00000 100110 000000 01001 01001 01001 00000 100111

(33)

2.15.4

a. 0x00000220 b. 0x00001234

2.15.5 Assuming $t1 = A, $t2 = B, $s1 = base of Array C

a. lw $t3, 0($s1) and $t1, $t2, $t3 b. beq $t1, $0, ELSE add $t1, $t2, $0 beq $0, $0, END ELSE: lw $t2, 0($s1) END:

2.15.6

a. lw $t3, 0($s1) and $t1, $t2, $t3

100011 10001 01011 0000000000000000 000000 01010 01011 01001 00000 100100 b. beq $t1, $0, ELSE

add $t1, $t2, $0 beq $0, $0, END ELSE: lw $t2, 0($s1) END:

000100 01001 00000 0000000000000010 000000 01010 00000 01001 00000 100000 000100 00000 00000 0000000000000001 100011 10001 01010 0000000000000000

Solution 2.16

2.16.1

a. $t2 = 1 b. $t2 = 1

2.16.2

a. all, 0x8000 to 0x7FFFF b. 0x8000 to 0xFFFE

2.16.3

a. jump—no, beq—no b. jump—no, beq—no

(34)

2.16.4

a. $t2 = 2 b. $t2 = 2

2.16.5

a. $t2 = 0 b. $t2 = 1

2.16.6

a. jump—yes, beq—no b. jump—yes, beq—yes

Solution 2.17

2.17.1 The answer is really the same for all. All of these instructions are either supported by an existing instruction, or sequence of existing instructions. Looking for an answer along the lines of, “these instructions are not common, and we are only making the common case fast”.

2.17.2

a. could be either R-type of I-type b. R-type

2.17.3

a. ABS: sub $t2,$zero,$t3 # t2 = – t3

ble $t3,$zero,done # if t3 < 0, result is t2 add $t2,$t3,$zero # if t3 > 0, result is t3 DONE:

b. slt $t1, $t3, $t2

2.17.4

a. 20 b. 200

(35)

2.17.5

a. i = 10;

do { B += 2;

i = i – 1;

} while (i > 0) b. i = 10;

do {

temp = 10;

do { B += 2;

temp = temp – 1;

} while (temp > 0) i = i – 1;

} while (i > 0)

2.17.6

a. 5 × N + 3 b. 33 × N

A += B

i < 10? i += 1

Solution 2.18

2.18.1

a.

b.

D[a] = b + a;

A += 1 A < 10

(36)

2.18.2

a. addi $t0, $0, 0 beq $0, $0, TEST LOOP: add $s0, $s0, $s1 addi $t0, $t0, 1 TEST: slti $t2, $t0, 10 bne $t2, $0, LOOP b. LOOP: slti $t2, $s0, 10

beq $t2, $0, DONE add $t3, $s1, $s0 sll $t2, $s0, 2 add $t2, $s2, $t2 sw $t3, ($t2) addi $s0, $s0, 1 j LOOP

DONE:

2.18.3

a. 6 instructions to implement and 44 instructions executed b. 8 instructions to implement and 2 instructions executed

2.18.4

a. 501 b. 301

2.18.5

a. for(i=100; i>0; i––){

result += MemArray[s0];

s0 += 1;

}

b. for(i=0; i<100; i+=2){

result += MemArray[s0 + i];

result += MemArray[s0 + i + 1];

}

2.18.6

a. addi $t1, $s0, 400 LOOP: lw $s1, 0($s0) add $s2, $s2, $s1 addi $s0, $s0, 4 bne $s0, $t1, LOOP b. already reduced to minimum instructions

(37)

Solution 2.19

2.19.1

a. compare:

addi $sp, $sp, –4 sw $ra, 0($sp) add $s0, $a0, $0

add $s1, $a1, $0 jal sub

addi $t1, $0, 1 beq $v0, $0, exit slt $t2, $0, $v0 bne $t2, $0, exit addi $t1, $0, $0 exit:

add $v0, $t1, $0 lw $ra, 0($sp) addi $sp, $sp, 4 jr $ra

sub:

sub $v0, $a0, $a1 jr $ra

b. ﬁ b_iter:

addi $sp, $sp, –16 sw $ra, 12($sp) sw $s0, 8($sp)

sw $s1, 4($sp) sw $s2, 0($sp) add $s0, $a0, $0 add $s1, $a1, $0 add $s2, $a2, $0 add $v0, $s1, $0, bne $s2, $0, exit add $a0, $s0, $s1 add $a1, $s0, $0 add $a2, $s2, –1 jal ﬁ b_iter exit:

lw $s2, 0($sp) lw $s1, 4($sp) lw $s0, 8($sp)

lw $ra, 12($sp) addi $sp, $sp, 16 jr $ra

(38)

2.19.2

a. compare:

addi $sp, $sp, –4 sw $ra, 0($sp) sub $t0, $a0, $a1

addi $t1, $0, 1 beq $t0, $0, exit slt $t2, $0, $t0 bne $t2, $0, exit addi $t1, $0, $0 exit:

add $v0, $t1, $0 lw $ra, 0($sp) addi $sp, $sp, 4 jr $ra

b. Due to the recursive nature of the code, not possible for the compiler to in-line the function call.

2.19.3

a. after calling function compare:

old $sp => 0x7ffffffc ???

$sp => –4 contents of register $ra after calling function sub:

–4 contents of register $ra

$sp => –8 contents of register $ra #return to compare

b. after calling function ﬁ b_iter:

–4 contents of register $ra –8 contents of register $s0 –12 contents of register $s1

$sp => –16 contents of register $s2

2.19.4

a. f: addi $sp,$sp,–8 sw $ra,4($sp) sw $s0,0($sp) move $s0,$a2 jal func move $a0,$v0 move $a1,$s0 jal func lw $ra,4($sp) lw $s0,0($sp) addi $sp,$sp,8 jr $ra

(39)

b. f: addi $sp,$sp,–12 sw $ra,8($sp) sw $s1,4($sp) sw $s0,0($sp) move $s0,$a1 move $s1,$a2 jal func move $a0,$s0 move $a1,$s1 move $s0,$v0 jal func add $v0,$v0,$s0 lw $ra,8($sp) lw $s1,4($sp) lw $s0,0($sp) addi $sp,$sp,12 jr ra

2.19.5

a. We can use the tail-call optimization for the second call to func, but then we must restore $ra and $sp before that call. We save only one instruction (jr $ra).

b. We can NOT use the tail call optimization here, because the value returned from f is not equal to the value returned by the last call to func.

2.19.6 Register $ra is equal to the return address in the caller function, registers

$sp and $s3 have the same values they had when function f was called, and register

$t5 can have an arbitrary value. For register $t5, note that although our function f does not modify it, function func is allowed to modify it so we cannot assume anything about the of $t5 after function func has been called.

Solution 2.20

2.20.1

a. FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra

L1: addi $a0, $a0, –1 jal FACT

mul $v0, $s0, $v0 lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra

(40)

b. FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra

2.20.2

a. 25 MIPS instructions to execute nonrecursive vs. 45 instructions to execute (corrected version of) recursion

Nonrecursive version:

FACT: addi $sp, $sp, –4 sw $ra, 4($sp) add $s0, $0, $a0 add $s2, $0, $1 LOOP: slti $t0, $s0, 2 bne $t0, $0, DONE mul $s2, $s0, $s2 addi $s0, $s0, –1 j LOOP

DONE: add $v0, $0, $s2 lw $ra, 4($sp) addi $sp, $sp, 4 jr $ra

b. 25 MIPS instructions to execute nonrecursive vs. 45 instructions to execute (corrected version of) recursion

FACT: addi $sp, $sp, –4 sw $ra, 4($sp) add $s0, $0, $a0 add $s2, $0, $1 LOOP: slti $t0, $s0, 2 bne $t0, $0, DONE mul $s2, $s0, $s2 addi $s0, $s0, –1 j LOOP

DONE: add $v0, $0, $s2 lw $ra, 4($sp) addi $sp, $sp, 4 jr $ra

(41)

2.20.3

a. Recursive version

FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 HERE: slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra

at label HERE, after calling function FACT with input of 4:

old $sp => 0xnnnnnnnn ???

$sp => –8 contents of register $a0 at label HERE, after calling function FACT with input of 3:

–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra

–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra

–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra –24 contents of register $a0 –28 contents of register $ra

$sp => –32 contents of register $a0

(42)

b. Recursive version

FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 HERE: slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra

at label HERE, after calling function FACT with input of 4:

–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra

–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra

–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra –24 contents of register $a0 –28 contents of register $ra

(43)

2.20.4

a. FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT

L1: addi $a0, $a0, –1 jal FIB addi $s1, $v0, $0 addi $a0, $a0, –1 jal FIB add $v0, $v0, $s1 EXIT: lw $a0, 0($sp) lw $s1, 4($sp) lw $ra, 8($sp) addi $sp, $sp, 12 jr $ra

b. FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT

(44)

2.20.5

a. 23 MIPS instructions to execute nonrecursive vs. 73 instructions to execute (corrected version of) recursion

FIB: addi $sp, $sp, –4 sw $ra, ($sp) addi $s1, $0, 1 addi $s2, $0, 1 LOOP: slti $t0, $a0, 3 bne $t0, $0, EXIT add $s3, $s1, $0 add $s1, $s1, $s2 add $s2, $s3, $0 addi $a0, $a0, –1 j LOOP

EXIT: add $v0, s1, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

b. 23 MIPS instructions to execute nonrecursive vs. 73 instructions to execute (corrected version of) recursion

FIB: addi $sp, $sp, –4 sw $ra, ($sp) addi $s1, $0, 1 addi $s2, $0, 1 LOOP: slti $t0, $a0, 3 bne $t0, $0, EXIT add $s3, $s1, $0 add $s1, $s1, $s2 add $s2, $s3, $0 addi $a0, $a0, –1 j LOOP

EXIT: add $v0, s1, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

(45)

2.20.6

a. recursive version

FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) HERE: slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT

at label HERE, after calling function FIB with input of 4:

–4 contents of register $ra –8 contents of register $s1

$sp => –12 contents of register $a0 b. recursive version

FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) HERE: slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT

at label HERE, after calling function FIB with input of 4:

–4 contents of register $ra –8 contents of register $s1

(46)

Solution 2.21

2.21.1

a. after entering function main:

$sp => –4 contents of register $ra after entering function leaf_function:

$sp => –8 contents of register $ra (return to main) b. after entering function main:

$sp => –4 contents of register $ra after entering function my_function:

$sp => –8 contents of register $ra (return to main) global pointers:

0x10008000 100 my_global

2.21.2

a. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 1 jal LEAF lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

LEAF: addi $sp, $sp, –8 sw $ra, 4($sp) sw $s0, 0($sp) addi $s0, $a0, 1 slti $t2, 5, $a0 bne $t2, $0, DONE add $a0, $s0, $0 jal LEAF DONE: add $v0, $s0, $0 lw $s0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra

(47)

b. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 10 addi $t1, $0, 20

lw $a1, ($s0) #assume $s0 has global variable base jal FUNC

add $t2, $v0 $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

FUNC: sub $v0, $a0, $a1 jr $ra

2.21.3

a. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 1 jal LEAF lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

LEAF: addi $sp, $sp, –8 sw $ra, 4($sp) sw $s0, 0($sp) addi $s0, $a0, 1 slti $t2, 5, $a0 bne $t2, $0, DONE add $a0, $s0, $0 jal LEAF DONE: add $v0, $s0, $0 lw $s0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra

b. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 10 addi $t1, $0, 20

lw $a1, ($s0) #assume $s0 has global variable base jal FUNC

add $t2, $v0 $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

FUNC: sub $v0, $a0, $a1 jr $ra

(48)

2.21.4

a. Register $s0 is used to hold a temporary result without saving $s0 fi rst. To correct this problem, $t0 (or $v0) should be used in place of $s0 in the fi rst two instructions. Note that a sub-optimal solution would be to continue using $s0, but add code to save/restore it.

b. The two addi instructions move the stack pointer in the wrong direction. Note that the MIPS calling convention requires the stack to grow down. Even if the stack grew up, this code would be incorrect because $ra and $s0 are saved according to the stack-grows-down convention.

2.21.5

a. int f(int a, int b, int c, int d){

return 2*(a–d)+c–b;

}

b. int f(int a, int b, int c){

return g(a,b)+c;

}

2.21.6

a. The function returns 842 (which is 2 × (1 – 30) + 1000 – 100) b. The function returns 1500 (g(a, b) is 500, so it returns 500 + 1000)

Solution 2.22

2.22.1

a. 65 20 98 121 116 101

b. 99 111 109 112 117 116 101 114

2.22.2

a. U+0041, U+0020, U+0062, U+0079, U+0074, U+0065

b. U+0063, U+006f, U+006d, U+0070, U+0075, U+0074, U+0065, U+0072

2.22.3

a. add b. shift

(49)

Solution 2.23

2.23.1

a. MAIN: addi $sp, $sp, –4 sw $ra, ($sp)

add $t6, $0, 0x30 # '0' add $t7, $0, 0x39 # '9' add $s0, $0, $0

add $t0, $a0, $0 LOOP: lb $t1, ($t0) slt $t2, $t1, $t6 bne $t2, $0, DONE slt $t2, $t7, $t1 bne $t2, $0, DONE sub $t1, $t1, $t6 beq $s0, $0, FIRST mul $s0, $s0, 10 FIRST: add $s0, $s0, $t1 addi $t0, $t0, 1 j LOOP

DONE: add $v0, $s0, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

b. MAIN: addi $sp, $sp, –4 sw $ra, ($sp)

add $t4, $0, 0x41 # 'A' add $t5, $0, 0x46 # 'F' add $t6, $0, 0x30 # '0' add $t7, $0, 0x39 # '9' add $s0, $0, $0

add $t0, $a0, $0 LOOP: lb $t1, ($t0) slt $t2, $t1, $t6 bne $t2, $0, DONE slt $t2, $t7, $t1 bne $t2, $0, HEX sub $t1, $t1, $t6 j DEC

HEX: slt $t2, $t1, $t4 bne $t2, $0, DONE slt $t2, $t5, $t1 bne $t2, $0, DONE sub $t1, $t1, $t4 addi $t1, $t1, 10 DEC: beq $s0, $0, FIRST mul $s0, $s0, 10 FIRST: add $s0, $s0, $t1 addi $t0, $t0, 1 j LOOP

DONE: add $v0, $s0, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra

(50)

Solution 2.24

2.24.1

a. 0x00000012 b. 0x12ffffff

2.24.2

a. 0x00000080 b. 0x80000000

2.24.3

a. 0x00000011 b. 0x11555555

Solution 2.25

2.25.1 Generally, all solutions are similar:

lui $t1, top_16_bits

ori $t1, $t1, bottom_16_bits

2.25.2 Jump can go up to 0x0FFFFFFC.

a. no b. no

2.25.3 Range is 0x604 + 0x1FFFC = 0x0002 0600 to 0x604 − 0x20000 = 0xFFFE 0604.

a. no b. yes

2.25.4 Range is 0x0042 0600 to 0x003E 0600.

a. no b. no

(51)

2.25.5 Generally, all solutions are similar:

add $t1, $zero, $zero #clear $t1 addi $t2, $zero, top_8_bits #set top 8b

sll $t2, $t2, 24 #shift left 24 spots or $t1, $t1, $t2 #place top 8b into $t1 addi $t2, $zero, nxt1_8_bits #set next 8b

sll $t2, $t2, 16 #shift left 16 spots or $t1, $t1, $t2 #place next 8b into $t1 addi $t2, $zero, nxt2_8_bits #set next 8b

sll $t2, $t2, 24 #shift left 8 spots or $t1, $t1, $t2 #place next 8b into $t1 ori $t1, $t1, bot_8_bits #or in bottom 8b

2.25.6

a. 0x12345678 b. 0x12340000

2.25.7

a. t0 = (0x1234 << 16) || 0x5678;

b. t0 = (t0 || 0x5678);

t0 = 0x1234 << 16;

Solution 2.26

2.26.1 Branch range is 0x00020000 to 0xFFFE0004.

a. one branch b. three branches

2.26.2

a. one b. can’t be done

2.26.3 Branch range is 0x00000200 to 0xFFFFFE04.

a. eight branches b. 512 branches

(52)

2.26.4

a. branch range is 16x larger b. branch range is 16x smaller

2.26.5

a. no change

b. jump to addresses 0 to 2¹² instead of 0 to 2²⁸, assuming the PC<0x08000000

2.26.6

a. rs fi eld now 3 bits b. no change

Solution 2.27

2.27.1

a. jump register b. beq

2.27.2

a. R-type b. I-type

2.27.3

a. + can jump to any 32b address

– need to load a register with a 32b address, which could take multiple cycles

b. + allows the PC to be set to the current PC + 4 +/– BranchAddr, supporting quick forward and backward branches

– range of branches is smaller than large programs

2.27.4

a. 0x00000000 lui $s0, 100 0x00000004 ori $s0, $s0, 40

0x3c100100 0x36100028 b. 0x00000100 addi $t0, $0, 0x0000

0x00000104 lw $t1, 0x4000($t0)

0x20080000 0x8d094000

(53)

2.27.5

a. addi $s0, $zero, 0x80 sll $s0, $s0, 17 ori $s0, $s0, 40 b. addi $t0, $0, 0x0040

sll $t0, $t0, 8 lw $t1, 0($t0)

2.27.6

a. 1 b. 1

Solution 2.28

2.28.1

a. 4 instructions

2.28.2

a. One of the locations specifi ed by the LL instruction has no corresponding SC instruction.

2.28.3

a. try: MOV R3,R4 MOV R6,R7 LL R2,0(R2)

# adjustment or test code here SC R3,0(R2)

BEQZ R3,try try2:

LL R5,0(R1)

# adjustment or test code here SC R6,0(R1)

BEQZ R6,try2 MOV R4,R2 MOV R7,R5

(54)

2.28.4

a.

Processor 1 Processor 2

Processor 1 Mem Processor 2

Cycle $t1 $t0 ($s1) $t1 $t0

0 1 2 99 30 40

ll $t1, 0($s1) ll $t1, 0($s1) 1 99 2 99 99 40

sc $t0, 0($s1) 2 99 1 2 99 40

sc $t0, 0($s1) 3 99 1 2 99 0

Processor 1 Processor 2

Processor 1 Mem Processor 2 Cycle $s4 $t1 $t0 ($s1) $s4 $t1 $t0

0 2 3 4 99 10 20 30

try: add $t0, $0, $s4 1 2 3 4 99 10 20 10

try: add $t0, $0, $s4 ll $t1, 0($s1) 2 2 3 2 99 10 99 10

ll $t1, 0($s1) 3 2 99 2 99 10 99 10

sc $t0, 0($s1) 4 2 99 1 2 10 99 10

beqz $t0, try sc $t0, 0($s1) 5 2 99 1 2 10 99 0

add $s4, $0, $t1 beqz $t0, try 6 99 99 1 2 10 99 0

b.

Solution 2.29

2.29.1 The critical section can be implemented as:

trylk: li $t1,1 ll $t0,0($a0) bnez $t0,trylk sc $t1,0($a0) beqz $t1,trylk operation

sw $zero,0($a0)

Where operation is implemented as:

a. lw $t0,0($a1) add $t0,$t0,$a2 sw $t0,0($a1) b. lw $t0,0($a1) sge $t1,$t0,$a2 bnez $t1,skip sw $a2,0($a1) skip:

(55)

2.29.2 The entire critical section is now:

a. try: ll $t0,0($a1) add $t0,$t0,$a2 sc $t0,0($a1) beqz $t0,try b. try: ll $t0,0($a1)

sge $t1,$t0,$a2 bnez $t1,skip mov $t0,$a2 sc $t0,0($a1) beqz $t0,try skip:

2.29.3 The code that directly uses ll/sc to update shvar avoids the entire lock/

unlock code. When SC is executed, this code needs 1) one extra instruction to check the outcome of SC, and 2) if the register used for SC is needed again we need an instruction to copy its value. However, these two additional instructions may not be needed, e.g., if SC is not on the best-case path or f it uses a register whose value is no longer needed. We have:

Lock-based Direct LL/SC implementation

a. 6+3 4

b. 6+3 3

2.29.4

a. Both processors attempt to execute SC at the same time, but one of them completes the write fi rst. The other’s SC detects this and its SC operation fails.

b. It is possible for one or both processors to complete this code without ever reaching the SC instruction. If only one executes SC, it completes successfully. If both reach SC, they do so in the same cycle, but one SC completes fi rst and then the other detects this and fails.

2.29.5 Every processor has a different set of registers, so a value in a register can- not be shared. Therefore, shared variable shvar must be kept in memory, loaded each time their value is needed, and stored each time a task wants to change the value of a shared variable. For local variable x there is no such restriction. On the contrary, we want to minimize the time spent in the critical section (or between the LL and SC, so if variable x is in memory it should be loaded to a register before the critical section to avoid loading it during the critical section.

2.29.6 If we simply do two instances of the code from 2.29.2 one after the other

(to update one shared variable and then the other), each update is performed

atomically, but the entire two-variable update is not atomic, i.e., after the update

to the fi rst variable and before the update to the second variable, another process

can perform its own update of one or both variables. If we attempt to do two LLs

(56)

(one for each variable), compute their new values, and then do two SC instructions (again, one for each variable), the second LL causes the SC that corresponds to the fi rst LL to fail (we have a LL and SC with a non-register-register instruction executed between them). As a result, this code can never successfully complete.

Solution 2.30

2.30.1

a. add $t1, $t2, $0 b. add $t0, $0, small

beq $t1, $t0, LOOP

2.30.2

a. Yes. The address of v is not known until the data segment is built at link time.

b. No. The branch displacement does not depend on the placement of the instruction in the text segment.

Solution 2.31

2.31.1

a.

Text Size 0x440

Data Size 0x90

Text Address Instruction

0x00400000 lw $a0, 0x8000($gp)

0x00400004 jal 0x0400140

… …

0x00400140 sw $a1, 0x8040($gp)

0x00400144 jal 0x0400000

… …

Data 0x10000000 (X)

… …

0x10000040 (Y)

(57)

b.

Text Size 0x440

Data Size 0x90

Text Address Instruction

0x00400000 lui $at, 0x1000

0x00400004 ori $a0, $at, 0

0x00400008 jal 0x0400140

… …

0x00400140 sw $a0, 8040($gp)

0x00400144 jmp 0x04002C0

… …

0x004002C0 jr $ra

… …

Data 0x10000000 (X)

… …

0x10000040 (Y)

2.31.2 0x8000 data, 0xFC00000 text. However, because of the size of the beq immediate fi eld, 218 words is a more practical program limitation.

2.31.3 The limitation on the sizes of the displacement and address fi elds in the instruction encoding may make it impossible to use branch and jump instructions for objects that are linked too far apart.

Solution 2.32

2.32.1

a. swap:

sll $t0,$a1,2 add $t0,$t0,$a0 lw $t2,0($t0) sll $t1,$a2,2 add $t1,$t1,$a0 lw $t3,0($t1) sw $t3,0($t0) sw $t2,0($t1) jr $ra b. swap:

lw $t0,0($a0) lw $t1,4($a0) sw $t1,0($a0) sw $t0,4($a0) jr $ra

(58)

2.32.2

a. Pass j+1 as a third parameter to swap. We can do this by adding an “addi $a2,$a1,1”

instruction right before “jal swap”.

b. Pass the address of v[j] to swap. Since that address is already in $t2 at the point when we want to call swap, we can replace the two parameter-passing instructions before “jal swap”

with a simple “mov $a0,$t2”.

2.32.3

a. swap:

add $t0,$t0,$a0 ; No sll

lb $t2,0($t0) ; Byte–sized load add $t1,$t1,$a0 ; No sll

lb $t3,0($t1)

sb $t3,0($t0) ; Byte–sized store sb $t2,0($t1)

jr $ra b. swap:

lb $t0,0($a0) ; Byte–sized load lb $t1,1($a0) ; Offset is 1, not 4 sb $t1,0($a0) ; Byte–sized store sb $t0,1($a0)

jr $ra

2.32.4

a. Yes, we must save the additional s-registers. Also, the code for sort() in Figure 2.27 is using 5 t-registers and only 4 s-registers remain. Fortunately, we can easily reduce this number, e.g., by using t1 instead of t0 for loop comparisons.

b. No change to saving/restoring code is needed because the same s-registers are used in the modifi ed sort() code.

2.32.5 When the array is already sorted, the inner loop always exits in its fi rst iteration, as soon as it compares v[j] with v[j+1]. We have:

a. We need 4 more instructions to save and 4 more to restore registers. The number of instructions in the rest of the code is the same, so there are exactly 8 more instructions executed in the modifi ed sort(), regardless of how large the array is.

b. One fewer instruction is executed in each iteration of the inner loop. Because the array is already sorted, the inner loop always exits during its fi rst iteration, so we save one instruction per iteration of the outer loop. Overall, we execute 10 instructions fewer.

2.32.6 When the array is sorted in reverse order, the inner loop always executes the maximum number of iterations and swap is called in each iteration of the inner loop (a total of 45 times). We have:

a. This change only affects the number of instructions needed to save/restore registers in swap(), so the answer is the same as in Problem When the array is already sorted, the inner loop always exits in its fi rst iteration, as soon as it compares v[j] with v[j+1]. We have:.