Solution 1.1
1.1.1 Computer used to run large problems and usually accessed via a network:
5 supercomputers
1.1.2 10
15or 2
50bytes: 7 petabyte
1.1.3 Computer composed of hundreds to thousands of processors and terabytes of memory: 3 servers
1.1.4 Today’s science fi ction application that probably will be available in near future: 1 virtual worlds
1.1.5 A kind of memory called random access memory: 12 RAM 1.1.6 Part of a computer called central processor unit: 13 CPU 1.1.7 Thousands of processors forming a large cluster: 8 datacenters
1.1.8 A microprocessor containing several processors in the same chip: 10 multi- core processors
1.1.9 Desktop computer without screen or keyboard usually accessed via a net- work: 4 low-end servers
1.1.10 Currently the largest class of computer that runs one application or one set of related applications: 9 embedded computers
1.1.11 Special language used to describe hardware components: 11 VHDL 1.1.12 Personal computer delivering good performance to single users at low cost: 2 desktop computers
1.1.13 Program that translates statements in high-level language to assembly
language: 15 compiler
1.1.14 Program that translates symbolic instructions to binary instructions:
21 assembler
1.1.15 High-level language for business data processing: 25 cobol
1.1.16 Binary language that the processor can understand: 19 machine language 1.1.17 Commands that the processors understand: 17 instruction
1.1.18 High-level language for scientifi c computation: 26 fortran
1.1.19 Symbolic representation of machine instructions: 18 assembly language 1.1.20 Interface between user’s program and hardware providing a variety of services and supervision functions: 14 operating system
1.1.21 Software/programs developed by the users: 24 application software 1.1.22 Binary digit (value 0 or 1): 16 bit
1.1.23 Software layer between the application software and the hardware that includes the operating system and the compilers: 23 system software
1.1.24 High-level language used to write application and system software: 20 C 1.1.25 Portable language composed of words and algebraic expressions that must be translated into assembly language before run in a computer: 22 high-level language
1.1.26 10
12or 2
40bytes: 6 terabyte
Solution 1.2
1.2.1 8 bits × 3 colors = 24 bits/pixel = 4 bytes/pixel. 1280 × 800 pixels = 1,024,000 pixels. 1,024,000 pixels × 4 bytes/pixel = 4,096,000 bytes (approx 4 Mbytes).
1.2.2 2 GB = 2000 Mbytes. No. frames = 2000 Mbytes/4 Mbytes = 500 frames.
1.2.3 Network speed: 1 gigabit network ==> 1 gigabit/per second = 125 Mbytes/
second. File size: 256 Kbytes = 0.256 Mbytes. Time for 0.256 Mbytes = 0.256/125 =
2.048 ms.
1.2.4 2 microseconds from cache ==> 20 microseconds from DRAM. 20 micro- seconds from DRAM ==> 2 seconds from magnetic disk. 20 microseconds from DRAM ==> 2 ms from fl ash memory.
Solution 1.3
1.3.1 P2 has the highest performance
performance of P1 (instructions/sec) = 2 × 10
9/1.5 = 1.33 × 10
9performance of P2 (instructions/sec) = 1.5 × 10
9/1.0 = 1.5 × 10
9performance of P3 (instructions/sec) = 3 × 10
9/2.5 = 1.2 × 10
91.3.2 No. cycles = time × clock rate
cycles(P1) = 10 × 2 × 10
9= 20 × 10
9s cycles(P2) = 10 × 1.5 × 10
9= 15 × 10
9s cycles(P3) = 10 × 3 × 10
9= 30 × 10
9s
time = (No. instr. × CPI)/clock rate, then No. instructions = No. cycles/CPI instructions(P1) = 20 × 10
9/1.5 = 13.33 × 10
9instructions(P2) = 15 × 10
9/1 = 15 × 10
9instructions(P3) = 30 × 10
9/2.5 = 12 × 10
91.3.3 time
new= time
old× 0.7 = 7 s
CPI = CPI × 1.2, then CPI(P1) = 1.8, CPI(P2) = 1.2, CPI(P3) = 3 ƒ = No. instr. × CPI/time, then
ƒ(P1) = 13.33 × 10
9× 1.8/7 = 3.42 GHz ƒ(P2) = 15 × 10
9× 1.2/7 = 2.57 GHz ƒ(P3) = 12 × 10
9× 3/7 = 5.14 GHz
1.3.4 IPC = 1/CPI = No. instr./(time × clock rate) IPC(P1) = 1.42
IPC(P2) = 2 IPC(P3) = 3.33
1.3.5 Time
new/Time
old= 7/10 = 0.7. So ƒ
new= ƒ
old/0.7 = 1.5 GHz/0.7 = 2.14 GHz.
1.3.6 Time
new/Time
old= 9/10 = 0.9.
So Instructions
new= Instructions
old× 0.9 = 30 × 10
9× 0.9 = 27 × 10
9.
Solution 1.4
1.4.1 P2 Class A: 10
5instr.
Class B: 2 × 10
5instr.
Class C: 5 × 10
5instr.
Class D: 2 × 10
5instr.
Time = No. instr. × CPI/clock rate P1: Time class A = 0.66 × 10
−4Time class B = 2.66 × 10
−4Time class C = 10 × 10
−4Time class D = 5.33 × 10
−4Total time P1 = 18.65 × 10
−4P2: Time class A = 10
−4Time class B = 2 × 10
−4Time class C = 5 × 10
−4Time class D = 3 × 10
−4Total time P2 = 11 × 10
−41.4.2 CPI = time × clock rate/No. instr.
CPI(P1) = 18.65 × 10
−4× 1.5 × 10
9/10
6= 2.79 CPI(P2) = 11 × 10
−4× 2 × 10
9/10
6= 2.2 1.4.3
clock cycles(P1) = 10
5× 1 + 2 × 10
5× 2 + 5 × 10
5× 3 + 2 × 10
5× 4 = 28 × 10
5clock cycles(P2) = 10
5× 2 + 2 × 10
5× 2 + 5 × 10
5× 2 + 2 × 10
5× 3 = 22 × 10
51.4.4
(500 × 1 + 50 × 5 + 100 × 5 + 50 × 2) × 0.5 × 10–9 = 675 ns
1.4.5 CPI = time × clock rate/No. instr.
CPI = 675 × 10–9 × 2 × 109/700 = 1.92
1.4.6
Time = (500 × 1 + 50 × 5 + 50 × 5 + 50 × 2) × 0.5 × 10–9 = 550 ns Speed-up = 675 ns/550 ns = 1.22
CPI = 550 × 10–9 × 2 × 109/700 = 1.57
Solution 1.5
1.5.1
a. 1G, 0.75G inst/s b. 1G, 1.5G inst/s
1.5.2
a. P2 is 1.33 times faster than P1 b. P1 is 1.03 times faster than P2
1.5.3
a. P2 is 1.31 times faster than P1 b. P1 is 1.00 times faster than P2
1.5.4
a. 2.05 µs b. 1.93 µs
1.5.5
a. 0.71 µs b. 0.86 µs
1.5.6
a. 1.30 times faster b. 1.40 times faster
Solution 1.6
1.6.1
Compiler A CPI Compiler B CPI
a. 1.00 1.17
b. 0.80 0.58
1.6.2
a. 0.86 b. 1.37
1.6.3
Compiler A speed-up Compiler B speed-up
a. 1.52 1.77
b. 1.21 0.88
1.6.4
P1 peak P2 peak
a. 4G Inst/s 3G Inst/s
b. 4G Inst/s 3G Inst/s
1.6.5 Speed-up, P1 versus P2:
a. 0.967105263 b. 0.730263158
1.6.6
a. 6.204081633 b. 8.216216216
Solution 1.7
1.7.1
Geometric mean clock rate ratio = (1.28 × 1.56 × 2.64 × 3.03 × 10.00 × 1.80 × 0.74)
1/7= 2.15
Geometric mean power ratio = (1.24 × 1.20 × 2.06 × 2.88 × 2.59 × 1.37 × 0.92)
1/7= 1.62
1.7.2
Largest clock rate ratio = 2000 MHz/200 MHz = 10 (Pentium Pro to Pentium 4 Willamette)
Largest power ratio = 29.1 W/10.1 W = 2.88 (Pentium to Pentium Pro)
1.7.3
Clock rate: 2.667 × 10
9/12.5 × 10
6= 212.8 Power: 95 W/3.3 W = 28.78
1.7.4 C = P/V
2× clockrate 80286: C = 0.0105 × 10
−680386: C = 0.01025 × 10
−680486: C = 0.00784 × 10
−6Pentium: C = 0.00612 × 10
−6Pentium Pro: C = 0.0133 × 10
−6Pentium 4 Willamette: C = 0.0122 × 10
−6Pentium 4 Prescott: C = 0.00183 × 10
−6Core 2: C = 0.0294 × 10
−61.7.5 3.3/1.75 = 1.78 (Pentium Pro to Pentium 4 Willamette) 1.7.6
Pentium to Pentium Pro: 3.3/5 = 0.66
Pentium Pro to Pentium 4 Willamette: 1.75/3.3 = 0.53 Pentium 4 Willamette to Pentium 4 Prescott: 1.25/1.75 = 0.71 Pentium 4 Prescott to Core 2: 1.1/1.25 = 0.88
Geometric mean = 0.68
Solution 1.8
1.8.1 Power
1= V
2× clock rate × C. Power
2= 0.9 Power
1C2/C1 = 0.9 × 52 × 0.5 × 109/3.32 × 1 × 109 = 1.03
1.8.2 Power
2/Power
1= V
22× clock rate
2/V
12× clock rate
1Power2/Power1 = 0.87 => Reduction of 13%
1.8.3
Power2 = V22 × 1 × 109 × 0.8 × C1 = 0.6 × Power1 Power1 = 52 × 0.5 × 109 × C1
V22 × 1 × 109 × 0.8 × C1 = 0.6 × 52 × 0.5 × 109 × C1 V2 = ( (0.6 × 52 × 0.5 × 109)/(1 × 109 × 0.8) )1/2 = 3.06 V
1.8.4 Power
new= 1 × C
old× V
2old/(2
−1/4)
2× clock rate × 2
1/2= Power
old. Thus, power scales by 1.
1.8.5 1/2
−1/2= 2
1/21.8.6 Voltage = 1.1 × 1/2
−1/4= 0.92 V. Clock rate = 2.667 × 2
1/2= 3.771 GHz
Solution 1.9
1.9.1
a. 1/49 × 100 = 2%
b. 45/120 × 100 = 37.5%
1.9.2
a. Ileak = 1/3.3 = 0.3 b. Ileak = 45/1.1 = 40.9
1.9.3
a. Powerst/Powerdyn = 1/49 = 0.02 b. Powerst/Powerdyn = 45/57 = 0.6
1.9.4 Power
st/Power
dyn= 0.6 = > Power
st= 0.6 × Power
dyna. Powerst = 0.6 × 40 W = 24 W b. Powerst = 0.6 × 30 W = 18 W
1.9.5
a. Ilk = 24/0.8 = 30 A b. Ilk = 18/0.8 = 22.5 A
1.9.6
Powerst at 1.0 V Ilk at 1.0 V Powerst at 1.2 V Ilk at 1.2 V Larger
a. 119 W 119 A 136 W 113.3 A Ilk at 1.0 V
b. 93.5 W 93.5 A 110.5 W 92.1 A Ilk at 1.0 V
Solution 1.10
1.10.1
a. Processors Instructions per processor Total instructions
1 4096 4096
2 2048 4096
4 1024 4096
8 512 4096
b. Processors Instructions per processor Total instructions
1 4096 4096
2 2278 4556
4 1464 5856
8 1132 9056
1.10.2
a. Processors Execution time (µs)
1 4.096
2 2.048
4 1.024
8 0.512
b. Processors Execution time (µs)
1 4.096
2 3.203
4 3.164
8 3.582
1.10.3
a. Processors Execution time (µs)
1 5.376
2 2.688
4 1.344
8 0.672
b. Processors Execution time (µs)
1 5.376
2 3.878
4 3.564
8 3.882
1.10.4
a. Cores Execution time (s) @ 3 GHz
1 4.00
2 2.17
4 1.25
8 0.75
b. Cores Execution time (s) @ 3 GHz
1 4.00
2 2.00
4 1.00
8 0.50
1.10.5
a.
Cores
Power (W) per core
@ 3 GHz
Power (W) per core
@ 500 MHz
Power (W)
@ 3 GHz
Power (W)
@ 500 MHz
1 15 0.625 15 0.625
2 15 0.625 30 1.25
4 15 0.625 60 2.5
8 15 0.625 120 5
b.
Cores
Power (W) per core
@ 3 GHz
Power (W) per core
@ 500 MHz
Power (W)
@ 3 GHz
Power (W)
@ 500 MHz
1 15 0.625 15 0.625
2 15 0.625 30 1.25
4 15 0.625 60 2.5
8 15 0.625 120 5
1.10.6
a. Processors Energy (J) @ 3 GHz Energy (J) @ 500 MHz
1 60 15
2 65 16.25
4 75 18.75
8 90 22.5
b. Processors Energy (J) @ 3 GHz Energy (J) @ 500 MHz
1 60 15
2 60 15
4 60 15
8 60 15
Solution 1.11
1.11.1 Wafer area = π × (d/2)
2a. Wafer area = π × 7.52 = 176.7 cm2 b. Wafer area = π × 12.52 = 490.9 cm2
Die area = wafer area/dies per wafer
a. Die area = 176.7/90 = 1.96 cm2 b. Die area = 490.9/140 = 3.51 cm2
Yield = 1/(1 + (defect per area × die area)/2)
2a. Yield = 0.97 b. Yield = 0.92
1.11.2 Cost per die = cost per wafer/(dies per wafer × yield)
a. Cost per die = 0.12 b. Cost per die = 0.16
1.11.3
a. Dies per wafer = 1.1 × 90 = 99
Defects per area = 1.15 × 0.018 = 0.021 defects/cm2 Die area = wafer area/Dies per wafer = 176.7/99 = 1.78 cm2 Yield = 0.97
b. Dies per wafer = 1.1 × 140 = 154
Defects per area = 1.15 × 0.024 = 0.028 defects/cm2 Die area = wafer area/Dies per wafer = 490.9/154 = 3.19 cm2 Yield = 0.93
1.11.4 Yield = 1/(1 + (defect per area × die area)/2)
2Then defect per area = (2/die area)(y
−1/2− 1)
Replacing values for T1 and T2 we get
T1: defects per area = 0.00085 defects/mm
2= 0.085 defects/cm
2T2: defects per area = 0.00060 defects/mm
2= 0.060 defects/cm
2T3: defects per area = 0.00043 defects/mm
2= 0.043 defects/cm
2T4: defects per area = 0.00026 defects/mm
2= 0.026 defects/cm
21.11.5 no solution provided
Solution 1.12
1.12.1 CPI = clock rate × CPU time/instr. count clock rate = 1/cycle time = 3 GHz
a. CPI(pearl) = 3 × 109 × 500/2118 × 109 = 0.7 b. CPI(mcf) = 3 × 109 × 1200/336 × 109 = 10.7
1.12.2 SPECratio = ref. time/execution time.
a. SPECratio(pearl) = 9770/500 = 19.54 b. SPECratio(mcf) = 9120/1200 = 7.6
1.12.3
(19.54 × 7.6)1/2 = 12.19
1.12.4 CPU time = No. instr. × CPI/clock rate
If CPI and clock rate do not change, the CPU time increase is equal to the increase in the number of instructions, that is, 10%.
1.12.5 CPU time(before) = No. instr. × CPI/clock rate CPU time(after) = 1.1 × No. instr. × 1.05 × CPI/clock rate
CPU times(after)/CPU time(before) = 1.1 × 1.05 = 1.155. Thus, CPU time is increased by 15.5%
1.12.6 SPECratio = reference time/CPU time
SPECratio(after)/SPECratio(before) = CPU time(before)/CPU time(after) = 1/1.1555 = 0.86. That, the SPECratio is decreased by 14%.
Solution 1.13
1.13.1 CPI = (CPU time × clock rate)/No. instr.
a. CPI = 450 × 4 × 109/(0.85 × 2118 × 109) = 0.99 b. CPI = 1150 × 4 × 109/(0.85 × 336 × 109) = 16.10
1.13.2 Clock rate ratio = 4 GHz/3 GHz = 1.33.
a. CPI @ 4 GHz = 0.99, CPI @ 3 GHz = 0.7, ratio = 1.41 b. CPI @ 4 GHz = 16.1, CPI @ 3 GHz = 10.7, ratio = 1.50
They are different because although the number of instructions has been reduced by 15%, the CPU time has been reduced by a lower percentage.
1.13.3
a. 450/500 = 0.90. CPU time reduction: 10%.
b. 1150/1200 = 0.958. CPU time reduction: 4.2%.
1.13.4 No. instr. = CPU time × clock rate/CPI.
a. No. instr. = 820 × 0.9 × 4 × 109/0.96 = 3075 × 109 b. No. instr. = 580 × 0.9 × 4 × 109/2.94 = 710 × 109
1.13.5 Clock rate = No. instr. × CPI/CPU time.
Clock rate
new= No. instr. × CPI/0.9 × CPU time = 1/0.9 clock rate
old= 3.33 GHz.
1.13.6 Clock rate = No. instr. × CPI/CPU time.
Clock rate
new= No. instr. × 0.85 × CPI/0.80 CPU time = 0.85/0.80 clock rate
old= 3.18 GHz.
Solution 1.14
1.14.1 No. instr. = 10
6Tcpu(P1) = 106 × 1.25/4 × 109 = 0.315 × 10–3 s Tcpu(P2) = 106 × 0.75/3 × 109 = 0.25 × 10–3 s
clock rate(P1) > clock rate(P2), but performance(P1) < performance(P2)
1.14.2
P1: 106 instructions, Tcpu(P1) = 0.315 × 10–3 s P2: Tcpu(P2) = N × 0.75/3 × 109 then N = 1.26 × 106
1.14.3 MIPS = Clock rate × 10
−6/CPI
MIPS(P1) = 4 × 109 × 10–6/1.25 = 3200 MIPS(P2) = 3 × 109 × 10–6/0.75 = 4000
MIPS(P1) < MIPS(P2), performance(P1) < performance(P2) in this case (from 1.14.1)
1.14.4
a. FP op = 106 × 0.4 = 4 × 105, clock cylesfp = CPI × No. FP instr. = 4 × 105 Tfp = 4 × 105 × 0.33 × 10–9 = 1.32 × 10–4 then MFLOPS = 3.03 × 103
b. FP op = 3 × 106 × 0.4 = 1.2 × 106, clock cylesfp = CPI × No. FP instr. = 0.70 × 1.2 × 106 Tfp = 0.84 × 106 × 0.33 × 10–9 = 2.77 × 10–4 then MFLOPS = 4.33 × 103
1.14.5 CPU clock cycles = FP cycles + CPI(L/S) × No. instr. (L/S) + CPI(Branch) × No. instr. (Branch)
a. 5 × 105 L/S instr., 4 × 105 FP instr. and 105 Branch instr.
CPU clock cycles = 4 × 105 + 0.75 × 5 × 105 + 1.5 × 105 = 9.25 × 105 Tcpu = 9.25 × 105 × 0.33 × 10–9 = 3.05 × 10–4
MIPS = 106/(3.05 × 10–4 × 106) = 3.2 × 103
b. 1.2 × 106 L/S instr., 1.2 × 106 FP instr. and 0.6 × 106 Branch instr.
CPU clock cycles = 0.84 × 106 + 1.25 × 1.2 × 106 + 1.25 × 0.6 × 106 = 3.09 × 106 Tcpu = 3.09 × 106 × 0.33 × 10–9 = 1.01 × 10–3
MIPS = 3 × 106/(1.01 × 10–3 × 106) = 2.97 × 103
1.14.6
a. performance = 1/Tcpu = 3.2 × 103 b. performance = 1/Tcpu = 9.9 × 102
The second program has the higher performance and the higher MFLOPS fi gure, but the fi rst program has the higher MIPS fi gure.
Solution 1.15
1.15.1
a. Tfp = 35 × 0.8 = 28 s, Tp1 = 28 + 85 + 50 + 30 = 193 s. Reduction: 3.5%
b. Tfp = 50 × 0.8 = 40 s, Tp4 = 40 + 80 + 50 + 30 = 200 s. Reduction: 4.7%
1.15.2
a. Tp1 = 200 × 0.8 = 160 s, Tfp + Tl/s + Tbranch = 115 s, Tint = 45 s. Reduction time INT: 47%
b. Tp4 = 210 × 0.8 = 168 s, Tfp + Tl/s + Tbranch = 130 s, Tint = 38 s. Reduction time INT: 52.4%
1.15.3
a. Tp1 = 200 × 0.8 = 160 s, Tfp + Tint + Tl/s = 170 s. NO b. Tp4 = 210 × 0.8 = 168 s, Tfp + Tint + Tl/s = 180 s. NO
1.15.4
Clock cyles = CPI
fp× No. FP instr. + CPI
int× No. INT instr. + CPI
l/s× No. L/S instr. + CPI
branch× No. branch instr.
T
cpu= clock cycles/clock rate = clock cycles/2 × 10
9a. 1 processor: clock cycles = 8192; Tcpu = 4.096 s b. 8 processors: clock cycles = 1024; Tcpu = 0.512 s
To half the number of clock cycles by improving the CPI of FP instructions:
CPI
improved fp× No. FP instr. + CPI
int× No. INT instr. + CPI
l/s× No. L/S instr. + CPI
branch× No. branch instr. = clock cycles/2
CPI
improved fp= (clock cycles/2 − (CPI
int× No. INT instr. + CPI
l/s× No. L/S instr. + CPI
branch× No. branch instr.))/No. FP instr.
a. 1 processor: CPIimproved fp = (4096 – 7632)/560 < 0 ==> not possible b. 8 processors: CPIimproved fp = (512 – 944)/80 < 0 ==> not possible
1.15.5 Using the clock cycle data from 1.15.4:
To half the number of clock cycles improving the CPI of L/S instructions:
CPI
fp× No. FP instr. + CPI
int× No. INT instr. + CPI
improved l/s× No. L/S instr. + CPI
branch× No. branch instr. = clock cycles/2
CPI
improved l/s= (clock cycles/2 − (CPI
fp× No. FP instr. + CPI
int× No. INT instr. +
CPI
branch× No. branch instr.))/No. L/S instr.
a. 1 processor: CPIimproved l/s = (4096 – 3072)/1280 = 0.8 b. 8 processors: CPIimproved l/s = (512 – 384)/160 = 0.8
1.15.6
Clock cyles = CPI
fp× No. FP instr. + CPI
int× No. INT instr. + CPI
l/s× No. L/S instr. + CPI
branch× No. branch instr.
T
cpu= clock cycles/clock rate = clock cycles/2 × 10
9CPI
int= 0.6 × 1 = 0.6; CPI
fp= 0.6 × 1 = 0.6; CPI
l/s= 0.7 × 4 = 2.8; CPI
branch= 0.7 × 2 = 1.4
a. 1 processor: Tcpu(before improv.) = 4.096 s; Tcpu(after improv.) = 2.739 s b. 8 processors: Tcpu(before improv.) = 0.512 s; Tcpu(after improv.) = 0.342 s
Solution 1.16
1.16.1 Without reduction in any routine:
a. total time 2 proc = 185 ns b. total time 16 proc = 34 ns
Reducing time in routines A, C and E:
a. 2 proc: T(A) = 17 ns, T(C) = 8.5 ns, T(E) = 4.1 ns, total time = 179.6 ns ==> reduction = 2.9%
b. 16 proc: T(A) = 3.4 ns, T(C) = 1.7 ns, T(E) = 1.7 ns, total time = 32.8 ns ==> reduction = 3.5%
1.16.2
a. 2 proc: T(B) = 72 ns, total time = 177 ns ==> reduction = 4.3%
b. 16 proc: T(B) = 12.6 ns, total time = 32.6 ns ==> reduction = 4.1%
1.16.3
a. 2 proc: T(D) = 63 ns, total time = 178 ns ==> reduction = 3.7%
b. 16 proc: T(D) = 10.8 ns, total time = 32.8 ns ==> reduction = 3.5%
1.16.4
# Processors Computing time
Computing time
ratio Routing time ratio
2 176
4 96 0.55 1.18
8 49 0.51 1.31
16 30 0.61 1.29
32 14 0.47 1.05
64 6.5 0.46 1.13
1.16.5 Geometric mean of computing time ratios = 0.52. Multiply this by the computing time for a 64-processor system gives a computing time for a 128- processor system of 3.4 ms.
Geometric mean of routing time ratios = 1.19. Multiply this by the routing time for a 64-processor system gives a routing time for a 128-processor system of 30.9 ms.
1.16.6 Computing time = 176/0.52 = 338 ms. Routing time = 0, since no com-
munication is required.
Solution 2.1
2.1.1
a. add f, g, h add f, f, i add f, f, j b. addi f, h, 5
addi f, f, g
2.1.2
a. 3 b. 2
2.1.3
a. 14 b. 10
2.1.4
a. f = g + h b. f = g + h
2.1.5
a. 5 b. 5
Solution 2.2
2.2.1
a. add f, f, f add f, f, i b. addi f, j, 2 add f, f, g
2.2.2
a. 2 b. 2
2.2.3
a. 6 b. 5
2.2.4
a. f += h;
b. f = 1–f;
2.2.5
a. 4 b. 0
Solution 2.3
2.3.1
a. add f, f, g add f, f, h add f, f, i add f, f, j addi f, f, 2 b. addi f, f, 5 sub f, g, f
2.3.2
a. 5 b. 2
2.3.3
a. 17 b. –4
2.3.4
a. f = h – g;
b. f = g – f – 1;
2.3.5
a. 1 b. 0
Solution 2.4
2.4.1
a. lw $s0, 16($s7) add $s0, $s0, $s1 add $s0, $s0, $s2 b. lw $t0, 16($s7)
lw $s0, 0($t0) sub $s0, $s1, $s0
2.4.2
a. 3 b. 3
2.4.3
a. 4 b. 4
2.4.4
a. f += g + h + i + j;
b. f = A[1];
2.4.5
a. no change b. no change
2.4.6
a. 5 as written, 5 minimally b. 2 as written, 2 minimally
Solution 2.5
2.5.1
a. Address Data
12 1 8 6 4 4 0 2
temp = Array[3];
Array[3] = Array[2];
Array[2] = Array[1];
Array[1] = Array[0];
Array[0] = temp;
b. Address Data
16 1 12 2 8 3 4 4 0 5
temp = Array[4];
Array[4] = Array[0];
Array[0] = temp;
temp = Array[3];
Array[3] = Array[1];
Array[1] = temp;
2.5.2
a. Address Data
12 1 8 6 4 4 0 2
temp = Array[3];
Array[3] = Array[2];
Array[2] = Array[1];
Array[1] = Array[0];
Array[0] = temp;
lw $t0, 12($s6) lw $t1, 8($s6) sw $t1, 12($s6) lw $t1, 4($s6) sw $t1, 8($s6) lw $t1, 0($s6) sw $t1, 4($s6) sw $t0, 0($s6)
b. Address Data
16 1 12 2 8 3 4 4 0 5
temp = Array[4];
Array[4] = Array[0];
Array[0] = temp;
temp = Array[3];
Array[3] = Array[1];
Array[1] = temp;
lw $t0, 16($s6) lw $t1, 0($s6) sw $t1, 16($s6) sw $t0, 0($s6) lw $t0, 12($s6) lw $t1, 4($s6) sw $t1, 12($s6) sw $t0, 4($s6)
2.5.3
a. Address Data
12 1
8 6
4 4
0 2
temp = Array[3];
Array[3] = Array[2];
Array[2] = Array[1];
Array[1] = Array[0];
Array[0] = temp;
lw $t0, 12($s6) lw $t1, 8($s6) sw $t1, 12($s6) lw $t1, 4($s6) sw $t1, 8($s6) lw $t1, 0($s6) sw $t1, 4($s6) sw $t0, 0($s6)
8 mips instructions, +1 mips inst. for every non- zero offset lw/sw pair (11 mips inst.)
b. Address Data
16 1
12 2
8 3
4 4
0 5
temp = Array[4];
Array[4] = Array[0];
Array[0] = temp;
temp = Array[3];
Array[3] = Array[1];
Array[1] = temp;
lw $t0, 16($s6) lw $t1, 0($s6) sw $t1, 16($s6) sw $t0, 0($s6) lw $t0, 12($s6) lw $t1, 4($s6) sw $t1, 12($s6) sw $t0, 4($s6)
8 mips instructions, +1 mips inst. for every non- zero offset lw/sw pair (11 mips inst.)
2.5.4
a. 305419896 b. 3199070221
2.5.5
Little-Endian Big-Endian
a. Address Data
12 12
8 34
4 56
0 78
Address Data
12 78
8 56
4 34
0 12
b. Address Data
12 be
8 ad
4 f0
0 0d
Address Data
12 0d
8 f0
4 ad
0 be
Solution 2.6
2.6.1
a. lw $s0, 4($s7) sub $s0, $s0, $s1 add $s0, $s0, $s2 b. add $t0, $s7, $s1
lw $t0, 0($t0) add $t0, $t0, $s6 lw $s0, 4($t0)
2.6.2
a. 3 b. 4
2.6.3
a. 4 b. 5
2.6.4
a. f = 2i + h;
b. f = A[g – 3];
2.6.5
a. $s0 = 110 b. $s0 = 300
2.6.6
a.
Type opcode rs rt rd immed
add $s0, $s0, $s1 R-type 0 16 17 16
add $s0, $s3, $s2 R-type 0 19 18 16
add $s0, $s0, $s3 R-type 0 16 19 16
b.
Type opcode rs rt rd immed
addi $s6, $s6, –20 I-type 8 22 22 –20
add $s6, $s6, $s1 R-type 0 22q 17 22
lw $s0, 8($s6) I-type 35 22 16 8
Solution 2.7
2.7.1
a. –1391460350 b. –19629
2.7.2
a. 2903506946 b. 4294947667
2.7.3
a. AD100002 b. FFFFB353
2.7.4
a. 01111111111111111111111111111111 b. 1111101000
2.7.5
a. 7FFFFFFF b. 3E8
2.7.6
a. 80000001 b. FFFFFC18
Solution 2.8
2.8.1
a. 7FFFFFFF, no overfl ow b. 80000000, overfl ow
2.8.2
a. 60000001, no overfl ow b. 0, no overfl ow
2.8.3
a. EFFFFFFF, overfl ow b. C0000000, overfl ow
2.8.4
a. overfl ow b. no overfl ow
2.8.5
a. no overfl ow b. no overfl ow
2.8.6
a. overfl ow b. no overfl ow
Solution 2.9
2.9.1
a. overfl ow b. no overfl ow
2.9.2
a. overfl ow b. no overfl ow
2.9.3
a. no overfl ow b. overfl ow
2.9.4
a. no overfl ow b. no overfl ow
2.9.5
a. 1D100002 b. 6FFFB353
2.9.6
a. 487587842 b. 1879028563
Solution 2.10
2.10.1
a. sw $t3, 4($s0) b. lw $t0, 64($t0)
2.10.2
a. I-type b. I-type
2.10.3
a. AE0B0004 b. 8D080040
2.10.4
a. 0x01004020 b. 0x8E690004
2.10.5
a. R-type b. I-type
2.10.6
a. op=0x0, rd=0x8, rs=0x8, rt=0x0, funct=0x0 b. op=0x23, rs=0x13, rt=0x9, imm=0x4
Solution 2.11
2.11.1
a. 1010 1110 0000 1011 1111 1111 1111 1100two b. 1000 1101 0000 1000 1111 1111 1100 0000two
2.11.2
a. 2920022012 b. 2366177216
2.11.3
a. sw $t3, –4($s0) b. lw $t0, –64($t0)
2.11.4
a. R-type b. I-type
2.11.5
a. add $v1, $at, $v0 b. sw $a1, 4($s0)
2.11.6
a. 0x00221820 b. 0xAD450004
Solution 2.12
2.12.1
Type opcode rs rt rd shamt funct
a. R-type 6 3 3 3 5 6 total bits = 26
b. R-type 6 5 5 5 5 6 total bits = 32
2.12.2
Type opcode rs rt immed
a. I-type 6 3 3 16 total bits = 28
b. I-type 6 5 5 10 total bits = 26
2.12.3
a. less registers → less bits per instruction → could reduce code size less registers → more register spills → more instructions
b. smaller constants → more lui instructions → could increase code size smaller constants → smaller opcodes → smaller code size
2.12.4
a. 17367056 b. 2366177298
2.12.5
a. add $t0, $t1, $0 b. lw $t1, 12($t0)
2.12.6
a. R-type, op=0×0, rt=0×9 b. I-type, op=0×23, rt=0×8
Solution 2.13
2.13.1
a. 0x57755778 b. 0xFEFFFEDE
2.13.2
a. 0x55555550 b. 0xEADFEED0
2.13.3
a. 0x0000AAAA b. 0x0000BFCD
2.13.4
a. 0x00015B5A b. 0x00000000
2.13.5
a. 0x5b5a0000 b. 0x000000f0
2.13.6
a. 0xEFEFFFFF b. 0x000000F0
Solution 2.14
2.14.1
a. add $t1, $t0, $0 srl $t1, $t1, 5
andi $t1, $t1, 0x0001ffff b. add $t1, $t0, $0
sll $t1, $t1, 10
andi $t1, $t1, 0xffff8000
2.14.2
a. add $t1, $t0, $0
andi $t1, $t1, 0x0000000f b. add $t1, $t0, $0
srl $t1, $t1, 14
andi $t1, $t1, 0x0003c000
2.14.3
a. add $t1, $t0, $0 srl $t1, $t1, 28 b. add $t1, $t0, $0 srl $t1, $t1, 14
andi $t1, $t1, 0x0001c000
2.14.4
a. add $t2, $t0, $0 srl $t2, $t2, 11
and $t2, $t2, 0x0000003f and $t1, $t1, 0xffffffc0 ori $t1, $t1, $t2 b. add $t2, $t0, $0
sll $t2, $t2, 3
and $t2, $t2, 0x000fc000 and $t1, $t1, 0xfff03fff ori $t1, $t1, $t2
2.14.5
a. add $t2, $t0, $0
and $t2, $t2, 0x0000001f and $t1, $t1, 0xffffffe0 ori $t1, $t1, $t2 b. add $t2, $t0, $0
sll $t2, $t2, 14
and $t2, $t2, 0x0007c000 and $t1, $t1, 0xfff83fff ori $t1, $t1, $t2
2.14.6
a. add $t2, $t0, $0 srl $t2, $t2, 29
and $t2, $t2, 0x00000003 and $t1, $t1, 0xfffffffc ori $t1, $t1, $t2 b. add $t2, $t0, $0
srl $t2, $t2, 15
and $t2, $t2, 0x0000c000 and $t1, $t1, 0xffff3fff ori $t1, $t1, $t2
Solution 2.15
2.15.1
a. 0x0000a581 b. 0x00ff5a66
2.15.2
a. nor $t1, $t2, $t2 and $t1, $t1, $t3 b. xor $t1, $t2, $t3 nor $t1, $t1, $t1
2.15.3
a. nor $t1, $t2, $t2 and $t1, $t1, $t3
000000 01010 01010 01001 00000 100111 000000 01001 01011 01001 00000 100100 b. xor $t1, $t2, $t3
nor $t1, $t1, $t1
000000 01010 01011 01001 00000 100110 000000 01001 01001 01001 00000 100111
2.15.4
a. 0x00000220 b. 0x00001234
2.15.5 Assuming $t1 = A, $t2 = B, $s1 = base of Array C
a. lw $t3, 0($s1) and $t1, $t2, $t3 b. beq $t1, $0, ELSE add $t1, $t2, $0 beq $0, $0, END ELSE: lw $t2, 0($s1) END:
2.15.6
a. lw $t3, 0($s1) and $t1, $t2, $t3
100011 10001 01011 0000000000000000 000000 01010 01011 01001 00000 100100 b. beq $t1, $0, ELSE
add $t1, $t2, $0 beq $0, $0, END ELSE: lw $t2, 0($s1) END:
000100 01001 00000 0000000000000010 000000 01010 00000 01001 00000 100000 000100 00000 00000 0000000000000001 100011 10001 01010 0000000000000000
Solution 2.16
2.16.1
a. $t2 = 1 b. $t2 = 1
2.16.2
a. all, 0x8000 to 0x7FFFF b. 0x8000 to 0xFFFE
2.16.3
a. jump—no, beq—no b. jump—no, beq—no
2.16.4
a. $t2 = 2 b. $t2 = 2
2.16.5
a. $t2 = 0 b. $t2 = 1
2.16.6
a. jump—yes, beq—no b. jump—yes, beq—yes
Solution 2.17
2.17.1 The answer is really the same for all. All of these instructions are either supported by an existing instruction, or sequence of existing instructions. Looking for an answer along the lines of, “these instructions are not common, and we are only making the common case fast”.
2.17.2
a. could be either R-type of I-type b. R-type
2.17.3
a. ABS: sub $t2,$zero,$t3 # t2 = – t3
ble $t3,$zero,done # if t3 < 0, result is t2 add $t2,$t3,$zero # if t3 > 0, result is t3 DONE:
b. slt $t1, $t3, $t2
2.17.4
a. 20 b. 200
2.17.5
a. i = 10;
do { B += 2;
i = i – 1;
} while (i > 0) b. i = 10;
do {
temp = 10;
do { B += 2;
temp = temp – 1;
} while (temp > 0) i = i – 1;
} while (i > 0)
2.17.6
a. 5 × N + 3 b. 33 × N
A += B
i < 10? i += 1
Solution 2.18
2.18.1
a.
b.
D[a] = b + a;
A += 1 A < 10
2.18.2
a. addi $t0, $0, 0 beq $0, $0, TEST LOOP: add $s0, $s0, $s1 addi $t0, $t0, 1 TEST: slti $t2, $t0, 10 bne $t2, $0, LOOP b. LOOP: slti $t2, $s0, 10
beq $t2, $0, DONE add $t3, $s1, $s0 sll $t2, $s0, 2 add $t2, $s2, $t2 sw $t3, ($t2) addi $s0, $s0, 1 j LOOP
DONE:
2.18.3
a. 6 instructions to implement and 44 instructions executed b. 8 instructions to implement and 2 instructions executed
2.18.4
a. 501 b. 301
2.18.5
a. for(i=100; i>0; i––){
result += MemArray[s0];
s0 += 1;
}
b. for(i=0; i<100; i+=2){
result += MemArray[s0 + i];
result += MemArray[s0 + i + 1];
}
2.18.6
a. addi $t1, $s0, 400 LOOP: lw $s1, 0($s0) add $s2, $s2, $s1 addi $s0, $s0, 4 bne $s0, $t1, LOOP b. already reduced to minimum instructions
Solution 2.19
2.19.1
a. compare:
addi $sp, $sp, –4 sw $ra, 0($sp) add $s0, $a0, $0
add $s1, $a1, $0 jal sub
addi $t1, $0, 1 beq $v0, $0, exit slt $t2, $0, $v0 bne $t2, $0, exit addi $t1, $0, $0 exit:
add $v0, $t1, $0 lw $ra, 0($sp) addi $sp, $sp, 4 jr $ra
sub:
sub $v0, $a0, $a1 jr $ra
b. fi b_iter:
addi $sp, $sp, –16 sw $ra, 12($sp) sw $s0, 8($sp)
sw $s1, 4($sp) sw $s2, 0($sp) add $s0, $a0, $0 add $s1, $a1, $0 add $s2, $a2, $0 add $v0, $s1, $0, bne $s2, $0, exit add $a0, $s0, $s1 add $a1, $s0, $0 add $a2, $s2, –1 jal fi b_iter exit:
lw $s2, 0($sp) lw $s1, 4($sp) lw $s0, 8($sp)
lw $ra, 12($sp) addi $sp, $sp, 16 jr $ra
2.19.2
a. compare:
addi $sp, $sp, –4 sw $ra, 0($sp) sub $t0, $a0, $a1
addi $t1, $0, 1 beq $t0, $0, exit slt $t2, $0, $t0 bne $t2, $0, exit addi $t1, $0, $0 exit:
add $v0, $t1, $0 lw $ra, 0($sp) addi $sp, $sp, 4 jr $ra
b. Due to the recursive nature of the code, not possible for the compiler to in-line the function call.
2.19.3
a. after calling function compare:
old $sp => 0x7ffffffc ???
$sp => –4 contents of register $ra after calling function sub:
old $sp => 0x7ffffffc ???
–4 contents of register $ra
$sp => –8 contents of register $ra #return to compare
b. after calling function fi b_iter:
old $sp => 0x7ffffffc ???
–4 contents of register $ra –8 contents of register $s0 –12 contents of register $s1
$sp => –16 contents of register $s2
2.19.4
a. f: addi $sp,$sp,–8 sw $ra,4($sp) sw $s0,0($sp) move $s0,$a2 jal func move $a0,$v0 move $a1,$s0 jal func lw $ra,4($sp) lw $s0,0($sp) addi $sp,$sp,8 jr $ra
b. f: addi $sp,$sp,–12 sw $ra,8($sp) sw $s1,4($sp) sw $s0,0($sp) move $s0,$a1 move $s1,$a2 jal func move $a0,$s0 move $a1,$s1 move $s0,$v0 jal func add $v0,$v0,$s0 lw $ra,8($sp) lw $s1,4($sp) lw $s0,0($sp) addi $sp,$sp,12 jr ra
2.19.5
a. We can use the tail-call optimization for the second call to func, but then we must restore $ra and $sp before that call. We save only one instruction (jr $ra).
b. We can NOT use the tail call optimization here, because the value returned from f is not equal to the value returned by the last call to func.
2.19.6 Register $ra is equal to the return address in the caller function, registers
$sp and $s3 have the same values they had when function f was called, and register
$t5 can have an arbitrary value. For register $t5, note that although our function f does not modify it, function func is allowed to modify it so we cannot assume anything about the of $t5 after function func has been called.
Solution 2.20
2.20.1
a. FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra
L1: addi $a0, $a0, –1 jal FACT
mul $v0, $s0, $v0 lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra
b. FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra
L1: addi $a0, $a0, –1 jal FACT
mul $v0, $s0, $v0 lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra
2.20.2
a. 25 MIPS instructions to execute nonrecursive vs. 45 instructions to execute (corrected version of) recursion
Nonrecursive version:
FACT: addi $sp, $sp, –4 sw $ra, 4($sp) add $s0, $0, $a0 add $s2, $0, $1 LOOP: slti $t0, $s0, 2 bne $t0, $0, DONE mul $s2, $s0, $s2 addi $s0, $s0, –1 j LOOP
DONE: add $v0, $0, $s2 lw $ra, 4($sp) addi $sp, $sp, 4 jr $ra
b. 25 MIPS instructions to execute nonrecursive vs. 45 instructions to execute (corrected version of) recursion
Nonrecursive version:
FACT: addi $sp, $sp, –4 sw $ra, 4($sp) add $s0, $0, $a0 add $s2, $0, $1 LOOP: slti $t0, $s0, 2 bne $t0, $0, DONE mul $s2, $s0, $s2 addi $s0, $s0, –1 j LOOP
DONE: add $v0, $0, $s2 lw $ra, 4($sp) addi $sp, $sp, 4 jr $ra
2.20.3
a. Recursive version
FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 HERE: slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra
L1: addi $a0, $a0, –1 jal FACT
mul $v0, $s0, $v0 lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra
at label HERE, after calling function FACT with input of 4:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra
$sp => –8 contents of register $a0 at label HERE, after calling function FACT with input of 3:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra
$sp => –16 contents of register $a0 at label HERE, after calling function FACT with input of 2:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra
$sp => –24 contents of register $a0 at label HERE, after calling function FACT with input of 1:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra –24 contents of register $a0 –28 contents of register $ra
$sp => –32 contents of register $a0
b. Recursive version
FACT: addi $sp, $sp, –8 sw $ra, 4($sp) sw $a0, 0($sp) add $s0, $0, $a0 HERE: slti $t0, $a0, 2 beq $t0, $0, L1 addi $v0, $0, 1 addi $sp, $sp, 8 jr $ra
L1: addi $a0, $a0, –1 jal FACT
mul $v0, $s0, $v0 lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra
at label HERE, after calling function FACT with input of 4:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra
$sp => –8 contents of register $a0 at label HERE, after calling function FACT with input of 3:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra
$sp => –16 contents of register $a0 at label HERE, after calling function FACT with input of 2:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra
$sp => –24 contents of register $a0 at label HERE, after calling function FACT with input of 1:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $a0 –12 contents of register $ra –16 contents of register $a0 –20 contents of register $ra –24 contents of register $a0 –28 contents of register $ra
$sp => –32 contents of register $a0
2.20.4
a. FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT
L1: addi $a0, $a0, –1 jal FIB addi $s1, $v0, $0 addi $a0, $a0, –1 jal FIB add $v0, $v0, $s1 EXIT: lw $a0, 0($sp) lw $s1, 4($sp) lw $ra, 8($sp) addi $sp, $sp, 12 jr $ra
b. FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT
L1: addi $a0, $a0, –1 jal FIB addi $s1, $v0, $0 addi $a0, $a0, –1 jal FIB add $v0, $v0, $s1 EXIT: lw $a0, 0($sp) lw $s1, 4($sp) lw $ra, 8($sp) addi $sp, $sp, 12 jr $ra
2.20.5
a. 23 MIPS instructions to execute nonrecursive vs. 73 instructions to execute (corrected version of) recursion
Nonrecursive version:
FIB: addi $sp, $sp, –4 sw $ra, ($sp) addi $s1, $0, 1 addi $s2, $0, 1 LOOP: slti $t0, $a0, 3 bne $t0, $0, EXIT add $s3, $s1, $0 add $s1, $s1, $s2 add $s2, $s3, $0 addi $a0, $a0, –1 j LOOP
EXIT: add $v0, s1, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
b. 23 MIPS instructions to execute nonrecursive vs. 73 instructions to execute (corrected version of) recursion
Nonrecursive version:
FIB: addi $sp, $sp, –4 sw $ra, ($sp) addi $s1, $0, 1 addi $s2, $0, 1 LOOP: slti $t0, $a0, 3 bne $t0, $0, EXIT add $s3, $s1, $0 add $s1, $s1, $s2 add $s2, $s3, $0 addi $a0, $a0, –1 j LOOP
EXIT: add $v0, s1, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
2.20.6
a. recursive version
FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) HERE: slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT
L1: addi $a0, $a0, –1 jal FIB addi $s1, $v0, $0 addi $a0, $a0, –1 jal FIB add $v0, $v0, $s1 EXIT: lw $a0, 0($sp) lw $s1, 4($sp) lw $ra, 8($sp) addi $sp, $sp, 12 jr $ra
at label HERE, after calling function FIB with input of 4:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $s1
$sp => –12 contents of register $a0 b. recursive version
FIB: addi $sp, $sp, –12 sw $ra, 8($sp) sw $s1, 4($sp) sw $a0, 0($sp) HERE: slti $t0, $a0, 3 beq $t0, $0, L1 addi $v0, $0, 1 j EXIT
L1: addi $a0, $a0, –1 jal FIB addi $s1, $v0, $0 addi $a0, $a0, –1 jal FIB add $v0, $v0, $s1 EXIT: lw $a0, 0($sp) lw $s1, 4($sp) lw $ra, 8($sp) addi $sp, $sp, 12 jr $ra
at label HERE, after calling function FIB with input of 4:
old $sp => 0xnnnnnnnn ???
–4 contents of register $ra –8 contents of register $s1
$sp => –12 contents of register $a0
Solution 2.21
2.21.1
a. after entering function main:
old $sp => 0x7ffffffc ???
$sp => –4 contents of register $ra after entering function leaf_function:
old $sp => 0x7ffffffc ???
–4 contents of register $ra
$sp => –8 contents of register $ra (return to main) b. after entering function main:
old $sp => 0x7ffffffc ???
$sp => –4 contents of register $ra after entering function my_function:
old $sp => 0x7ffffffc ???
–4 contents of register $ra
$sp => –8 contents of register $ra (return to main) global pointers:
0x10008000 100 my_global
2.21.2
a. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 1 jal LEAF lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
LEAF: addi $sp, $sp, –8 sw $ra, 4($sp) sw $s0, 0($sp) addi $s0, $a0, 1 slti $t2, 5, $a0 bne $t2, $0, DONE add $a0, $s0, $0 jal LEAF DONE: add $v0, $s0, $0 lw $s0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra
b. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 10 addi $t1, $0, 20
lw $a1, ($s0) #assume $s0 has global variable base jal FUNC
add $t2, $v0 $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
FUNC: sub $v0, $a0, $a1 jr $ra
2.21.3
a. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 1 jal LEAF lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
LEAF: addi $sp, $sp, –8 sw $ra, 4($sp) sw $s0, 0($sp) addi $s0, $a0, 1 slti $t2, 5, $a0 bne $t2, $0, DONE add $a0, $s0, $0 jal LEAF DONE: add $v0, $s0, $0 lw $s0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 jr $ra
b. MAIN: addi $sp, $sp, –4 sw $ra, ($sp) addi $a0, $0, 10 addi $t1, $0, 20
lw $a1, ($s0) #assume $s0 has global variable base jal FUNC
add $t2, $v0 $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
FUNC: sub $v0, $a0, $a1 jr $ra
2.21.4
a. Register $s0 is used to hold a temporary result without saving $s0 fi rst. To correct this problem, $t0 (or $v0) should be used in place of $s0 in the fi rst two instructions. Note that a sub-optimal solution would be to continue using $s0, but add code to save/restore it.
b. The two addi instructions move the stack pointer in the wrong direction. Note that the MIPS calling convention requires the stack to grow down. Even if the stack grew up, this code would be incorrect because $ra and $s0 are saved according to the stack-grows-down convention.
2.21.5
a. int f(int a, int b, int c, int d){
return 2*(a–d)+c–b;
}
b. int f(int a, int b, int c){
return g(a,b)+c;
}
2.21.6
a. The function returns 842 (which is 2 × (1 – 30) + 1000 – 100) b. The function returns 1500 (g(a, b) is 500, so it returns 500 + 1000)
Solution 2.22
2.22.1
a. 65 20 98 121 116 101
b. 99 111 109 112 117 116 101 114
2.22.2
a. U+0041, U+0020, U+0062, U+0079, U+0074, U+0065
b. U+0063, U+006f, U+006d, U+0070, U+0075, U+0074, U+0065, U+0072
2.22.3
a. add b. shift
Solution 2.23
2.23.1
a. MAIN: addi $sp, $sp, –4 sw $ra, ($sp)
add $t6, $0, 0x30 # '0' add $t7, $0, 0x39 # '9' add $s0, $0, $0
add $t0, $a0, $0 LOOP: lb $t1, ($t0) slt $t2, $t1, $t6 bne $t2, $0, DONE slt $t2, $t7, $t1 bne $t2, $0, DONE sub $t1, $t1, $t6 beq $s0, $0, FIRST mul $s0, $s0, 10 FIRST: add $s0, $s0, $t1 addi $t0, $t0, 1 j LOOP
DONE: add $v0, $s0, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
b. MAIN: addi $sp, $sp, –4 sw $ra, ($sp)
add $t4, $0, 0x41 # 'A' add $t5, $0, 0x46 # 'F' add $t6, $0, 0x30 # '0' add $t7, $0, 0x39 # '9' add $s0, $0, $0
add $t0, $a0, $0 LOOP: lb $t1, ($t0) slt $t2, $t1, $t6 bne $t2, $0, DONE slt $t2, $t7, $t1 bne $t2, $0, HEX sub $t1, $t1, $t6 j DEC
HEX: slt $t2, $t1, $t4 bne $t2, $0, DONE slt $t2, $t5, $t1 bne $t2, $0, DONE sub $t1, $t1, $t4 addi $t1, $t1, 10 DEC: beq $s0, $0, FIRST mul $s0, $s0, 10 FIRST: add $s0, $s0, $t1 addi $t0, $t0, 1 j LOOP
DONE: add $v0, $s0, $0 lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
Solution 2.24
2.24.1
a. 0x00000012 b. 0x12ffffff
2.24.2
a. 0x00000080 b. 0x80000000
2.24.3
a. 0x00000011 b. 0x11555555
Solution 2.25
2.25.1 Generally, all solutions are similar:
lui $t1, top_16_bits
ori $t1, $t1, bottom_16_bits
2.25.2 Jump can go up to 0x0FFFFFFC.
a. no b. no
2.25.3 Range is 0x604 + 0x1FFFC = 0x0002 0600 to 0x604 − 0x20000 = 0xFFFE 0604.
a. no b. yes
2.25.4 Range is 0x0042 0600 to 0x003E 0600.
a. no b. no
2.25.5 Generally, all solutions are similar:
add $t1, $zero, $zero #clear $t1 addi $t2, $zero, top_8_bits #set top 8b
sll $t2, $t2, 24 #shift left 24 spots or $t1, $t1, $t2 #place top 8b into $t1 addi $t2, $zero, nxt1_8_bits #set next 8b
sll $t2, $t2, 16 #shift left 16 spots or $t1, $t1, $t2 #place next 8b into $t1 addi $t2, $zero, nxt2_8_bits #set next 8b
sll $t2, $t2, 24 #shift left 8 spots or $t1, $t1, $t2 #place next 8b into $t1 ori $t1, $t1, bot_8_bits #or in bottom 8b
2.25.6
a. 0x12345678 b. 0x12340000
2.25.7
a. t0 = (0x1234 << 16) || 0x5678;
b. t0 = (t0 || 0x5678);
t0 = 0x1234 << 16;
Solution 2.26
2.26.1 Branch range is 0x00020000 to 0xFFFE0004.
a. one branch b. three branches
2.26.2
a. one b. can’t be done
2.26.3 Branch range is 0x00000200 to 0xFFFFFE04.
a. eight branches b. 512 branches
2.26.4
a. branch range is 16x larger b. branch range is 16x smaller
2.26.5
a. no change
b. jump to addresses 0 to 212 instead of 0 to 228, assuming the PC<0x08000000
2.26.6
a. rs fi eld now 3 bits b. no change
Solution 2.27
2.27.1
a. jump register b. beq
2.27.2
a. R-type b. I-type
2.27.3
a. + can jump to any 32b address
– need to load a register with a 32b address, which could take multiple cycles
b. + allows the PC to be set to the current PC + 4 +/– BranchAddr, supporting quick forward and backward branches
– range of branches is smaller than large programs
2.27.4
a. 0x00000000 lui $s0, 100 0x00000004 ori $s0, $s0, 40
0x3c100100 0x36100028 b. 0x00000100 addi $t0, $0, 0x0000
0x00000104 lw $t1, 0x4000($t0)
0x20080000 0x8d094000
2.27.5
a. addi $s0, $zero, 0x80 sll $s0, $s0, 17 ori $s0, $s0, 40 b. addi $t0, $0, 0x0040
sll $t0, $t0, 8 lw $t1, 0($t0)
2.27.6
a. 1 b. 1
Solution 2.28
2.28.1
a. 4 instructions
2.28.2
a. One of the locations specifi ed by the LL instruction has no corresponding SC instruction.
2.28.3
a. try: MOV R3,R4 MOV R6,R7 LL R2,0(R2)
# adjustment or test code here SC R3,0(R2)
BEQZ R3,try try2:
LL R5,0(R1)
# adjustment or test code here SC R6,0(R1)
BEQZ R6,try2 MOV R4,R2 MOV R7,R5
2.28.4
a.
Processor 1 Processor 2
Processor 1 Mem Processor 2
Cycle $t1 $t0 ($s1) $t1 $t0
0 1 2 99 30 40
ll $t1, 0($s1) ll $t1, 0($s1) 1 99 2 99 99 40
sc $t0, 0($s1) 2 99 1 2 99 40
sc $t0, 0($s1) 3 99 1 2 99 0
Processor 1 Processor 2
Processor 1 Mem Processor 2 Cycle $s4 $t1 $t0 ($s1) $s4 $t1 $t0
0 2 3 4 99 10 20 30
try: add $t0, $0, $s4 1 2 3 4 99 10 20 10
try: add $t0, $0, $s4 ll $t1, 0($s1) 2 2 3 2 99 10 99 10
ll $t1, 0($s1) 3 2 99 2 99 10 99 10
sc $t0, 0($s1) 4 2 99 1 2 10 99 10
beqz $t0, try sc $t0, 0($s1) 5 2 99 1 2 10 99 0
add $s4, $0, $t1 beqz $t0, try 6 99 99 1 2 10 99 0
b.
Solution 2.29
2.29.1 The critical section can be implemented as:
trylk: li $t1,1 ll $t0,0($a0) bnez $t0,trylk sc $t1,0($a0) beqz $t1,trylk operation
sw $zero,0($a0)
Where operation is implemented as:
a. lw $t0,0($a1) add $t0,$t0,$a2 sw $t0,0($a1) b. lw $t0,0($a1) sge $t1,$t0,$a2 bnez $t1,skip sw $a2,0($a1) skip:
2.29.2 The entire critical section is now:
a. try: ll $t0,0($a1) add $t0,$t0,$a2 sc $t0,0($a1) beqz $t0,try b. try: ll $t0,0($a1)
sge $t1,$t0,$a2 bnez $t1,skip mov $t0,$a2 sc $t0,0($a1) beqz $t0,try skip:
2.29.3 The code that directly uses ll/sc to update shvar avoids the entire lock/
unlock code. When SC is executed, this code needs 1) one extra instruction to check the outcome of SC, and 2) if the register used for SC is needed again we need an instruction to copy its value. However, these two additional instructions may not be needed, e.g., if SC is not on the best-case path or f it uses a register whose value is no longer needed. We have:
Lock-based Direct LL/SC implementation
a. 6+3 4
b. 6+3 3
2.29.4
a. Both processors attempt to execute SC at the same time, but one of them completes the write fi rst. The other’s SC detects this and its SC operation fails.
b. It is possible for one or both processors to complete this code without ever reaching the SC instruction. If only one executes SC, it completes successfully. If both reach SC, they do so in the same cycle, but one SC completes fi rst and then the other detects this and fails.
2.29.5 Every processor has a different set of registers, so a value in a register can- not be shared. Therefore, shared variable shvar must be kept in memory, loaded each time their value is needed, and stored each time a task wants to change the value of a shared variable. For local variable x there is no such restriction. On the contrary, we want to minimize the time spent in the critical section (or between the LL and SC, so if variable x is in memory it should be loaded to a register before the critical section to avoid loading it during the critical section.
2.29.6 If we simply do two instances of the code from 2.29.2 one after the other
(to update one shared variable and then the other), each update is performed
atomically, but the entire two-variable update is not atomic, i.e., after the update
to the fi rst variable and before the update to the second variable, another process
can perform its own update of one or both variables. If we attempt to do two LLs
(one for each variable), compute their new values, and then do two SC instructions (again, one for each variable), the second LL causes the SC that corresponds to the fi rst LL to fail (we have a LL and SC with a non-register-register instruction executed between them). As a result, this code can never successfully complete.
Solution 2.30
2.30.1
a. add $t1, $t2, $0 b. add $t0, $0, small
beq $t1, $t0, LOOP
2.30.2
a. Yes. The address of v is not known until the data segment is built at link time.
b. No. The branch displacement does not depend on the placement of the instruction in the text segment.
Solution 2.31
2.31.1
a.
Text Size 0x440
Data Size 0x90
Text Address Instruction
0x00400000 lw $a0, 0x8000($gp)
0x00400004 jal 0x0400140
… …
0x00400140 sw $a1, 0x8040($gp)
0x00400144 jal 0x0400000
… …
Data 0x10000000 (X)
… …
0x10000040 (Y)
b.
Text Size 0x440
Data Size 0x90
Text Address Instruction
0x00400000 lui $at, 0x1000
0x00400004 ori $a0, $at, 0
0x00400008 jal 0x0400140
… …
0x00400140 sw $a0, 8040($gp)
0x00400144 jmp 0x04002C0
… …
0x004002C0 jr $ra
… …
Data 0x10000000 (X)
… …
0x10000040 (Y)
2.31.2 0x8000 data, 0xFC00000 text. However, because of the size of the beq immediate fi eld, 218 words is a more practical program limitation.
2.31.3 The limitation on the sizes of the displacement and address fi elds in the instruction encoding may make it impossible to use branch and jump instructions for objects that are linked too far apart.
Solution 2.32
2.32.1
a. swap:
sll $t0,$a1,2 add $t0,$t0,$a0 lw $t2,0($t0) sll $t1,$a2,2 add $t1,$t1,$a0 lw $t3,0($t1) sw $t3,0($t0) sw $t2,0($t1) jr $ra b. swap:
lw $t0,0($a0) lw $t1,4($a0) sw $t1,0($a0) sw $t0,4($a0) jr $ra
2.32.2
a. Pass j+1 as a third parameter to swap. We can do this by adding an “addi $a2,$a1,1”
instruction right before “jal swap”.
b. Pass the address of v[j] to swap. Since that address is already in $t2 at the point when we want to call swap, we can replace the two parameter-passing instructions before “jal swap”
with a simple “mov $a0,$t2”.
2.32.3
a. swap:
add $t0,$t0,$a0 ; No sll
lb $t2,0($t0) ; Byte–sized load add $t1,$t1,$a0 ; No sll
lb $t3,0($t1)
sb $t3,0($t0) ; Byte–sized store sb $t2,0($t1)
jr $ra b. swap:
lb $t0,0($a0) ; Byte–sized load lb $t1,1($a0) ; Offset is 1, not 4 sb $t1,0($a0) ; Byte–sized store sb $t0,1($a0)
jr $ra
2.32.4
a. Yes, we must save the additional s-registers. Also, the code for sort() in Figure 2.27 is using 5 t-registers and only 4 s-registers remain. Fortunately, we can easily reduce this number, e.g., by using t1 instead of t0 for loop comparisons.
b. No change to saving/restoring code is needed because the same s-registers are used in the modifi ed sort() code.
2.32.5 When the array is already sorted, the inner loop always exits in its fi rst iteration, as soon as it compares v[j] with v[j+1]. We have:
a. We need 4 more instructions to save and 4 more to restore registers. The number of instructions in the rest of the code is the same, so there are exactly 8 more instructions executed in the modifi ed sort(), regardless of how large the array is.
b. One fewer instruction is executed in each iteration of the inner loop. Because the array is already sorted, the inner loop always exits during its fi rst iteration, so we save one instruction per iteration of the outer loop. Overall, we execute 10 instructions fewer.
2.32.6 When the array is sorted in reverse order, the inner loop always executes the maximum number of iterations and swap is called in each iteration of the inner loop (a total of 45 times). We have:
a. This change only affects the number of instructions needed to save/restore registers in swap(), so the answer is the same as in Problem When the array is already sorted, the inner loop always exits in its fi rst iteration, as soon as it compares v[j] with v[j+1]. We have:.