• 沒有找到結果。

# Part II: Solutions Guide

N/A
N/A
Protected

Share "Part II: Solutions Guide"

Copied!
52
0
0

(1)

## Part II: Solutions Guide

(2)

52 Instructors Manual for Computer Organization and Design

1.1 q 1.2 u 1.3 f 1.4 a 1.5 c 1.6 d 1.7 i 1.8 k 1.9 j 1.10 o 1.11 w 1.12 p 1.13 n 1.14 r 1.15 y 1.16 s 1.17 l 1.18 g 1.19 x 1.20 z 1.21 t 1.22 b 1.23 h 1.24 m 1.25 e 1.26 v 1.27 j 1.28 b 1.29 f 1.30 j 1.31 i 1.32 e

### 1 Solutions

(3)

Part II: Solutions Guide 53

1.33 d 1.34 g 1.35 c 1.36 g 1.37 d 1.38 c 1.39 j 1.40 b 1.41 f 1.42 h 1.43 a 1.44 a

1.45 Time for

Time for

1.46 As discussed in section 1.4, die costs rise very fast with increasing die area. Con- sider a wafer with a large number of defects. It is quite likely that if the die area is very small, some dies will escape with no defects. On the other hand, if the die area is very large, it might be likely that every die has one or more defects. In general, then, die area greatly affects yield (as the equations on page 48 indicate), and so we would expect that dies from wafer B would cost much more than dies from wafer A.

1.47 The die area of the Pentium processor in Figure 1.16 is 91 mm2 and it contains about 3.3 million transistors, or roughly 36,000 per square millimeter. If we assume the period has an area of roughly .1 mm2, it would contain 3500 transistors (this is certainly a very rough estimate). Similar calculations with regard to Figure 1.26 and the Intel 4004 result in 191 transistors per square millimeter or roughly 19 transistors.

1.48 We can write Dies per wafer = f((Die area)–1) and Yield = f((Die area)–2) and thus Cost per die = f((Die area)3). More formally, we can write:

1.49 No solution provided.

1.50 From the caption in Figure 1.16 we have 198 dies at 100% yield. If the defect density is 1 per square centimeter, then the yield is approximated by 1/((1 + 1 × .91/2)2)

= .47. Thus 198 × .47 = 93 dies with a cost of \$1000/93 = \$10.75 per die.

1.51 Defects per area.

1

2--- revolution 1 2---

= rev 1 5400---

× minutes

---rev ×60 seconds minute

--- = 5.56 ms 1

2--- revolution 1 2---

= rev 1 7200---

× minutes

---rev ×60 seconds minute

--- = 4.17 ms

Cost per die Cost per wafer Dies per wafer×yield ---

=

Dies per wafer Wafer area Die area ---

=

Yield 1

1+Defect per area×Die area 2⁄

( )2

---

=

(4)

1.52

1.53

1.54 No solution provided.

1.55 No solution provided.

1.56 No solution provided.

1980 Die area 0.16

Yield 0.48

Defect density 17.04

1992 Die area 0.97

Yield 0.48

Defect density 1.98

1992 + 1980 Improvement 8.62

Yield 1

1+Defects per area×Die area 2⁄

( )2

---

=

(5)

2.1 For program 1, M2 is 2.0 (10/5) times as fast as M1. For program 2, M1 is 1.33 (4/3) times as fast as M2.

2.2 Since we know the number of instructions executed and the time it took to execute the instructions, we can easily calculate the number of instructions per second while running program 1 as (200 × 106)/10 = 20 × 106 for M1 and (160 × 106)/5 = 32 × 106 for M2.

2.3 We know that Cycles per instruction = Cycles per second / Instructions per sec- ond. For M1 we thus have a CPI of 200 × 106 cycles per second / 20 × 106 instructions per second = 10 cycles per instruction. For M2 we have 300/32 = 9.4 cycles per instruc- tion.

2.4 We are given the number of cycles per second and the number of seconds, so we can calculate the number of required cycles for each machine. If we divide this by the CPI we’ll get the number of instructions. For M1, we have 3 seconds × 200 × 106 cy- cles/second = 600 × 106 cycles per program / 10 cycles per instruction = 60 × 106 in- structions per program. For M2, we have 4 seconds × 300 × 106 cycles/second = 1200 × 106 cycles per program / 9.4 cycles per instruction = 127.7 × 106 instructions per pro- gram.

2.5 M2 is twice as fast as M1, but it does not cost twice as much. M2 is clearly the ma- chine to purchase.

2.6 If we multiply the cost by the execution time, we are multiplying two quantities, for each of which smaller numbers are preferred. For this reason, cost times execution time is a good metric, and we would choose the machine with a smaller value. In the example, we get \$10,000 × 10 seconds = 100,000 for M1 vs. \$15,000 × 5 seconds = 75,000 for M2, and thus M2 is the better choice. If we used cost divided by execution time and assume we choose the machine with the larger value, then a machine with a ridiculous- ly high cost would be chosen. This makes no sense. If we choose the machine with the smaller value, then a machine with a ridiculously high execution time would be cho- sen. This too makes no sense.

2.7 We would define cost-effectiveness as performance divided by cost. This is essen- tially (1/Execution time) × (1/Cost), and in both cases larger numbers are more cost- effective when we multiply.

2.8 We can use the method in Exercise 2.7, but the execution time is the sum of the two execution times.

So M1 is slightly more cost-effective, specifically 1.04 times more.

### 2 Solutions

Executions per second per dollar for M1 1 13×10,000

--- 1 130,000 ---

= =

Executions per second per dollar for M2 1 9×15,000

--- 1 135,000 ---

= =

(6)

2.9 We do this problem by finding the amount of time that program 2 can be run in an hour and using that for executions per second, the throughput measure.

With performance measured by throughput for program 2, machine M2 is = 1.2 times faster than M1. The cost-effectiveness of the machines is to be measured in units of throughput on program 2 per dollar, so

Cost-effectiveness of M1 = = 0.053

Cost-effectiveness of M2 = = 0.043

Thus, M1 is more cost-effective than M2. (Machine costs are from Exercise 2.5.) 2.10 For M1 the peak performance will be achieved with a sequence on instructions of class A, which have a CPI of 1. The peak performance is thus 500 MIPS.

For M2, a mixture of A and B instructions, both of which have a CPI of 2, will achieve the peak performance, which is 375 MIPS.

2.11 Let’s find the CPI for each machine first. , and

. Using , we get

the following: , and

.

M2 has a smaller execution time and is thus faster by the inverse ratio of the execution time or 250/200 = 1.25.

2.12 M1 would be as fast if the clock rate were 1.25 higher, so 500 × 1.25 = 625 MHz.

2.13 Note: There is an error in Exercise 2.13 on page 92 in the text. The table entry for row c, column 3 (“CPI on M2”) should be 3 instead of 8. This will be corrected in the first reprint of the book. With the corrected value of 3, this solution is valid. Using C1, the CPI on M1 = 5.8 and the CPI on M2 = 3.2. Because M1 has a clock rate twice as fast as that of M2, M1 is 1.10 times as fast. Using C2, the CPI on M1 = 6.4 and the CPI on M2 = 2.9. M2 is

Executions of P2 per hour

3600seconds

--- 200hour seconds Execution of P1 ---

× –

seconds Execution of P2 ---

---

=

Executions of P2 per hour on M1

3600seconds

--- 200hour – ×10

---3 1600 ---3 533

= = =

Executions of P2 per hour on M2

3600seconds

--- 200hour – ×5

---4 2600 ---4 650

= = =

650 533---

533 10,000 ---

650 15,000 ---

CPI for M1 1+2+3+4 ---4 2.5

= =

CPI for M2 2+2+4+4 ---4 3.0

= = CPU time Instruction count×CPI

Clock rate ---

= CPU time for M1 Instruction count×2.5

500 MHz

--- Instruction count 200 million ---

= =

CPU time for M2 Instruction count×3 750 MHz

--- Instruction count 250 million ---

= =

(7)

(6.4/2)/2.9 = 1.10 times as fast. Using a third-party product, CPI on M1 = 5.4 and on M2 = 2.8. The third-party compiler is the superior product regardless of machine pur- chase. M1 is the machine to purchase using the third-party compiler, as it will be 1.04 times faster for typical programs.

2.14 Let I = number of instructions in program and C = number of cycles in program.

The six subsets are {clock rate, C} {cycle time, C} {MIPS, I} {CPI, C, MIPS} {CPI, I, clock rate} {CPI, I, cycle time}. Note that in every case each subset has to have at least one rate {CPI, clock rate, cycle time, MIPS} and one absolute {C, I}.

2.15 . Let’s find the CPI for MFP first:

; of course, the CPI for

MNFP is simply 2. So and

. 2.16

2.17 . So execution time is = 1.08 seconds, and execu-

tion time on MNFP is = 5.52 seconds.

2.18 CPI for Mbase = 2 × 0.4 + 3 × 0.25 + 3 × 0.25 + 5 × 0.1 = 2.8 CPI for Mopt = 2 × 0.4 + 2 × 0.25 + 3 × 0.25 + 4 × 0.1 = 2.45 2.19 MIPS for Mbase= 500/2.8 = 179. MIPS for Mopt = 600/2.45 = 245.

2.20 Since it’s the same architecture, we can compare the native MIPS ratings. Mopt is faster by the ratio 245/179 = 1.4.

2.21 This problem can be done in one of two ways. Either find the new mix and adjust the frequencies first or find the new (relative) instruction count and divide the CPI by that. We use the latter.

. So we can calculate CPI as

2.22 How must faster is Mcomp than Mbase?

Instruction class Frequency on MFP Count on MFP in millions Count on MNFP in millions

Floating point multiply 10% 30 900

Floating point add 15% 45 900

Floating point divide 5% 15 750

Integer instructions 70% 210 210

Totals 100% 300 2760

MIPS Clock rate CPI×106 ---

=

CPI for MFP = 0.1×6+0.15×4+0.05×20×0.7×2 = 3.6 MIPS for MFP 1000

---CPI 278

= =

MIPS for MNFP 1000 ---CPI 500

= =

Execution time IC×106 ---MIPS

= 300

278--- 2760

---500

Ratio of instructions = 0.9×0.4+0.9×0.25+0.85×0.25+0.1×0.95 = 0.81

CPI 2×0.4×0.9+3×0.25×0.9+3×0.25×0.85+5×0.1×0.95 ---0.81 3.1

= =

CPU time Mbase Clock rate IC×CPI

--- Clock rate IC×2.8 ---

= =

CPU time Mcomp Clock rate IC×0.81×3.1

--- Clock rate IC×2.5 ---

= =

(8)

So then

2.23 The CPI is different from either Mbase or Mcomp; find that first:

2.24 First, compute the performance growth after 6 and 8 months. After 6 months = 1.0346 = 1.22. After 8 months = 1.0348 = 1.31. The best choice would be to implement either Mboth or Mopt.

2.25 No solution provided.

2.26 Total execution time of computer A is 1001 seconds; computer B, 110 seconds;

computer C, 40 seconds. Computer C is fastest. It’s 25 times faster than computer A and 2.75 times faster than computer B.

2.27 We can just take the GM of the execution times and use the inverse.

, , and , so C is

fastest.

2.28 A, B: B has the same performance as A. If we run program 2 once, how many times should we run program 1: , or x = 100. So the mix is 99%

program 1, 1% program 2.

B, C: C is faster by the ratio of . Program 2 is run once, so we have

, x = 3.1 times. So the mix is 76% program 1 and 24% pro- gram 2.

A, C: C is also faster by 1.6 here. We use the same equation, but with the proper times:

x + 1000 = 1.6 × (20x + 20), x = 31.2. So the mix is 97% program 1 and 3% program 2.

Note that the mix is very different in each case!

2.29

So B is fastest; it is 1.10 times faster than C and 5.0 times faster than A. For an equal number of executions of the programs, the ratio of total execution times A:B:C is 1001:110:40, thus C is 2.75 times faster than B and 25 times faster than A.

Program Weight Computer A Computer B Computer C

Program 1 (seconds) 10 1 10 20

Program 2 (seconds) 1 1000 100 20

Weighted AM 9.18 18.2 20

Performance Mboth Performance Mbase

--- CPU time Mbase CPU time Mboth ---

Clock rate IC×2.8 --- Clock rate IC×2.5 --- --- 2.8

2.5--- 1.12

= = = =

Mboth CPI 2×0.4×0.9+2×0.25×0.9+3×0.25×0.85+4×0.1×0.95 ---0.81 2.7

= =

Performance Mboth Performance Mbase

--- CPU time Mbase CPU time Mboth ---

Clock rate IC×2.8 --- Clock rate IC×2.2 ---

--- 2.8×600MHz 2.2×500MHz --- 1.5

= = = =

GM(A) = 1000 = 32 GM(B) = 1000 = 32 GM(C) = 400 = 20

x+1000 = 10x+100

32 20--- = 1.6 10x+100 = 1.6×(20x+20)

(9)

2.30 Equal time on machine A:

This makes A the fastest.

Now with equal time on machine B:

Machine B is the fastest.

Comparing them to unweighted numbers, we notice that this weighting always makes the base machine fastest, and machine C second. The unweighted mean makes machine C fastest (and is equivalent to equal time weighting on C).

2.31 Assume 100 instructions, then the number of cycles will be 90 × 4 + 10 × 12 = 480 cycles. Of these, 120 are spent doing multiplication, and thus 25% of the time is spent doing multiplication.

2.32 Unmodified for 100 instructions we are using 480 cycles, and if we improve multipli- cation it will only take 420 cycles. But the improvement increases the cycle time by 20%.

Thus we should not perform the improvement as the original is 1.2(420)/480 = 1.05 times faster than the improvement!

2.33 No solution provided.

2.34 No solution provided.

2.35 No solution provided.

2.36 No solution provided.

2.37 No solution provided.

2.38

2.39 The harmonic mean of a set of rates,

where AM is the arithmetic mean of the corresponding execution times.

Program Weight Computer A Computer B Computer C

Program 1 (seconds) 1 1 10 20

Program 2 (seconds) 1/1000 1000 100 20

Weighted AM 2 10.1 20

Program Weight Computer A Computer B Computer C

Program 1 (seconds) 1 1 10 20

Program 2 (seconds) 1/10 1000 100 20

Weighted AM 91.8 18.2 20

Program Computer A Computer B Computer C

1 10 1 0.5

2 0.1 1 5

HM n

1 Ratei ---

i = 1 n

--- = n Timei

i = 1 n

--- = 1 Timei

i = 1

---n

--- = 1 1

n--- Timei

i = 1 n

### ∑

---

= 1

AM---

=

(10)

2.40 No solution provided.

2.41 No solution provided.

2.42 No solution provided.

2.43 No solution provided.

2.44 Using Amdahl’s law (or just common sense) we can determine the following:

Speedup if we improve only multiplication = 100 / (30 + 50 + 20/4) = 100/85 = 1.18.

Speedup if we only improve memory access = 100 / (100 – (50 – 50/2)) = 100/75

= 1.33.

Speedup if both improvements are made = 100 / (30 + 50/2 + 20/4) = 100/60 = 1.67.

2.45 The problem is solved algebraically and results in the equation 100/(Y + (100 – X – Y) + X/4) = 100/(X + (100 – X – Y) + Y/2)

where X = multiplication percentage and Y = memory percentage. Solving, we get memory percentage = 1.5 × multiplication percentage. Many examples thus exist, e.g., multiplication = 20%, memory = 30%, other = 50%, or multiplication = 30%, memory = 45%, other = 25%, etc.

2.46

Rewrite the execution time equation:

Speed-up Execution time before improvement Execution time after improvement ---

=

Execution time after improvement Execution time affected by improvement Amount of improvement

--- Execution time unaffected+

=

= Execution time affected Amount of improvement+ ×Execution time unaffected Amount of improvement

---

(11)

Rewrite execution time affected by improvement as execution time before improve- ment x f, where f is the fraction affected. Similarly execution time unaffected.

The denominator has two terms: the fraction improved (f) divided by the amount of the improvement and, the fraction unimproved (1 –f).

= Execution time before improvement× f Amount of improvement

--- Execution time before improvement+ ×(1– f)

= Execution time before improvement× f Amount of improvement

--- Execution time before improvement+ ×(1–f)

= Execution time before improvement× f Amount of improvement

--- Execution time before improvement+ ×(1–f)

= f

Amount of improvement

---+(1–f)

 

 ×Execution time before improvement

Speedup Execution time before improvement

f

Amount of improvement

---+(1–f)

 

 ×Execution time before improvement

---

=

Speedup 1

f

Amount of improvement

---+(1–f)

 

 

---

=

(12)

3.1 The program computes the sum of odd numbers up to the largest odd number smaller than or equal to n, e.g., 1 + 3 + 5 + ... + n (or n – 1 if n is even). There are many alternative ways to express this summation. For example, an equally valid answer is that the program calculates (ceiling(n/2))2.

3.2 The code determines the most frequent word appearing in the array and returns it in \$v1 and its multiplicity in \$v0.

3.3 Ignoring the four instructions before the loops, we see that the outer loop (which iterates 5000 times) has four instructions before the inner loop and six after in the worst case. The cycles needed to execute these are 1 + 2 + 1 + 1 = 5 and 1 + 2 + 1 + 1 + 1 + 2 = 8, for a total of 13 cycles per iteration, or 5000 × 13 for the outer loop. The inner loop requires 1 + 2 + 2 + 1 + 1 + 2 = 9 cycles per iteration and it repeats 5000 × 5000 times, for a total of 9 × 5000 × 500 cycles. The overall execution time is thus approximately (5000

× 13 + 9 × 5000 × 5000) / (500 × 106) = .45 sec. Note that the execution time for the inner loop is really the only code of significance.

3.4 addi \$t0,\$t1,100 # register \$t0 = \$t1 + 100

3.5 The base address of x, in binary, is 0000 0000 0011 1101 0000 1001 0000 0000, which implies that we must use lui:

lui \$t1, 0000 0000 0011 1101 ori \$t1, \$t1, 0000 1001 0000 0000 lw \$t2, 44(\$t1)

add \$t2, \$t2, \$t0 sw \$t2, 40(\$t1)

3.6 addi \$v0,\$zero,–1 # Initialize to avoid counting zero word loop: lw \$v1,0(\$a0) # Read next word from source

addi \$v0,\$v0,1 # Increment count words copied sw \$v1,0(\$a1) # Write to destination

addi \$a0,\$a0,4 # Advance pointer to next source addi \$a1,\$a1,4 # Advance pointer to next dest bne \$v1,\$zero,loop# Loop if the word copied ≠ zero

Bugs:

1. Count (\$v0) is not initialized.

2. Zero word is counted. (1 and 2 fixed by initializing \$v0 to –1).

3. Source pointer (\$a0) incremented by 1, not 4.

4. Destination pointer (\$a1) incremented by 1, not 4.

3.7

### 3 Solutions

Instruction Format op rs rt immediate

lw \$v1,0(\$a0) I 35 4 3 0

addi \$v0,\$v0,1 I 8 2 2 1

sw \$v1,0(\$a1) I 43 5 3 0

addi \$a0,\$a0,1 I 8 4 4 1

addi \$a1,\$a1,1 I 8 5 5 1

bne \$v1,\$zero,loop I 5 3 0 –20

(13)

3.8 count = –1;

do {

temp = *source;

count = count + 1;

*destination = temp;

source = source + 1;

destination = destination + 1;

} while (temp != 0);

3.9 The C loop is

while (save[i] == k) i = i + j;

with i, j, and k corresponding to registers \$s3, \$s4, and \$s5 and the base of the array save in \$s6. The assembly code given in the example is

Code before

Loop: add \$t1, \$s3, \$s3 # Temp reg \$t1 = 2 * i add \$t1, \$t1, \$t1 # Temp reg \$t1 = 4 * i add \$t1, \$t1, \$s6 # \$t1 = address of save [i]

lw \$t0, 0(\$t1) # Temp reg \$t0 = save[i]

bne \$t0, \$s5, Exit # go to Exit if save[i] ≠ k add \$s3, \$s3, \$s4 # i = i + j

j Loop# go to Loop Exit:

Number of instructions executed if save[i + m * j] does not equal k for m = 10 and does equal k for 0 ≤ m ≤ 9 is 10 × 7 + 5 = 75, which corresponds to 10 complete itera- tions of the loop plus a final pass that goes to Exit at the bne instruction before updat- ing i. Straightforward rewriting to use at most one branch or jump in the loop yields Code after

add \$t1, \$s3, \$s3 # Temp reg \$t1 = 2 * i add \$t1, \$t1, \$t1 # Temp reg \$t1 = 4 * i add \$t1, \$t1, \$s6 # \$t1 = address of save[i]

lw \$t0, 0(\$t1) # Temp reg \$t0 = save[i]

bne \$t0, \$s5, Exit # go to Exit if save[i] ≠ k Loop: add \$s3, \$s3, \$s4 # i = i + j

add \$t1, \$s3, \$s3 # Temp reg \$t1 = 2 * i add \$t1, \$t1, \$t1 # Temp reg \$t1 = 4 * i add \$t1, \$t1, \$s6 # \$t1 = address of save[i]

lw \$t0, 0(\$t1) # Temp reg \$t0 = save[i]

beq \$t0, \$s5, Loop# go to Loop if save[i] = k Exit:

The number of instructions executed by this new form of the loop is 5 + 10 × 6 = 65. If 4 × j is computed before the loop, then further saving in the loop body is possible.

Code after further improvement

add \$t2, \$s4, \$s4 # Temp reg \$t2 = 2 * j add \$t2, \$t2, \$t2 # Temp reg \$t2 = 4 * j add \$t1, \$s3, \$s3 # Temp reg \$t1 = 2 * i add \$t1, \$t1, \$t1 # Temp reg \$t1 = 4 * i

(14)

lw \$t0, 0(\$t1) # Temp reg \$t0 = save[i]

bne \$t0, \$s5, Exit # go to Exit if save[i] ≠ k

Loop: add \$t1, \$t1, \$t2 # \$t1 = address of save [i + m * j]

lw \$t0, 0(\$t1) # Temp reg \$t0 = save[i]

beq \$t0, \$s5, Loop# go to Loop if save[i] = k Exit:

The number of instructions executed is now 7 + 10 × 3 = 37.

3.10

Note: In the solutions, we make use of the li instruction, which should be imple- mented as shown in rows 3 and 4.

3.11 The fragment of C code is

for (i=0; i<=100; i=i+1) {a[i] = b[i] + c;}

with a and b arrays of words at base addresses \$a0 and \$a1, respectively. First initial- ize i to 0 with i kept in \$t0:

add \$t0, \$zero, \$zero # Temp reg \$t0 = 0

Assume that \$s0 holds the address of c (if \$s0 is assumed to hold the value of c, omit the following instruction):

lw \$t1, 0(\$s0) # Temp reg \$t1 = c

To compute the byte address of successive array elements and to test for loop termina- tion, the constants 4 and 401 are needed. Assume they are placed in memory when the program is loaded:

lw \$t2, AddressConstant4(\$zero) # Temp reg \$t2 = 4 lw \$t3, AddressConstant401(\$zero) # Temp reg \$t3 = 401 Pseudoinstruction What it accomplishes Solution

move \$t5, \$t3 \$t5 = \$t3 add \$t5, \$t3, \$zero

clear \$t5 \$t5 = 0 add \$t5, \$zero, \$zero

li \$t5, small \$t5 = small addi \$t5, \$zero, small

li \$t5, big \$t5 = big lui \$t5, upper_half(big)

ori \$t5, \$t5, lower_half(big) lw \$t5, big(\$t3) \$t5 = Memory[\$t3 + big] li \$at, big

add \$at, \$at, \$t3 lw \$t5, 0(\$at) addi \$t5, \$t3, big \$t5 = \$t3 + big li \$at, big

add \$t5, \$t3, \$at beq \$t5, small, L if (\$t5 = small) go to L li \$at, small

beq \$t5, \$at, L beq \$t5, big, L if (\$t5 = big) go to L li \$at, big

beq \$at, \$zero, L ble \$t5, \$t3, L if (\$t5 <= \$t3) go to L slt \$at, \$t3, \$t5 beq \$at, \$zero, L bgt \$t5, \$t3, L if (\$t5 > \$t3) go to L slt \$at, \$t3, \$t5 bne \$at, \$zero, L bge \$t5, \$t3, L if (\$t5 >= \$t3) go to L slt \$at, \$t5, \$t3 beq \$at, \$zero, L

(15)

In section 3.8 the instructions addi (add immediate) and slti (set less than immedi- ate) are introduced. These instructions can carry constants in their machine code rep- resentation, saving a load instruction and use of a register. Now the loop body accesses array elements, performs the computation, and tests for termination:

Loop: add \$t4, \$a1, \$t0 # Temp reg \$t4 = address of b[i]

lw \$t5, 0(\$t4) # Temp reg \$t5 = b[i]

add \$t6, \$t5, \$t1 # Temp reg \$t6 = b[i] + c

This add instruction would be add\$t6,\$t5,\$s0 if it were assumed that \$s0 holds the value of c. Continuing the loop:

add \$t7, \$a0, \$t0 # Temp reg \$t7 = address of a[i]

sw \$t6, 0(\$t7) # a[i] = b[i] + c add \$t0, \$t0, \$t2 # i = i + 4

slt \$t8, \$t0, \$t3 # \$t8 = 1 if \$t0 < 401, i.e., i ≤ 100 bne \$t8, \$zero, Loop# go to Loop if i ≤ 100

The number of instructions executed is 4 + 101 × 8 = 812. The number of data refer- ences made is 3 + 101 × 2 = 205.

3.12 The problem is that we are using PC-relative addressing, so if the address of there is too far away, we won’t be able to use 16 bits to describe where it is relative to the PC. One simple solution would be

here: bne \$t1, \$t2, skip j there

skip:

...

This will work as long as our program does not cross the 256-MB address boundary described in the elaboration on page 150.

3.13 Let I be the number of instructions taken by gcc on the unmodified MIPS. This decomposes into .48I arithmetic instructions, .33I data transfer instructions, .17I condi- tional branches, and .02I jumps. Using the CPIs given for each instruction class, we get a total of (.48 × 1.0 + .33 × 1.4 + .17 × 1.7 + .02 × 1.2) × I cycles; if we call the unmodified machine’s cycle time C seconds, then the time taken on the unmodified machine is (.48

× 1.0 + .33 × 1.4 + .17 × 1.7 + .02 × 1.2) × I × C seconds. Changing some fraction, f (namely .25), of the data transfer instructions into the autoincrement or autodecrement version will leave the number of cycles spent on data transfer instructions unchanged. How- ever, each of the .33 × I × f data transfer instructions that is changed corresponds to an arithmetic instruction that can be eliminated. So, there are now only (.48 –(.33 × f)) × I arithmetic instructions, and the modified machine, with its cycle time of 1.1 × C sec- onds, will take ((.48 – .33f) × 1.0 + .33 × 1.4 + .17 × 1.7 + .02 × 1.2) × I × 1.1 × C seconds to execute gcc. When f is .25, the unmodified machine is 2.8% faster than the modified one.

3.14 From Figure 3.38, 33% of all instructions executed by spice are data access in- structions. Thus, for every 100 instructions there are 100 + 33 = 133 memory accesses:

one to read each instruction and 33 to access data.

a. The percentage of all memory accesses that are for data = 33/133 = 25%.

(16)

b. Assuming two-thirds of data transfers are loads, the percentage of all memory accesses that are reads = = 92%.

3.15 From Figure 3.38, 41% of all instructions executed by spice are data access in- structions. Thus, for every 100 instructions there are 100 + 41 = 141 memory accesses:

one to read each instruction and 41 to access data.

a. The percentage of all memory accesses that are for data = 41/141 = 29%.

b. Assuming two-thirds of data transfers are loads, the percentage of all memory accesses that are reads = = 90%.

3.16 Effective CPI = CPIclass × Frequency of executionclass

For gcc, CPI = 1.0 × 0.48 + 1.4 × 0.33 + 1.7 × 0.17 + 1.2 × 0.02 = 1.3. For spice, CPI = 1.0 × 0.5 + 1.4 × 0.41 + 1.7 × 0.08 + 1.2 × 0.01 = 1.2.

3.17 Let the program have n instructions.

Let the original clock cycle time be t. Let N be percent loads retained.

execold = n × CPI × t

execnew = (0.78 n + N × 0.22 n) × CPI × 1.1t execnew ≤ execold

(0.78 n + N × 0.22 n) × CPI × 1.1t ≤ n × CPI × t (0.78 n + N × 0.22n) × 1.1 ≤ n

(0.78 + N × 0.22) × 1.1 ≤ 1 0.78 + N × 0.22 ≤

N × 0.22 ≤

N ≤ N ≤ N ≤ 0.587

We need to eliminate at least 41.3% of loads.

3.18 No solution provided.

100 33 2 3---

 × 

 

+

---133

100 41 2 3---

 × 

 

+

---141

Instruction

### Σ

classes

1 1.1--- 1 1.1--- 0.78– 1

1.1--- 0.78– ---0.22 1 1.1– ×0.78

1.1×0.22 ---

(17)

3.19

Code size is 22 bytes, and memory bandwidth is 22 + 28 = 50 bytes.

Code size is 27 bytes, and memory bandwidth is 27 + 28 = 55 bytes.

Code size is 21 bytes, and memory bandwidth is 21 + 36 = 57 bytes.

Accumulator

Instruction Code bytes Data bytes

load b # Acc = b; 3 4

add c # Acc += c; 3 4

store a # a = Acc; 3 4

add c # Acc += c; 3 4

store b # Acc = b; 3 4

neg # Acc =- Acc; 1 0

add a # Acc -= b; 3 4

store d # d = Acc; 3 4

Total: 22 28

Stack

Instruction Code bytes Data bytes

push b 3 4

push c 3 4

dup 1 0

pop a 3 4

push c 3 4

dup 1 0

pop b 3 4

neg 1 0

push a 3 4

pop d 3 4

Total: 27 28

Memory-Memory

Instruction Code bytes Data bytes

add a, b, c # a=b+c 7 12

add b, a, c # b=a+c 7 12

sub d, a, b # d=a–b 7 12

Total: 21 36

(18)

Code size is 29 bytes, and memory bandwidth is 29 + 20 = 49 bytes.

The load-store machine has the lowest amount of data traffic. It has enough registers that it only needs to read and write each memory location once. On the other hand, since all ALU operations must be separate from loads and stores, and all operations must specify three registers or one register and one address, the load-store has the worst code size. The memory-memory machine, on the other hand, is at the other extreme. It has the fewest instructions (though also the largest number of bytes per instruction) and the largest number of data accesses.

3.20 To know the typical number of memory addresses per instruction, the nature of a typical instruction must be agreed upon. For the purpose of categorizing computers as 0-, 1-, 2-, 3-address machines, an instruction that takes two operands and produces a result, for example, add, is traditionally taken as typical.

Accumulator: An add on this architecture reads one operand from memory, one from the accumulator, and writes the result in the accumulator. Only the location of the operand in memory need be specified by the instruction. CATEGORY: 1-address architecture.

Memory-memory: Both operands are read from memory and the result is written to memory, and all locations must be specified. CATEGORY: 3-address architecture.

Stack: Both operands are read (removed) from the stack (top of stack and next to top of stack), and the result is written to the stack (at the new top of stack). All locations are known; none need be specified. CATEGORY: 0-address architecture.

Load-store: Both operands are read from registers and the result is written to a register.

Just like memory-memory, all locations must be specified; however, location addresses are much smaller—5 bits for a location in a typical register file versus 32 bits for a location in a common memory. CATEGORY: 3-address architecture.

3.21 Figure 3.15 shows decimal values corresponding to ACSII characters.

3.22 No solution provided.

3.23 No solution provided.

3.24 No solution provided.

Instruction Code bytes Data bytes

load \$1, b # \$1 = b; 4 4

load \$2, c # \$2 = c; 4 4

add \$3, \$1, \$2 # \$3 = \$1 + \$2 3 0

store \$3, a # a = \$3; 4 4

add \$1, \$2, \$3 # \$1 = \$2 + \$3; 3 0

store \$1, b # b = \$1; 4 4

sub \$4, \$3, \$1 # \$4 = \$3 - \$1; 3 0

store \$4, d # d = \$4; 4 4

Total: 29 20

A b y t e i s 8 b i t s

65 32 98 121 116 101 32 101 115 32 56 32 98 101 116 115 0

(19)

3.25 Here is the C code for itoa, as taken from The C Programing Language by Ker- nighan and Ritchie:

void reverse( char *s ) {

int c, i, j;

for( i = 0, j = strlen(s)–1; i < j; i++, j–– ) { c=s[i];

s[i]=s[j];

s[j] = c;

} }

void itoa( int n, char *s ) {

int i, sign;

if( ( sign = n ) < 0 ) n = –n;

i = 0;

do {

s[i++] = n % 10 + '0';

} while( ( n /= 10 ) > 0 );

if( sign < 0 ) s[i++] = '–';

s[i] = '\0';

reverse( s );

} }

The MIPS assembly code, along with a main routine to test it, might look something like this:

.data

hello: .ascii "\nEnter a number:"

newln: .asciiz "\n"

str: .space 32

.text

reverse: # Expects string to

# reverse in \$a0

# s = i = \$a0

# j = \$t2

addi \$t2, \$a0, –1 # j = s –1;

lbu \$t3, 1(\$t2) # while( *(j+1) ) beqz \$t3, end_strlen

strlen_loop:

addi \$t2, \$t2, 1 # j++;

lbu \$t3, 1(\$t2) bnez \$t3, strlen_loop

(20)

end_strlen: # now j =

# &s[strlen(s)–1]

bge \$a0, \$t2, end_reverse # while( i < j )

# { reverse_loop:

lbu \$t3, (\$a0) # \$t3 = *i;

lbu \$t4, (\$t2) # \$t4 = *j;

sb \$t3, (\$t2) # *j = \$t3;

sb \$t4, (\$a0) # *i = \$t4;

addi \$a0, \$a0, 1 # i++;

addi \$t2, \$t2, –1 # j––;

blt \$a0, \$t2, reverse_loop# } end_reverse:

jr \$31

.globl itoa # \$a0 = n

itoa: addi \$29, \$29, –4 # \$a1 = s sw \$31, 0(\$29)

move \$t0, \$a0 # sign = n;

move \$t3, \$a1 # \$t3 = s;

bgez \$a0, non_neg # if( sign < 0 ) sub \$a0, \$0, \$a0 # n = –n non_neg:

li \$t2, 10

itoa_loop: # do {

div \$a0, \$t2 # lo = n / 10;

# hi = n % 10;

mfhi \$t1

mflo \$a0 # n /= 10;

addi \$t1, \$t1, 48 # \$t1 =

# '0' + n % 10;

sb \$t1, 0(\$a1) # *s = \$t1;

addi \$a1, \$a1, 1 # s++;

bnez \$a0, itoa_loop# } while( n );

bgez \$t0, non_neg2 # if( sign < 0 )

# {

li \$t1, '–' # *s = '–';

sb \$t1, 0(\$a1) # s++;

addi \$a1, \$a1, 1 # }

non_neg2:

sb \$0, 0(\$a1) move \$a0, \$t3

jal reverse # reverse( s );

lw \$31, 0(\$29) addi \$29, \$29, 4 jr \$31

.globl main

(21)

main: addi \$29, \$29, –4 sw \$31, 0(\$29) li \$v0, 4 la \$a0, hello syscall

syscall

move \$a0, \$v0 # itoa( \$a0, str );

la \$a1, str #

jal itoa #

la \$a0, str li \$v0, 4 syscall

la \$a0, newln syscall

lw \$31, 0(\$29) addi \$29, \$29, 4 jr \$31

One common problem that occurred was to treat the string as a series of words rather than as a series of bytes. Each character in a string is a byte. One zero byte terminates a string. Thus when people stored ASCII codes one per word and then attempted to invoice the print_str system call, only the first character of the number printed out.

3.26

# Description: Computes the Fibonacci function using a recursive

# process.

# Function: F(n) = 0, if n = 0;

# 1, if n = 1;

# F(n–1) + F(n–2), otherwise.

# Input: n, which must be a non-negative integer.

# Output: F(n).

# Preconditions: none

# Instructions: Load and run the program in SPIM, and answer the

# prompt.

# Algorithm for main program:

# print prompt

# call fib(read) and print result.

# Register usage:

# \$a0 = n (passed directly to fib)

# \$s1 = f(n) .data .align 2

# Data for prompts and output description

prmpt1: .asciiz "\n\nThis program computes the Fibonacci function."

prmpt2: .asciiz "\nEnter value for n: "

descr: .asciiz "fib(n) = "

.text .align 2 .globl __start

(22)

__start:

# Print the prompts

li \$v0, 4 # print_str system service ...

la \$a0, prmpt1 # ... passing address of first prompt syscall

li \$v0, 4 # print_str system service ...

la \$a0, prmpt2 # ... passing address of 2nd prompt syscall

# Read n and call fib with result

li \$v0, 5 # read_int system service syscall

move \$a0, \$v0 # \$a0 = n = result of read jal fib # call fib(n)

move \$s1, \$v0 # \$s0 = fib(n)

# Print result

li \$v0, 4 # print_str system service ...

la \$a0, descr # ... passing address of output descriptor syscall

li \$v0, 1 # print_int system service ...

move \$a0, \$s # ... passing argument fib(n) syscall

# Call system – exit li \$v0, 10 syscall

# Algorithm for Fib(n):

# if (n == 0) return 0

# else if (n == 1) return 1

# else return fib(n–1) + fib(n–2).

#

# Register usage:

# \$a0 = n (argument)

# \$t1 = fib(n–1)

# \$t2 = fib(n–2)

# \$v0 = 1 (for comparison)

#

# Stack usage:

# 1. push return address, n, before calling fib(n–1)

# 2. pop n

# 3. push n, fib(n–1), before calling fib(n–2)

# 4. pop fib(n–1), n, return address

fib: bne \$a0, \$zero, fibne0 # if n == 0 ...

move \$v0, \$zero # ... return 0 jr \$31

fibne0: # Assert: n != 0

li \$v0, 1

bne \$a0, \$v0, fibne1 # if n == 1 ...

jr \$31 # ... return 1

fibne1: # Assert: n > 1

(23)

## Compute fib(n–1)

addi \$sp, \$sp, –8 # push ...

sw \$ra, 4(\$sp) # ... return address sw \$a0, 0(\$sp) # ... and n

addi \$a0, \$a0, –1 # pass argument n–1 ...

jal fib # ... to fib

move \$t1, \$v0 # \$t1 = fib(n–1) lw \$a0, 0(\$sp) # pop n

addi \$sp, \$sp, 4 # ... from stack

## Compute fib(n–2)

addi \$sp, \$sp, –8 # push ...

sw \$a0, 4(\$sp) # ... n

sw \$t1, 0(\$sp) # ... and fib(n–1) addi \$a0, \$a0, –2 # pass argument n–2 ...

jal fib # ... to fib

move \$t2, \$v0 # \$t2 = fib(n–2) lw \$t1, 0(\$sp) # pop fib(n–1) ...

lw \$a0, 4(\$sp) # ... n

lw \$ra, 8(\$sp) # ... and return address addi \$sp, \$sp, 12 # ... from stack

## Return fib(n–1) + fib(n–2)

add \$v0, \$t1, \$t2 # \$v0 = fib(n) = fib(n–1) + fib(n–2)

3.27

# Description: Computes the Fibonacci function using an iterative

# process.

# Function: F(n) = 0, if n = 0;

# 1, if n = 1;

# F(n–1) + F(n–2), otherwise.

# Input: n, which must be a non-negative integer.

# Output: F(n).

# Preconditions:none

# Instructions: Load and run the program in SPIM, and answer the

# prompt.

#

# Algorithm for main program:

# print prompt

# call fib(1, 0, read) and print result.

#

# Register usage:

# \$a2 = n (passed directly to fib)

# \$s1 = f(n) .data .align 2

# Data for prompts and output description

prmpt1: .asciiz "\n\nThis program computes the the Fibonacci function."

prmpt2: .asciiz "\nEnter value for n: "

descr: .asciiz "fib(n) = "

.text .align 2 .globl __start

(24)

__start:

# Print the prompts

li \$v0, 4 # print_str system service ...

la \$a0, prmpt1 # ... passing address of first prompt syscall

li \$v0, 4 # print_str system service ...

la \$a0, prmpt2 # ... passing address of 2nd prompt syscall

# Read n and call fib with result

li \$v0, 5 # read_int system service syscall

move \$a2, \$v0 # \$a2 = n = result of read li \$a1, 0 # \$a1 = fib(0)

li \$a0, 1 # \$a0 = fib(1) jal fib # call fib(n) move \$s1, \$v0 # \$s0 = fib(n)

# Print result

li \$v0, 4 # print_str system service ...

la \$a0, descr # ... passing address of output

# descriptor syscall

li \$v0, 1 # print_int system service ...

move \$a0, \$s1 # ... passing argument fib(n) syscall

# Call system - exit li \$v0, 10 syscall

# Algorithm for Fib(a, b, count):

# if (count == 0) return b

# else return fib(a + b, a, count – 1).

#

# Register usage:

# \$a0 = a = fib(n–1)

# \$a1 = b = fib(n–2)

# \$a2 = count (initially n, finally 0).

# \$t1 = temporary a + b

fib: bne \$a2, \$zero, fibne0 # if count == 0 ...

move \$v0, \$a1 # ... return b jr \$31

fibne0: # Assert: n != 0 addi \$a2, \$a2, –1 # count = count – 1 add \$t1, \$a0, \$a1 # \$t1 = a + b

move \$a1, \$a0 # b = a

move \$a0, \$t1 # a = a + old b

j fib # tail call fib(a+b, a, count–1) 3.28 No solution provided.

(25)

3.29

start:sbn temp, b, .+1 # Sets temp = –b, always goes to next instruction

sbn a, temp, .+1 # Sets a = a – temp = a – (–b) = a + b

3.30 There are a number of ways to do this, but this is perhaps the most concise and elegant:

sbn c, c, .+1 # c = 0;

sbn tmp, tmp, .+1 # tmp = 0;

loop: sbn b, one, end # while (b–– > 0)

sbn tmp, a, loop # c –= a; /* always continue */

end: sbn c, tmp, .+1 # c = –tmp; /* = a × b */

(26)

4.1 In the manner analogous to that used in the example on page 214, the number 512ten = 5x102 + 1x101 + 2x100. Given that 1two = 100, 1010two = 101, and 110 0100two = 102, we have

512ten = 5x 110 0100two + 1x1001two + 2x1two

= 110 0100 110 0100 110 0100 110 0100 110 0100 1010 1

+ 1

110 0000 0000

The number 512ten is positive, so the sign bit is 0, and sign extension yields the 32-bit two’s complement result 512ten = 0000 0000 0000 0000 0000 0010 0000 0000two.

4.2 This exercise can be solved with exactly the same approach used by the solution to Exercise 4.1, with the sign and sign extension being 1 digits. Because memorizing the base 10 representations of powers of 2 is useful for many hardware and software tasks, another conversion technique is widely used.

Let N be the magnitude of the decimal number, x[i] be the ith bit of the binary repre- sentation of N, and m be the greatest integer such that 2m ≤ N. Then

for (i = m; i≤0; i = i - 1) { if (2i ≤ N) x[i] = 1;

else x[i] = 0;

N = N - 2i; }

### 4 Solutions

(27)

For –1023ten, N = 1023

– 512 ⇒ x[9] = 1 511

– 256 ⇒ x[8] = 1 255

– 128 ⇒ x[7] = 1 127

– 64 ⇒ x[6] = 1 63

– 32 ⇒ x[5] = 1 31

– 16 ⇒ x[4] = 1 15

– 8 ⇒ x[3] = 1 7

– 4 ⇒ x[2] = 1 3

– 2 ⇒ x[1] = 1 1

– 1 ⇒ x[0] = 1 0 Done

So N = 0000 0000 0000 0000 0000 0011 1111 1111two Thus –1023ten = 1111 1111 1111 1111 1111 1100 0000 0001two

4.3 Using the method of the solution to either Exercise 4.1 or Exercise 4.2, –4,000,000ten

= 1111 1111 1100 0010 1111 0111 0000 0000two.

4.4 We could substitute into the formula at the bottom of page 213, (x31 x –231) + (x30 x 230) + (x29 x 229) +...+ (x1 x 21) + (x0 x 20), to get the answer. Because this two’s com- plement number has predominantly 1 digits, the formula will have many nonzero terms to add. The negation of this number will have mostly 0 digits, so using the nega- tion shortcut in the Example on page 216 first and remembering the original sign will save enough work in the formula to be a good strategy. Thus,

Negating 1111 1111 1111 1111 1111 1110 0000 1100two is 0000 0000 0000 0000 0000 0001 1111 0011two

+ 1two

= 0000 0000 0000 0000 0000 0001 1111 0100two Then the nonzero terms of the formula are

28 + 27 + 26 + 25 + 24 + 22 = 500

and the original sign is 1 meaning negative, so 1111 1111 1111 1111 1111 1110 0000 1100two = –500ten

4.5 Here, negating first as in the solution for Exercise 4.4 really pays off. The negation is the all-zero string incremented by 1, yielding +1. Remembering the original sign is negative, 1111 1111 1111 1111 1111 1111 1111 1111two = –1ten.

4.6 Negating the two’s complement representation gives 1000 0000 0000 0000 0000 0000 0000 0001 which equals (1 x –231) + (1 x 20) = –2,147,483,648ten + 1ten

= –2,147,483,647ten

Recalling that the original two’s complement number is positive, 0111 1111 1111 1111 1111 1111 1111 1111two = 2,147,483,647ten.

(28)

4.7 By lookup using the table in Figure 4.1, page 218, 7fff fffahex = 0111 1111 1111 1111 1111 1111 1111 1010two and by the same technique used to solve Exercise 4.6

= 2,147,483,642ten.

4.8 By lookup using the table in Figure 4.1, page 218, 1100 1010 1111 1110 1111 1010 1100 1110two = cafe facehex.

4.9 Since MIPS includes add immediate and since immediates can be positive or neg- ative, subtract immediate would be redundant.

4.10

addu \$t2, \$zero, \$t3 # copy \$t3 into \$t2 bgez \$t3, next # if \$t3 >= 0 then done

sub \$t2, \$zero, \$t3 # negate \$t3 and place into \$t2 next

4.11 You should be quite suspicious of both claims. A simple examination yields 6 = 2 + 4

12 = 4 + 8 18 = 2 + 16 24 = 8 + 16

30 = 2 + 4 + 8 + 16 (so we know Harry is wrong) 36 = 4 + 32

42 = 2 + 8 + 32 (so we know David is wrong).

4.12 The code loads the sll instruction at shifter into a register and masks off the shift amount, placing the least significant 5 bits from \$s2 in its place. It then writes the instruction back to memory and proceeds to execute it. The code is self-modifying; thus it is very hard to debug and likely to be forbidden in many modern operating systems.

One key problem is that we would be writing into the instruction cache, and clearly this slows things down and would require a fair amount of work to implement correctly (see Chapters 6 and 7).

4.13 The problem is that A_lower will be sign-extended and then added to \$t0. The solution is to adjust A_upper by adding 1 to it if the most significant bit of A_lower is a 1. As an example, consider 6-bit two’s complement and the address 23 = 010111. If we split it up, we notice that A_lower is 111 and will be sign-extended to 111111 = –1 dur- ing the arithmetic calculation. A_upper_adjusted = 011000 = 24 (we added 1 to 010 and the lower bits are all 0s). The calculation is then 24 + –1 = 23.

4.14

a. The sign bit is 1, so this is a negative number. We first take its two’s comple- ment.

A = 1000 1111 1110 1111 1100 0000 0000 0000 –A = 0111 0000 0001 0000 0100 0000 0000 0000

= 230 + 229 + 228 + 220 + 214

= 1,073,741,824 + 536,870,912 + 268,435,456 + 1,048,576 + 16,384

= 1,880,113,152 A = –1,880,113,152

(29)

b.

c.

d.

opcode (6 bits) = 100011 = lw rs (5 bits) = 11111 = 31 rt (5 bits) = 01111 = 15

address (16 bits) = 1100 0000 0000 0000

Since the address is negative we have to take its two’s complement.

Therefore the instruction is lw15,–16384(31).

Notice that the address embedded within the 16-bit immediate field is a byte address unlike the constants embedded in PC-relative branch instructions where word addressing is used.

4.15 a. 0 b. 0 c. 0.0

d. sll \$0,\$0,0

4.16 Figure 4.54 shows 21% of the instructions as being lw. If 15% of these could take advantage of the new variant, that would be 3.2% of all instructions. Each of these pre- sumably now has an addu instruction (that adds the two registers values together) that could be eliminated. Thus roughly 3.2% of the instructions, namely those addu instruc-

A = 1000 1111 1110 1111 1100 0000 0000 0000

= 8FEFC000

= 8 X 167 + 15 × 166 + 14 + 165 + 15 × 164 + 12 × 163

= 2,147,483,648 + 251,658,240 + 14,680,064 + 983,040 + 49,152

= 2,414,854,144

s = 1

exponent = 0001 1111

= 25 – 1 = 31

significand = 110 1111 1100 0000 0000 0000 (–1)S × (1 + significand) × 2 exponent–127 = –1x1.1101 1111 1x2–96

= –1 × (1 + 13 × 16–1 + 15 × 16–2 + 2–9) × 2–96

= –1.873 × 2–96

= –2.364 × 10–29

Two’s complement of address = 0100 0000 0000 0000 address = –214

= –16384

(30)

tions, could be eliminated if the addition were now done as part of the lw. The savings may be a bit overestimated (slightly less than 3.2%) due to the fact that some existing instructions may be sharing addu instructions. Thus we might not eliminate one addu for every changed lw.

4.17 Either the instruction sequence addu \$t2, \$t3, \$t4

sltu \$t2, \$t2, \$t4

or

addu \$t2, \$t3, \$t4 sltu \$t2, \$t2, \$t3

work.

4.18 If overflow detection is not required, then addu \$t3, \$t5, \$t7

sltu \$t2, \$t3, \$t5 addu \$t2, \$t2, \$t4 addu \$t2, \$t2, \$t6

is sufficient. If overflow detection is desired, then use addu \$t3, \$t5, \$t7

sltu \$t2, \$t3, \$t5 add \$t2, \$t2, \$t4 add \$t2, \$t2, \$t6

If overflow detection is desired, the last two addu instructions should be replaced by add instructions.

4.19 To detect whether \$s0 < \$s1, it’s tempting to subtract them and look at the sign of the result. This idea is problematic, because if the subtraction results in an overflow an exception would occur! To overcome this, there are two possible methods: You can subtract them as unsigned numbers (which never produces an exception) and then check to see whether overflow would have occurred (this is discussed in an elaboration on page 223). This method is acceptable, but it is lengthy and does more work than nec- essary. An alternative would be to check signs. Overflow can occur if \$s0 and (–\$s1) share the same sign; i.e., if \$s0 and \$s1 differ in sign. But in that case, we don’t need to subtract them since the negative one is obviously the smaller! The solution in pseudocode would be

if (\$s0<0) and (\$s1>0) then

\$t0:=1

else if (\$s0>0) and (\$s1<0) then

\$t0:=0 else

\$t1:=\$s0–\$s1 # overflow can never occur here if (\$t1<0) then

\$t0:=1 else

\$t0:=0

4.20 The new instruction treats \$s0 and \$s1 as a register pair and performs a shift, wherein the least significant bit of \$s0 becomes the most significant bit of \$s1 and both

\$s0 and \$s1 are shifted right one place.

(31)

4.21

sll \$s1, \$s0, 2 addu \$s1, \$s0, \$s1 4.22 No solution provided.

4.23 The ALU-supported set less than (slt) uses just the sign bit. In this case, if we try a set less than operation using the values –7ten and 6ten we would get –7 > 6. This is clearly wrong. Modify the 32-bit ALU in Figure 4.11 on page 169 to handle slt correct- ly by factor in overflow in the decision.

If there is no overflow, the calculation is done properly in Figure 4.17 and we simply use the sign bit (Result31). If there is overflow, however, then the sign bit is wrong and we need the inverse of the sign bit.

LessThan = Overflow ⊕ Result31

4.24 No solution provided.

4.25

Overflow Result31 LessThan

0 0 0

0 1 1

1 0 1

1 1 0

10ten = 1010two

= 1.01two 23

Sign = 0 Significand = .01 Single exponent = 3 + 127 = 130 Double exponent = 3 + 1023 = 1026

Overflow

Result31 LessThan

10000000010

0 10000010 01000000000000000000000

1 8 23

0 0100000000000000000000000000000000000000000000000000

1 11 52

Single precision

Double precision

(32)

4.26

The solution is the same as that for Exercise 4.25, but with the fourth bit from the left of the significand changed to a 1.

4.27

4.28

–2/3 = –1.01 2–1 Single exponent = –1 + 127 = 126 Double exponent = –1 + 1023 = 1022

4.29 main( ) {

float x;

printf("> ");

scanf("%f", &x);

printf("%08lx\n", * (long *) &x);

}

10.5ten = 1010.1two = 1.0101two 23

Sign = 0 Significand = .0101 Single exponent = 3 + 127 = 130

Double exponent = 3 + 1023 = 1026

0.1ten = 0.00011two = 1.10011two 2–4

Sign = 0 Significand= .10011 Single exponent = –4 + 127 = 123 Double exponent = –4 + 1023 = 1019

01111111011

0 01111011 10011001100110011001100

8 23

0 1001100110011001100110011001100110011001100110011001

1 11 52

Single

Double

10011001100110011001101 trunc round

precision 1001100110011001100110011001100110011001100110011010 trunc round

precision

1

01111111110

1 01111110 01010101010101010101010

8 23

1 0101010101010101010101010101010101010101010101010101

1 11 52

Single

Double

01010101010101010101011 trunc round

precision 0101010101010101010101010101010101010101010101010110 trunc round

precision

1

(33)

4.30

#include

// assumes float and long are both 32-bit numbers main( )

{

float x;

cout << "> "; // prompt cin >> x;

cout << hex << (long &) x << endl; // cast by reference (don't convert)

}

4.31 The IEEE number can be loaded easily into a register, \$t0, as a word since it oc- cupies 4 bytes. To multiply it by 2, we recall that it is made up of three fields: a sign, an exponent, and a fraction. The actual stored number is the fraction (with an implied 1) multiplied by a power of two. Hence to multiply the whole number by 2, we simply add 1 to the exponent! Note that the following variations are incorrect:

Multiply the register \$t0 by 2 (directly or thru sll) — this is wrong because \$t0 does not store an integer.

Multiply the fractional part by 2. This is correct in principle but the resulting number would need to be renormalized into an IEEE format because the multi- plication would shift the binary point. This would lead to adding 1 to the expo- nent!

Shift the exponent left—this is wrong because it amounts to squaring the num- ber rather than doubling it.

To add 1 to the exponent (which occupies bit positions 23 through 30) in \$t0, there are two ways (both ignore possible overflow):

1. Add \$t0 to a \$a0 where \$a0 contains a number having zero bits everywhere except at bit position 23. If a carry were to occur, this addition will change the sign bit, but we’re ignoring floating-point overflow anyway. Here’s the full sequence:

lw \$t0, X(\$0) addi \$a0, \$0, 1 sll \$a0, \$a0, 23 addu \$t0, \$t0, \$a0 sw \$t0, X(\$0)

2. Isolate the exponent (possibly by shifting left once then shifting 24 times to the right), add 1 to it, then shift it left 23 places to its proper place. To insert it back in its place, we start by “cleaning out” the old exponent (by ANDing with an appropriate mask) and then we OR the cleaned \$t0 with the incremented expo- nent. This method, albeit lengthy, is 100% acceptable. Here is a possible sequence:

lw \$t0, X(\$0)

andi \$t1, \$t0, 0x7f800000 srl \$t1, \$t1, 23

addi \$t1, \$t1, 1 sll \$t1, \$t1, 23

and \$t0, \$t0, 0x807fffff or \$t0, \$t0, \$t1

sw \$t0, X(\$0)

(34)

4.32

4.33

4.34 a. lw

b. List in Exercise 4.34a = 46% of gcc instructions executed.

c. List in Exercise 4.32 = 89% of gcc instructions executed.

d. List in Exercise 4.34a = 31% of spice instructions executed.

e. List in Exercise 4.33 = 76% of spice instructions executed.

They are the five most popular for gcc, and they represent 53% of the instructions exe- cuted for spice. The other popular spice instructions are never used in gcc, so they are used by only some programs and not others.

Rank Instruction Percent executed

1 lw 21%

3 sw 12%

4 beq 9%

6 bne 8%

7 sll 5%

8 slt 2%

8 andi 2%

8 lui 2%

8 sra 2%

Rank Instruction Percent executed

1 l.s 24%

3 s.s 9%

4 lw 7%

5 lui 6%

6 mul.d 5%

6 sll 5%

9 beq 3%

9 sub.d 3%

(35)

4.36

Therefore,

4.37

Therefore,

Note that the percentages given in Figure 4.54 for spice add up to 97%.

4.38 No solution provided.

4.39 No solution provided.

4.40

xor \$s0, \$s0, \$s1 xor \$s1, \$s0, \$s1 xor \$s0, \$s0, \$s1

Instruction category Average CPI Frequency (%)

Loads and stores 1.4 21 + 12 = 33

Conditional branch 1.8 9 + 8 + 1 + 1 = 19

Jumps 1.2 1 + 1 = 2

Integer multiply 10.0 0

Integer divide 30.0 0

FP add and subtract 2.0 0

FP multiply, single precision 4.0 0

FP multiply, double precision 5.0 0

FP divide, single precision 12.0 0

FP divide, double precision 19.0 0

Integer ALU other than multiply and divide (addu, addiu, and, andi, sll, lui, lb, sb, slt, slti, sltu, sltiu, sra, lh)

1.0 9 + 17 + 1 + 2 + 5 + 2 + 1+ 1 + 2 + 1 + 1 + 1 + 2 + 1 = 46

CPI = 1.4 × 33% + 1.8 × 19% + 1.2 × 2% + 1.0 × 46%

= 1.3

Instruction category Average CPI Frequency (%)

Loads and stores 1.4 7 + 2 + 24 + 9 = 42

Conditional branch 1.8 3 + 2 + 1 + 1 + 1 = 8

Jumps 1.2 1 + 1 = 2

Integer multiply 10.0 0

Integer divide 30.0 0

FP add and subtract 2.0 4 + 3 = 7

FP multiply, single precision 4.0 0

FP multiply, double precision 5.0 5

FP divide, single precision 12.0 0

FP divide, double precision 19.0 2

Integer ALU other than multiply and divide (addu, addiu, subu, andi, sll, srl, lui, c.x.d, mtcl, mtc2, cut)

1.0 10 + 1 + 1 + 1 + 5 + 1 + 6+ 1 + 2 + 2 + 1 = 31

CPI = 1.4 × 42% + 1.8 × 8% + 1.2 × 2% + 2.0 × 7% + 5.0 × 5% + 19.0 × 2% + 1.0 × 31%

= 1.8

(36)

4.41

nor \$s0, \$zero, \$zero xor \$s0, \$s0, \$s1

4.42 Given that a number that is greater than or equal to zero is termed positive and a number that is less than zero is negative, inspection reveals that the last two rows of Figure 4.44 restate the information of the first two rows. Because A – B = A + (–B), the operation A – B when A is positive and B negative is the same as the operation A + B when A is positive and B is positive. Thus the third row restates the conditions of the first. The second and fourth rows refer also to the same condition.

Because subtraction of two’s complement numbers is performed by addition, a com- plete examination of overflow conditions for addition suffices to show also when overflow will occur for subtraction. Begin with the first two rows of Figure 4.44 and add rows for A and B with opposite signs. Build a table that shows all possible combi- nations of Sign and CarryIn to the sign bit position and derive the CarryOut, Over- flow, and related information. Thus,

From this table an Exclusive OR (XOR) of the CarryIn and CarryOut of the sign bit serves to detect overflow. When the signs of A and B differ, the value of the CarryIn is determined by the relative magnitudes of A and B, as listed in the Notes column.

Sign A

Sign B

Carry In

Carry Out

Sign of result

Correct sign of result

Over- flow?

Carry In XOR

Carry Out Notes

0 0 0 0 0 0 No 0

0 0 1 0 1 0 Yes 1 Carries differ

0 1 0 0 1 1 No 0 |A| < |B|

0 1 1 1 0 0 No 0 |A| > |B|

1 0 0 0 1 1 No 0 |A| > |B|

1 0 1 1 0 0 No 0 |A| < |B|

1 1 0 1 0 1 Yes 1 Carries differ

1 1 1 1 1 1 No 0

An n×n square is called an m–binary latin square if each row and column of it filled with exactly m “1”s and (n–m) “0”s. We are going to study the following question: Find

Mie–Gr¨uneisen equa- tion of state (1), we want to use an Eulerian formulation of the equations as in the form described in (2), and to employ a state-of-the-art shock capturing

In part II (“Invariance of quan- tum rings under ordinary flops II”, Algebraic Geometry, 2016), we develop a quantum Leray–Hirsch theorem and use it to show that the big

The function f (m, n) is introduced as the minimum number of lolis required in a loli field problem. We also obtained a detailed specific result of some numbers and the upper bound of

Given a shift κ, if we want to compute the eigenvalue λ of A which is closest to κ, then we need to compute the eigenvalue δ of (11) such that |δ| is the smallest value of all of

 Promote project learning, mathematical modeling, and problem-based learning to strengthen the ability to integrate and apply knowledge and skills, and make. calculated

It is intended in this project to integrate the similar curricula in the Architecture and Construction Engineering departments to better yet simpler ones and to create also a new

For the proposed algorithm, we establish its convergence properties, and also present a dual application to the SCLP, leading to an exponential multiplier method which is shown