Computer Arithmetic Design

(1)

Computer Arithmetic Design

Instructor: Kuan Jen Lin

E-Mail: [email protected]

Web: http://vlsi.ee.fju.edu.tw/teacher/kjlin/kjlin.htm Dept. of EE, FJU, Taiwan

Room: SF 727B

(2)

SW & HW

SW = Algorithm + Data Structure + Programming techniques HW = Algorithm + Architecture + Design Method

Computing

Communication

Pipeline

Systolic array Low power Interface

…

Full custom Cell based FPGA

System level

(3)

Course Objectives

Learn computer algorithms to do arithmetic operations

Learn hardware designs for computer arithmetic.

After completing the course

Students are able to implement computer arithmetic hardware designs using HDL.

Students are able to read research papers about computer arithmetic.

(4)

Textbook

•Textbook

Behrooz Parhami,

“Computer Arithmetic

Algorithms and Hardware Designs,”

Oxford University Press

•Reference books:

Ercegovac and Lang, “Digital Arithmetic,” MKP.

Stine, “Digital Computer Aruthmetic datapath Design Using Verilog HDL,” CAP

(5)

Syllabus

Number representation

Two-operand Addition

Multi-operand Addition

Multiplication

Division

Square Root

Papers reading and presentation

(6)

Grading

Mid Exam (30%)

Papers reading and presentation (30%)

Homework (some problems need HDL programming) (30%)

Attendance and Others (10%)

(7)

Number Representation

E-Mail: [email protected] Dept. of EE, FJU, Taiwan

Room: SF 727B

Most slides are revision of PowerPoint files gotten from textbook website.

(8)

Numbers and Arithmetic

Chapter Goals

Define scope and provide motivation

Set the framework for the rest of the book Review positional fixed-point numbers

Chapter Highlights

What goes on inside your calculator?

Ways of encoding numbers in k bits

Radices and digit sets: conventional, exotic Conversion from one system to another

(9)

What is Computer Arithmetic?

Pentium Division Bug (1994-95): Pentium’s radix-4 SRT algorithm occasionally gave incorrect quotient

First noted in 1994 by T. Nicely who computed sums of reciprocals of twin primes:

1/5 + 1/7 + 1/11 + 1/13 + . . . + 1/p + 1/(p + 2) + . . . Worst-case example of division error in Pentium:

4 195 835 3 145 727

1.333 820 44...

1.333 739 06...

c = = Correct quotient

circa 1994 Pentium double FLP value;

accurate to only 14 bits (worse than single!)

(10)

Hardware (our focus in this book) Software

––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––––––––––––––––––––––––

Design of efficient digital circuits for Numerical methods for solving primitive and other arithmetic operations systems of linear equations,

such as +, –, ×, ÷, √, log, sin, cos partial differential equations, etc.

Issues: Algorithms Issues: Algorithms

Error analysis Error analysis

Speed/cost trade-offs Computational complexity

Hardware implementation Programming

Testing, verification Testing, verification

General-purpose Special-purpose

–––––––––––––––––––––– –––––––––––––––––––––––

Flexible data paths Tailored to

Fast primitive applications like:

operations like Digital filtering +, –, ×, ÷, √ Image processing Benchmarking Radar tracking

The Scope of Computer Arithmetic.

(11)

Using a calculator with √, x², and x^y functions, compute:

u = √√ … √ 2 = 1.000 677 131 “1024th root of 2”

v = 2^1/1024 = 1.000 677 131

Save u and v; If you can’t save, recompute values when needed x = (((u²)²)...)² = 1.999 999 963

x' = u¹⁰²⁴ = 1.999 999 973 y = (((v²)²)...)² = 1.999 999 983

y' = v¹⁰²⁴ = 1.999 999 994

Perhaps v and u are not really the same value

w = v – u = 1 × 10^–11 Nonzero due to hidden digits (u – 1) × 1000 = 0.677 130 680 [Hidden ... (0) 68]

(v – 1) × 1000 = 0.677 130 690 [Hidden ... (0) 69]

A Motivating Example

(12)

Finite Precision Can Lead to Disaster

Example: Failure of Patriot Missile (1991 Feb. 25)

Source http://www.math.psu.edu/dna/455.f96/disasters.html American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile

The Scud struck an American Army barracks, killing 28

Cause, per GAO/IMTEC-92-26 report: “software problem” (inaccurate calculation of the time since boot)

Problem specifics:

Time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds

Internal registers were 24 bits wide

1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 b) Error ≈ 0.1100 1100 × 2^–23 ≈ 9.5 × 10^–8

Error in 100-hr operation period

≈ 9.5 × 10 –8 × 100 × 60 × 60 × 10 = 0.34 s

Distance traveled by Scud = (0.34 s) × (1676 m/s) ≈ 570 m

(13)

Numbers and Their Encodings

Some 4-bit number representation formats

Unsigned integer ± Signed integer

Signed fraction 2's-compl fraction

Floating point Logarithmic

Fixed point, 3+1

±

e s log x

Radix point

Base-2 logarithm Exponent in

{−2, −1, 0, 1} Significand in {0, 1, 2, 3}

(14)

Encoding Numbers in 4 Bits

0 2 4 6 8 10 12 14 16

−2

−4

−6

−8

−10

−12

−14

−16

Unsigned integers Signed-magnitude

3 + 1 fixed-point, xxx.x Signed fraction, ±.xxx 2’s-compl. fraction, x.xxx 2 + 2 floating-point, s × 2 e in [−2, 1], s in [0, 3]

2 + 2 logarithmic (log = xx.xx)

±

Number format

log x s e

e

(15)

Fixed-Radix Positional Number Systems

( x_k–1x_k–2 . . . x₁x₀ . ^x_–1^x_–2 ^{. . . x}_–l⁾_r ⁼ ^x_i^rⁱ One can generalize to:

Arbitrary radix (not necessarily integer, positive, constant) Arbitrary digit set, usually {–α, –α+1, . . . , β–1, β} = [–α, β]

Example 1.1. Balanced ternary number system:

Radix r = 3, digit set = [–1, 1]

Example 1.2. Negative-radix number systems:

Radix –r, r ≥ 2, digit set = [0, r – 1]

The special case with radix –2 and digit set [0, 1]

is known as the negabinary number system

Can it represent all integer number?

∑

⁻

−

= 1 k

l i

(16)

More Examples of Number Systems

Example 1.3. Digit set [–4, 5] for r = 10:

(3 ^–1 5)ten represents 295 = 300 – 10 + 5

Example 1.4. Digit set [–7, 7] for r = 10:

(3 ^–1 5)ten = (3 0 ^–5)ten = (1 ^–7 0 ^–5)ten

Example 1.7. Quater-imaginary number system:

radix r = 2j, digit set [0, 3]

(17)

Number Radix Conversion

Radix conversion, using arithmetic in the old radix r Convenient when converting from r = 10

u = w . v

= ( x_k–1x_k–2 . . . x₁x₀ . x_–1x_–2 . . . x_–l )_r Old

= ( X_K–1X_K–2 . . . X₁X₀ . X_–1X_–2 . . . X_–L )_R New

Radix conversion, using arithmetic in the new radix R Convenient when converting to R = 10

Whole part Fractional part

Example: (31)_eight = (25)_ten 31 Oct. = 25 Dec. Halloween = Xmas

(18)

Radix Conversion: Old-Radix Arithmetic

Converting whole part w: (105)_ten = (?)_five

Repeatedly divide by five Quotient Remainder

105 0

21 1

4 4

0 Therefore, (105)_ten = (410)_five

Converting fractional part v: (105.486)_ten = (410.?)_five Repeatedly multiply by five Whole Part Fraction

.486 2 .430

2 .150

0 .750

3 .750

3 .750 Therefore, (105.486)_ten ≅ (410.22033)_five

(19)

Radix Conversion: New-Radix Arithmetic

Converting whole part w: (22033)_five = (?)_ten

((((2 × 5) + 2) × 5 + 0) × 5 + 3) × 5 + 3

|---| : : : : 10 : : : :

|---| : : : 12 : : :

|---| : : 60 : :

|---| : 303 :

|---|

1518

Converting fractional part v: (410.22033)_five = (105.?)_ten (0.22033)_five × 5⁵ = (22033)_five = (1518)_ten

1518 / 5⁵ = 1518 / 3125 = 0.48576 Therefore, (410.22033)_five = (105.48576)_ten

Horner’s rule or formula

(20)

Horner’s Rule for Fractions

Converting fractional part v: (0.22033)_five = (?)_ten

(((((3 / 5) + 3) / 5 + 0) / 5 + 2) / 5 + 2) / 5

|---| : : : : 0.6 : : : :

|---| : : : 3.6 : : :

|---| : : 0.72 : :

|---| : 2.144 :

|---|

2.4288

|---|

0.48576

Horner’s rule or formula

(21)

Classes of Number Representations

Signed number

Redundant number system

Residue number system

Real number

(22)

2 Representing Signed Numbers

Chapter Goals

Learn different encodings of the sign info Discuss implications for arithmetic design

Chapter Highlights

Using sign bit, biasing, complementation Properties of 2’s-complement numbers Signed vs unsigned arithmetic

Signed numbers, positions, or digits

(23)

0000

0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

0 +1

+3

+4

+5 +6 +7

-7

-3 -5

-4

-1 -0 +2

-

_ +

Bit pattern (representation) Signed values

(signed magnitude)

+2 -6

Increment Decrement

-

Four-bit signed-magnitude number representation system for integers

(24)

Four-bit biased integer number

representation system with a bias of 8

0000

0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

-8 -7

-5

-4

-3 -2 -1

+7

+3 +5

+4

+1 0 +2

+

_

Bit pattern (representation) Signed values

(biased by 8)

-6 +6

Increment Increment

(25)

Arithmetic with Biased Numbers

Addition/subtraction of biased numbers

x + y + bias = (x + bias) + (y + bias) – bias x – y + bias = (x + bias) – (y + bias) + bias

A power-of-2 (or 2^a – 1) bias simplifies addition/subtraction Comparison of biased numbers:

Compare like ordinary unsigned numbers find true difference by ordinary subtraction

We seldom perform arbitrary arithmetic on biased numbers Main application: Exponent field of floating-point numbers

(26)

Example and Two Special Cases

Example -- complement system for fixed-point numbers:

Complementation constant M = 12.000

Fixed-point number range [–6.000, +5.999]

Represent –3.258 as 12.000 – 3.258 = 8.742 Auxiliary operations for complement representations

complementation or change of sign (computing M – x) computations of residues mod M

Thus, M must be selected to simplify these operations

Two choices allow just this for fixed-point radix-r arithmetic with k whole digits and l fractional digits

Radix complement M = r^k

Digit complement M = r^k – ulp (aka diminished radix compl) ulp (unit in least position) stands for r^−l

Allows us to forget about l, even for nonintegers

(27)

Two’s- Complement Numbers

0000

0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

+0 +1

+3

+4

+5 +6 +7

-1

-5 -3

-4

-7 -8 -6

_ +

Unsigned representations Signed values

(2’s complement)

+2 -2

Two’s complement = radix complement system for r = 2

M = 2^k

2^k – x = [(2^k – ulp) – x] + ulp

= x^compl + ulp Range of representable

numbers in with k whole bits:

from –2^k–1 to 2^k–1 – ulp ulp (unit in least position) stands for r^−l

Allows us to forget about l, even for nonintegers

(28)

One’s-Complement Number Representation

One’s complement = digit

complement (diminished radix complement) system for r = 2

M = 2^k– ulp

(2^k – ulp) – x = x^compl Range of representable

numbers in with k whole bits:

from –2^k–1 + ulp to 2^k–1 – ulp

0000

0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

+0 +1

+3

+4

+5 +6 +7

-0

-4 -2

-3

-6 -7 -5

_ +

Unsigned representations Signed values

(1’s complement)

+2 -1

(29)

Range/Precision extension for 2’s- and 1’s Complement

Range/precision extension for 2’s-complement numbers

. . . x_k–1x_k–1x_k–1x_k–1x_k–2 . . . x₁x₀. x_–1x_–2 . . . x_–l0 0 0 . . .

Å Sign extension Æ Sign LSD Å Extension Æ

bit

Range/precision extension for 1’s-complement numbers

. . . x_k–1x_k–1x_k–1x_k–1x_k–2 . . . x₁x₀. x_–1x_–2 . . . x_–lx_k–1x_k–1x_k–1. . .

Å Sign extension Æ Sign LSD Å Extension Æ

bit

(30)

Mod 2

^k

vs Mod 2

^k

-1

Mod-2^k operation needed in 2’s-complement arithmetic is trivial:

Simply drop the carry-out (subtract 2^k if result is 2^k or greater) Mod-(2^k – ulp) operation needed in 1’s-complement

arithmetic is done via end-around carry

(x + y) – (2^k – ulp) Connect c_out to c_in

Since the dropped carry is worth 2^k unites and the inserted carry is worth ulp, the combined effect is to reduce the

magnitude by 2^k-ulp.

(31)

Why 2’s-Complement Is the Universal Choice

Adder/subtractor architecture for 2’s-complement numbers.

Mux

Adder

0 1

x y

y or y _

s = x ± y

add/sub

___

c _in

Controlled

complementation

0 for addition, 1 for subtraction

c _out

Can replace this mux with k XOR gates

(32)

Interpreting a 2’s-complement number as having a negatively weighted most-significant digit.

x = (1 0 1 0 0 1 1 0)two’s-compl

–2⁷ 2⁶ 2⁵ 2⁴ 2³ 2² 2¹ 2⁰

–128 + 32 + 4 + 2 = –90

Check:

x = (1 0 1 0 0 1 1 0)two’s-compl

–x = (0 1 0 1 1 0 1 0)_two

2⁷ 2⁶ 2⁵ 2⁴ 2³ 2² 2¹ 2⁰

64 + 16 + 8 + 2 = 90

(33)

Redundant Number Systems

Chapter Goals

Explore the advantages and drawbacks of using more than r digit values in radix r Chapter Highlights

Redundancy eliminates long carry chains Redundancy takes many forms: trade-offs Conversions between redundant

and nonredundant representations Redundancy used for end values too?

(34)

Coping with the Carry Problem

Ways of dealing with the carry propagation problem:

1. Limit propagation to within a small number of bits (Chapters 3-4) 2. Detect end of propagation; don’t wait for worst case (Chapter 5) 3. Speed up propagation via lookahead etc. (Chapters 6-7)

4. Ideal: Eliminate carry propagation altogether! (Chapter 3)

(35)

Use Redundant Number System (1/2)

5 7 8 2 4 9

6 2 9 3 8 9 Operand digits in [0, 9]

––––––––––––––––––––––––––––––––––

11 9 17 5 12 18 Position sums in [0, 18]

But how can we extend this beyond a single addition?

Subsequent additions will cause problems.

+

•The digit values 10 through 18 are redundant.

•Carry occurs if the sum >= 10, while not >18.

(36)

Use Redundant Number System (2/2)

18 18 18 18 18 + 0 0 0 0 1

Is there still carry propagation problem?

The sum of digits for each position is in [0, 36], each can be decomposed into an interim sum in [0, 16] and a

transfer digit in [0, 2], i.e. carry.

8 8 8 8 9 1 1 1 1 1 9 9 9 9 9

(37)

Example: Addition of Redundant Numbers

Position sum decomposition [0, 36] = 10 × [0, 2] + [0, 16]

Absorption of transfer digit [0, 16] + [0, 2] = [0, 18]

6 12 9 10 8 18

Operand digits in [0, 18]

17 21 26 20 20 36 7 11 16 0 10 16

Position sums i n [0, 36]

Interim sums in [0, 16]

1 1 1 2 1 2

1 8 12 18 1 12 16 11 9 17 10 12 18

Transfer digits in [0, 2]

Sum digits in [0, 18]

+

(38)

Carry-Free Addition Schemes

Interim sum at position i

Transfer digit into position i Operand digits

at position i

si+1 si si–1

xi–1, yi–1

xi,

xi+1, yi+1 yi xi+1,yi+1 xi,yi xi–1,yi–1

(b) Two-stage carry-free.

si+1 si si–1

ti

(c) Single-stage with lookahead.

si+1 si si–1

xi–1,yi–1

xi, xi+1,yi+1 yi

(a) Ideal single-stage carry-free.

(Impossible for positional system with fixed digit set)

(39)

Redundancy Index

So, redundancy helps us achieve carry-free addition

But how much redundancy is actually needed? Is [0, 11] enough for r = 10?

18 12 16 21 12 16 Position sums in [0, 22]

8 2 6 1 2 6 1 1 1 2 1 1

Interim sums in [0, 9]

Transfer digits in [0, 2]

1 9 3 8 2 3 6 11 10 7 11 3 8

Sum digits in [0, 11]

+ 7 2 9 10 9 8

Operand digits in [0, 11]

Redundancy index ρ = α + β + 1 – r For example, 0 + 11 + 1 – 10 = 2

(40)

Digit Sets and Digit-Set Conversions

Example 3.1: Convert from digit set [0, 18] to [0, 9] in radix 10

11 9 17 10 12 18 18 = 10 (carry 1) + 8 11 9 17 10 13 8 13 = 10 (carry 1) + 3

11 9 17 11 3 8 11 = 10 (carry 1) + 1

11 9 18 1 3 8 18 = 10 (carry 1) + 8

11 10 8 1 3 8 10 = 10 (carry 1) + 0

12 0 8 1 3 8 12 = 10 (carry 1) + 2

1 2 0 8 1 3 8 Answer;

all digits in [0, 9]

Note: Conversion from redundant to nonredundant representation always involves carry propagation

Thus, the process is sequential and slow

(41)

Generalized Signed-Digit Numbers

Radix-r Positional

ρ = 0 ρ ≥ 1

Non-redundant

α = 0 α ≥ 1

Conventional Non-redundant signed-digit

Generalized

signed-digit (GSD)

ρ = 1 ρ ≥ 2

Minimal GSD

Non-minimal GSD

α = β

(even r) α ≠ β

Symmetric minimal GSD r = 2

BSD or BSB

Asymmetric minimal GSD

α = 0 α = 1

(r ?2) Stored-

carry (SC) Non-binary SB

Symmetric non- minimal GSD

α = β α ≠ β

Asymmetric non- minimal GSD α < r

Ordinary signed-digit

Minimally

redundant OSD Maximally

redundant OSD BSCB SCB

r = 2 α = 1

β = r α = 0

Unsigned-digit redundant (UDR) r = 2

BSC

α = r ?1 α = ⎣ ⎦r/2 + 1

≠

Radix r

Digit set [–α, β]

Requirement α + β + 1 ≥ r Redundancy index

ρ = α + β + 1 – r

(42)

Binary Signed Digit (BSD)

x_i 1 ^–1 0 ^–1 0 BSD representation of +6

〈s, v〉 01 11 00 11 00 Sign and value encoding

2’s-compl 01 10 00 10 00 2-bit 2’s-complement

〈n, p〉 01 10 00 10 00 Negative & positive flags

〈n, z, p〉 001 100 010 100 010 1-out-of-3 encoding

(43)

Carry-Free Addition Algorithms

Carry-free addition of GSD numbers Compute the position sums p_i = x_i + y_i

Divide p_i into a transfer t_i+1 and interim sum w_i = p_i – rt_i+1 Add incoming transfers to get the sum digits s_i = w_i + t_i

x_i? ,y_i?

x_i, x_i+1,y_i+1 y_i

s_i+1 s_i s_i?

t_i w_i

If the transfer digits t_i are in [–λ, μ], we must have:

–α + λ ≤ p_i – rt_i+1 ≤ β – μ interim sum

Smallest interim sum Largest interim sum if a transfer of –λ if a transfer of μ

is to be absorbable is to be absorbable

These constraints lead to:

λ ≥ α / (r – 1) μ ≥ β / (r – 1)

(44)

Is Carry-Free Addition Always Applicable?

No: It requires one of the following two conditions [Parh 90]

a. r > 2, ρ ≥ 3

b. r > 2, ρ = 2, α ≠ 1, β ≠ 1 e.g., not [−1, 10] in radix 10 In other words, it is inapplicable for

r = 2 Perhaps most useful case

ρ = 1 e.g., carry-save

ρ = 2 with α = 1 or β = 1 e.g., carry/borrow-save

BSD is not two-stage carry-free -1 -1 0 -1 -1 -2 -1

-1

(45)

Use Carry-Estimate

A position sum –1 is kept intact when the incoming transfer is in [0, 1], whereas it is rewritten as 1 with a carry of –1 for incoming transfer in [–1, 0]. This guarantees that t_i ≠ w_i and thus –1≤ s_i ≤ 1.

1 –1 0 –1 0 x in [–1, 1]

+ 0 –1 –1 0 1 1 –2 –1 –1 1

1 0 1 –1 –1 –1 –1 0 1

0 –1 1 0 –1

i

i+1

y in [–1, 1] _i p in [–2, 2] _i

w in [–1, 1] _i

s in [–1, 1] _i t in [–1, 1]

low low low high high high

0 0

e in {low: [–1, 0], high: [0, 1]} _i

(46)

Residue Number Systems

Chapter Goals

Study a way of encoding large numbers as a collection of smaller numbers

to simplify and speed up some operations Chapter Highlights

Moduli, range, arithmetic operations Many sets of moduli possible: tradeoffs Conversions between RNS and binary The Chinese remainder theorem

Why are RNS applications limited?

(47)

RNS Representations and Arithmetic

Chinese puzzle, 1500 years ago:

What number has the remainders of 2, 3, and 2 when divided by 7, 5, and 3, respectively?

Residues uniquely identify the number, hence they constitute a representation

Pairwise relatively prime moduli: m_k–1 > . . . > m₁ > m₀

The residue x_i of x wrt the ith modulus m_i (similar to a digit):

x_i = x mod m_i = 〈x〉_mi

RNS representation contains a list of k residues or digits:

x = (2 | 3 | 2)_RNS(7|5|3)

Default RNS for this chapter: RNS(8 | 7 | 5 | 3)

(48)

RNS Dynamic Range

Product M of the k pairwise relatively prime moduli is the dynamic range M = m_k–1 × . . . × m₁ × m₀

For RNS(8 | 7 | 5 | 3), M = 8× 7 × 5 × 3 = 840 Negative numbers: Complement relative to M

〈–x〉_mi = 〈M – x〉_mi

21 = (5 | 0 | 1 | 0)_RNS

–21 = (8 – 5 | 0 | 5 – 1 | 0)_RNS = (3 | 0 | 4 | 0)_RNS

Here are some example numbers in our default RNS(8 | 7 | 5 | 3):

(0 | 0 | 0 | 0)_RNS Represents 0 or 840 or . . . (1 | 1 | 1 | 1)_RNS Represents 1 or 841 or . . . (2 | 2 | 2 | 2)_RNS Represents 2 or 842 or . . . . .

(0 | 1 | 4 | 1)_RNS Represents 64 or 904 or . . . (2 | 0 | 0 | 2)_RNS Represents –70 or 770 or . . . (7 | 6 | 4 | 2)_RNS Represents –1 or 839 or . . .

We can take the

range of RNS(8|7|5|3) to be [−420, 419] or any other set of 840 consecutive integers

(49)

We will see later how the weights can be determined for a given RNS

RNS as Weighted Representation

For RNS(8 | 7 | 5 | 3), the weights of the 4 positions are:

105 120 336 280

Example: (1 | 2 | 4 | 0)_RNS represents the number

〈105×1 + 120×2 + 336×4 + 280×0〉₈₄₀ = 〈1689〉₈₄₀ = 9

For RNS(7 | 5 | 3), the weights of the 3 positions are:

15 21 70

Example -- Chinese puzzle: (2 | 3 | 2)_RNS(7|5|3) represents the number

〈15 × 2 + 21 × 3 + 70 × 2〉₁₀₅ = 〈233〉₁₀₅ = 23

(50)

RNS Encoding and Arithmetic Operations

Binary-coded format for RNS(8 | 7 | 5 | 3).

Arithmetic in RNS(8 | 7 | 5 | 3)

(5 | 5 | 0 | 2)_RNS Represents x = +5 (7 | 6 | 4 | 2)_RNS Represents y = –1

(4 | 4 | 4 | 1)_RNS x + y : 〈5 + 7〉₈ = 4, 〈5 + 6〉₇ = 4, etc.

(6 | 6 | 1 | 0)_RNS x – y : 〈5 – 7〉₈ = 6, 〈5 – 6〉₇ = 6, etc.

(alternatively, find –y and add to x) (3 | 2 | 0 | 1)_RNS x × y : 〈5 × 7〉₈ = 3, 〈5 × 6〉₇ = 2, etc.

mod 8 mod 7 mod 5 mod 3

mod 8 mod 7 mod 5 mod 3 Mod-8

Unit Mod-7

Unit Mod-5

Unit Mod-3 Unit

3 3 3 2

Operand 1 Operand 2

Result

(51)

Choosing the RNS Moduli

Target range for our RNS: Decimal values [0, 100 000]

Strategy 1: To minimize the largest modulus, and thus ensure high-speed arithmetic, pick prime numbers in sequence

Pick m₀ = 2, m₁ = 3, m₂ = 5, etc. After adding m₅ = 13:

RNS(13 | 11 | 7 | 5 | 3 | 2) M = 30 030 Inadequate RNS(17 | 13 | 11 | 7 | 5 | 3 | 2) M = 510 510 Too large RNS(17 | 13 | 11 | 7 | 3 | 2) M = 102 102 Just right!

5 + 4 + 4 + 3 + 2 + 1 = 19 bits Fine tuning: Combine pairs of moduli 2 & 13 (26) and 3 & 7 (21)

RNS(26 | 21 | 17 | 11) M = 102 102

(52)

An Improved Strategy

Strategy 2: Improve strategy 1 by including powers of smaller primes before proceeding to the next larger prime

RNS(2² | 3) M = 12

RNS(3² | 2³ | 7 | 5) M = 2520

RNS(11 | 3² | 2³ | 7 | 5) M = 27 720 RNS(13 | 11 | 3² | 2³ | 7 | 5) M = 360 360

(remove one 3, combine 3 & 5) RNS(15 | 13 | 11 | 2³ | 7) M = 120 120

4 + 4 + 4 + 3 + 3 = 18 bits Fine tuning: Maximize the size of the even modulus within the 4-bit limit RNS(2⁴ | 13 | 11 | 3² | 7 | 5) M = 720 720 Too large

We can now remove 5 or 7; not an improvement in this example

(53)

Low-Cost RNS Moduli

Strategy 3: To simplify the modular reduction (mod m_i) operations, choose only moduli of the forms 2^a or 2^a – 1, aka “low-cost moduli”

RNS(2âk–1 | 2âk–2 – 1 | . . . | 2â1 – 1 | 2â0 – 1) We can have only one even modulus

2^ai – 1 and 2^aj – 1 are relatively prime iff a_i and a_j are relatively prime RNS(2³ | 2³–1 | 2²–1) basis: 3, 2 M = 168 RNS(2⁴ | 2⁴–1 | 2³–1) basis: 4, 3 M = 1680 RNS(2⁵ | 2⁵–1 | 2³–1 | 2²–1) basis: 5, 3, 2 M = 20 832 RNS(2⁵ | 2⁵–1 | 2⁴–1 | 2³–1) basis: 5, 4, 3 M = 104 160 Comparison

RNS(15 | 13 | 11 | 2³ | 7) 18 bits M = 120 120 RNS(2⁵ | 2⁵–1 | 2⁴–1 | 2³–1) 17 bits M = 104 160

It’s easy to mod 2^kand 2^k -1

(54)

Encoding and Decoding of Numbers

Conversion from binary/decimal to RNS

–––––––––––––––––––––––––––––

i 2ⁱ 〈2ⁱ〉₇ 〈2ⁱ〉₅ 〈2ⁱ〉₃ –––––––––––––––––––––––––––––

0 1 1 1 1

1 2 2 2 2

2 4 4 4 1

3 8 1 3 2

4 16 2 1 1

5 32 4 2 2

6 64 1 4 1

7 128 2 3 2

8 256 4 1 1

9 512 1 2 2

–––––––––––––––––––––––––––––

Table 4.1 Residues of the first 10 powers of 2 Example 4.1: Represent the

number y = (1010 0100)_two = (164)_ten in RNS(8 | 7 | 5 | 3)

The mod-8 residue is easy to find x₃ = 〈y〉₈ = (100)_two = 4

We have y = 2⁷+2⁵+2²; thus x₂ = 〈y〉₇ = 〈2 + 4 + 4〉₇ = 3 x₁ = 〈y〉₅ = 〈3 + 2 + 4〉₅ = 4 x₀ = 〈y〉₃ = 〈2 + 2 + 1〉₃ = 2

(55)

Conversion from RNS to Binary/Decimal

Theorem 4.1 (The Chinese remainder theorem)

x = (x_k–1 | . . . | x₂ | x₁ | x₀)_RNS = 〈 ∑_i M_i 〈α_i x_i〉_mi 〉_M

where M_i = M/m_i and α_i = 〈M_i ^–1〉_mi (multiplicative inverse of M_i wrt m_i) Implementing CRT-based RNS-to-binary conversion

x = 〈 ∑_i M_i 〈α_i x_i〉_mi 〉_M = 〈 ∑_i f_i(x_i)〉_M

We can use a table to store the f_i values –- ∑_i m_i entries Table 4.2 Values needed in applying the

Chinese remainder theorem to RNS(8 | 7 | 5 | 3) ––––––––––––––––––––––––––––––

i m_i x_i 〈M_i 〈α_i x_i〉_mi〉_M ––––––––––––––––––––––––––––––

3 8 0 0

1 105

2 210

3 315

. .

(56)

Intuitive Justification for CRT

Puzzle: What number has the remainders of 2, 3, and 2

when divided by the numbers 7, 5, and 3, respectively?

x = (2 | 3 | 2)_RNS(7|5|3) = (?)_ten

(1 | 0 | 0)_RNS(7|5|3) = multiple of 15 that is 1 mod 7 = 15 (0 | 1 | 0)_RNS(7|5|3) = multiple of 21 that is 1 mod 5 = 21 (0 | 0 | 1)_RNS(7|5|3) = multiple of 35 that is 1 mod 3 = 70 (2 | 3 | 2)_RNS(7|5|3) = (2 | 0 | 0) + (0 | 3 | 0) + (0 | 0 | 2)

= 2 × (1 | 0 | 0) + 3 × (0 | 1 | 0) + 2 × (0 | 0 | 1)

= 2 × 15 + 3 × 21 + 2 × 70

= 30 + 63 + 140

= 233 = 23 mod 105 Therefore, x = (23)_ten

(57)

Difficult RNS Arithmetic Operations

Sign test

Magnitude comparison

Division

•Could convert back and forth to/from binary.

•Another approach: convert to a mixed radix system, as numbers in a mixed radix system are comparable.

(58)

Difficult RNS Arithmetic Operations

Example: Of the following RNS(8 | 7 | 5 | 3) numbers:

Which, if any, are negative?

Which is the largest?

Which is the smallest?

Assume a range of [–420, 419]

a = (0 | 1 | 3 | 2)_RNS b = (0 | 1 | 4 | 1)_RNS c = (0 | 6 | 2 | 1)_RNS d = (2 | 0 | 0 | 2)_RNS e = (5 | 0 | 1 | 0)_RNS f = (7 | 6 | 4 | 2)_RNS

Answers:

d < c < f < a < e < b

–70 < –8 < –1 < 8 < 21 < 64

(59)

General RNS Division

General RNS division, as opposed to division by one of the moduli (aka scaling), is difficult; hence, use of RNS is unlikely to be effective when an application requires many divisions

Scheme proposed in 1994 PhD thesis of Ching-Yu Hung (UCSB):

Use an algorithm that has built-in tolerance to imprecision, and apply the approximate CRT decoding to choose quotient digits

Example –– SRT algorithm (s is the partial remainder) s < 0 quotient digit = –1

s ≅ 0 quotient digit = 0 s > 0 quotient digit = 1

The BSD quotient can be converted to RNS on the fly

(60)

Limits of Fast Arithmetic in RNS

Known results from number theory

Implications to speed of arithmetic in RNS

Theorem 4.5: It is possible to represent all k-bit binary numbers in RNS with O(k / log k) moduli such that the largest modulus has O(log k) bits

That is, with fast log-time adders, addition needs O(log log k) time Theorem 4.2: The ith prime p_i is asymptotically i ln i

Theorem 4.3: The number of primes in [1, n] is asymptotically n / ln n Theorem 4.4: The product of all primes in [1, n] is asymptotically eⁿ

(61)

Hardware Implementation for RNS Representations

mod 8 mod 7 mod 5 mod 3 Mod-8

Unit Mod-7

Unit Mod-5

Unit Mod-3 Unit

3 3 3 2

Operand 1 Operand 2

Result

(62)

Addition/Subtraction

E-Mail: [email protected] Dept. of EE, FJU, Taiwan

Room: SF 727B

Most slides originate from the textbook author’s PowerPoint presentation files.

(63)

II Addition / Subtraction

Chapter 8 Multioperand Addition Chapter 7 Variations in Fast Adder Chapter 6 Carry-Lookahead Adders

Chapter 5 Basic Addition and Counting

Topics in This Part

Review addition schemes and various speedup methods

• Addition is a key op (in itself, and as a building block)

• Subtraction = negation + addition

• Carry propagation speedup: lookahead, skip, select, …

• Two-operand versus multioperand addition

(64)

Basic Addition and Counting

Chapter Goals

Study the design of ripple-carry adders, discuss why their latency is unacceptable, and set the foundation for faster adders Chapter Highlights

Full adders are versatile building blocks Longest carry chain on average: log₂k bits Fast asynchronous adders are simple

Counting is relatively easy to speed up

(65)

HA and FA Adders

Half-adder (HA): Truth table and block diagram

Full-adder (FA): Truth table and block diagram

x y c c s --- 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 Inputs Outputs

c _out c _in

out

in x y

s FA x y c s

--- 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0

Inputs Outputs

HA

x y c

s