Computer Arithmetic Design
Instructor: Kuan Jen Lin
E-Mail: [email protected]
Web: http://vlsi.ee.fju.edu.tw/teacher/kjlin/kjlin.htm Dept. of EE, FJU, Taiwan
Room: SF 727B
SW & HW
SW = Algorithm + Data Structure + Programming techniques HW = Algorithm + Architecture + Design Method
Computing
Communication
Pipeline
Systolic array Low power Interface
…
Full custom Cell based FPGA
System level
Course Objectives
Learn computer algorithms to do arithmetic operations
Learn hardware designs for computer arithmetic.
After completing the course
Students are able to implement computer arithmetic hardware designs using HDL.
Students are able to read research papers about computer arithmetic.
Textbook
•Textbook
Behrooz Parhami,
“Computer Arithmetic
Algorithms and Hardware Designs,”
Oxford University Press
•Reference books:
Ercegovac and Lang, “Digital Arithmetic,” MKP.
Stine, “Digital Computer Aruthmetic datapath Design Using Verilog HDL,” CAP
Syllabus
Number representation
Two-operand Addition
Multi-operand Addition
Multiplication
Division
Square Root
Papers reading and presentation
Grading
Mid Exam (30%)
Papers reading and presentation (30%)
Homework (some problems need HDL programming) (30%)
Attendance and Others (10%)
Number Representation
Instructor: Kuan Jen Lin
E-Mail: [email protected] Dept. of EE, FJU, Taiwan
Room: SF 727B
Most slides are revision of PowerPoint files gotten from textbook website.
Numbers and Arithmetic
Chapter Goals
Define scope and provide motivation
Set the framework for the rest of the book Review positional fixed-point numbers
Chapter Highlights
What goes on inside your calculator?
Ways of encoding numbers in k bits
Radices and digit sets: conventional, exotic Conversion from one system to another
What is Computer Arithmetic?
Pentium Division Bug (1994-95): Pentium’s radix-4 SRT algorithm occasionally gave incorrect quotient
First noted in 1994 by T. Nicely who computed sums of reciprocals of twin primes:
1/5 + 1/7 + 1/11 + 1/13 + . . . + 1/p + 1/(p + 2) + . . . Worst-case example of division error in Pentium:
4 195 835 3 145 727
1.333 820 44...
1.333 739 06...
c = = Correct quotient
circa 1994 Pentium double FLP value;
accurate to only 14 bits (worse than single!)
Hardware (our focus in this book) Software
––––––––––––––––––––––––––––––––––––––––––––––––– ––––––––––––––––––––––––––––––––––––
Design of efficient digital circuits for Numerical methods for solving primitive and other arithmetic operations systems of linear equations,
such as +, –, ×, ÷, √, log, sin, cos partial differential equations, etc.
Issues: Algorithms Issues: Algorithms
Error analysis Error analysis
Speed/cost trade-offs Computational complexity
Hardware implementation Programming
Testing, verification Testing, verification
General-purpose Special-purpose
–––––––––––––––––––––– –––––––––––––––––––––––
Flexible data paths Tailored to
Fast primitive applications like:
operations like Digital filtering +, –, ×, ÷, √ Image processing Benchmarking Radar tracking
The Scope of Computer Arithmetic.
Using a calculator with √, x2, and xy functions, compute:
u = √√ … √ 2 = 1.000 677 131 “1024th root of 2”
v = 21/1024 = 1.000 677 131
Save u and v; If you can’t save, recompute values when needed x = (((u2)2)...)2 = 1.999 999 963
x' = u1024 = 1.999 999 973 y = (((v2)2)...)2 = 1.999 999 983
y' = v1024 = 1.999 999 994
Perhaps v and u are not really the same value
w = v – u = 1 × 10–11 Nonzero due to hidden digits (u – 1) × 1000 = 0.677 130 680 [Hidden ... (0) 68]
(v – 1) × 1000 = 0.677 130 690 [Hidden ... (0) 69]
A Motivating Example
Finite Precision Can Lead to Disaster
Example: Failure of Patriot Missile (1991 Feb. 25)
Source http://www.math.psu.edu/dna/455.f96/disasters.html American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile
The Scud struck an American Army barracks, killing 28
Cause, per GAO/IMTEC-92-26 report: “software problem” (inaccurate calculation of the time since boot)
Problem specifics:
Time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds
Internal registers were 24 bits wide
1/10 = 0.0001 1001 1001 1001 1001 100 (chopped to 24 b) Error ≈ 0.1100 1100 × 2–23 ≈ 9.5 × 10–8
Error in 100-hr operation period
≈ 9.5 × 10 –8 × 100 × 60 × 60 × 10 = 0.34 s
Distance traveled by Scud = (0.34 s) × (1676 m/s) ≈ 570 m
Numbers and Their Encodings
Some 4-bit number representation formats
Unsigned integer ± Signed integer
Signed fraction 2's-compl fraction
Floating point Logarithmic
Fixed point, 3+1
±
e s log x
Radix point
Base-2 logarithm Exponent in
{−2, −1, 0, 1} Significand in {0, 1, 2, 3}
Encoding Numbers in 4 Bits
0 2 4 6 8 10 12 14 16
−2
−4
−6
−8
−10
−12
−14
−16
Unsigned integers Signed-magnitude
3 + 1 fixed-point, xxx.x Signed fraction, ±.xxx 2’s-compl. fraction, x.xxx 2 + 2 floating-point, s × 2 e in [−2, 1], s in [0, 3]
2 + 2 logarithmic (log = xx.xx)
±
±
Number format
log x s e
e
Fixed-Radix Positional Number Systems
( xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r = xi ri One can generalize to:
Arbitrary radix (not necessarily integer, positive, constant) Arbitrary digit set, usually {–α, –α+1, . . . , β–1, β} = [–α, β]
Example 1.1. Balanced ternary number system:
Radix r = 3, digit set = [–1, 1]
Example 1.2. Negative-radix number systems:
Radix –r, r ≥ 2, digit set = [0, r – 1]
The special case with radix –2 and digit set [0, 1]
is known as the negabinary number system
Can it represent all integer number?
∑
−−
= 1 k
l i
More Examples of Number Systems
Example 1.3. Digit set [–4, 5] for r = 10:
(3 –1 5)ten represents 295 = 300 – 10 + 5
Example 1.4. Digit set [–7, 7] for r = 10:
(3 –1 5)ten = (3 0 –5)ten = (1 –7 0 –5)ten
Example 1.7. Quater-imaginary number system:
radix r = 2j, digit set [0, 3]
Number Radix Conversion
Radix conversion, using arithmetic in the old radix r Convenient when converting from r = 10
u = w . v
= ( xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r Old
= ( XK–1XK–2 . . . X1X0 . X–1X–2 . . . X–L )R New
Radix conversion, using arithmetic in the new radix R Convenient when converting to R = 10
Whole part Fractional part
Example: (31)eight = (25)ten 31 Oct. = 25 Dec. Halloween = Xmas
Radix Conversion: Old-Radix Arithmetic
Converting whole part w: (105)ten = (?)five
Repeatedly divide by five Quotient Remainder
105 0
21 1
4 4
0 Therefore, (105)ten = (410)five
Converting fractional part v: (105.486)ten = (410.?)five Repeatedly multiply by five Whole Part Fraction
.486 2 .430
2 .150
0 .750
3 .750
3 .750 Therefore, (105.486)ten ≅ (410.22033)five
Radix Conversion: New-Radix Arithmetic
Converting whole part w: (22033)five = (?)ten
((((2 × 5) + 2) × 5 + 0) × 5 + 3) × 5 + 3
|---| : : : : 10 : : : :
|---| : : : 12 : : :
|---| : : 60 : :
|---| : 303 :
|---|
1518
Converting fractional part v: (410.22033)five = (105.?)ten (0.22033)five × 55 = (22033)five = (1518)ten
1518 / 55 = 1518 / 3125 = 0.48576 Therefore, (410.22033)five = (105.48576)ten
Horner’s rule or formula
Horner’s Rule for Fractions
Converting fractional part v: (0.22033)five = (?)ten
(((((3 / 5) + 3) / 5 + 0) / 5 + 2) / 5 + 2) / 5
|---| : : : : 0.6 : : : :
|---| : : : 3.6 : : :
|---| : : 0.72 : :
|---| : 2.144 :
|---|
2.4288
|---|
0.48576
Horner’s rule or formula
Classes of Number Representations
Signed number
Redundant number system
Residue number system
Real number
2 Representing Signed Numbers
Chapter Goals
Learn different encodings of the sign info Discuss implications for arithmetic design
Chapter Highlights
Using sign bit, biasing, complementation Properties of 2’s-complement numbers Signed vs unsigned arithmetic
Signed numbers, positions, or digits
0000
0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
0 +1
+3
+4
+5 +6 +7
-7
-3 -5
-4
-1 -0 +2
-
_ +
Bit pattern (representation) Signed values
(signed magnitude)
+2 -6
Increment Decrement
-
Four-bit signed-magnitude number representation system for integers
Four-bit biased integer number
representation system with a bias of 8
0000
0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
-8 -7
-5
-4
-3 -2 -1
+7
+3 +5
+4
+1 0 +2
+
_
Bit pattern (representation) Signed values
(biased by 8)
-6 +6
Increment Increment
Arithmetic with Biased Numbers
Addition/subtraction of biased numbers
x + y + bias = (x + bias) + (y + bias) – bias x – y + bias = (x + bias) – (y + bias) + bias
A power-of-2 (or 2a – 1) bias simplifies addition/subtraction Comparison of biased numbers:
Compare like ordinary unsigned numbers find true difference by ordinary subtraction
We seldom perform arbitrary arithmetic on biased numbers Main application: Exponent field of floating-point numbers
Example and Two Special Cases
Example -- complement system for fixed-point numbers:
Complementation constant M = 12.000
Fixed-point number range [–6.000, +5.999]
Represent –3.258 as 12.000 – 3.258 = 8.742 Auxiliary operations for complement representations
complementation or change of sign (computing M – x) computations of residues mod M
Thus, M must be selected to simplify these operations
Two choices allow just this for fixed-point radix-r arithmetic with k whole digits and l fractional digits
Radix complement M = rk
Digit complement M = rk – ulp (aka diminished radix compl) ulp (unit in least position) stands for r−l
Allows us to forget about l, even for nonintegers
Two’s- Complement Numbers
0000
0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
+0 +1
+3
+4
+5 +6 +7
-1
-5 -3
-4
-7 -8 -6
_ +
Unsigned representations Signed values
(2’s complement)
+2 -2
Two’s complement = radix complement system for r = 2
M = 2k
2k – x = [(2k – ulp) – x] + ulp
= xcompl + ulp Range of representable
numbers in with k whole bits:
from –2k–1 to 2k–1 – ulp ulp (unit in least position) stands for r−l
Allows us to forget about l, even for nonintegers
One’s-Complement Number Representation
One’s complement = digit
complement (diminished radix complement) system for r = 2
M = 2k – ulp
(2k – ulp) – x = xcompl Range of representable
numbers in with k whole bits:
from –2k–1 + ulp to 2k–1 – ulp
0000
0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
+0 +1
+3
+4
+5 +6 +7
-0
-4 -2
-3
-6 -7 -5
_ +
Unsigned representations Signed values
(1’s complement)
+2 -1
Range/Precision extension for 2’s- and 1’s Complement
Range/precision extension for 2’s-complement numbers
. . . xk–1 xk–1 xk–1 xk–1 xk–2 . . . x1 x0. x–1 x–2 . . . x–l 0 0 0 . . .
Å Sign extension Æ Sign LSD Å Extension Æ
bit
Range/precision extension for 1’s-complement numbers
. . . xk–1 xk–1 xk–1 xk–1 xk–2 . . . x1 x0. x–1 x–2 . . . x–l xk–1 xk–1 xk–1 . . .
Å Sign extension Æ Sign LSD Å Extension Æ
bit
Mod 2
kvs Mod 2
k-1
Mod-2k operation needed in 2’s-complement arithmetic is trivial:
Simply drop the carry-out (subtract 2k if result is 2k or greater) Mod-(2k – ulp) operation needed in 1’s-complement
arithmetic is done via end-around carry
(x + y) – (2k – ulp) Connect cout to cin
Since the dropped carry is worth 2k unites and the inserted carry is worth ulp, the combined effect is to reduce the
magnitude by 2k-ulp.
Why 2’s-Complement Is the Universal Choice
Adder/subtractor architecture for 2’s-complement numbers.
Mux
Adder
0 1
x y
y or y _
s = x ± y
add/sub
___
c in
Controlled
complementation
0 for addition, 1 for subtraction
c out
Can replace this mux with k XOR gates
Interpreting a 2’s-complement number as having a negatively weighted most-significant digit.
x = (1 0 1 0 0 1 1 0)two’s-compl
–27 26 25 24 23 22 21 20
–128 + 32 + 4 + 2 = –90
Check:
x = (1 0 1 0 0 1 1 0)two’s-compl
–x = (0 1 0 1 1 0 1 0)two
27 26 25 24 23 22 21 20
64 + 16 + 8 + 2 = 90
Redundant Number Systems
Chapter Goals
Explore the advantages and drawbacks of using more than r digit values in radix r Chapter Highlights
Redundancy eliminates long carry chains Redundancy takes many forms: trade-offs Conversions between redundant
and nonredundant representations Redundancy used for end values too?
Coping with the Carry Problem
Ways of dealing with the carry propagation problem:
1. Limit propagation to within a small number of bits (Chapters 3-4) 2. Detect end of propagation; don’t wait for worst case (Chapter 5) 3. Speed up propagation via lookahead etc. (Chapters 6-7)
4. Ideal: Eliminate carry propagation altogether! (Chapter 3)
Use Redundant Number System (1/2)
5 7 8 2 4 9
6 2 9 3 8 9 Operand digits in [0, 9]
––––––––––––––––––––––––––––––––––
11 9 17 5 12 18 Position sums in [0, 18]
But how can we extend this beyond a single addition?
Subsequent additions will cause problems.
+
•The digit values 10 through 18 are redundant.
•Carry occurs if the sum >= 10, while not >18.
Use Redundant Number System (2/2)
18 18 18 18 18 + 0 0 0 0 1
Is there still carry propagation problem?
The sum of digits for each position is in [0, 36], each can be decomposed into an interim sum in [0, 16] and a
transfer digit in [0, 2], i.e. carry.
8 8 8 8 9 1 1 1 1 1 9 9 9 9 9
Example: Addition of Redundant Numbers
Position sum decomposition [0, 36] = 10 × [0, 2] + [0, 16]
Absorption of transfer digit [0, 16] + [0, 2] = [0, 18]
6 12 9 10 8 18
Operand digits in [0, 18]
17 21 26 20 20 36 7 11 16 0 10 16
Position sums i n [0, 36]
Interim sums in [0, 16]
1 1 1 2 1 2
1 8 12 18 1 12 16 11 9 17 10 12 18
Transfer digits in [0, 2]
Sum digits in [0, 18]
+
Carry-Free Addition Schemes
Interim sum at position i
Transfer digit into position i Operand digits
at position i
si+1 si si–1
xi–1, yi–1
xi,
xi+1, yi+1 yi xi+1,yi+1 xi,yi xi–1,yi–1
(b) Two-stage carry-free.
si+1 si si–1
ti
(c) Single-stage with lookahead.
si+1 si si–1
xi–1,yi–1
xi, xi+1,yi+1 yi
(a) Ideal single-stage carry-free.
(Impossible for positional system with fixed digit set)
Redundancy Index
So, redundancy helps us achieve carry-free addition
But how much redundancy is actually needed? Is [0, 11] enough for r = 10?
18 12 16 21 12 16 Position sums in [0, 22]
8 2 6 1 2 6 1 1 1 2 1 1
Interim sums in [0, 9]
Transfer digits in [0, 2]
1 9 3 8 2 3 6 11 10 7 11 3 8
Sum digits in [0, 11]
+ 7 2 9 10 9 8
Operand digits in [0, 11]
Redundancy index ρ = α + β + 1 – r For example, 0 + 11 + 1 – 10 = 2
Digit Sets and Digit-Set Conversions
Example 3.1: Convert from digit set [0, 18] to [0, 9] in radix 10
11 9 17 10 12 18 18 = 10 (carry 1) + 8 11 9 17 10 13 8 13 = 10 (carry 1) + 3
11 9 17 11 3 8 11 = 10 (carry 1) + 1
11 9 18 1 3 8 18 = 10 (carry 1) + 8
11 10 8 1 3 8 10 = 10 (carry 1) + 0
12 0 8 1 3 8 12 = 10 (carry 1) + 2
1 2 0 8 1 3 8 Answer;
all digits in [0, 9]
Note: Conversion from redundant to nonredundant representation always involves carry propagation
Thus, the process is sequential and slow
Generalized Signed-Digit Numbers
Radix-r Positional
ρ = 0 ρ ≥ 1
Non-redundant
α = 0 α ≥ 1
Conventional Non-redundant signed-digit
Generalized
signed-digit (GSD)
ρ = 1 ρ ≥ 2
Minimal GSD
Non-minimal GSD
α = β
(even r) α ≠ β
Symmetric minimal GSD r = 2
BSD or BSB
Asymmetric minimal GSD
α = 0 α = 1
(r ?2) Stored-
carry (SC) Non-binary SB
Symmetric non- minimal GSD
α = β α ≠ β
Asymmetric non- minimal GSD α < r
Ordinary signed-digit
Minimally
redundant OSD Maximally
redundant OSD BSCB SCB
r = 2 α = 1
β = r α = 0
Unsigned-digit redundant (UDR) r = 2
BSC
α = r ?1 α = ⎣ ⎦r/2 + 1
≠
Radix r
Digit set [–α, β]
Requirement α + β + 1 ≥ r Redundancy index
ρ = α + β + 1 – r
Binary Signed Digit (BSD)
xi 1 –1 0 –1 0 BSD representation of +6
〈s, v〉 01 11 00 11 00 Sign and value encoding
2’s-compl 01 10 00 10 00 2-bit 2’s-complement
〈n, p〉 01 10 00 10 00 Negative & positive flags
〈n, z, p〉 001 100 010 100 010 1-out-of-3 encoding
Carry-Free Addition Algorithms
Carry-free addition of GSD numbers Compute the position sums pi = xi + yi
Divide pi into a transfer ti+1 and interim sum wi = pi – rti+1 Add incoming transfers to get the sum digits si = wi + ti
xi? ,yi?
xi, xi+1,yi+1 yi
si+1 si si?
ti wi
If the transfer digits ti are in [–λ, μ], we must have:
–α + λ ≤ pi – rti+1 ≤ β – μ interim sum
Smallest interim sum Largest interim sum if a transfer of –λ if a transfer of μ
is to be absorbable is to be absorbable
These constraints lead to:
λ ≥ α / (r – 1) μ ≥ β / (r – 1)
Is Carry-Free Addition Always Applicable?
No: It requires one of the following two conditions [Parh 90]
a. r > 2, ρ ≥ 3
b. r > 2, ρ = 2, α ≠ 1, β ≠ 1 e.g., not [−1, 10] in radix 10 In other words, it is inapplicable for
r = 2 Perhaps most useful case
ρ = 1 e.g., carry-save
ρ = 2 with α = 1 or β = 1 e.g., carry/borrow-save
BSD is not two-stage carry-free -1 -1 0 -1 -1 -2 -1
-1
Use Carry-Estimate
A position sum –1 is kept intact when the incoming transfer is in [0, 1], whereas it is rewritten as 1 with a carry of –1 for incoming transfer in [–1, 0]. This guarantees that ti ≠ wi and thus –1≤ si ≤ 1.
1 –1 0 –1 0 x in [–1, 1]
+ 0 –1 –1 0 1 1 –2 –1 –1 1
1 0 1 –1 –1 –1 –1 0 1
0 –1 1 0 –1
i
i+1
y in [–1, 1] i p in [–2, 2] i
w in [–1, 1] i
s in [–1, 1] i t in [–1, 1]
low low low high high high
0 0
e in {low: [–1, 0], high: [0, 1]} i
Residue Number Systems
Chapter Goals
Study a way of encoding large numbers as a collection of smaller numbers
to simplify and speed up some operations Chapter Highlights
Moduli, range, arithmetic operations Many sets of moduli possible: tradeoffs Conversions between RNS and binary The Chinese remainder theorem
Why are RNS applications limited?
RNS Representations and Arithmetic
Chinese puzzle, 1500 years ago:
What number has the remainders of 2, 3, and 2 when divided by 7, 5, and 3, respectively?
Residues uniquely identify the number, hence they constitute a representation
Pairwise relatively prime moduli: mk–1 > . . . > m1 > m0
The residue xi of x wrt the ith modulus mi (similar to a digit):
xi = x mod mi = 〈x〉mi
RNS representation contains a list of k residues or digits:
x = (2 | 3 | 2)RNS(7|5|3)
Default RNS for this chapter: RNS(8 | 7 | 5 | 3)
RNS Dynamic Range
Product M of the k pairwise relatively prime moduli is the dynamic range M = mk–1 × . . . × m1 × m0
For RNS(8 | 7 | 5 | 3), M = 8× 7 × 5 × 3 = 840 Negative numbers: Complement relative to M
〈–x〉mi = 〈M – x〉mi
21 = (5 | 0 | 1 | 0)RNS
–21 = (8 – 5 | 0 | 5 – 1 | 0)RNS = (3 | 0 | 4 | 0)RNS
Here are some example numbers in our default RNS(8 | 7 | 5 | 3):
(0 | 0 | 0 | 0)RNS Represents 0 or 840 or . . . (1 | 1 | 1 | 1)RNS Represents 1 or 841 or . . . (2 | 2 | 2 | 2)RNS Represents 2 or 842 or . . . . .
(0 | 1 | 4 | 1)RNS Represents 64 or 904 or . . . (2 | 0 | 0 | 2)RNS Represents –70 or 770 or . . . (7 | 6 | 4 | 2)RNS Represents –1 or 839 or . . .
We can take the
range of RNS(8|7|5|3) to be [−420, 419] or any other set of 840 consecutive integers
We will see later how the weights can be determined for a given RNS
RNS as Weighted Representation
For RNS(8 | 7 | 5 | 3), the weights of the 4 positions are:
105 120 336 280
Example: (1 | 2 | 4 | 0)RNS represents the number
〈105×1 + 120×2 + 336×4 + 280×0〉840 = 〈1689〉840 = 9
For RNS(7 | 5 | 3), the weights of the 3 positions are:
15 21 70
Example -- Chinese puzzle: (2 | 3 | 2)RNS(7|5|3) represents the number
〈15 × 2 + 21 × 3 + 70 × 2〉105 = 〈233〉105 = 23
RNS Encoding and Arithmetic Operations
Binary-coded format for RNS(8 | 7 | 5 | 3).
Arithmetic in RNS(8 | 7 | 5 | 3)
(5 | 5 | 0 | 2)RNS Represents x = +5 (7 | 6 | 4 | 2)RNS Represents y = –1
(4 | 4 | 4 | 1)RNS x + y : 〈5 + 7〉8 = 4, 〈5 + 6〉7 = 4, etc.
(6 | 6 | 1 | 0)RNS x – y : 〈5 – 7〉8 = 6, 〈5 – 6〉7 = 6, etc.
(alternatively, find –y and add to x) (3 | 2 | 0 | 1)RNS x × y : 〈5 × 7〉8 = 3, 〈5 × 6〉7 = 2, etc.
mod 8 mod 7 mod 5 mod 3
mod 8 mod 7 mod 5 mod 3 Mod-8
Unit Mod-7
Unit Mod-5
Unit Mod-3 Unit
3 3 3 2
Operand 1 Operand 2
Result
Choosing the RNS Moduli
Target range for our RNS: Decimal values [0, 100 000]
Strategy 1: To minimize the largest modulus, and thus ensure high-speed arithmetic, pick prime numbers in sequence
Pick m0 = 2, m1 = 3, m2 = 5, etc. After adding m5 = 13:
RNS(13 | 11 | 7 | 5 | 3 | 2) M = 30 030 Inadequate RNS(17 | 13 | 11 | 7 | 5 | 3 | 2) M = 510 510 Too large RNS(17 | 13 | 11 | 7 | 3 | 2) M = 102 102 Just right!
5 + 4 + 4 + 3 + 2 + 1 = 19 bits Fine tuning: Combine pairs of moduli 2 & 13 (26) and 3 & 7 (21)
RNS(26 | 21 | 17 | 11) M = 102 102
An Improved Strategy
Target range for our RNS: Decimal values [0, 100 000]
Strategy 2: Improve strategy 1 by including powers of smaller primes before proceeding to the next larger prime
RNS(22 | 3) M = 12
RNS(32 | 23 | 7 | 5) M = 2520
RNS(11 | 32 | 23 | 7 | 5) M = 27 720 RNS(13 | 11 | 32 | 23 | 7 | 5) M = 360 360
(remove one 3, combine 3 & 5) RNS(15 | 13 | 11 | 23 | 7) M = 120 120
4 + 4 + 4 + 3 + 3 = 18 bits Fine tuning: Maximize the size of the even modulus within the 4-bit limit RNS(24 | 13 | 11 | 32 | 7 | 5) M = 720 720 Too large
We can now remove 5 or 7; not an improvement in this example
Low-Cost RNS Moduli
Target range for our RNS: Decimal values [0, 100 000]
Strategy 3: To simplify the modular reduction (mod mi) operations, choose only moduli of the forms 2a or 2a – 1, aka “low-cost moduli”
RNS(2ak–1 | 2ak–2 – 1 | . . . | 2a1 – 1 | 2a0 – 1) We can have only one even modulus
2ai – 1 and 2aj – 1 are relatively prime iff ai and aj are relatively prime RNS(23 | 23–1 | 22–1) basis: 3, 2 M = 168 RNS(24 | 24–1 | 23–1) basis: 4, 3 M = 1680 RNS(25 | 25–1 | 23–1 | 22–1) basis: 5, 3, 2 M = 20 832 RNS(25 | 25–1 | 24–1 | 23–1) basis: 5, 4, 3 M = 104 160 Comparison
RNS(15 | 13 | 11 | 23 | 7) 18 bits M = 120 120 RNS(25 | 25–1 | 24–1 | 23–1) 17 bits M = 104 160
It’s easy to mod 2k and 2k -1
Encoding and Decoding of Numbers
Conversion from binary/decimal to RNS
–––––––––––––––––––––––––––––
i 2i 〈2i〉7 〈2i〉5 〈2i〉3 –––––––––––––––––––––––––––––
0 1 1 1 1
1 2 2 2 2
2 4 4 4 1
3 8 1 3 2
4 16 2 1 1
5 32 4 2 2
6 64 1 4 1
7 128 2 3 2
8 256 4 1 1
9 512 1 2 2
–––––––––––––––––––––––––––––
Table 4.1 Residues of the first 10 powers of 2 Example 4.1: Represent the
number y = (1010 0100)two = (164)ten in RNS(8 | 7 | 5 | 3)
The mod-8 residue is easy to find x3 = 〈y〉8 = (100)two = 4
We have y = 27+25+22; thus x2 = 〈y〉7 = 〈2 + 4 + 4〉7 = 3 x1 = 〈y〉5 = 〈3 + 2 + 4〉5 = 4 x0 = 〈y〉3 = 〈2 + 2 + 1〉3 = 2
Conversion from RNS to Binary/Decimal
Theorem 4.1 (The Chinese remainder theorem)
x = (xk–1 | . . . | x2 | x1 | x0)RNS = 〈 ∑i Mi 〈αi xi〉mi 〉M
where Mi = M/mi and αi = 〈Mi –1〉mi (multiplicative inverse of Mi wrt mi) Implementing CRT-based RNS-to-binary conversion
x = 〈 ∑i Mi 〈αi xi〉mi 〉M = 〈 ∑i fi(xi)〉M
We can use a table to store the fi values –- ∑i mi entries Table 4.2 Values needed in applying the
Chinese remainder theorem to RNS(8 | 7 | 5 | 3) ––––––––––––––––––––––––––––––
i mi xi 〈Mi 〈αi xi〉mi〉M ––––––––––––––––––––––––––––––
3 8 0 0
1 105
2 210
3 315
. .
. .
. .
Intuitive Justification for CRT
Puzzle: What number has the remainders of 2, 3, and 2
when divided by the numbers 7, 5, and 3, respectively?
x = (2 | 3 | 2)RNS(7|5|3) = (?)ten
(1 | 0 | 0)RNS(7|5|3) = multiple of 15 that is 1 mod 7 = 15 (0 | 1 | 0)RNS(7|5|3) = multiple of 21 that is 1 mod 5 = 21 (0 | 0 | 1)RNS(7|5|3) = multiple of 35 that is 1 mod 3 = 70 (2 | 3 | 2)RNS(7|5|3) = (2 | 0 | 0) + (0 | 3 | 0) + (0 | 0 | 2)
= 2 × (1 | 0 | 0) + 3 × (0 | 1 | 0) + 2 × (0 | 0 | 1)
= 2 × 15 + 3 × 21 + 2 × 70
= 30 + 63 + 140
= 233 = 23 mod 105 Therefore, x = (23)ten
Difficult RNS Arithmetic Operations
Sign test
Magnitude comparison
Division
•Could convert back and forth to/from binary.
•Another approach: convert to a mixed radix system, as numbers in a mixed radix system are comparable.
Difficult RNS Arithmetic Operations
Example: Of the following RNS(8 | 7 | 5 | 3) numbers:
Which, if any, are negative?
Which is the largest?
Which is the smallest?
Assume a range of [–420, 419]
a = (0 | 1 | 3 | 2)RNS b = (0 | 1 | 4 | 1)RNS c = (0 | 6 | 2 | 1)RNS d = (2 | 0 | 0 | 2)RNS e = (5 | 0 | 1 | 0)RNS f = (7 | 6 | 4 | 2)RNS
Answers:
d < c < f < a < e < b
–70 < –8 < –1 < 8 < 21 < 64
General RNS Division
General RNS division, as opposed to division by one of the moduli (aka scaling), is difficult; hence, use of RNS is unlikely to be effective when an application requires many divisions
Scheme proposed in 1994 PhD thesis of Ching-Yu Hung (UCSB):
Use an algorithm that has built-in tolerance to imprecision, and apply the approximate CRT decoding to choose quotient digits
Example –– SRT algorithm (s is the partial remainder) s < 0 quotient digit = –1
s ≅ 0 quotient digit = 0 s > 0 quotient digit = 1
The BSD quotient can be converted to RNS on the fly
Limits of Fast Arithmetic in RNS
Known results from number theory
Implications to speed of arithmetic in RNS
Theorem 4.5: It is possible to represent all k-bit binary numbers in RNS with O(k / log k) moduli such that the largest modulus has O(log k) bits
That is, with fast log-time adders, addition needs O(log log k) time Theorem 4.2: The ith prime pi is asymptotically i ln i
Theorem 4.3: The number of primes in [1, n] is asymptotically n / ln n Theorem 4.4: The product of all primes in [1, n] is asymptotically en
Hardware Implementation for RNS Representations
mod 8 mod 7 mod 5 mod 3 Mod-8
Unit Mod-7
Unit Mod-5
Unit Mod-3 Unit
3 3 3 2
Operand 1 Operand 2
Result
Addition/Subtraction
Instructor: Kuan Jen Lin
E-Mail: [email protected] Dept. of EE, FJU, Taiwan
Room: SF 727B
Most slides originate from the textbook author’s PowerPoint presentation files.
II Addition / Subtraction
Chapter 8 Multioperand Addition Chapter 7 Variations in Fast Adder Chapter 6 Carry-Lookahead Adders
Chapter 5 Basic Addition and Counting
Topics in This Part
Review addition schemes and various speedup methods
• Addition is a key op (in itself, and as a building block)
• Subtraction = negation + addition
• Carry propagation speedup: lookahead, skip, select, …
• Two-operand versus multioperand addition
Basic Addition and Counting
Chapter Goals
Study the design of ripple-carry adders, discuss why their latency is unacceptable, and set the foundation for faster adders Chapter Highlights
Full adders are versatile building blocks Longest carry chain on average: log2k bits Fast asynchronous adders are simple
Counting is relatively easy to speed up
HA and FA Adders
Half-adder (HA): Truth table and block diagram
Full-adder (FA): Truth table and block diagram
x y c c s --- 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 Inputs Outputs
c out c in
out
in x y
s FA x y c s
--- 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0
Inputs Outputs
HA
x y c
s