Linear Codes

(1)

CHAPTER 1

An Introduction to Codes

Basic Definitions

The concept of a string is fundamental to the subjects of information and coding theory.

Let A = {a₁, a₂, . . . , a_n} be a finite nonempty set which we refer to as an alphabet. A string over A is simply a finite sequence of elements of A. Strings will be denoted by boldface letters, such as x and y. If x = x1x2· · · xk is a string over A, then each xi in x is called an element of x. The length of a string x, denoted by Len(x), is the number of elements in the string x.

A code is nothing more than a set of strings over a certain alphabet. Of course, codes are generally used to encode messages.

Definition 1.1. Let A = {a₁, a₂, . . . , a_r} be a set of r elements, which we call a code alphabet and whose elements are called code symbols. An r-ary code over A is a subset C of the set of all strings over A. The number r is called the radix of the code. The elements of C are called codewords and the number of codewords of C is called the size of C. When A = Z₂ and A = Z₃, codes over A are referred to as binary codes and ternary codes, respectively.

Definition 1.2. Let S = {s1, s2, . . . , sn} be a finite set which we refer to as source alphabet. The elements of S are called source symbol and the number of source symbol in S is called the size of S. Let C be a code. An encoding function is a bijective function f : S → C, from S to C. We refer to the ordered pair (C, f ) as an encoding scheme for S.

Definition 1.3. If all the codewords in a code C have the same length, we say that C is a fixed length code, or block code. Any encoding scheme that uses a fixed length code will be referred to as a fixed length encoding scheme. If C contains codewords of different lengths, we say that C is a variable length code. Any encoding scheme that uses a variable length code will be referred to as a variable length encoding schemes.

Fixed length codes have advantages and disadvantages over variable length codes. One advantage is that they never require a special symbol to separate the source symbols in the message being coded. Perhaps the main disadvantage of fixed length codes is that source symbols that are used frequently have codes as long as source symbols that are used infrequently. On the other hand, variable length codes, which can encode frequently used source symbols using shorter codewords, can save a great deal of time and space.

Uniquely Decipherable Codes

Definition 1.4. A code C over an alphabet A is uniquely decipherable if no two different sequences of codewords in C represents the same string over A. In symbols, if

c₁c₂· · · c_m = d₁d₂· · · d_n for c_i, d_j ∈ C, then m = n and c_i = d_i for all i = 1, . . . , n.

1

(2)

2 1. AN INTRODUCTION TO CODES

The following theorem, proved by McMillan in 1956, provides us some information about the codeword lengths for unique decipherable code.

Theorem 1.5 (McMillan’s Theorem). Let C = {c₁, c₂, . . . , c_n} be a uniquely decipherable r-ary code and let l_i = Len(c_i). Then its codeword lengths l₁, l₂. . . , l_n must satisfy

Xn i=1

1 r^lⁱ ≤ 1.

Remark 1.6. Consider the binary code C = {0, 11, 100, 110}. Its codeword lengths 1, 2, 3 and 3 satisfy Kraft’s Inequality, but it is not uniquely decipherable. Hence McMillan’s Theorem cannot tell us when a particular code is uniquely decipherable, but only when it is not.

Instantaneous Codes

Definition 1.7. A code is said to be instantaneous if each codeword in any string of codewords can be decoded (reading from left to right) as soon as it is received.

If a code is instantaneous, then it is also uniquely decipherable. However, there exist codes that are uniquely decipherable but not instantaneous.

Definition 1.8. A code is said to have the prefix property if no codeword is a prefix of any other codeword, that is, if whenever c = x1x2· · · xn is a codeword, then x1x2· · · xk is not a codeword for 1 ≤ k < n.

Given a code C, it is easy to determine whether or not it has the prefix property. It is only necessary to compare each codeword with all codewords of greater length to see if it is a prefix. The importance of the prefix property comes from the following proposition.

Proposition 1.9. A code is instantaneous if and only if it has the prefix property.

Now we come to a theorem, published by L. G. Kraft in 1949, gives a simple criterion to determine whether or not there is an instantaneous code with given codeword lengths.

Theorem 1.10 (Kraft’s Theorem). There exists an instantaneous r-ary code C with codeword lengths l₁, . . . , l_n, if and only if these lengths satisfy Kraft’s inequality,

Xn i=1

1 r^lⁱ ≤ 1.

Remark 1.11. Again we should point out that as in Remark 1.6, Kraft’s Theorem does not say that any code whose codeword lengths satisfy Kraft’s inequality must be instantaneous. However, we can construct an instantaneous code with these codeword lengths.

Definition 1.12. An instantaneous code C is said to be maximal instantaneous if C is not contained in any strictly larger instantaneous code.

Corollary 1.13. Let C be an instantaneous r-ary code with codeword lengths l₁, . . . , l_n. Then C is maximal instantaneous if and only if these lengths satisfy

Xn i=1

1 r^lⁱ = 1.

(3)

EXERCISES 3

McMillan’s Theorem and Kraft’s Theorem together tell us something interesting about the relationship between uniquely decipherable codes and instantaneous codes. We have the following useful result.

Corollary 1.14. If a uniquely decipherable code exists with codeword lengths l₁, . . . , l_n, then an instantaneous code must also exist with these same codeword lengths.

Our interest in Corollary 1.14 will come later, when we turn to questions related to codeword lengths. For it tells us that we lose nothing by considering only instantaneous codes rather than all uniquely decipherable codes.

Exercises

(1) What is the minimum possible length for a binary block code containing n codewords?

(2) How many encoding functions are possible from the source alphabet S = {a, b, c} to the code C = {00, 01, 11}? List them.

(3) How many r-ary codes are there with maximum codeword length n over an alphabet A?

What is this number for r = 2 and n = 5?

(4) Which of the the following codes

C₁ = {0, 01, 011, 0111, 01111, 11111} and C₂ = {0, 10, 1101, 1110, 1011, 110110}

are uniquely decipherable?

(5) Is it possible to construct a uniquely decipherable code over the alphabet {0, 1, . . . , 9}

with nine codewords of length 1, nine codewords of length 2, ten codewords of length 3 and ten codewords of length 4?

(6) For a given binary code C = {0, 10, 11}, let N(k) be the total number of sequences of codewords that contain exactly k bits. For instance, we have N(3) = 5. Show that in this case N(k) = N(k − 1) + 2N(k − 2), for all k ≥ 3.

(7) Suppose that we want an instantaneous binary code that contains the codewords 0, 10 and 1100. How many additional codewords of length 6 could be added to this code?

(8) Suppose that C is a maximal instantaneous code with maximum codeword length m.

Show that C must contain at least two codewords of maximum length m.

(4)

(5)

CHAPTER 2

Noiseless Coding

Optimal Encoding Schemes

In order to achieve unique decipherability, McMillan’s Theorem tells us that we must allow reasonably long codewords. Unfortunately, this tends to reduce the efficiency of a code. On the other hand, it is often the case that not all source symbols occur with the same frequency within a given class of messages. When no errors can occur in the transmission of data, it makes sense to assign the longer codewords to the less frequently used source symbols, thereby improving the efficiency of the code.

Definition 2.1. An information source is an ordered pair I = (S, P), where S = {s₁, . . . , s_n} is a source alphabet and P is a probability law that assigns to each source symbol s_i of S a probability P(s_i). The sequence P(s₁), . . . , P(s_n) is the probability distribution for I.

For the noiseless coding, the measure of efficiency of an encoding scheme is its average codeword length.

Definition 2.2. The average codeword length of an encoding scheme (C, f ) for an infor- mation source I = (S, P), where S = {s₁, . . . , s_n}, is defined by

Xn i=1

Len(f (s_i))P(s_i).

We should emphasizes the fact that the average codeword length of an encoding scheme is not the same as the average codeword length of a code, since the former depends also on the probability distribution.

It is clear that the average codeword length of an encoding scheme is not affected by the nature of the source symbols themselves. Hence, for the purposes of measuring average codeword length, we may assume that the codewords are assigned directly to the probabilities. Accordingly, we may speak of an encoding scheme (c₁, . . . , c_n) for the probability distribution (p₁, . . . , p_n). With this in mind, the average codeword length of an encoding scheme C = (c₁, . . . , c_n) is

AveLen(C) = Xn

i=1

p_iLen(c_i).

Let (C1, f1) and (C2, f2) be two encoding schemes of the information source I such that the corresponding codes have the same radix. We say that (C1, f1) is more efficient than (C₂, f₂), if AveLen(C₁) < AveLen(C₂). We should point out that it makes sense to compare the average codeword lengths of different encoding schemes only when the corresponding codes have the same radix. For in general the larger the radix, the shorter we can make the average codeword length.

5

(6)

6 2. NOISELESS CODING

We will use the notation MinAveLen_r(p₁, . . . , p_n) to denote the minimum average code- word length among all r-ary instantaneous encoding schemes for the probability distribution (p₁, . . . , p_n).

Definition 2.3. An optimal r-ray encoding scheme for a probability distribution (p₁, . . . , p_n) is an r-ary instantaneous encoding scheme (c₁, . . . , c_n) for which

AveLen(c1, . . . , cn) = MinAveLenr(p1, . . . , pn).

Note the optimal encoding schemes are, by definition, instantaneous. By virtue of Corol- lary 1.14, this minimum is also over all uniquely decipherable schemes. Hence, we may restrict attention to instantaneous codes.

Huffman Encoding

In 1952 D. A. Huffman published a method for constructing optimal encoding schemes.

This method is now known as Huffman encoding.

Since we are dealing with r-ary codes, we may as well assume that the code alphabet is {1, 2, . . . , r}.

Lemma 2.4. Let P = (p₁, . . . , p_n) be a probability distribution, with p₁ ≥ p₂ ≥ · · · ≥ p_n. Then there exists an optimal r-ary encoding scheme C = (c₁, . . . , c_n) for P that has exactly s codewords of maximum length of the form d1, d2, . . . , ds, where s is uniquely determined by the conditions s ≡ n (mod r − 1) and 2 ≤ s ≤ r.

As a result, for such probability distributions, we have

MinAveLen_r(p₁, . . . , p_n) = MinAveLen_r(p₁, . . . , p_n−s, q) + q, where q =P_n

i=n−s+1p_i.

By Lemma 2.4 we can present Huffman’s algorithm.

Theorem 2.5. The following algorithm H produces r-ary optimal encoding schemes C for probability distributions P:

(1) If P = (p₁, . . . , p_n), where n ≤ r, then let C = (1, . . . , n).

(2) If P = (p1, . . . , pn), where n > r, then

(a) Reorder P if necessary so that p1 ≥ p2 ≥ · · · ≥ pn. (b) Let Q = (p1, . . . , pn−s, q), where q = P_n

i=n−s+1pi and s is uniquely determined by the conditions s ≡ n (mod r − 1) and 2 ≤ s ≤ r.

(c) Perform the algorithm H on Q, obtaining an encoding scheme D = (c₁, . . . , c_n−s, d).

(d) Let C = (c₁, . . . , c_n−s, d1, d2, . . . , ds).

Entropy of a Source

For the information obtained from a source symbol, it should have the property that the less likely a source symbol is to occur, the more information we obtain from an occurrence of that symbol, and conversely. Because the information obtained from a source symbol is not a function of the symbol itself, but rather of the symbol’s probability of occurrence p, we use the notation I(p) to denote the information obtained from a source symbol with probability of occurrence p.

(7)

ENTROPY OF A SOURCE 7

Definition 2.6. For a source alphabet S, the r-ary information I_r(p) obtained from a source symbol s ∈ S with probability of occurrence p, is given by

I_r(p) = log_r 1 p.

I_r(p) can be characterized by the fact that it is the only continuous function on (0, 1]

with the property that I_r(pq) = I_r(p) + I_r(q) and I_r(1/r) = 1.

Definition 2.7. Let P = {p₁, . . . , p_n} be a probability distribution. The r-ary entropy of the distribution P is

H_r(P) = Xn

i=1

p_iI_r(p_i) = Xn

i=1

p_ilog_r 1 pi

.

(When p_i = 0 we set p_ilog_r(1/p_i) = 0.) If I = (S, P) is a information source, with probability distribution P = {p₁, . . . , p_n}, then we refer to H_r(I) = H_r(P) as the entropy of the source I.

The quantity H_r(I) is the average information obtained from a simple sample of I. It seems reasonable to say that sampling from I with equal probability gives an amount of information equal to one r-ary unit. For instance, if S = {0, 1} with P(0) = 1/2 and P(1) = 1/2, then it gives us one binary unit of information (or one bit of information). We mention that many books on information theory restrict attention to binary entropy and use the notation H(p₁, . . . , p_n) for binary entropy.

To begin with the main properties of entropy, we begin with a lemma which can be easily derived from the fact that ln x ≤ x − 1, for all x > 0, and equality holds only when x = 1 .

Lemma 2.8. Let P = {p₁, . . . , p_n} be a probability distribution. Let Q = {q₁, . . . , q_n} have the property that 0 ≤ q_i ≤ 1 for all i, and P_n

i=1q_i ≤ 1. Then Xn

i=1

p_ilog_r 1 pi

≤ Xn

i=1

p_ilog_r 1 qi

,

(We set 0 · log_r¹₀ = 0 and p log_r¹₀ = +∞, for p > 0.)

Furthermore, the equality holds if and only if p_i = q_i for all i.

With Lemma 2.8 at our disposal, we can get the range of th entropy function.

Theorem 2.9. For a information source I = (S, P) of size n (i.e. |S| = n), the entropy satisfies

0 ≤ Hr(P) ≤ log_rn.

Furthermore, H_r(P) = log_rn if and only if the source has a uniform distribution (i.e. all of the source symbols are equally likely to occur), and H_r(P) = 0 if and only if one of the source symbols has probability 1 of occurring.

Theorem 2.9 confirms the fact that, on the average, the most information is obtained from sources for which each source symbol is equally likely to occur.

(8)

8 2. NOISELESS CODING

The Noiseless Coding Theorem

As we know, the entropy H(I) of an information source I is the amount of information contained in the source. Further, since an instantaneous encoding scheme for I captures the information in the source, it is reasonable to believe that the average codeword length of such a code must be at least as large as the entropy. In fact, this is what the Noiseless Coding Theorem says.

Theorem 2.10 (The Noiseless Coding Theorem). For any probability distribution P = (p₁, . . . , p_n), we have

H_r(p₁, . . . , p_n) ≤ MinAveLen_r(p₁, . . . , p_n) < H_r(p₁, . . . , p_n) + 1.

Notice that the condition for equality in Theorem 2.10 is that l_i = − log_rp_i, which means that log_rp_i is an integer. Since this is not often the case, we cannot often expect equality.

In general, if we choose the integer l_i to satisfy log_r 1

p_i ≤ l_i < log_r 1 p_i + 1,

for all i, then, by Kraft’s Theorem, there is an instantaneous encodings with these codeword lengths. An encoding scheme constructed by this method is referred as a Shannon-Fano encoding scheme. However, this method does not, in general, give the smallest possible average codeword length.

The Noiseless Coding Theorem determines MinAveLen_r(p₁, . . . , p_n) to within 1 r-ary unit, but this may still be too much for some purposes. Fortunately, there is a way to improve upon this, based on the following idea.

Definition 2.11. Let S = {x1, . . . , xn} with probability distribution P(xi) = pi, for all i. The k-th extension of I = (S, P) is I^k = (S^k, P^k), where S^k is the set of all strings of length k over S and Pⁿ is the probability distribution defined for x = x1x2· · · xk ∈ S^k by P^k(x) = P(x₁) · · · P(x_k).

The entropy of an extension I^k is related to the entropy of I in a very simple way.

It seems intuitively clear that, since we get k times as much information from a string of length k as from a single symbol, the entropy of I^k should be k times the entropy of I. The following lemma confirms this.

Lemma 2.12. Let I be an information source and let I^k be its k-th extension. Then H_r(I^k) = kH_r(I).

Applying the Noiseless Coding Theorem to the extension I^kand using Lemma 2.12, gives the final version of the Noiseless Coding Theorem.

Theorem 2.13. Let P be a probability distribution and let P^k be its k-th extension. Then H_r(P) ≤ MinAveLen_r(S^k)

k < H_r(P) + 1 k.

Since each codeword in the k-th extension S^k encodes k source symbol from S, the quantity

MinAveLen_r(S^k) k

(9)

EXERCISE 9

is the minimum average codeword length per source symbol of S, taken over all uniquely decipherable r-ary encodings of S^k. Theorem 2.13 says that, by encoding a sufficiently long extension of I, we may make the minimum average codeword length per source symbol of S as close to the entropy H_r(P) as desired. The penalty for doing so is that, since |S^k| = |S|^k, the number of codewords required to encode the k-th extension S^k grows exceedingly large as k gets large.

Exercise

(1) Let P = (0.3, 0.1, 0.1, 0.1, 0.1, 0.06, 0.05, 0.05, 0.05, 0.04, 0.03, 0.02). Find the Huff- man encodings of P for the given radix r, with r = 2, 3, 4.

(2) Determine possible probability distributions that have (00, 01, 10, 11) and (0, 10, 110, 111) as binary Huffman encodings.

(3) Determine all possible ternary Huffman encodings of sizes 5 and 6.

(4) Let C be a binary Huffman encoding. Prove that C is maximal instantaneous.

(5) Let C be a binary Huffman encoding for the uniform probability distribution P = (1/n, . . . , 1/n) and suppose that Len(c_i) = l_i for i = 1, . . . , n. Let m = max_i{l_i}

(a) Show that C has the minimum total codeword lengthP_n

i=1l_i among all instantaneous encodings.

(b) Show that there exist two codewords c and d in C such that Len(c) = Len(d) = m, and c and d differ only in their last positions.

(c) Show that m − 1 ≤ li ≤ m for i = 1, . . . , n.

(d) Let n = α2^k, where 1 < α ≤ 2. Let u be the number of codewords of length m − 1 and let v be the number of codewords of length m. determine u, v and m in terms of α and k.

(e) Find MinAveLen₂(P).

(6) Prove the following properties of entropy.

(a) Let {p₁, . . . , p_n, q₁, . . . , q_m} be a probability distribution. If p = p₁+ · + p_n, then H_r(p₁, . . . , p_n, q₁, . . . , q_m) = H_r(p, 1 − p) + pH_r¡p₁

p, . . . , p_n p

¢+ (1 − p)H_r¡ q₁

1 − p, . . . , q_m 1 − p

¢. (b) Let P = {p₁, . . . , p_n} and Q = {q₁, . . . , q_n} be two probability distributions. For

0 ≤ t ≤ 1, we have

H_r(tp₁+ (1 − t)q₁, . . . , tp_n+ (1 − t)q_n) ≥ tH_r(p₁, . . . , p_n) + (1 − t)H_r(q₁, . . . , q_n).

(c) Let P = {p1, . . . , pn} be a probability distribution. Suppose that ε is a positive real number such that p1− ε > p2 + ε ≥ 0. Thus, {p1 − ε, p2 + ε, p3, . . . , pn} is also a probability distribution. Show that

H_r(p₁, . . . , p_n) < H_r(p₁− ε, p₂+ ε, p₃, . . . , p_n).

(7) Let S = {0, 1}. In order to guarantee that the average codeword length per source symbol of S is at most 0.01 greater than the entropy of S, which extension of S should we encode? How many codewords would we need?

(8) Let I be an information source and let I²be its second extension. Is the second extension of I² equal to the fourth extension of S?

(9) Show that the Noiseless Coding Theorem is best possible by showing that for any ² > 0, there is a probability distribution P = {p1, . . . , pn} for which MinAveLenr(p1, . . . , pn) − Hr(p1, . . . , pn) ≥ 1 − ².

(10)

(11)

CHAPTER 3

Noisy Coding

Communications Channels

In the previous chapter, we discussed the question of how to most efficiently encode source information for transmission over a noiseless channel, where we did not need to be concerned about correcting errors. Now we are ready to consider the question of how to encode source data efficiently and, at the same time, minimize the probability of uncorrected errors when transmitting over a noisy channel.

Definition 3.1. A communications channel consists of a finite input alphabet I = {x₁, . . . , x_s} and output alphabet O = {y₁, . . . , y_t}, and a set of forward channel probabil- ities or transition probabilities, P_f(y_j| x_i), satisfying P_t

j=1P_f(y_j| x_i) = 1, for all i = 1, . . . , s.

Intuitively, we think of Pf(yj| xi) as the probability that yj is received, given that xi is sent through the channel. It is important not to confuse the forward channel probability Pf(yj| xi) with the so-called backward channel probability Pb(xi| yj). In the forward probabilities, we assume a certain input symbol was sent. In the backward probabilities, we assume a certain output symbol is received.

Example 3.2. The noiseless channel, which we discussed in previous chapter, has the same input and output alphabet I = O = {x₁, . . . , x_s} and channel probabilities P_f(x_i| x_j) = (

1 i = j, 0 otherwise.

Example 3.3. A communications channel is called symmetric if it has the same input and output alphabet I = O = {x₁, . . . , x_s} and channel probabilities P_f(x_i| x_i) = P_f(x_j| x_j) and P_f(x_i| x_j) = P_f(x_j| x_i), for all i, j = 1, . . . , s. Perhaps the most important memoryless channel is the binary symmetric channel, which has I = O = {0, 1} and channel probabilities P_f(1 | 0) = P_f(0 | 1) = p and P_f(0 | 0) = P_f(1 | 1) = 1 − p. Thus, the probability of a symbol error, also called the crossover probability, is p.

Example 3.4. Another important memoryless channel is the binary erasure channel, which has input alphabet I = {0, 1}, output alphabet O = {0, ?, 1} and channel probabilities P_f(1 | 0) = P_f(0 | 1) = q, P_f(? | 0) = P_f(? | 1) = p and P_f(0 | 0) = P_f(1 | 1) = 1 − p − q.

We will deal only with channels that have no memory, in the following sense.

Definition 3.5. A communications channel is said to be memoryless if for c = c₁· · · c_n∈ I and d = d₁· · · d_n ∈ O, the probability that d is received, given that c is sent, is

P_f(d | c) = Yn i=1

P_f(d_i| c_i).

We will also refer to the probabilities P_f(d | c) as forward channel probabilities.

11

(12)

12 3. NOISY CODING

We use the the term memoryless because the probability that an output symbol d_i is received depends only on the current input c_i, and not on previous inputs.

Decision Rules

A decision rule for C is a partial function f from the set of output strings to the set of the codewords C. The process of applying a decision rule is referred to as decoding. The word

“partial” refers to the fact that f may not be defined for all output strings. The intention is that, if an output string d is received and if f (d) ∈ C is defined, then the decision rule decodes that f (d) is the codeword that was sent or else declares a decoding error.

Our goal is to find a decision rule that maximizes the probability of correct decoding.

The probability of correct decoding can be expressed in a variety of ways.

Conditioning on the codeword sent gives P(correct decoding) =X

c∈C

X

d∈Bc

Pf(d | c)Pi(c),

where B_c = {d|f (d) = c} and P_i(c) is the probability that c is sent through the channel.

The probabilities {P_i(c)| c ∈ C} form the so-called input distribution for the channel.

Conditioning instead on the string received gives P(correct decoding) =X

d

P_b(f (d) | d)P_o(d),

where Po(d) is the probability that d is received through the channel and is called the output distribution for the channel.

The probability of correct decoding can be maximized by choosing the decision rule that maximizes each of the conditional probability P_b(f (d) | d).

Definition 3.6. Any decision rule f for which f (d) has the property that Pb(f (d) | d) = max

c∈C Pb(c | d), for every possible received string d, is called an ideal observer.

Proposition 3.7. An ideal observer decision rule maximizes the probability of the correct decoding of received strings among all decision rules.

We remark that an ideal observer decision rule depends on the input distribution because P_b(c | d) = P_f(d | c)P_i(c)

P

c⁰∈CP_f(d | c⁰)P_i(c⁰).

For the case that the input probability distribution is uniform, i.e. P_i(c) = 1/|C|, we have P_b(c | d) = P_f(d | c)

P

c⁰∈CP_f(d | c⁰).

Now the denominator on the right is a sum of forward channel probabilities and thus depends only on the communications channel. Thus, maximizing P_b(c | d) is equivalent to maximizing P_f(d | c). This leads to the following definition and proposition.

Definition 3.8. Any decision rule f for which f (d) maximizes the forward channel probabilities, that is, for which

Pf(d | f (d)) = max

c∈C Pf(d | c),

(13)

CONDITIONAL ENTROPY AND CHANNEL CAPACITY 13

for every possible received string d, is called a maximum likelihood decision rule.

Proposition 3.9. For the uniform input distribution, an ideal observer is the same as a maximum likelihood decoding.

Conditional Entropy and Channel Capacity

In general, knowing the value of the output of a channel will have an effect on our information about the input. This leads us to make the following definition.

Definition 3.10. Consider a communications channel with the input alphabet I and the output alphabet O. The r-ary conditional entropy of I, given y ∈ O, is defined by

H_r(I | y) =X

x∈I

P_b(x | y) log_r 1 Pb(x | y).

The r-ary conditional entropy of I, given O, is the average conditional entropy defined by H_r(I | O) =X

y∈O

H_r(I | y)P_o(y).

Note that H_r(I | O) measure the amount of information remaining in I, after sampling O, and so it can be interpreted as the loss of information about I caused by the channel.

Conditional entropy can also be defined for strings.

Definition 3.11. Let C be a code over the input alphabet I and D be the set of output strings over the output alphabet O. The r-ary conditional entropy of C, given that d = y₁· · · y_m ∈ D, is defined by

H_r(C | d) = X

c∈C

P_b(c | d) log_r 1 P_b(c | d). The r-ary conditional entropy of C, given D is defined by

H_r(C | D) = X

d∈D

H_r(C | d)P_o(d).

The quantity I_r(I, O) = H_r(I) − H_r(I | O) is the amount of information in I minus the amount of information still in I after knowing O. In other words, I_r(I, O) is the amount of information about I that gets through the channel.

Definition 3.12. The r-ary mutual information of I and O is defined by I_r(I, O) = H_r(I) − H_r(I | O) =X

x∈I

P_i(x) log_r 1

P_i(x)− H_r(I | O).

Notice that the quantity I_r(I, O) depends upon the input distribution of I as well as the forward channel probabilities Pf(y | x).

We are now ready to define the concept of the capacity of a communications channel.

This concept plays a key role in the main results of information theory.

Definition 3.13. The capacity of a communications channel is the maximum mutual information I_r(I, O), taken over all input distributions of I.

(14)

14 3. NOISY CODING

Proposition 3.14. Consider a symmetric channel with input alphabet and output al- phabet I of size r. Then capacity of this symmetric channel is

1 −X

y∈I

P_f(y | x) log_r 1 P_f(y | x) ,

for any x ∈ I. Furthermore, the capacity is achieved by the uniform input distribution.

Corollary 3.15. The capacity of the binary symmetric channel with crossover proba- bility p is

1 + p log₂p + (1 − p) log₂(1 − p).

The Noisy Coding Theorem

It is sometimes said that there are two main results in information theory. One is the Noiseless Coding Theorem, which we discussed in previous chapter, and the other is the so-called Noisy Coding Theorem.

Before we can state the Noisy Coding Theorem formally, we need to discuss in detail the notion of rate of transmission. Let us suppose that the source information is in the form of strings of length k, over the input alphabet I of size r and that the r-ary block code C consist of codewords of fixed length n over I. Now, since the channel must transmit n code symbols in order to send k source symbols, the rate of transmission is R = k/n source symbols per code symbol. Further, since there are r^k possible source strings, the code must have size at least r^k in order to accommodate all of these strings. Assuming that |C| = r^k, we have k = log_r|C| and hence R = log_r|C|/n. Thus we have the following.

Definition 3.16. An r-ary block code C of length n and size |C| is called an (n, |C|) − code. The number

R(C) = log_r|C|

n is called the rate of C.

Now, we can state the Noisy Coding Theorem. Let dxe denote the smallest integer greater than or equal to x.

Theorem 3.17 (The Noisy Coding Theorem). Consider a memoryless communications channel with capacity C. For any positive number R < C, there exists a sequence C_n of r-ary block codes and corresponding decision rules f_n with the following properties.

(1) C_n is an (n, dr^nRe)-code. Thus, C_n has length n and rate at least R.

(2) The probability of decoding error of f_n approach 0 as n → ∞.

Roughly speaking, the Noisy Coding Theorem says that, if we choose any transmission rate below the capacity of the channel, there exists a code that can transmit at that rate and yet maintain a probability of decoding error below some predefined limit.

The price we pay for this efficient encoding is that the code size n may be extremely large. Furthermore, the known proofs of this theorem tell us only that such a code must exist, but do not show us how to actually find these codes.

(15)

EXERCISE 15

Exercise

(1) Consider a channel whose input alphabet is the set of all integers between −n and n and whose output is the square of the input. Determinate the forward channel probabilities of this channel.

(2) Suppose that codewords from the code {0000, 1111} are being sent over a binary sym- metric channel (c.f. Example 3.3) with crossover probability p = 0.01. Use the maximum likelihood decision rule to decode the received strings 0000, 0010 and 1010.

(3) Let C be a block code consists of all 8 binary strings of length 3. Denote the input code- word by i1i2i3 and the received string by o1o2o3. Let B.S.C. denote a binary symmetric channel with crossover probability p = 0.001. Consider the following different channels.

(a) The first channel works as follows: send i₁ through the B.S.C. to get o₁ and no matter what i₂ and i₃ are, choose o₂ and o₃ randomly.

(b) The second channel works as follows: send i₁ through the B.S.C. to get o₁, send i₂ through the B.S.C. to get o₂ and send i₃ through the B.S.C. to get o₃.

(c) The third channel works as follows: choose o₁ = o₂ = o₃ to be the majority bit among i₁, i₂ and i₃.

Compute the probability of correct decoding for each of these channels, assuming a uniform input distribution. Which channel is best?

(4) Show that for a symmetric channel with uniform input distribution, the output distribution is also uniform.

(5) Let I and O be the input alphabet and the output alphabet of a noiseless communications channel. Show that Hr(I | O) = 0.

(6) Let I and O be the input alphabet and the output alphabet of a communications channel with forward channel probabilities {P_f(y | x) | x ∈ I, y ∈ O}. Suppose that {P_i(x) | x ∈ I} is the input distribution and {P_o(y) | y ∈ O} is the output distribution for the channel.

(a) Show that the backward channel probability for x ∈ I and y ∈ O is P_b(x | y) = P_f(y | x)P_i(x)

P_o(y) . (b) Show that for an r-ary symmetric channel,

I_r(I, O) = X

y∈O

P_o(y) log_r 1

P_o(y) −X

y∈O

P_f(y | x) log_r 1 P_f(y | x), for any x ∈ I.

(7) Consider the special case of a binary erasure channel (c.f. Example 3.4), which has input alphabet I = {0, 1}, output alphabet O = {0, ?, 1} and channel probabilities P_f(1 | 0) = P_f(0 | 1) = 0, P_f(? | 0) = P_f(? | 1) = p and P_f(0 | 0) = P_f(1 | 1) = 1 − p.

Calculate the mutual information I₂(I, O) in terms of the input probability P_i(0) = p₀. Then determine the capacity of the channel, and an input probability that achieves that capacity.

(16)

(17)

CHAPTER 4

General Remarks on Codes

Nearest Neighbor Decoding

In general the problem of finding good codes is a very difficult one. However, by making certain assumptions about the channel, we can at least give the problem a highly intuitive flavor. We begin with a definition.

Definition 4.1. Let x = x₁x₂· · · x_n and y = y₁y₂· · · y_n be strings of the same length n over the same alphabet A. The Hamming distance d(x, y) between x and y is the number of positions in which xi 6= yi.

For instance, if x = 10112 and y = 20110, then d(x, y) = 2. The following result says that Hamming distance is a metric.

Proposition 4.2. Let Aⁿ be the set of all strings of length n over the alphabet A. Then the Hamming distance function d : Aⁿ× Aⁿ → N satisfies the following properties. For all x, y and z in Aⁿ,

(1) d(x, y) ≥ 0 and d(x, y) = 0 if and only if x = y;

(2) d(x, y) = d(y, x);

(3) d(x, y) ≤ d(x, z) + d(z, y).

In other words, (Aⁿ, d) is a metric space.

Suppose that C is a block code of length n over A. The codewords that are closest to a given received string x are referred to as nearest neighbor codewords. The nearest neighbor decoding or minimum distance decoding is the decision rule that decodes a received strings as a nearest neighbor codeword. When there are more than one nearest neighbor codeword, we will refer to this situation as a tie. In some cases, we may wish to choose randomly from among the candidates. In other cases, it might be more desirable simply to admit a decoding error. The term complete decoding refers to the case where all received strings are decoded and the term incomplete decoding refers to the case where we prefer occasionally to simply admit an error, rather than always decodes.

There are many channels for which maximum likelihood decoding takes the intuitive form of nearest neighbor decoding. For instance, the r-ary symmetric channel with forward channel probabilities

P_f(x_i| x_j) = (

1 − p if i = j,

p

r−1 otherwise.

has this property, for p < 1/2.

In implementing nearest neighbor decoding, the following concepts are useful.

Definition 4.3. Let C be a block code with at least two codewords. The minimum distance of C is defined to be

d(C) = min{d(c, d) | c, d ∈ C, c 6= d}.

17

(18)

18 4. GENERAL REMARKS ON CODES

An (n, M, d)-code is a block code of size M, length n and minimum distance d. The numbers n, M and d are called the parameters of the code.

Since for c 6= d, d(c, d) ≥ 1, the minimum distance of a code must be at least 1.

Perfect Code

Definition 4.4. Let x be a string in Aⁿ, where |A| = r and let ρ > 0. The sphere in Aⁿ with center x and radius ρ is the set

S_rⁿ(x, ρ) = {y ∈ Aⁿ| d(x, y) ≤ ρ}.

The volume V_rⁿ(ρ) of the sphere S_rⁿ(x, ρ) is the number of elements in S_rⁿ(x, ρ).

This volume is independent of the center and is given by V_rⁿ(ρ) =

Xbρc k=0

µn k

¶

(r − 1)^k, where bρc denote the greatest integer smaller than or equal to ρ.

We can determine the minimum distance of a code C by simply increasing the radius t of the spheres centered at each codeword of C until just before two spheres become “tangent”

(which will happen when d(C) = 2t + 2), or just before two spheres “overlap” (which will happen when d(C) = 2t + 1).

Definition 4.5. Let C ∈ Aⁿ be a code. The packing radius of C is the largest integer ρ for which the spheres S_rⁿ(c, ρ) centered at each codeword c are disjoint. The covering radius of C is the smallest integer ρ⁰ for which the spheres S_rⁿ(c, ρ⁰) centered at each codeword c cover Aⁿ. We will denote the packing radius of C by pr(C) and the covering radius by cr(C).

Proposition 4.6. The packing radius of an (n, M, d)-code C is pr(C) = b^d−1₂ c.

The following concept plays a major role in coding theory.

Definition 4.7. An r-ary (n, M, d)-code C is perfect if pr(C) = cr(C)

In words, a code C ⊆ Aⁿ is perfect if there exists a number ρ for which the spheres S_rⁿ(c, ρ) centered at each codeword c are disjoint and cover Aⁿ.

The size of a perfect code is uniquely determined by the length and the minimum distance.

The following result is known as the sphere-packing condition.

Proposition 4.8. Let C be an r-ary (n, M, d)-code. Then C is perfect if and only if d = 2v + 1 is odd and

M · V_rⁿ(v) = M · Xv k=0

µn k

¶

(r − 1)^k = rⁿ.

It is important to emphasize that the existence of numbers n, M and d = 2v + 1 for which the sphere-packing condition holds does not mean that there is a perfect code with these parameters. The problem of determining all perfect codes has not yet been solved.

However, a great deal is known about perfect codes over alphabets whose size is a power of a prime.

(19)

ERROR DETECTION AND ERROR CORRECTION 19

Error Detection and Error Correction

Let u be a positive integer. If u errors occur in the transmission of a codeword, we will say that an error of size u has occurred. It is possible that so many errors occurred as to change the codeword into another codeword, so that we cannot detect if any error has occurred or not.

Definition 4.9. A code C is u-error-detecting, if whenever an error of size of at most u but at least one has occurred, the resulting string is not a codeword. A code C is exactly u-error-detecting if it is u-error-detecting but not u + 1-error-detecting.

The next theorem is essentially just a restatement of the definition of u-error-detecting in terms of minimum distance.

Theorem 4.10. A code C is u-error-detecting if and only if d(C) ≥ u + 1. In particular, C is exactly u-error-detecting if and only if d(C) = u + 1.

Definition 4.11. Let v be a positive integer. A code C is v-error-correcting if nearest neighbor decoding is able to correct v or fewer errors, assuming that if a tie occurs in the decoding process, a decoding error is reported. A code is exactly v-error-correcting if it is v-error-correcting but not (v + 1)-error-correcting.

It should be kept in mind that, as long as the received word is not a codeword, nearest neighbor decoding will decode it as some codeword, but the receiver has no way of knowing whether that codeword is the one that was actually sent. We know only that, under a v-error-correcting code, if no more than v errors were introduced, then nearest neighbor decoding will produce the codeword that was sent.

Theorem 4.12. A code is v-error-correcting if and only if d(C) ≥ 2v + 1. In particular, C is exactly v-error-correcting if and only if d(C) = 2v + 1 or d(C) = 2v + 2.

Corollary 4.13. A code C has d(C) = d if and only if it is exactly b^d−1₂ c-error- correcting.

The following result is a consequence of Proposition 4.6 and Theorem 4.12. It shows the connection between error correction and pr(C).

Corollary 4.14. Assuming that ties are always reported as error, a code C is exactly v-error-correcting if and only if pr(C) = v.

Example 4.15. The r-ary repetition code of length n is

Rep_r(n) = {00 · · · 0, 11 · · · 1, . . . , (r − 1)(r − 1) · · · (r − 1)},

consisting of r codewords each of length n. The r-ary repetition code of length n can detect up to n − 1 errors in transmission, and so it is exactly (n − 1)-error-detecting. Furthermore, it is exactly bⁿ⁻¹₂ c-error-correcting.

Suppose that a code C has minimum distance d. If we use C for error detecting only, it can detect up to d − 1 errors. On the other hand, if we want C to also correct errors whenever possible, then it can correct up to b^d−1₂ c errors, but may no longer be able to detect a situation where more than b^d−1₂ c but less than d errors have occurred. For if more than b^d−1₂ c are made, nearest neighbor decoding might “correct” the received word to the wrong codeword and thus the errors will go undetected.

(20)

20 4. GENERAL REMARKS ON CODES

We consider the following strategy: Let v be a positive integer. If a string x is received and if the closed codeword c to x is at a distance of at most v, and there is only one such codeword, then decode x as c. If there is more than one codeword at minimum distance to x or if the closest codeword has distance greater than v, then simply declare an error.

Definition 4.16. A code C is simultaneously v-error-correcting and u-error-detecting, if whenever at least one but at most v errors are made, the strategy describe above will correct these errors and if whenever at least v + 1 but at most v + u errors are made, the strategy above simply reports an error.

Theorem 4.17. A code C is simultaneously v-error-correcting and u-error-detecting if and only if d(C) ≥ 2v + u + 1.

It is intuitively clear that, given any code C, we may continually add new codewords to it at no cost to its minimum distance. This leads us to make the following definition.

Definition 4.18. An (n, M, d)-code is said to be maximal if it is not contained in any larger code with the same minimum distance, that is, if it is not contained in any (n, M +1, d)- code.

Thus an (n, M, d)-code C is maximal if and only if, for all strings x ∈ Aⁿ, there is a codeword c ∈ C with the property that d(x, c) < d.

Proposition 4.19. For the binary symmetric channel with crossover probability p using minimum distance decoding, the probability of a decoding error for maximal (n, M, d)-code satisfies

Xn k=d

µn k

¶

p^k(1 − p)^n−k ≤ P(decode error) ≤ 1 −

bX^d−1₂ c k=0

µn k

¶

p^k(1 − p)^n−k.

Furthermore, for a non-maximal code, the upper bound still holds, but the lower bound may not.

Making New Codes from Old Codes

There are several useful techniques that can be used to obtain new codes from old codes.

In the following, we always suppose that our codes are over the alphabet A = Z_r = Z/rZ.

Extending a Code. The process of adding one or more additional positions to all the codewords in a code, thereby increasing the length of the code, is referred to as extending the code. The most common way to extend a code is by adding an overall parity check, which is done as follows. If C is an r-ary (n, M, d)-code over Z_r, we define the extended code C by

C = {c1c2· · · cncn+1| c1c2· · · cn ∈ C and Xn+1 k=1

ck ≡ 0 (mod r)}.

If C is an (n, M, d)-code, then n = n + 1, M = M and d = d or d + 1.

We remark that for a binary (n, M, d)-code C, the minimum distance of C depends on the parity of d. In particular, since all of the codewords in C have even sum, the minimum distance of C is even. It follows that if d is even then d(C) = d and if d is odd then d(C) = d + 1. Moreover, since b^d(C)−1₂ c = b^d(C)−1₂ c, the error-correcting capabilities of the code do not increase.

(21)

MAKING NEW CODES FROM OLD CODES 21

Puncturing a Code. The opposite process to extending a code is puncturing a code, in which one or more positions are removed from the codewords. If C is an r-ary (n, M, d)-code and if d ≥ 2, then the code C^∗ obtained by puncturing C once has parameters n^∗ = n − 1, M^∗ = M and d^∗ = d or d − 1.

For binary code, the process of extending and puncturing can be used to prove the following useful result.

Lemma 4.20. A binary (n, M, 2v +1)-code exists if and only if a binary (n+1, M, 2v +2)- code exists.

Shortening a Code. Shortening a code refers to the process of keeping only those codewords in a code that have a given symbol in a given position, and then deleting that position. If C is an (n, M, d)-code then a shortened code has length n − 1 and minimum distance at least d. In fact, shortening a code can result in a substantial increase in the minimum distance, but shortening a code does result in a code with smaller size.

The shortened code formed by taking codewords with an s in the i-th position is referred to as the cross-section x_i = s. We will have many occasions to use cross-sections in the sequel.

Augmenting a Code. Augmenting a code which simply means adding additional strings to the code. A common way to augment a binary code C is to include the complements of each codeword in C, where the complement of a binary codeword c is the string obtained from c by interchanging all 0’s and 1’s.

Let us denote the complement of c by c^c and denote the set of all complements of the codewords in C by C^c. It is easy to check that if x, y ∈ Zⁿ₂, then d(x, y^c) = n − d(x, y).

Proposition 4.21. Let C be a binary (n, M, d)-code. Suppose that d⁰ is the maximum distance between codewords in C. Then d(C ∪ C^c) = min{d, n − d⁰}.

The Direct Sum Construction. If C1 is an r-ary (n1, M1, d1)-code and C2 is an r-ary (n2, M2, d2)-code, the direct sum C1 ¯ C2 is the code

C₁¯ C₂ = {cd | c ∈ C₁, d ∈ C₂}.

Clearly, C₁¯ C₂ has parameters n = n₁+ n₂, M = M₁M₂ and d = min{d₁, d₂}.

The u(u + v) Construction. A much more useful construction than the direct sum is the following. If C₁ is an r-ary (n, M₁, d₁)-code and C₂ is an r-ary (n, M₂, d₂)-code, then we define a code C₁⊕ C₂ by

C₁⊕ C₂ = {c(c + d) | c ∈ C₁, d ∈ C₂}.

Certainly, the length of C1⊕ C2 is 2n and the size is M1M2. As for the minimum distance, consider two distinct codewords x = c1(c1 + d1) and y = c2(c2 + d2). If d1 = d2, then d(x, y) ≥ 2d1. On the other hand, if d1 6= d2, then d(x, y) ≥ d2. Since equality can hold in both cases, we get the following result.

Lemma 4.22. Let C₁ be an r-ary (n, M₁, d₁)-code and C₂ be an r-ary (n, M₂, d₂)-code.

Then C₁ ⊕ C₂ is a (2n, M₁M₂, d⁰)-code, where d⁰ = min{2d₁, d₂}.