有限長度區間碼通道容量

(1)

Error Probability Analysis of

Binary Asymmetric Channels

Final Report of NSC Project

“Finite Blocklength Capacity”

Date: 31 January 2012

Project-Number: NSC 97–2221–E–009–003–MY3 Project Duration: 1 August 2008 – 31 October 2011 Funded by: National Science Council, Taiwan

Author: Stefan M. Moser

Co-Authors: Po-Ning Chen, Hsuan-Yin Lin Organization: Information Theory Laboratory

Department of Electrical and Computer Engineering

National Chiao Tung University Address: Engineering Building IV, Office 727

1001 Daxue Rd.

Hsinchu 30010, Taiwan E-mail: [email protected]

(2)

Abstract

In his world-famous paper of 1948, Shannon defined channel capacity as the ultimate rate at which information can be transmitted over a communication channel with an error probability that will vanish if we allow the blocklength to get infinitely large. While this result is of tremendous theoretical importance, the reality of practical systems looks quite different: no communication system will tolerate an infinite delay caused by an extremely large blocklength, nor can it deal with the computational complexity of decoding such huge codewords. On the other hand, it is not necessary to have an error probability that is exactly zero either, a small, but finite value will suffice.

Therefore, the question arises what can be done in a practical scheme. In particular, what is the maximal rate at which information can be transmitted over a communication channel for a given fixed maximum blocklength (i.e., a fixed maximum delay) if we allow a certain maximal probability of error? In this project, we have started to study these questions.

Block-codes with very short blocklength over the most general binary chan-nel, the binary asymmetric channel (BAC), are investigated. It is shown that for only two possible messages, flip-flop codes are optimal, however, depending on the blocklength and the channel parameters, not necessarily the linear flip-flop code. Further it is shown that the optimal decoding rule is a threshold rule. Some fundamental dependencies of the best code on the channel are given.

Block-codes with a very small number of codewords are investigated for the two special binary memoryless channels, the binary symmetric channel (BSC) and the Z-channel (ZC). The optimal (in the sense of minimum average error probability, using maximum likelihood decoding) code structure is derived for the cases of two, three, and four codewords and an arbitrary blocklength. It is shown that for two possible messages, on a BSC, the so-called flip codes of type t are optimal for any t, while on a ZC, the flip code of type 0 is optimal. For codes with three or four messages it is shown that the so-called weak flip codes of some given type are optimal where the type depends on the blocklength. For all cases an algorithm is presented that constructs an optimal code for blocklength n recursively from an optimal code of length n − 1. In the situation of two and four messages, the optimal code is shown to be linear. For the ZC a recursive optimal code design is conjectured in the case of five possible messages.

The derivation of these optimal codes relies heavily on a new approach of constructing and analyzing the code-matrix not row-wise (codewords), but

column-wise. Moreover, these results also prove that the minimum Hamming

distance might be the wrong design criterion for optimal codes even for very symmetric channels like the BSC.

Keywords: Channel capacity, binary asymmetric channel (BAC), error prob-ability, finite blocklengths, ML, optimal codes, Z-channel.

(3)

1 Introduction

The analytical study of optimal communication over a channel is very difficult even if we restrict ourselves to discrete memoryless channels (DMCs). Most known results are derived using the mathematical trick of considering some limits, in particular, usually it is assumed that the blocklength tends to infinity. The insights that have been achieved in this way are considerable, but there still remains the open question how far these asymptotic results can be applied to the practical scenario where the blocklength is strongly restricted.

Shannon proved in his ground-breaking work [1] that it is possible to find an information transmission scheme that can transmit messages at arbitrarily small

(4)

error probability as long as the transmission rate in bits per channel use is below the so-called capacity of the channel. However, he did not provide a way on how to find such schemes, in particular he did not tell us much about the design of codes apart from the fact that good codes need to have large blocklength.

For many practical applications exactly this latter constraint is rather unfortu-nate as often we cannot tolerate too much delay (e.g., inter-human communication, time-critical control and communication, etc.). Moreover, the system complexity usually will grow exponentially in the blocklength. So we see that having large blocklength might not be an option and we have to restrict the blocklength to some reasonable size. The question now arises what can theoretically be said about the performance of communication systems with such restricted block size.

During the last years there has been an increased interests in the theoretical understanding of finite-length coding, see for example [2], [3]. There are several possible approaches on how one can approach the problem of finite-length codes. In [3] the authors fix an acceptable error probability and a finite blocklength and then try to find bounds on the possible transmission rates. In another approach, one fixes the transmission rate and studies how the error probability depends on the blocklength (i.e., one basically studies error exponents, but for relatively small n) [2]. Both approaches are related to Shannon’s ideas in the sense that they try to make fundamental statements of what is possible and what not. The exact manner in which these systems have to be built is ignored on purpose.

Our approach in this work is different: based on the insight that for very short blocklength one has no big hope of transmitting much information with acceptable error probability, we concentrate on codes with an only very small fixed number of codewords: so called ultra-small block-codes. For such codes we try to find a best possible design that minimizes the average error probability. Hence, we put a big emphasis on finding insights in how to actually design an optimal system.

For these reasons we have started to investigate the fundamental behavior of communication in the extreme case of an ultra-short blocklength. We would like to ask the following questions: What performance can we expect from codes of fixed, very short blocklength? What can we say about good design for such codes?

There are interesting applications for such codes. For example, in the situation of establishing an initial connection in a wireless link, the amount of information that needs to be transmitted during the setup of the link is very much limited to usually only a couple of bits. However, these bits need to be transmitted in very short time (e.g., blocklength in the range of n = 20 to n = 30) with the highest possible reliability [4]. Note that while the motivation of this work focuses on rather smaller values of n, our results nevertheless hold for arbitrary finite n.

The study of ultra-small block-codes is interesting not only because of the above mentioned direct applications, but because their analytic description is a first step to a better fundamental understanding of optimal nonlinear coding schemes (with ML decoding) and of their performance based on the true error probability rather than an upper bound on the error probability derived from the union bound. To simplify our analysis, we have restricted ourselves for the moment to binary discrete memoryless channels.

For simplification of the exposition, in this paper we will exclusively focus on two special cases: the binary symmetric channel (BSC) and the Z-channel (ZC). For results on general binary channels we refer to [5]. Note that while particularly for the BSC much is known about linear code design [6], there is basically no literature about optimal, possibly nonlinear codes.

(5)

The remainder of this report is structured as follows: after some comments about our notation we will introduce the channel models in Section3. In Section 5 we will give some code definitions that will be used for the main results that are summarized in Section6. Some of the proofs are omitted for space reasons. We refer to [5] for more details. Finally, Section 7 contains a discussion about the optimal code structure for the BSC.

As it is common in coding theory, vectors (denoted by bold face Roman let-ters, e.g., x) are row-vectors. However, for simplicity of notation and to avoid a huge number of transpose-signs we slightly misuse this notational convention for one special case: any vector c is a column-vector. It should be always clear from the context because these vectors are used to build codebook matrices and are therefore also conceptually quite different from the transmitted codewords x or the received sequence y. Otherwise our used notation follows the main stream: we use capital letters for random quantities and small letters for realizations.

2 Definitions

2.1 Discrete Memoryless Channel

The probably most fundamental model describing communication over a noisy chan-nel is the so-called discrete memoryless chanchan-nel (DMC). A DMC consists of a

• a finite input alphabet X ; • a finite output alphabet Y; and

• a conditional probability distribution PY|X(·|x) for all x ∈ X such that

PYk|X1,X2,...,Xk,Y1,Y2,...,Yk−1(yk|x1, x2, . . . , xk, y1, y2, . . . , yk−1)

= PY|X(yk|xk) ∀ k. (1)

Note that a DMC is called memoryless because the current output Yk depends only

on the current input xk. Moreover also note that the channel is time-invariant in

the sense that for a particular input xk, the distribution of the output Yk does not

change over time.

Definition 1. We say a DMC is used without feedback, if

P (xk|x1, . . . , xk−1, y1, . . . , yk−1) = P (xk|x1, . . . , xk−1) ∀ k, (2)

i.e., Xk depends only on past inputs (by choice of the encoder), but not on past

outputs. Hence, there is no feedback link from the receiver back to the transmitter that would inform the transmitter about the last outputs.

Note that even though we assume the channel to be memoryless, we do not restrict the encoder to be memoryless! We now have the following theorem.

Theorem 2. If a DMC is used without feedback, then P (y1, . . . , yn|x1, . . . , xn) =

n

Y

k=1

PY|X(yk|xk) ∀ n ≥ 1. (3)

(6)

2.2 Coding for DMC

Definition 3. A (M, n) coding scheme for a DMC (X , Y, PY|X) consists of

• the message set M = {1, . . . , M} of M equally likely random messages M ; • the (M, n) codebook (or simply code) consisting of M length-n channel input

sequences, called codewords;

• an encoding function f : M → Xn _{that assigns for every message m ∈ M a}

codeword x = (x1, . . . , xn); and

• a decoding function g : Yn _{→ ˆ}_{M that maps the received channel output}

n-sequence y to a guess ˆm ∈ ˆM. (Usually, we have ˆM = M.)

Note that an (M, n) code consist merely of a unsorted list of M codewords of length n, whereas an (M, n) coding scheme additionally also defines the encoding and decoding functions. Hence, the same code can be part of many different coding schemes.

Definition 4. A code is called linear if the sum of any two codewords again is a codeword.

Note that a linear code always contains the all-zero codeword.

The two main parameters of interest of a code are the number of possible mes-sages M (the larger, the more information is transmitted) and the blocklength n (the shorter, the less time is needed to transmit the message):

• we have M equally likely messages, i.e., the entropy is H(M ) = log₂M _bits and we need log₂M _{bits to describe the message in binary form;}

• we need n transmissions of a channel input symbol Xk over the channel in

order to transmit the complete message.

Hence, it makes sense to give the following definition. Definition 5. The rate1 of a (M, n) code is defined as

R, log2M

n bits/transmission. (4)

It describes what amount of information (i.e., what part of the log₂M _{bits) is} transmitted in each channel use.

However, this definition of a rate makes only sense if the message really arrives at the receiver, i.e., if the receiver does not make a decoding error!

Definition 6. An (M, n) coding scheme for a BAC consists of a codebook C(M,n)

with M binary codewords xm of length n, an encoder that maps every message m

into its corresponding codeword xm, and a decoder that makes a decoding decision

g(y) ∈ {1, . . . , M} for every received binary n-vector y.

We will always assume that the M possible messages are equally likely.

1_{We define the rate here using a logarithm of base 2. However, we can use any logarithm as long}

(7)

Definition 7. Given that message m has been sent, let λ(n)m be the probability of a

decoding error of an (M, n) coding scheme with blocklength n:

λ(n)_m , Pr[g(Y) 6= m | X = xm] (5)

=X

y

P_Y|X(y|xm) I{g(y) 6= m}, (6)

where I{·} is the indicator function I_{{statement} ,}

(

1 if statement is true,

0 if statement is wrong. (7) The maximum error probability λ(n) _{of an (M, n) coding scheme is defined as}

λ(n), max

m∈Mλm. (8)

The average error probability Pe(n)of an (M, n) coding scheme is defined as

P_e(n), 1 M M X m=1 λ(n)_m . (9)

Moreover, sometimes it will be more convenient to focus on the probability of not making any error, denoted success probability ψm(n):

ψ(n)m , Pr[g(Y) = m | X = xm] (10)

=X

y

P_Y|X(y|xm)I{g(y) = m}. (11)

The definition of maximum success probability ψ(n) and the average success proba-bility2 Pc(n)are accordingly.

Definition 8. For a given codebook C , we define the decoding region Dm

corre-sponding to the m-th codeword xm as follows:

Dm, {y : g(y) = m}. (12)

Note that we will always assume that the decoder g is a maximum likelihood

(ML) decoder :

g(y) , arg max

1≤m≤MPY|X(y|xm) (13)

that minimizes the average error probability Pe(n).

Note that we write the codebook C(M,n) _{as an M × n matrix with the M rows}

corresponding to the M codewords:

C(M,n)₌    x1 .. . xM    . (14)

Since we are only considering memoryless channels, any permutation of the columns of C(M,n) will lead to another codebook that is completely equivalent to the first in

(8)

the sense that it has the exact same error probability. Similarly, since we assume equally likely messages, any permutation of rows only changes the assignment of codewords to messages and has no impact on the performance. Therefore, in the remainder of this paper, we will always consider such equivalent codes as being the same. In particular, when we speak of unique design we do not exclude the always possible permutations of columns and rows.

The most famous relation between code rate and error probability has been derived by Shannon in his landmark paper from 1948 [1].

Theorem 9 (The Channel Coding Theorem for a DMC). Define C, max

PX(·)

I(X; Y ) (15)

where X and Y have to be understood as input and output of a DMC and where the maximization is over all input distributions PX(·).

Then for every R < C there exists a sequence of (2nR_{, n) coding schemes with}

maximum error probability λ(n)→ 0 as the blocklength n gets very large.

Conversely, any sequence of (2nR_{, n) coding schemes with maximum error}

prob-ability λ(n)_{→ 0 must have a rate R ≤ C.}

So we see that C denotes the maximum rate at which reliable communication is possible. Therefore C is called channel capacity.

Note that this theorem considers only the situation of n tending to infinity and thereby the error probability going to zero. However, in a practical system, we cannot allow the blocklength n to be too large because of delay and complexity. On the other hand it is not necessary to have zero error probability either.

So the question arises what we can say about “capacity” for finite n, i.e., if we allow a certain maximal probability of error, what is the smallest necessary blocklength n to achieve it? Or, vice versa, fixing a certain short blocklength n, what is the best average error probability that can be achieved? And, what is the optimal code structure for a given channel?

3 Channel Models

In the following we will concentrate on the special cases of binary DMCs, i.e., we restrict our channel alphabets to be binary.

The most general binary discrete memoryless channel is the so-called binary

asymmetric channel (BAC). It has a probability ǫ0 that an input 0 will be flipped

into a 1 and a (possible different) probability ǫ1 for a flip from 1 to 0. See Figure 1.

0 0 1 1 ǫ0 ǫ1 1 − ǫ0 1 − ǫ1 X Y

(9)

For symmetry reasons and without loss of generality we can restrict the values of these parameters as follows:

0 ≤ ǫ0 ≤ ǫ1≤ 1, (16)

ǫ0 ≤ 1 − ǫ0, (17)

ǫ0 ≤ 1 − ǫ1. (18)

Note that in the case where ǫ0 > ǫ1we simply can flip all zeros to ones and vice-versa

to get an equivalent channel with ǫ0 ≤ ǫ1. For the case where ǫ0 > 1 − ǫ0, we can

flip the output Y , i.e., change all output zeros to ones and ones to zeros, to get an equivalent channel with ǫ0≤ 1 − ǫ0. Note that (17) can be simplified to ǫ0 ≤ 1₂. And

for the case where ǫ0 > 1 − ǫ1, we can flip the input X to get an equivalent channel

that satisfies ǫ0 ≤ 1 − ǫ1.

We have depicted the region of possible choices of the parameters ǫ0 and ǫ1 in

Figure 2. The region of interesting choices given by (16) and (17) is denoted by Ω.

ǫ0 ǫ1 Ω 1 1 1 2 1 2 ǫ0+ ǫ1 = 1 (completely noisy) ǫ0 = ǫ1 (BSC) ǫ0 = 0 (Z-channel)

Figure 2: Region of possible choices of the channel parameters ǫ0 and ǫ1 of a BAC.

The shaded area corresponds to the interesting area according to (16), (17) and (18). Note that the BAC includes all well-known binary channel models: if ǫ0 = ǫ1,

we have a BSC; and if ǫ0 = 0, we have a Z-channel. In the case when ǫ0 = 1 − ǫ1 we

end up with a completely noisy channel of zero capacity: given Y = y, X = 0 and X = 1 are equally likely, i.e., X ⊥⊥ Y .

In this report we will put special emphasis on the former two special cases of the BAC. The binary symmetric channel (BSC) has equal cross-over probability ǫ0 = ǫ1 = ǫ, see Fig. 3. For symmetry reasons and without loss of generality, we

assume that ǫ < 1₂.

The Z-channel (ZC) will never distort an input 0, i.e., ǫ0 = 0. But the input 1

(10)

X Y 0 0 1 1 ǫ ǫ 1 − ǫ 1 − ǫ

Figure 3: The binary symmetric channel (BSC).

X Y 0 0 1 1 1 ǫ1 1 − ǫ1

Figure 4: The Z-channel (ZC).

4 Preliminaries

4.1 Capacity of the BAC

As mentioned above, without loss of generality, we only consider BACs with 0 ≤ ǫ0≤ ǫ1 ≤ 1. The capacity of a BAC is given by

C_BAC ₌ ǫ0 1 − ǫ0− ǫ1 · Hb(ǫ1) − 1 − ǫ1 1 − ǫ0− ǫ1 · Hb(ǫ0) + log2 1 + 2Hb(ǫ0)−Hb(ǫ1)1−ǫ0−ǫ1 (19) bits, where Hb(·) is the binary entropy function defined as

Hb(p) , −p log2p − (1 − p) log2(1 − p).

The input distribution P∗

X(·) that achieves this capacity is given by

P_X∗(0) = 1 − P_X∗(1) = 1 − ǫ1(1 + z) (1 − ǫ0− ǫ1)(1 + z)

(20) with

z , 2Hb(ǫ0)−Hb(ǫ1)1−ǫ0−ǫ1 _. ₍₂₁₎

4.2 Error Probability of the BAC

To simplify our notation we introduce dαβ(xm, y) to be the number of positions j

where x(j)m = α and y(j) = β, where as usual xm, i ∈ {1, 2, . . . , M}, is the sent

codeword and y is the received sequence.

The conditional probability of the received vector given the sent codeword can now be written as

P_Y|X(y|xm) = (1 − ǫ0)d00(xm,y)· ǫd001(xm,y)· ǫ

d10(xm,y)

(11)

Note that we can express these different dαβ’s as follows: d11(xm, y) = 1 2wH(xm+ y − |xm− y|), (23) d10(xm, y) = wH(I{xm− y > 0}), (24) d01(xm, y) = wH(I{y − xm > 0}), (25) d00(xm, y) = n − d11(xm, y) − d10(xm, y) − d01(xm, y), (26)

where wH(x) is the Hamming weight of x.

The average error probability of a code C over a BAC (assuming equally likely messages) can be expressed as

P_e(n)(C ) = (1 − ǫ0) n M X y M X i=1 i6=g(y) ǫ0 1 − ǫ0 d01(xm,y) _ǫ 1 1 − ǫ0 d10(xm,y) 1 − ǫ 1 1 − ǫ0 d11(xm,y) (27) = 1 M M X i=1 X y g(y)6=i PY|X(y|xm), (28)

where g(y) is the ML decision (13) for the observation y.

Note that a closer investigation shows that some of these optimal codes are linear, but some are not.

4.3 Error Probability of the BSC

Consider the situation of a BSC and assume that we transmit the m-th codeword xm, 1 ≤ m ≤ M, and that we receive y. The maximum likelihood (ML) decision is

then

g(y) , arg max

1≤i≤MPY|X(y|xm). (29)

The average probability of error can be computed as P_e(n)= 1 M(1 − ǫ) nX y M X i=1 i6=g(y) ǫ 1 − ǫ dH(xm,y) (30)

where dH(·, ·) is the Hamming distance.

Note that if we want to find the best average error probability, we need to check through all possible codes (including both linear and nonlinear codes). The com-plexity of such a search grows exponentially fast in n: for M = 4 and

• for n = 3 there are 8₄= 70 different codes; • for n = 4 there are 16₄= 1820 different codes; • for n = 5 there are 32₄= 35960 different codes, etc.

It turns out that for a given BSC, blocklength n, and number of message M, there is a vast amount of different codes (linear and nonlinear) that are all optimal. This is not really surprising because the BSC is strongly symmetric.

(12)

4.4 Error (and Success) Probability of the Z-Channel

A special case of the BAC is the Z-channel where we have ǫ0 = 0. By symmetry,

assume that ǫ1 ≤ 1₂. Note that it is often easier to maximize the success probability

instead of minimizing the error probability. For the convenience of later derivations, we now are going to derive its error and success probabilities:

P_c(n)(C ) = 1 M M X i=1 X y g(y)=i (1 − ǫ1)wH(xm) ǫ1 1 − ǫ1 d10(xm,y) · Inif x(j)_m = 0 =⇒ y(j)= 0, ∀ jo. (31) The error probability formula is accordingly

P_e(n)(C ) = 1 M M X i=1 X y g(y)6=i (1 − ǫ1)wH(xm) ǫ1 1 − ǫ1 d10(xm,y) · Inif x(j)_m = 0 =⇒ y(j)= 0, ∀ jo. (32) Note that the capacity-achieving distribution for ǫ1 = 1₂ is

Pr[X = 1] = 2

5. (33)

4.5 Pairwise Hamming Distance

The minimum Hamming distance is a well-known and often used quality criterion of a codebook. However, for the description of an optimal code design for a fixed blocklength n, it turns out to be too crude. We define a slightly more general and more concise description of a codebook: the pairwise Hamming distance vector. Definition 10. Given a codebook C(M,n)with codewords xmwe define the pairwise

Hamming distance vector d(M,n) of length (M−1)M₂ as

d(M,n),d(n)₁₂, d(n)₁₃, d(n)₂₃, d(n)₁₄ , d(n)₂₄ , d(n)₃₄, . . . ,

d(n)_1M, d(n)_2M, . . . , d(n)_(M−1)M (34) with d(n)_ij , dH(xi, xj), 1 ≤ i < j ≤ M. The minimum Hamming distance d(M,n)_min is

defined as the minimum component of the vector d(M,n).

Note that we have seen in Section4.3that the error probability of a binary code that is used over a BSC can be described using the Hamming distance between the received vectors y and the different codewords xm. We introduce the following

notation for this type of Hamming distance.

Definition 11. For a given codebook C(M,n)and for some received n-vector y, the

received Hamming distance vector d(n)_{(y) is defined as}

d(n)(y) = d(n)₁ (y), . . . , d(n)M (y) , dH(y, x1), . . . , dH(y, xM)

, (35) where d(n)m (y) , dH(y, xm) denotes its mth component and is called received

Ham-ming distance.

Note that an ML decoder will always decode a received vector y to that message that results into a minimum value of the received Hamming distance:

d(n)_min(y) , min

m=1,...,Md (n)

(13)

5 Flip Codes and Weak Flip Codes

Next, we will introduce some special codebooks that will be used later on.

Definition 12. The flip code of type t for t ∈0, 1, . . . ,n₂ is a code with M = 2 codewords defined by the following codebook matrix C_t(2,n):

tcolumns z }| { C_t(2,n) , x ¯ x = 0 · · · 0 1 · · · 1 1 · · · 1 0 · · · 0 . (37)

Defining the column vectors c(2)₁ , 0 1 , c(2)₂ , 1 0 , (38)

we see that a flip code of type t is given by a codebook matrix that consists of (n − t) columns c(2)₁ and t columns c(2)₂ .

We again remind the reader that due to the memorylessness of the BSC and the ZC, the order of the columns of any codebook matrix is irrelevant. Moreover, we would like to point out that while the flip code of type 0 corresponds to a repetition code, the general flip code of type t with t > 0 is neither a repetition code nor is it even linear.

Definition 13. A weak flip code of type (t2, t3) for M = 3 or M = 4 codewords

is defined by a codebook matrix C_t(M,n)₂_,t₃ that consists of t1 , (n − t2− t3) columns

c(M)₁ , t2 columns c(M)2 , and t3 columns c(M)3 , where

   c(3)₁ ,   0 0 1  , c(3)₂ ,   0 1 0  , c(3)₃ ,   0 1 1      (39) or _       c(4)₁ ,     0 0 1 1    , c (4) 2 ,     0 1 0 1    , c (4) 3 ,     0 1 1 0            , (40)

respectively.3 We often describe a weak flip code of type (t2, t3) by the code

param-eters

[t1, t2, t3] (41)

where t1 can be computed from blocklength n and the type (t2, t3) as t1 = n−t2−t3.

Lemma 14. The pairwise Hamming distance vector of a weak flip code can be

com-puted as follows:

d(3,n) = (t2+ t3, t1+ t3, t1+ t2), (42)

d(4,n) = (t2+ t3, t1+ t3, t1+ t2, t1+ t2, t1+ t3, t2+ t3). (43)

3_{The name weak flip code is motivated by the fact that the weak flip code is a generalization}

of the flip code: while for M = 3 it is not possible to have all codewords to be flipped versions of other codewords and for M = 4 such a definition would be too restrictive, it is still true that the distribution of zeros and ones in the candidate columns c1, c2, and c3 is very balanced.

(14)

6 Main Results

6.1 An Example

To show that the search for an optimal (possibly nonlinear) code is neither trivial nor intuitive even in the symmetric BSC case, we would like to start with a small example before we summarize our main results.

Example 15. Assume a BSC with cross probability ǫ = 0.4, M = 4, and a block-length n = 4. Then consider the following two weak flip codes:

C_1,0(4,4) ,     0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1    , C (4,4) 2,0 ,     0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1    . (44)

We observe that while both codes are linear, the first code has a minimum Hamming distance 1, and the second has 2. Assuming an ML decoder, the average error prob-ability can be expressed using the Hamming distance between the received sequence and the codewords:

P_e(n)(C) = 1 M 4 X m=1 X y g(y)6=m P_Y|X(y|xm) (45) = (1 − ǫ) 4 4 4 X m=1 X y g(y)6=m ǫ 1 − ǫ dH(xm,y) , (46)

where dH(xm, y) is the Hamming distance between a codeword xm and a received

vector y.

If evaluated, we get an error probability Pe(n) = 0.6112 for C_1,0(4,4) and 0.64 for

C_2,0(4,4). Hence, even though the minimum Hamming distance of the first codebook is smaller, its overall performance is superior to the second codebook! ♦

Our goal is to find the structure of an optimal code C(M,n)∗ that satisfies P_e(n) C(M,n)∗_{≤ P}(n)

e C(M,n)

, (47)

for any code C(M,n).

6.2 Optimal Codes on BAC for M = 2

Due to the memorylessness of the BAC, the order of the columns of any code is irrelevant. We therefore can restrict ourselves without loss of generality to flip-flop codes of type t to describe all possible flip-flop codes. Also note that the only possible linear flip-flop code is C₀(2,n). All other flip-flop codes are nonlinear.

We are now ready for the following result.

Proposition 16. Consider the case M = 2, and fix the blocklength n. Then,

ir-respective of the channel parameters ǫ0 and ǫ1, on a BAC there always exists a t,

0 ≤ t ≤n₂, such that the flip-flop code of type t, C_t(2,n), is an optimal code in the sense that it minimizes the error probability.

(15)

This result is intuitively very pleasing because it seems to be a rather bad choice to have two codewords with the same symbol in a particular position. However, the proposition does not exclude the possibility that such a code might exist that also is optimal.

We would like to point out that the exact choice of t is not obvious and depends strongly on n, ǫ0, and ǫ1. As an example, the optimal choices of t are shown in

Figure 5 for n = 5. We see that depending on the channel parameters, the optimal

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 M_{= 2, n = 5} t = 0 t = 1 t = 2 ǫ0 ǫ1

Figure 5: Optimal codebooks on a BAC: the optimal choice of the parameter t for different values of ǫ0 and ǫ1 for a fixed blocklength n = 5.

value of t changes. Note that on the boundaries the optimal choice is not unique: for a completely noisy BAC (ǫ1 = 1 − ǫ0), the choice of the codebook is irrelevant

since the probability of error is 1₂ in any case. For a BSC, t = 0, t = 1, or t = 2 are equivalent. And for a Z-channel we can prove that a linear code is always optimal.4 6.3 Optimal Decision Rule on BAC for M = 2

In any system with only two possible messages the optimal ML receiver can easily be described by the log likelihood ratio (LLR):

LLR(y) , log PY|X(y|x1) P_Y|X(y|x2)

. (48)

If LLR(y) > 0, then the receiver decides for 1, while if LLR(y) < 0, it decides for 2. In the situation of LLR(y) = 0, both decisions are equally good.

(16)

In the situation of a flip-flop code of type t, C_t(2,n), the LLR is given as LLR(n)_t (ǫ0, ǫ1, d) , (t − d) log 1 − ǫ1 ǫ0 + (n − t − d) log 1 − ǫ0 ǫ1 , (49) where d is defined to be the Hamming distance between the received vector and the first codeword:

d , dH(x1, y). (50)

Note that 0 ≤ d ≤ n depends on the received vector, while t and n are code parameters, and ǫ0 and ǫ1 are channel parameters.

Hence, the optimal decision rule can be expressed in terms of d. Proposition 17. We list some properties of LLR(n)_t (ǫ0, ǫ1, d):

1. If ǫ0+ ǫ1 = 1, then LLR(n)t (ǫ0, ǫ1, d) = 0 for all d.

2. LLR(n)_t (ǫ0, ǫ1, d) is a decreasing function in d:

LLR(n)_t (ǫ0, ǫ1, d) ≥ LLR(n)t (ǫ0, ǫ1, d + 1), ∀ 0 ≤ d ≤ n − 1. (51)

3. For d ≤ t and d > n 2

the LLR(n)_t is always larger or smaller than zero, respectively: LLR(n)_t (ǫ0, ǫ1, d)      ≥ 0 for 0 ≤ d ≤ t, ≶ 0 for t < d ≤n 2 , depending on ǫ0, ǫ1, ≤ 0 for n₂< d ≤ n. (52)

4. LLR(n)_t (ǫ0, ǫ1, d) is an increasing function in n, when we fix d, ǫ0, and ǫ1.

5. LLR(n)_t (ǫ0, ǫ1, d) is an increasing function in t when we fix n, d, ǫ0, and ǫ1.

6. For 0 ≤ d ≤ n − 1,

LLR(n+1)_t (ǫ0, ǫ1, d + 1) < LLR(n)t (ǫ0, ǫ1, d). (53)

Proof. Omitted.

From these properties we immediately obtain an interesting result about the optimal decision rule.

Proposition 18 (Optimal Decision Rule has a Threshold). For a fixed

flip-flop code C_t(2,n) and a fixed BAC (ǫ0, ǫ1) ∈ Ω, there exists a threshold ℓ, t ≤ ℓ ≤

_n−1

2

, such that the optimal ML decision rule can be stated as

g(y) = (

1 if 0 ≤ d ≤ ℓ,

2 if ℓ + 1 ≤ d ≤ n. (54)

The threshold ℓ depends on (ǫ0, ǫ1), but similar channels will usually have the same

threshold. We define the region of channel parameters with identical threshold as follows: Ω(n)_ℓ,t , n(ǫ0, ǫ1) LLR (n) t (ǫ0, ǫ1, ℓ) ≥ 0 o \ n (ǫ0, ǫ1) LLR (n) t (ǫ0, ǫ1, ℓ + 1) ≤ 0 o . (55) In Figure6an example of this threshold behavior is shown. For ǫ0 ∈ [0.136, 0.270]

we see that LLR(7)₁ (ǫ0, 1−2ǫ0, d) ≥ 0 for d = 0, d = 1, and d = 2, while LLR(7)1 (ǫ0, 1−

(17)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −5 −4 −3 −2 −1 0 1 2 3 4 5 L L R (n ) t ǫ1 = 1 − 2ǫ0, t = 1 ǫ0 d = 0 d = 1 d = 2 d = 3 d = 4 d = 5 d = 6 d = 7 Ω(7)_ℓ=2,t=1

Figure 6: The log likelihood function depicted as a function of (ǫ0, ǫ1) for different

values of d. To simplify the plot, only ǫ0 is depicted with ǫ1 being a fixed function

of ǫ0. The solid blue lines depict the case n = 7, the dashed red lines n = 6. The

code is fixed to be t = 1. We see that for n = 7 and ǫ0 ∈ [0.136, 0.270] the threshold

(18)

6.4 Optimal Codes on ZC

Theorem 19. For a ZC and for any n ≥ 1, an optimal codebook with two codewords M_{= 2 is the flip codebook of type 0, C}(2,n)

0 . It has an error probability

P_e(n) C(2,n) 0 = 1 2ǫ n 1. (56)

Proof. Let the m-th codeword be xm = (xm,1, xm,2, . . . , xm,n), m = 1, 2, and let the

received vector be y = (y1, y2, . . . , yn). Then the average error probability of a ZC

can be expressed as P_e(n)(C ) = 1 2 X y 2 X m=1 m6=g(y) (1 − ǫ1)wH(xm) ǫ1 1 − ǫ1 d10(xm,y) · I {xm,j = 0 =⇒ yj = 0, ∀ j} (57) = 1 2 X y

minP_Y|X(y|x1), PY|X(y|x2) , (58)

where wH(xm) is the Hamming weight of the codeword xm and d10(xm, y) denotes

the number of positions j where xm,j = 1 and yj = 0.

Now note that Proposition16shows that an optimal code should have two code-words that are flipped to each other. The intuition behind this is that an optimal decoder will simply ignore all those bit positions where both codewords are iden-tical, leading to the same performance that can be achieved for a code of shorter length. We therefore now assume that x2 = ¯x is the flipped version of x1= x, with

x defined in (37).

For such a flip code we now observe that due to the peculiarity of the ZC that will never flip a zero to a one, we can only make an error if the received vector is the all-zero vector y = 0:

minP_Y|X(y|x), P_Y|X(y|¯x) = (

0 if y 6= 0,

ǫmax{wH(x),wH(¯x)}

1 if y = 0.

(59) This error probability is minimized if one of the codewords is the all-one codeword, i.e., we see that C₀(2,n) is optimal.

Lemma 20. For a ZC and for any n ≥ 2, the average success probabilities of the

weak flip code of type (t, 0), 1 ≤ t ≤ ⌊n

2⌋, with three codewords M = 3 or four

codewords M = 4 are 3P_c(n) C(3,n) t,0 = 1 + t−1 X d=0 t d (1 − ǫ1)t−dǫd1+ (n−t)−1 X d=0 n − t d (1 − ǫ1)(n−t)−dǫd1; (60) 4P_c(n) C_t,0(4,n)= 1 + t−1 X d=0 t d (1 − ǫ1)t−dǫd1+ n−t−1 X d=0 n − t d (1 − ǫ1)(n−t)−dǫd1 + n−1 X d=0 n d −n − t d − t − t d − (n − t) (1 − ǫ1)n−dǫd1. (61)

(19)

Proof. Note that the average success probability of a code with M messages used

over a ZC can be written as follows: P_c(n)(C ) = 1 M M X m=1 X y g(y)=m (1 − ǫ1)wH(xm) ǫ1 1 − ǫ1 d10(xm,y) · I {xm,j = 0 =⇒ yj = 0, ∀ j} . (62)

We will use this together with the peculiar behavior of the ZC, which ensures that P_Y_|X(1|0) = 0, to derive (60) and (61).

We start with M = 4. Consider the weak flip code of type (t, 0), C_t,0(4,n), where 1 ≤ t ≤ ⌊n₂⌋. Denote the success probability of the m-th codeword by ψ(4,n)_t,m and the corresponding decoding region by D(4,n)_t,m . Also, in the following we will use a superscript to emphasize the length of a vector, e.g., y(t) is a vector of length t.

Recall that the first codeword of C_t,0(4,n) is the all-zero codeword, the second codeword has Hamming weight t, and the remaining two are flipped version of the first two. From PY|X(0|0) = 1, the first codeword will always be transmitted to 0(n),

i.e., D(4,n)_t,1 only consists of the all-zero vector.

Next note that for any y with y(n) = [0(n−t)y(t)] and where wH(y(t)) ≥ 1 we

have

maxP_Y|X(y|x1), PY|X(y|x2), PY|X(y|x3), PY|X(y|x4)

= max0, P_Y|X(y|x2), 0, PY|X(y|x4) (63)

= P_Y|X(y|x2) (64)

= (1 − ǫ1)t−d· ǫd1, (65)

where d ≤ t − 1 denotes the Hamming distance between y and x2. The last step

follow because we assume that 0 < ǫ1≤ 1₂ and because wH(x4) = n > t = wH(x2).

The same argument can be applied to the decoding region D_t,3(4,n), and D(4,n)_t,4 must be {0, 1}n_\S3

i=1D (4,n)

t,i . Hence we get the following list:

D(4,n)_t,1 =n0(n)o, (66) D(4,n)_t,2 =ny(n): y(n)= [0(n−t)y(t)] with 1 ≤ wH(y(t)) ≤ t o , (67) D(4,n)_t,3 =ny(n): y(n)= [y(n−t)0(t)] with 1 ≤ wH(y(n−t)) ≤ n − t o , (68) D(4,n)_t,4 = {0, 1}n\ 3 [ m=1 D(4,n)_t,m . (69)

Using this in (62) will then lead to (61). Similarly, we also can show that the success probability of C_t,0(3,n) is as given in (60).

Theorem 21. For a ZC and for any n ≥ 2, an optimal codebook with three codewords M_{= 3 or four codewords M = 4 is the weak flip code of type (t}∗_{, 0) with t}∗ ,n

2 : C(M,n)∗ ZC = C (M,n) t∗_,0 . (70)

(20)

Proof. Our proof is based on induction on n. The optimal code for M = 4 and n = 2

is trivial since there are only four possible different codewords. The optimal code is C(4,2)

1,0 =

c(4)₁ , c(4)₂ . (71) Next assume that for blocklength n, C_⌊(4,n)n

2⌋,0is optimal. Then, from Claim22 below,

this assumption still holds for blocklength (n + 1). This will prove the theorem. Claim 22. Let’s append one new column to the weak flip code of type (⌊n₂⌋, 0), C_⌊(4,n)n

2⌋,0,

to generate a new code of length (n + 1). The optimal (in the sense of resulting in the biggest success probability) choice among all possible 2M

= 16 columns is c(4)₂ .

In fact, this claim holds not only for t = ⌊n

2⌋ but for any t ∈ {1, 2 . . . , ⌊n2⌋}. Note

that there are actually only 14 possible columns that we could choose as the (n + 1)-th column because 1)-the all-zero and 1)-the all-one columns clearly are suboptimal as in this case an optimal decoder will simply ignore the (n + 1)-th received digit.

To prove Claim 22we append an additional bit to all four codewords as follows:     [0 x1,n+1] [x x2,n+1] [¯x x3,n+1] [1 x4,n+1]     (72)

where xm,n+1 ∈ {0, 1} and the x is given in (37) with t ,

_n

2

. Note that in the remainder of this proof t can be read as shorthand for ⌊n₂⌋.

We now extend the decoding regions given in (66)–(69) by one bit for m = 1, 2, 3, 4: [D_t,m(4,n) 0] ∪ [D_t,m(4,n) 1]. Observe that these new decoding regions retain the same success probability ψ_t,m(4,n+1)= ψ(4,n)_t,m · 1, because

PY|X(0|xm,n+1) + PY|X(1|xm,n+1) = 1. (73)

However, it is quite clear that these new regions are in general not the optimal decision regions anymore for the new code. So the question is how to fix them to make them optimal again (and thereby also finding out how to optimally choose xm,n+1).

Firstly note that if xm,n+1 = 0, adding a 0 to the received vector y(n) will

not change the decision m because 0 is the success outcome anyway. Similarly, if xm,n+1= 1, adding a 1 to the vector y(n) will not change the decision m.

Secondly, we claim that even if xm,n+1 = 1, all received vectors y(n+1)∈ [Dt,m(4,n) 0]

still will optimally be decoded to m. To see this, let’s have a look at the four cases separately:

• [D_t,1(4,n)0]: The decoding region [D(4,n)_t,1 0] only contains one vector: the all-zero vector. We have

PY|X 0(n+1)

x(n+1)₁ = [0(n)1]= ǫ1

≥ P_Y|X 0(n+1)x(n+1)_j , ∀ j = 2, 3, 4, (74) independent of the choices of xj,n+1, j = 2, 3, 4. Hence, we decide for m = 1.

(21)

• [D_t,2(4,n) 0]: All vectors in [D_t,2(4,n) 0] contain ones in positions that make it impossible to decode it as m = 1 or m = 3. On the other hand, m = 4 obviously is less likely than m = 2, i.e., we decide m = 2.

• [D_t,3(4,n) 0]: All vectors in [D_t,3(4,n) 0] contain ones in positions that make it impossible to decode it as m = 1 or m = 2. On the other hand, m = 4 obviously is less likely than m = 3, i.e., we decide m = 3.

• [D_t,4(4,n) 0]: All vectors in [D_t,4(4,n) 0] contain ones in positions that make it impossible to decode it as m = 1, m = 2, or m = 3. It only remains to decide m = 4.

So, it only remains to investigate the decisions made about the vectors in [D_t,m(4,n)1] if xm,n+1= 0. Note that we do not need to bother about [D(4,n)t,4 1] as it is impossible

to receive such a vector because for all y ∈ D(4,n)_t,4 ,

P_Y|X(y(n)|0(n)) = P_Y|X(y(n)|0(n−t)1(t)) = P_Y|X(y(n)|1(n−t)0(t)) = 0. (75) For m = 1, 2, or 3, if xm,n+1 = 0, the received vectors in [D(4,n)t,m 1] will change to

another decoding region not equal to m because PY|X(1|0) = 0.

• [D_t,1(4,n)1]: If we assign these vectors (actually, it’s only one) to the new decod-ing region D(4,n+1)_t,2 , the amount of newly added conditional success probability for m = 2 is given by ∆ψ2 , ψ(4,n+1)t,2 − ψ (4,n) t,2 (76) = X y(n)_∈D(4,n) t,1 PY|X [y(n)1] [0(n−t)1t1] · (x2,n+1− x1,n+1)+ (77) = ǫt₁· (1 − ǫ1) · (x2,n+1− x1,n+1)+, (78) where (x)+= ( x if x ≥ 0, 0 if x < 0. (79)

Note that x2,n+1 must be 1 if it shall be possible for this event to occur!

Similarly, we compute

∆ψ3 = ǫn−t₁ · (1 − ǫ1) · (x3,n+1− x1,n+1)+; (80)

∆ψ4 = ǫn1 · (1 − ǫ1) · (x4,n+1− x1,n+1)+. (81)

From ǫt

1 ≥ ǫn−t1 > ǫn1 we see that ∆ψ2 gives the highest increase, followed by

∆ψ3 and then ∆ψ4. Hence, we should write them as follows:

∆ψ2 = ǫt1· (1 − ǫ1) · (x2,n+1− x1,n+1)+, (82)

∆ψ3 = ǫn−t1 · (1 − ǫ1)

· (x3,n+1− x2,n+1− x1,n+1)+, (83)

∆ψ4 = ǫn1 · (1 − ǫ1)

(22)

• [D_t,2(4,n)1]: In this case only D(4,n+1)_t,4 will yield a nonzero additional conditional success probability: ∆ψ4 = X y(n)_∈D(4,n) t,2 P_Y|X [y(n)1][1(n)1] · (x4,n+1− x2,n+1)+ (85) = t−1 X d=0 t d (1 − ǫ1)t−d· ǫn−t+d1 · (1 − ǫ1) · (x4,n+1− x2,n+1)+ (86) = ǫn−t₁ − ǫn₁· (1 − ǫ1) · (x4,n+1− x2,n+1)+. (87) • [D_t,3(4,n) 1]: Again, only D(4,n+1)_t,4 will yield a nonzero additional conditional

success probability: ∆ψ4 = X y(n)_∈D(4,n) t,3 P_Y|X [y(n)1][1(n)1] · (x4,n+1− x3,n+1)+ (88) = ǫt₁− ǫn₁· (1 − ǫ1) · (x4,n+1− x3,n+1)+. (89) From ǫt

1 ≥ ǫn−t1 > ǫn1, we can therefore now conclude that the best solution for the

choice of xm,n+1 yielding the largest increase in success probability in (82), (83),

(84), (87), and (89) is as follows:      x2,n+1− x1,n+1= 1, x4,n+1− x2,n+1= 0, x4,n+1− x3,n+1= 1 =⇒            x1,n+1= 0, x2,n+1= 1, x3,n+1= 0, x4,n+1= 1. (90)

This will lead to a total increase of success probability of

4∆Pc= ǫt1(1 − ǫ1) + (ǫt1− ǫn1)(1 − ǫ1). (91)

Note that for n even with t = n₂, adding the column c(4)₂ to the code Cn(4,n) 2,0 will

result in a code that is equivalent to C(4,n+1)

⌊n+1₂ ⌋,0 by just exchanging the roles of the

second and third codeword and re-order the columns. For n odd with t = ⌊n

2⌋, adding the column c (4) 2 to the code C (4,n) ⌊n 2⌋,0 results in C(4,n+1)n+1 2 ,0

. In particular, since t = ⌊n₂⌋ < n − t, this also proves that for even blocklength these optimal linear codes are unique.

Finally, the case with three codewords M = 3 can be proved in a similar manner. We observe that

c(3)₁ , c(3)₃ ≡c(3)₁ , c(3)₂ (92) are optimal codebooks for n = 2. An optimal way of extending these codes is then to add columns c(3)₂ or c(3)₃ .

Similarly, we also can prove that the codebook consisting of (n − t) columns c(3)₁ and t columns arbitrarily chosen from c(3)₂ or c(3)₃ is optimal on a ZC.

(23)

Note that for M = 2 and M = 4, the optimal codes given in Theorem 19 and Theorem21are linear. The proof of Theorem21 shows that for even n, these linear codes are the unique optimal codes. For odd n, there are other (also nonlinear) designs that achieve the same optimal performance.

It is remarkable that these optimal codes perform quite well even for very short blocklength. As an example, consider four codewords M = 4 of blocklength n = 10 that are used over a ZC with ǫ1 = 0.3: the optimal average error probability is

Pe(n) C_5,0(4,10) ≈ 2.43 · 10−3. If we increase the blocklength to n = 20, we already

achieve an average error probability Pe(n) C_10,0(4,20)≈ 5.90 · 10−6.

Moreover, also note that the optimal code C_t,0(4,n) can be seen as a double-flip code consisting of the combination of the flip-code of type 0 with the flip-code of type t > 0: C_t,0(4,n) =     x1 x2 x3 x4    =     0 x ¯ x 1     (93) with x defined in (37).

Since we know that the success probability increases with n on a binary DMC, it is quite natural to try to construct the optimal codes recursively in n.

Corollary 23. The optimal codebooks defined in Theorem 21for M = 3 and 4 can be constructed recursively in the blocklength n. We start with an optimal codebook for n = 2: C(M,2)∗ ZC = c(M)₁ , c(M)₂ . (94)

Then, we recursively construct the optimal codebook for n ≥ 3 by using C_ZC(M,n−1)∗ and appending

(

c(M)₁ if n mod 2 = 1,

c(M)₂ if n mod 2 = 0. (95) 6.5 Conjectured Optimal Codes on ZC for M = 5

The idea of designing an optimal code recursively promises to be a very powerful approach. However, note that for larger values of M, the recursion might need a step-size larger than 1. In the following we conjecture an optimal code construction for a ZC in the case of five codewords M = 5 with a different recursive design for n odd and n even.

We define the following five column vectors:            c(5)₁ ,       0 0 0 1 1       , c(5)₂ ,       0 0 1 0 1       , c(5)₃ ,       0 1 0 0 1       , c(5)₄ ,       0 0 1 1 1       , c(5)₅ ,       0 1 0 1 1                  . (96)

(24)

An optimal code can be constructed recursively for even n in the following way: we start with an optimal codebook for n = 8:

C(5,8)∗ ZC = c(5)₁ , c(5)₂ , c(5)₃ , c(5)₁ , c(5)₂ , c(5)₃ , c(5)₄ , c(5)₅ . (97)

Then, we recursively construct an optimal codebook for n ≥ 10, n even, by using C(5,n−2)∗ ZC and appending                c(5)₄ , c(5)₅ if n mod 10 = 0, c(5)₁ , c(5)₂ if n mod 10 = 2, c(5)₁ , c(5)₃ if n mod 10 = 4, c(5)₃ , c(5)₄ if n mod 10 = 6, c(5)₂ , c(5)₅ if n mod 10 = 8. (98)

For n odd we have C(5,9)∗ ZC = c(5)₁ , c(5)₂ , c(5)₃ , c(5)₄ , c(5)₅ , c(5)₁ , c(5)₂ , c(5)₁ , c(5)₃ . (99)

Then, we recursively construct an optimal codebook for n ≥ 11, n odd, by using C(5,n−2)∗ ZC and appending                c(5)₃ , c(5)₄ if n mod 10 = 1, c(5)₂ , c(5)₅ if n mod 10 = 3, c(5)₄ , c(5)₅ if n mod 10 = 5, c(5)₁ , c(5)₂ if n mod 10 = 7, c(5)₁ , c(5)₃ if n mod 10 = 9. (100)

Note that the recursive structure in (98) and (100) is actually identical apart from the ordering. Also note that when increasing the blocklength by 10, we add each of the five column vectors in (96) exactly twice.

For n < 10 the optimal code structure goes through some transient states. 6.6 Optimal Codes on BSC

Theorem 24. For a BSC and for any n ≥ 1, an optimal codebook with two

code-words M = 2 is the flip code of type t for any t ∈0, 1, . . . ,n 2

.

Proof. This proof is basically a corollary of Proposition16. The details are omitted.

Theorem 25. For a BSC and for any n ≥ 2, an optimal codebook with three

code-words M = 3 or four codecode-words M = 4 is the weak flip code of type (t∗₂, t∗₃): C(M,n)∗ BSC = C (M,n) t∗₂,t∗₃ , (101) where we define t∗₂ , n − 1 3 , t∗₃ , n + 1 3 . (102)

(25)

Note that for M = 2, the optimal codes given in Theorem 24 can be linear or nonlinear. For M = 4, by the definition of weak flip code of type (t2, t3), the optimal

codes in Theorem 25 are linear. However, due to the strong symmetry of the BSC, there also exist nonlinear codes with the same optimal performance.

Moreover, note that one can learn from the proof of Theorem25that the received vector y that is farthest from the three codewords when M = 3 is

y = (1, 1, · · · , 1 | {z } t∗₁ , 1, 1, · · · , 1 | {z } t∗₂ , 0, 0, . . . , 0 | {z } t∗₃ ). (103)

This is identical to the optimal choice of a fourth codeword x4 when M = 4.

Corollary 26. The optimal codebooks defined in Theorem21for M = 3 and M = 4 can be constructed recursively in the blocklength n. We start with an optimal codebook for n = 2: C(M,2)∗ BSC = c(M)₁ , c(M)₃ . (104)

Then, we recursively construct the optimal codebook for n ≥ 3 by using C_BSC(M,n−1)∗ and appending      c(M)₁ if n mod 3 = 0, c(M)₂ if n mod 3 = 1, c(M)₃ if n mod 3 = 2. (105)

7 Pairwise Hamming Distance Structure

It is quite common in conventional coding theory to use the minimum Hamming

distance or the weight enumerating function (WEF) of a code as a design and quality

criterion [6]. This is motivated by the equivalence of Hamming weight and Hamming distance for linear codes, and by the union bound that converts the search for the global error probability into pairwise error probabilities. Since we are interested in the globally optimal code design and the best performance achieved by an ML decoder, we can neither use the union bound, nor can we a priori restrict our search to linear codes. Note that for most values of M, linear codes do not even exist!

In order to demonstrate that these commonly used design criteria do not work when searching for an optimal code, we will now investigate the minimum Ham-ming distance of an optimal code. Although, as (46) shows, the error probability performance of a BSC is completely specified by the Hamming distance between codewords and received vectors, it turns out that a design based on the minimum Hamming distance can fail, even for the very symmetric BSC and even for linear codes. Recall that we have seen a first glimpse of this behavior in Example 15. In the case of a more general (and not symmetric) BAC, this is even more pronounced [5].

For the symmetric case of a BSC, one can rely on the pairwise Hamming distance

vector as defined in Section4.5.

For M = 3 or M = 4, we know from Theorem 25 that the optimal code C_BSC(M,n)∗ for a BSC consists of t∗₂ columns c(M)₂ , t₃∗ columns c(M)₃ , and t∗₁, n − t∗

2− t∗3 columns

c(M)₁ , where the parameters t∗

(26)

Using the shorthand k , ⌊n

3⌋, we can write the optimal code parameters of (102)

as [t∗₁, t∗₂, t∗₃] =      [k + 1, k − 1, k] if n mod 3 = 0, [k + 1, k, k] if n mod 3 = 1, [k + 1, k, k + 1] if n mod 3 = 2. (106)

Using (42) we can compute the pairwise Hamming distance vector of this code for M_{= 3 as follows:} d(3,n) =      (2k − 1, 2k + 1, 2k) if n mod 3 = 0, (2k, 2k + 1, 2k) if n mod 3 = 1, (2k + 1, 2k + 2, 2k + 1) if n mod 3 = 2, (107) i.e., d(3,n)_min =      2k − 1 if n mod 3 = 0, 2k if n mod 3 = 1, 2k + 1 if n mod 3 = 2. (108)

For M = 4 we get accordingly:

d(4,n) =      (2k − 1, 2k + 1, 2k, 2k, 2k + 1, 2k − 1) if n mod 3 = 0, (2k, 2k + 1, 2k, 2k, 2k + 1, 2k) if n mod 3 = 1, (2k + 1, 2k + 2, 2k + 1, 2k + 1, 2k + 2, 2k + 1) if n mod 3 = 2, (109)

with the same values for the minimum Hamming distance as for the M = 3. We will compare this optimal code with the following different weak flip code C(M,n) subopt: [t1, t2, t3] =      [k, k, k] if n mod 3 = 0, [k + 1, k − 1, k + 1] if n mod 3 = 1, [k + 2, k, k] if n mod 3 = 2. (110)

This code can actually be constructed from the optimal code C_BSC(M,n−1)∗ by append-ing a correspondappend-ing column (dependappend-ing on n). In fact, by adaptappend-ing the proof of Corollary26, we can show that this second weak flip code is strictly suboptimal.

The pairwise Hamming distance vectors of this suboptimal code is given as fol-lows. For M = 3: d(3,n) =      (2k, 2k, 2k) if n mod 3 = 0, (2k, 2k + 2, 2k) if n mod 3 = 1, (2k, 2k + 2, 2k + 2) if n mod 3 = 2, (111)

i.e., d(3,n)_min = 2k in all cases. For M = 4 the situation is accordingly with also d(4,n)_min = 2k in all cases.

Hence, we see that the minimum Hamming distance of the optimal code is 2k − 1 and therefore strictly smaller than the minimum Hamming distance 2k of the suboptimal code. By adapting the construction of the strictly suboptimal code C(M,n)

subopt, a similar statement can be made for the case when n mod 3 = 1.

We have shown the following proposition.

Proposition 27. On a BSC for M = 3 or M = 4 and for all n with n mod 3 = 0

or n mod 3 = 1, the codes that maximize the minimum Hamming distance d(n)_min can be strictly suboptimal. This is not true in the case of n mod 3 = 2.

(27)

8 Conclusion

We have studied ultra-small block-codes to be used on the most general binary chan-nel, the binary asymmetric channel (BAC) and its two special cases, the Z-channel and the binary symmetric channel. We have shown that in contrast to capacity that always can be achieved with linear codes, the best codes in the sense that they achieve the smallest average probability of error for a fixed blocklength, often are not linear. For an arbitrary blocklength, we have given the optimal construction for the cases of four or less messages. In the case of the Z-channel, we have also conjectured an optimal construction for the case of M = 5 messages.

We have introduced a new powerful way of generating these codes recursively by using a column-wise build-up of the codebook matrix. This recursive construction might be extended to a higher number of codewords M ≥ 5, however, we might then require a larger step-size, i.e., the optimal code of blocklength n can be constructed from the optimal code of blocklength n − k, where the step-size k might be larger than 1. We have conjectured a k = 2 for the case of M = 5 for the Z-channel.

In any case, the idea of the column-wise approach is a very powerful tool both for construction and analysis that will also help in the quest of finding the optimal construction for codes with more than four codewords.

Finally, we have shown that the well-known and commonly used code parameter

minimum Hamming distance might not be suitable as an optimum design criterion

for codes even for strongly symmetric channels like the BSC.

References

[1] Claude E. Shannon, “A mathematical theory of communication,” Bell System

Technical Journal, vol. 27, pp. 379–423 and 623–656, July and October 1948.

[2] Robert G. Gallager, Information Theory and Reliable Communication. John Wiley & Sons, 1968.

[3] Yury Polyanskiy, H. Vincent Poor, and Sergio Verd´u, “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, May 2010.

[4] Chia-Lung Wu, Po-Ning Chen, Yunghsiang S. Han, and Yan-Xiu Zheng, “On the coding scheme for joint channel estimation and error correction over block fading channels,” in Proceedings IEEE International Symposium on Personal,

Indoor and Mobile Radio Communications (PIMRC), Tokyo, Japan, September

13–16, 2009, pp. 1272–1276.

[5] Po-Ning Chen, Hsuan-Yin Lin, and Stefan M. Moser, “On ultra-small block-codes for binary discrete memoryless channels,” August 2011, in preparation. [6] Shu Lin and Daniel J. Costello, Jr., Error Control Coding, 2nd ed. Prentice

Hall, 2004.

[7] Stefan M. Moser, Information Theory (Lecture Notes), version 1, fall semester 2011/2012, Information Theory Lab, Department of Electrical Engineering, National Chiao Tung University (NCTU), September 2011. [Online]. Available:

(28)

A

Appendix: Derivation of Proposition

16

Assume that the optimal code for blocklength n is not a flip-flop code. Then the code has a number m of positions where both codewords have the same symbol. The optimal decoder will ignore these m positions completely. Hence, the performance of this code will be identical to a flip-flip code of length n − m.

We therefore only need to show that increasing n will always allow us to find a new flip-flop code with a better performance. In other words, Proposition 16 is proven once we have shown that

Pe Ct(2,n−1) ≥ maxnPe Ct(2,n) , Pe Ct+1(2,n) o . (112)

Here we have used the following notation: C(2,n−1) t = x(n−1) ¯ x(n−1) (113) is a length-(n − 1) flip-flop code of some type t, and

C(2,n) t = [x(n−1)_0] [¯x(n−1)_1] , C(2,n) t+1 = [x(n−1)_1] [¯x(n−1)_0] (114) are the two length-n flip-flop codes that can be derived from C_t(2,n−1).

As shown in Proposition 18, the optimal decision rule for any flip-flop code is a threshold rule with some threshold ℓ: the decision rule for received y only depends on d such that g(y) = ( x if 0 ≤ d ≤ ℓ, ¯ x if ℓ + 1 ≤ d ≤ n, (115) where we use g(·) to denote the ML decoding rule.

The threshold satisfies 0 ≤ ℓ ≤ ⌊n−1₂ ⌋. Note that when ℓ = ⌊n−1₂ ⌋, the decision rule is equivalent to a majority rule. Also note that when n is even and d = n₂, the decisions for x and ¯x are equally likely, i.e., without loss of generality we then always decode to ¯x.

So let the threshold for C_t(2,n−1) be ℓ(n−1)_{. We will now argue that the threshold}

for C_t(2,n) and C_t+1(2,n) according to (114) must satisfy

ℓ(n−1)≤ ℓ(n)≤ ℓ(n−1)+ 1. (116) Consider the code C_t(2,n). Note that since t is unchanged (C_t(2,n−1) is changed to C_t(2,n)_{), the first codeword was appended a 0, while the second codeword was} ap-pended a 1, i.e., x(n)_{= [x}(n−1)_{0] and ¯}_x(n)_{= [¯}_x(n−1)_{1], see (}₁₁₄_).

Now firstly assume by contradiction that ℓ(n)< ℓ(n−1)and pick a received y(n−1) with d(n−1)= ℓ(n−1)that (for the code C_t(2,n−1)) is decoded to x(n−1). The received length-n vector y(n)_{= [y}(n−1)_{0] has d}(n)_{= ℓ}(n−1)_{> ℓ}(n)_{, i.e., it will be now decoded}

to ¯x(n). This however is a contradiction to the assumption that the ML decision for the code C_t(2,n−1)was x(n−1).

Then secondly assume by contradiction that ℓ(n) _{> ℓ}(n−1)_{+ 1. Pick a received}

y(n−1) _{with d}(n−1) _{= ℓ}(n−1)_{+ 1 that (for the code C}(2,n−1)

t ) is decoded to ¯x(n−1).

The received length-n vector y(n) = [y(n−1)1] has d(n)= ℓ(n−1)+ 2 < ℓ(n)+ 1, i.e., it will be now decoded to x(n). This, however, is a contradiction to the assumption that the ML decision for the code C_t(2,n−1) was ¯x(n−1)_.

(29)

The same arguments also hold for the other code C_t+1(2,n). Hence, we see that there are only two possible changes with respect to the decoding rule to be considered.

We will next use this fact to prove that Pe(n−1) Ct(2,n−1)

≥ Pe(n) Ct(2,n)

. The error probability is given by

Pe= 1 2 X y g(y)=¯x P_Y|X(y|x) + 1 2 X y g(y)=x P_Y|X(y|¯x). (117)

Here, in (119) we use the fact that P_Y_|X(1|0) + P_Y_|X(0|0) = 1 and P_Y_|X(1|1) + PY|X(0|1) = 1; and in (120) we combine the terms together using the definition of

C_t(2,n) _{according to (}₁₁₄_).

We can now distinguish two cases in (116):

(i) If the decision rule is unchanged, i.e., ℓ(n)_{= ℓ}(n−1)_{, we only need to take care}

of the third summation in (120) that contains some terms that will now be decoded differently. Since we have assumed that ℓ(n)= ℓ(n−1), we know that

(30)

for all y(n−1) _{with d}(n−1) _{= ℓ}(n−1) _{the length-n received vector [y}(n−1)_{1] has}

d(n)= ℓ(n−1)+ 1 = ℓ(n)+ 1 and will be decoded to ¯x(n). Hence we must have Pn Y|X [y(n−1)1] x(n) Pn Y|X [y(n−1)1] ¯x(n) ≤ 1, (121) and therefore X y(n−1) 1≤d(n)_≤ℓ(n−1)₊₁ P_Yn_|X [y(n−1)1]x¯(n) = X y(n−1) d(n)=ℓ(n−1)₊₁ Pn Y|X [y(n−1)1] x¯(n) | {z } ≥ Pn Y|X [y(n−1)1] x(n) + X y(n−1) 1≤d(n)_≤ℓ(n−1) Pn Y|X [y(n−1)1] ¯x(n) (122) ≥ X y(n−1) d(n)_=ℓ(n−1)₊₁ P_Yn_|X [y(n−1)1]x(n)+ X y(n−1) 1≤d(n)_≤ℓ(n−1) P_Yn_|X [y(n−1)1]¯x(n). (123) Hence, we get from (120):

2P_e(n−1) C(2,n−1) t ≥ X y(n−1) ℓ(n−1)_+2≤d(n)_≤n P_Yn_|X [y(n−1)1]x(n) + X y(n−1) ℓ(n−1)_+1≤d(n)_≤n−1 P_Yn_|X [y(n−1)0]x(n) + X y(n−1) d(n)=ℓ(n−1)₊₁ P_Yn_|X [y(n−1)1]x(n) + X y(n−1) 1≤d(n)_≤ℓ(n−1) P_Yn_|X [y(n−1)1]x¯(n) + X y(n−1) 0≤d(n)_≤ℓ(n−1) P_Yn_|X [y(n−1)0]¯x(n) (124) = X y(n) ℓ(n−1)+1≤d(n)_≤n P_Yn_|X y(n)x(n) + X y(n) 0≤d(n)_≤ℓ(n−1) P_Yn_|X y(n)¯x(n) (125) = 2P_e(n) C(2,n) t . (126)

(ii) If the decision rule is changed such that ℓ(n) = ℓ(n−1)+ 1, we need to take care of the second summation in (120) that contains some terms that will now be decoded differently. Since we have assumed that ℓ(n)_{= ℓ}(n−1)_{+ 1, we}

(31)

[y(n−1)_{0] has d}(n)_{= ℓ}(n−1)_{+ 1 = ℓ}(n) _{and will be decoded to x}(n)_{. Hence we} must have Pn Y|X [y(n−1)0] x(n) Pn Y|X [y(n−1)0] ¯x(n) ≥ 1, (127) and therefore X y(n−1) ℓ(n−1)_+1≤d(n)_≤n−1 P_Yn_|X [y(n−1)0]x(n) = X y(n−1) d(n)=ℓ(n−1)₊₁ P_Yn_|X [y(n−1)0]x(n) + X y(n−1) ℓ(n−1)_+2≤d(n)_≤n−1 P_Yn_|X [y(n−1)0]x(n) (128) ≥ X y(n−1) d(n)=ℓ(n−1)₊₁ P_Yn_|X [y(n−1)0]x¯(n) + X y(n−1) ℓ(n−1)_+2≤d(n)_≤n−1 P_Yn_|X [y(n−1)0]x(n). (129) (130) The rest of the argument now is analogous to case (i).

This proves that Pe(n−1) Ct(2,n−1)

≥ Pe(n) Ct(2,n)

. It only remains to show that Pe(n−1) Ct(2,n−1)

≥ Pe(n) Ct+1(2,n)

. This derivation is similar and therefore omitted.

B

Appendix: Derivation of Theorem

25

B.1 M_{= 3}

Our proof is based on induction in n. We start with an optimal code of length n − 1 and then prove that appending a column according to the choice given in Corollary26 will result in a new optimal code. We rely on a couple of observations that for clarity are summarized here once more:

• The proof that the n = 2 binary code given in (104) is optimal is straightfor-ward and omitted.

• We do not need to worry about any other codebook columns than those given in (39) because firstly the all-zero and the all-one column can be neglected by the same argument as used in the proof of Proposition 16, and because secondly the flipped version of the columns c(3)₁ , c(3)₂ , and c(3)₃ will result in the same performance because the BSC is strongly symmetric.

• We need to distinguish three cases in the induction from n − 1 to n, depending on whether n mod 3 = 0, 1, or 2.

Considering the possible choices c(3)₁ , c(3)₂ , and c(3)₃ of (39), we see that exactly 2 components in d(3,n−1)will be increased by 1 to form the new d(3,n). For example, if the newly added column is c(3)₁ , then (d(n)₁₂, d(n)₁₃, d₂₃(n)) = (d(n−1)₁₂ , d(n−1)₁₃ +1, d(n−1)₂₃ +1).

有限長度區間碼通道容量

Error Probability Analysis of

Binary Asymmetric Channels

Final Report of NSC Project

“Finite Blocklength Capacity”

Contents

1

Introduction

2

Definitions

3

Channel Models

4

Preliminaries

5

Flip Codes and Weak Flip Codes

6

Main Results

7

Pairwise Hamming Distance Structure

8

Conclusion

References

A

Appendix: Derivation of Proposition

16

B

Appendix: Derivation of Theorem

25