Least Squares Superposition Codes of Moderate Dictionary Size Are Reliable at Rates up to Capacity

(1)

Least Squares Superposition Codes of Moderate Dictionary Size Are Reliable at Rates up to Capacity

Antony Joseph, Student Member, IEEE, and Andrew R Barron, Senior Member, IEEE

Abstract—For the additive white Gaussian noise channel with average codeword power constraint, coding methods are analyzed in which the codewords are sparse superpositions, that is, linear combinations of subsets of vectors from a given design, with the possible messages indexed by the choice of subset. Decoding is by least squares (maximum likelihood), tailored to the assumed form of codewords being linear combinations of elements of the design.

Communication is shown to be reliable with error probability ex- ponentially small for all rates up to the Shannon capacity.

Index Terms—Achieving capacity, compressed sensing, exponen- tial error bounds, Gaussian channel, maximum likelihood estima- tion, subset selection.

I. INTRODUCTION

T

HE additive white Gaussian noise channel is basic to Shannon theory and real communication models. In superposition coding schemes, the codewords are sparse linear combinations of elements from a given dictionary. We show that superposition codes from polynomial size dictionaries with maximum likelihood (minimum distance) decoding achieve exponentially small error probability for any communication rate less than the Shannon capacity. A companion paper [8],[9]

provides a fast decoding method and its analysis. The develop- ments involve a merging of modern perspectives on statistical linear model selection and information theory.

The familiar communication problem is as follows. Input

bit strings of length are mapped

into codewords, of length , with control of their power. The channel adds independent noise to the selected codeword yielding a received length string . Using the received string and knowledge of the codebook, the decoder, then, gets an estimate of the transmitted string . Block error is the event , bit error at position is the event , and the bit error rate is . Analogous section error rate for our code is defined as follows. The reliability requirement is that, with sufficiently large , the bit error rate or section error rate is small with high probability, when averaged over input strings as well as the distribution of . A

Manuscript received June 05, 2010; revised July 07, 2011; accepted November 28, 2011. Date of publication January 31, 2012; date of current version April 17, 2012. The material in this paper was presented in part at the 2010 IEEE International Symposium on Information Theory.

The authors are with the Department of Statistics, Yale University, New Haven, CT 06520 USA (e-mail: [email protected]; an- [email protected]).

Communicated by I. Kontoyiannis, Associate Editor for Shannon Theory.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIT.2012.2184847

more stringent requirement would be to have small block error probability, again averaged over the distributions of and . As will be made clear later on, for ease of analysis, we perform a further averaging over the distribution of our design matrix.

The communication rate is the ratio of the input length to the code length for communication across the channel.

By traditional information theory, as in [20], [35], and [60], the supremum of reliable rates is the channel capacity

, where is a constraint on the power of the codewords. Standard communication models, even in continuous time, have been reduced to the aforementioned discrete- time white Gaussian noise setting, as in [31] and [35].

We now describe the superposition coding scheme. The story begins with a dictionary (design matrix) , with

columns for . We further assume

that , with and being positive integers. The dictionary is partitioned into sections, each of size as depicted in Fig. 1.

The codewords take the form of particular linear combinations of subsets of columns of the dictionary. Specifically, each codeword is of the form , where belongs to a set given by

Consequently, for , the codeword is a superposition of columns of , with exactly one column selected from each section. The received vector is then in accordance with the statistical linear model

(1) where is the noise vector distributed .

For ease of encoding, it is assumed that the section size is a power of 2. The input bit strings are of length , which split into substrings of size . The encoder maps to simply by interpreting each substring of as giving the index of which coordinate of is nonzero in the corresponding section. That is, each substring is the binary representation of the corresponding index.

As mentioned earlier, we analyze the maximum likelihood decoder. This decoder is the same as that which chooses the that maximizes the posterior probability when the prior distribution is uniform over . The decoder is given by

(2) where denotes the Euclidean norm. Here, we implicitly assume that if the minimization has a nonunique solution, one may take to be any value in the solution set. Since the earlier

(2)

Fig. 1. Schematic rendering of the dictionary matrixX and coefficient vector . The vertical bars in the X matrix indicate the selected columns from a section.

is a least squares minimization problem over coefficient vectors in , we also call this the least squares decoder. Although the aforementioned decoder is not a computationally feasible scheme, the result is significant since we show that one can achieve rates up to capacity with a codebook that has a compact representation in the form of the dictionary .

The entries of are drawn independently from a normal distribution with mean zero and variance . With this distribution, one has that for each , the expected codeword power, given by , is equal to . Our design pro- duces a distribution of codeword powers , across the codewords, that is highly concentrated near , with average

codeword power having expectation

. Use of average power rather than individual power constraint does not increase the capacity.

An alternative method would be to arrange the entries of to be equiprobable random variables. This would achieve an approximately Gaussian distribution for the ’s. It is very likely that this alternative design also achieves capacity, though that is not explored here.

As we have said, the rate of the code is input bits per channel uses and we arrange for arbitrarily close to . For our code, this rate is . For specified rate , the code length . As explained in the following, the section size will be related to the number of sections by an expression of polynomial size. Consequently, the length

and the number of terms agree to within a log factor.

Control of the dictionary size is critical to computationally advantageous coding and decoding. If the number of sections were fixed, then has size that is exponential in , making its direct use impractical. Instead, with agreeing with to within a factor, the dictionary size is more manage- able. In this setting, we construct reliable, high-rate codes with codewords corresponding to linear combinations of subsets of terms in moderate size dictionaries.

The idea of superposition codes for Gaussian channels began with Cover [19] in the context of determination of the capacity region of certain multiple user channels. There represents the number of messages decoded and a selected column represents the codeword for a message. Codes for the Gaussian channel based on sparse linear combinations have been proposed in the compressed sensing community by Tropp [64]. However, as he discusses, the rate achieved there is not up to capacity. Relation-

ship of our study to that in these communities will be discussed in further detail later on.

We now describe our main result concerning the performance of the least squares decoder. We show that if , for any exceeding a particular positive function of the signal-to-noise ratio , then rates arbitrarily close to capacity can be achieved.

This function is near for small and near 1 for large . Consequently, the dictionary has size that is polynomial in . This required section size does not depend on the gap , and thus, the dictionary has a compact representation irrespective of the closeness of to .

For , let denote the joint distribution of given . Further, let denote the number of mistakes made by the least squares decoder, that is, the number of sections in which the position of the nonzero term in is different from that in the true . Denote the error event

(3) that the decoder makes mistakes in at least fraction of sections. Assuming that is drawn from a uniform distribution over all elements from , the average probability of error conditional on is given by

Deriving bounds for the aforementioned is not easy. We follow the information theory tradition and bound the average of the earlier over the distribution of , given by

(4)

For positive , let . Furthermore, for

, let

(5)

A positive expression possessing properties explained in Section IV, lemma 5 is used. For large , it is near a function near for small and near 1 for large . Our main result is the following.

(3)

Proposition 1: Assume , where , and rate is less than capacity . Let represents the fraction of section mistakes made by the least squares decoder. Then

with , where

(6)

is evaluated at and .

Proposition 1 is proved in Section V.

Remark: It is shown in Appendix C that the exponent

can be improved by replacing with

where

Here, is a positive function of and , which for given is near for small , where and are positive expressions as in (43) and (44) shown later.

Let . Then, it is not hard to see that (7) Accordingly, the function , appearing in the lower bound (6), may be replaced by , revealing that the exponent is, up to a constant, of the form , where

. With the improved bound in Appendix C, it is of the form .

Moreover, an approach is discussed which completes the task of identifying the terms by arranging sufficient distance between the subsets, using composition with an outer Reed–Solomon (RS) code of rate near one. It corrects the small fraction of remaining mistakes so that we end up not only with small mistake rate but also with small block error probability. If

is the rate of an RS code, with , then section error rates less than can be corrected, provided . Fur- thermore, if (or simply ) is the rate associated with our inner (superposition) code, then the total rate after correcting for

the remaining mistakes is given by . The

end result, using our theory for the distribution of the fraction of mistakes of the superposition code, is that the block error probability is exponentially small. One may regard the composite code as a superposition code in which the subsets are forced to maintain at least a certain minimal separation, so that decoding to within a certain distance from the true subset implies exact decoding. Accordingly, we make the following claim about block error probability.

Proposition 2: For given fraction of mistakes , let be a rate for which the partitioned superposition code with sections has exponentially small probability of Proposi- tion 1. Then, through concatenation with an outer RS code, one obtains a code with rate and block error probability less than or equal to .

Proposition 2 is proved in Section VI.

Particular interest is given to the case that the rate is made to

approach the capacity . Arrange and .

One may let the rate gap tend to zero (e.g., at a rate or any polynomial rate not faster than ); then, the overall

rate continues to have drop from

capacity of order , with the composite code having block error probability of order

The aforementioned exponent, of order for near , is in agreement with the form of the optimal reliability bounds as in [35] and [53], though, here, our constant is not demonstrated to be optimal.

In Fig. 2, we plot curves of achievable rates using our scheme for block error probability fixed at and signal-to-noise ratios of 20 and 100. We also compare this to a rate curve given in [53] (PPV curve), where it is demonstrated that for a Gaussian channel with signal-to-noise ratio , the block error probability , code length , and rate with an optimal code can be well approximated by the following relation:

(8)

where is the channel dis-

persion and is the inverse normal cumulative distribution function.

For the superposition code curve, the -axis gives the highest for which the error probability stays below . We see for the given and block error probability values, the achievable rates using our scheme are reasonably close to the theoretically best scheme. Note that the PPV curve was computed with an approach that uses a codebook of size that is exponential in block length, whereas our dictionary, of size , is of considerably smaller size.

A. Variants of the Superposition Scheme

To distinguish it from other sparse superposition codes, the code analyzed here may be called a partitioned superposition code. The motivations for introducing the partitioning versus arbitrary subsets, in the superposition coding scheme, are the ease in mapping the input bit string to the coefficient vector and the ease in composition with the outer RS code. Natural vari- ants of the schemes are subset superposition coding, where one arranges for a number of the coordinates to be nonzero and taking the value 1, with the message conveyed by the choice of subset. With somewhat greater freedom, one may have signed superposition coding, where one arranges the nonzero coeffi- cients to be or . Then, the message is conveyed by the sequence of signs as well as the choice of subset. In both cases, if one takes the elements of to be i.i.d as before, then the expected power of each codeword is . The signed superposition coding scheme has been proposed in [36] and [64].

As mentioned earlier, superposition codes began with [19]

for multiuser channels in the context of determination of the capacity region of Gaussian broadcast channels. There the number of users corresponds to . The codewords for user , for

, corresponds to the columns in section . In that setting, what is sent is the sum of codewords, one from each user.

With fixed, is exponential in . Here, for the

(4)

Fig. 2. Plot of comparison between achievable rates using our scheme and the theoretical best possible rates for block error probability of10 and signal-to-noise ratio (v) values of 20 and 100. The curves for our partitioned superposition code were evaluated at points with number of sections L ranging from 20 to 100 in steps of 10, with correspondingM values taken to be L , where a is as given in Lemma 5, (32), and (33) later on. For the v values of 20 and 100 shown previously, a is around 2.6 and 1.6, respectively. For details on computations, refer to Appendix D.

single-user channel, by allowing to be of the same order as , to within a factor, we make it possible to achieve rates close to capacity with polynomial size dictionaries. Related rate splitting (partitioning) for superposition codes is developed for Gaussian multiple-access problems in [18] and [59].

As to the relationship to single-user decoding, note that in the Gaussian broadcast channel, with optimal decoding, it is ar- ranged that one of the receivers decodes all the messages. This is also the case for the multiple access channel receiver. The termi- nology of superposition codes, rate splitting (partitioning), and issues of power allocations arise from such work in multiuser Shannon theory. Here, to achieve the benefits of the reduced size dictionary, we decode the sections jointly rather than suc- cessively. Here, it does allow the power allocation to be constant across sections. In the companion paper [8], in achieving a practical decoder, we do make use of standard variable power allocation in the sections.

Sparse superposition codes have been proposed for communication in random access channels as in [14] and [27].

Our ideas of sparse superposition coding are adapted to Gaussian vector quantization in [42].

B. Related Work on Sparse Signal Recovery

While reviewing works on sparse signal recovery and compressed sensing, we adhere to our notation that we have a linear model of the form

where is a deterministic or random matrix and has exactly nonzero values. The quantities , and will be called parameters for the model. In our description in the following, we denote as some positive constant whose value will change from time to time.

The conclusions here complement recent work on sparse signal recovery in the linear model setup as we now discuss.

In a broad sense, these works analyze for various schemes (practical or otherwise), conditions on the parameters so that certain reliability requirements are satisfied with high probability. Closely related to our work is the requirement that only the indices corresponding to the nonzero elements of , that is the support of , be recovered exactly or almost exactly.

The conditions explored by this community do translate into results on communication rate, though heretofore not rates up to capacity.

In this paper, in order to achieve rates arbitrarily close capacity, we require , with precise values of specified later on, putting us in the sublinear sparsity regime, that is, as . Also, if we change the scale and take the elements of the matrix as i.i.d standard normal, the nonzero values of assume the value . Accordingly, although most the claims in this area are for more general sparsity regimes and values of , the results most relevant to us are those for the sublinear sparsity regime and when the nonzero ’s are

at least .

A significant portion of the work in this area focuses on deterministic matrices satisfying certain assumptions. A common assumption is the mutual incoherence condition [15], [24], [33] which places controls on the magnitude correlation between distinct pairs of columns. Another related assumption is the exact recovery condition [64], [68], [72].

The recovery uses -relaxation methods such as Lasso [62]

or iterative methods such as orthogonal matching pursuit [49], [52]. This line has been pursued by the authors in [23], [24], [32], [70], [72], and others, for general sparse signal recovery problems and by Tropp [64] for the communication problem. While the aforementioned covers broad classes of dictionaries, they impose severe constraints on the dictionaries.

Indeed, when applied to Gaussian matrices, they require

or sparsity ,

which would correspond to rate approaching 0. In contrast,

for our scheme , which using

(5)

and , one gets that is sufficient for subset recovery, which is of a smaller order of magnitude than the aforementioned. Consequently, these results on deterministic matrices, when applied to our setting, are insufficient to communicate at positive rates, let alone rates close to capacity.

The aforementioned works allow for decoding of arbitrary sparse subsets with high probability. This rather stringent form of conclusion corresponds to worst case error probability in the communication setting.

Work which does correspond to positive rate, when translated to the communication setting, arises from three approaches.

First, there is the work of Candes and Plan [15] and Tropp [65], [66], in which one looks at the probability of error averaged over codewords (i.e., the subset is chosen randomly). This achieves reliable support recovery with as high as . Second, there is the work of Zhang [71] that employs a more involved forward/backward stepwise selection algorithm, for dictionaries satisfying certain properties, to achieve reliable performance for arbitrary subsets (worst case error probability), again for as high . However, the constants are such that demonstration that rates up to capacity can be achieved has been lacking.

Third, analysis using random matrices in the noisy setting has also been carried out in [17], [22], and [68], among others, where the analysis in [68] addresses the issue of support recovery. More closely related to ours, support recovery of the least squares decoder is analyzed in [2], [28], and [67], for Gaussian matrices, where Akcakaya and Tarokh [2] also ad- dress the issue of partial support recovery. Similar to aforementioned, one can infer from this that communication at positive rates is possible using random designs. The signal recovery pur- pose is somewhat different here from our communications pur- pose, in that the work typically does not constrain the nonzero coefficients to the same value, and the resulting freedom in their values lead to order of magnitude conclusions that obstruct interpretation in terms of exact rate.

Furthermore, there are result giving necessary conditions for exact support recovery [29], [67], [69] and for partial support recovery [2]. Both these agree in terms of order of magnitude, requiring an order of for the regime we deal with. In [56], it is shown that in the linear sparsity regime, that is, when is of the same order as , one requires for reliable recovery of the support. An implication of this is that the sublinear sparsity regime is necessary for communication at positive rates.

Consequently, one can infer, from some of the aforementioned works, that communication at positive rates is possible with sparse superposition codes. We add to the existing litera- ture by showing that one can achieve any rate up to capacity in certain sparsity regimes with a compact dictionary, albeit for a computationally infeasible scheme. Furthermore, we demonstrate that the error exponents are of the optimal form.

C. Practical Decoding Algorithms Approaching Capacity Along with this paper, we pursued the problem of achieving capacity using computationally feasible schemes. In [8] and [9], an iterative decoding scheme, called adaptive successive de-

coding, is analyzed. This is similar in spirit to iterative decoding techniques such as forward stepwise regression [7], [41], re- laxed greedy algorithm [6], [38], [44], and orthogonal matching pursuit [49], [52], and other iterative algorithms [13], [21], [51].

The rate attained there is of the order of below capacity, with corresponding error probability being exponentially small in . These performance levels are not as good as obtained here with the optimal decoder. The sparse superposition codes achieving these performance levels, by least squares and by adaptive successive decoding, are different in an important aspect. For this paper, we use a constant power allocation, with the same power for each term. However, in [8] and [9], to yield rates near capacity, we needed a variable power allocation, achieved by a specific schedule of the nonzero ’s. In contrast, if one were to use equal power allocation for the decoding scheme, then reliable decoding holds only up to a threshold rate , which is less than the capacity . Since the focus here is on the least squares decoder, we defer detailed discussion to the later paper [9].

D. Related Communication Issues and Schemes

The development, here, is specific to the discrete-time

channel for which for with

real-valued inputs and outputs and with independent Gaussian noise.

Standard approaches, as discussed in [31], entail a decom- position of the problem into separate problems of coding and of shaping of a multivariate signal constellation. Notice that we build shaping directly into the coding scheme by choosing codewords to follow a distribution.

For the low signal-to-noise regime, binary codes suffice for communication near capacity and there is no need for shaping.

The performance of the maximum likelihood decoder for binary linear codes, with a random design matrix and with exponential error bounds at rates up to capacity for the binary symmetric channel, has been established in [26]. Computationally feasible schemes, with empirically good performance, for discrete channels include LDPC codes [34], [46], [47], [57], [58] and turbo codes [12], [50]. Error bounds for rates up to capacity for expander codes (related to LDPC) are shown in [5] and for LDPC codes with random low-density design matrix in [43], whereas turbo codes have an error floor [37], [54] that precludes such exponential scaling of error probability. Thus, the work in [5], [26], and [43], with a random design matrix of controlled size, are conclusions for discrete channels that correspond to the conclusion obtained here for the Gaussian channel for rates up to capacity.

Recently, practical and capacity-achieving polar codes have been developed for discrete channels [3], [4], though with an error probability that is exponentially small in rather than . Unlike the present development, it remains unknown how the exponent for the polar codes depends on the closeness of to

.

When the signal-to-noise ratio is not small, proper shaping for the Gaussian channel requires larger size signal alphabets, as explained in [31]. For example, Abbe and Barron [1] pro- vide such analysis adapting polar codes to use for the Gaussian channel.

(6)

The analysis of concatenated codes in [30] is an important forerunner to the development we give here. The author in [30]

identified benefits of an outer RS code paired in theory with an optimal inner code of Shannon–Gallager type and in practice with binary inner codes based on linear combinations of orthogonal terms (for target rates less than 1 such a basis is available). The challenge concerning theoretically good inner codes is that the number of messages searched is exponentially large in the inner code length. Forney made the inner code length of logarithmic size compared to the outer code length as a step to- ward practical solution. However, caution is required with such a strategy. Suppose the rate of the inner code has only a small drop from capacity, . For small inner code error probability, the inner code length must be of order at least . So with that scheme, one has the undesirable consequence that the required outer code length becomes exponential in .

For the Gaussian noise channel, our tactic to overcome that difficulty uses a superposition inner code with a polynomial size dictionary. We use inner and outer code lengths that are comparable, with the outer code used to correct errors in a small fraction of the sections of the inner code. The overall code length to achieve error probability remains of the order

.

Section II contains brief preliminaries. Section III provides core lemmas on the reliability of least squares for our superposition codes. Section IV analyzes the matter of section size sufficient for reliability. In Sections V and VI, we give proofs of propositions 1 and 2, respectively. In Section VII, we discuss how our results can be adapted for an approximate form of the least squares decoder. The Appendix collects some auxiliary matters.

II. PRELIMINARIES

For vectors of length , let be the sum of squares of coordinates, let be the average square,

and let be the associated inner product.

It is a matter of taste, but we find it slightly more convenient to work, henceforth, with the norm rather than .

Concerning the base of the logarithm ( ) and associated exponential ( ), base 2 is most suitable for interpretation and base most suitable for the calculus. For instance, the rate

is measured in bits if the log is base 2 and nats if the log is base . Typically, conclusions are stated in a manner that can be interpreted to be invariant to the choice of base, and base is used for convenience in the derivations.

We make repeated use of the following moment generating function and its associated Cramer–Chernoff large deviation exponent in constructing bounds on error probabilities. If and are normal with means equal to 0, variances equal to 1, and correlation coefficient , then

(9)

when and infinity otherwise. So, taking

the logarithm, the associated cumulant generating function

of is , with the

understanding that the minus log is replaced by infinity when is at least . For positive , we define the quantity

given by

(10) The expression corresponding to but with the maximum

restricted to is denoted as , that is

(11) When the optimal is strictly less than 1, the value of matches as given previously.

The case occurs when

, or equivalently . Then, the exponent is

, which is as least .

Consequently, in this regime, is between and . The special case is included with .

There is a role for the function

(12)

for , where is the signal-to-noise ratio

and is the channel capacity. We

note that is a nonnegative concave function equal to 0 when is 0 or 1 and strictly positive in between. The quantity is larger by the additional amount , positive when the rate is less than the Shannon capacity .

Remark on average codeword power: The average codeword

power has expectation with respect to

that matches , for all . The distribution of the average codeword power is tightly concentrated around as explained in the [11, Appendix], and will not be explored further here.

III. PERFORMANCE OFLEASTSQUARES

In this section, we examine the performance of the least squares decoder (2) in terms of rate and reliability. For , let denote the set of indices for which is nonzero. Furthermore, let

(13) denote the set of allowed subset of terms. It corresponds to the subsets of of size and comprising of exactly one term from each section.

Recall that we are interested in bounding given in (4).

By symmetry

where . Correspondingly, for fixed

, we proceed to obtain bounds for . Let . Furthermore, let be the least squares solution (2) and

. Notice that , which is

also the number of sections incorrectly decoded.

(7)

For let be the event that there are exactly mistakes. Now, can be expressed as a disjoint union of , for . Correspondingly

(14)

In the following two lemmas, we give bounds for

for .

Lemma 3: Set for an . The prob-

ability can be bounded by , where

(15)

where and . Here, is

the signal-to-noise ratio.

Remark: Notice that depends also on , and . Whether is exponentially small depends on the relative size of the combinatorial term and the exponential term in

and .

Proof of Lemma 3: For the occurrence of , there must be an which differs from the subset sent in an amount

and which has , or equivalently has , where

(16) The analysis proceeds by considering an arbitrary such , bounding the probability that , and then using an appropriately designed union bound to put such probabilities together. Notice that the subsets and have an intersection

of size and difference of size

.

Let denote the joint density of and when is sent. Furthermore, let . The actual density of given , denoted by , has mean and variance . Furthermore, there is conditional inde- pendence of and given .

Next, consider the alternative hypothesis that was sent and let denote the corresponding joint density under this hypothesis. The conditional density for given and ,

denoted by , is now Normal( ). With

respect to this alternative hypothesis, the conditional distribu-

tion for given remains Normal( ). That

is, .

We decompose the test statistic in (16) as , where (17) and

(18) Note that depends only on terms in , whereas

depends also on the part of not in .

Concerning , note that we may express it as

(19) where

is the adjustment by the logarithm of the ratio of the normalizing constants of these densities. Using Bayes rule, notice that

Correspondingly, one gets from (19) that

(20) We are examining the event that there is an , with and . For positive , the indicator of this event satisfies

where is of size and of size .

The earlier follows since if there is such an with , then indeed that contributes a term on the right side of value at least 1. Here, the outer sum is over . For each such , for the inner sum, we have sections in each of which, to comprise , there is a term selected from among choices other than the one prescribed by .

To bound the probability of , take the expectation of both sides, bring the expectation on the right inside the outer sum, and write it as the iterated expectation, where on the inside condition on , , and to pull out the factor involving , to get that is not more than

Notice that , that is, is

independent of , and . Correspondingly, the inner expectation may be expressed as . Furthermore, we arrange for to be not more than 1. Then, by Jensen’s inequality, the expectation may be brought inside the power and inside the inner sum, yielding

(21)

Recall that

from (20). Consequently, one has

(8)

which is equal to . The sum over entails less than , where , choices so the bound (21) becomes

(22)

The sum over in the aforementioned expression is over terms. Furthermore, is a sum of independent mean- zero random variables each of which is the difference of squares of normals for which the squared correlation is . So using (9), the expectation is found to be equal

to . When plugged in earlier and

optimized over in , one gets from the expression of given in (11) that the expectation in the right side of (22) is equal to . This completes the proof of the lemma.

Remark: A natural question to ask is why we did not use the simpler union bound for given by

where , is any set with . One could

then use a Chernoff bound for the term . Indeed, this is what we tried initially; however, due to the presence of the two combinatorial terms, we were unable to make the aforementioned go to zero, with large , for all rates less than capacity. In our aforementioned proof, by introducing the term in the exponent, we were able to reduce the term to . Optimizing over revealed the best bound using this method. Somewhat similar analysis has been done before to obtain error exponent for the standard channel coding problem, for example, in [35].

A difficulty with the Lemma 3 bound is that for near 1 and for correspondingly close to , in the key quantity

, the order of is , which is too close to zero to cancel the effect of the combinatorial coefficient .

The following lemma refines the analysis of Lemma 3, ob- taining the same exponent with an improved correlation coefficient. The denominator of now becomes

. This is an improvement due to the presence of the factor allowing the conclusion to be useful also for near 1. The price we pay is the presence of an additional term in the bound.

Lemma 4: Let a positive integer be given and let . Then, is bounded by the minimum for

in the interval of , where

(23)

where, here, the quantities and

Proof of Lemma 4: Split the test statistic where

and

Take positive and negative . Then,

, with being the event that there is an , with and . Similarly, is the corresponding event that . The part has no dependence on so its treatment is more simple. It is a mean zero average of differences of squared normal random variables, with squared correlation . So using its moment generating function, is exponentially small, bounded by the second of the two expressions in (23).

Concerning , its analysis is much the same as for Lemma 3. We again decompose as the sum

, where is the same as earlier. The difference is that in forming we subtract rather than . Consequently

which again involves a difference of squares of standardized normals. But here the coefficient multiplying is such that we have maximized the correlations between the

and . Consequently, we have reduced the spread of the distribution of the differences of squares of their stan- dardizations as quantified by the cumulant generating function.

One finds that the squared correlation coefficient is

for which .

Accordingly we have that the moment generating function is which gives rise to the bound appearing as the first of the two expressions in (23). This completes the proof of Lemma 4.

From Lemma 4, one gets that , where

Consequently, from Lemmas 3 and 4, along with (14), one gets

that , where

(24)

This is the bound we use to numerically compute the rate curve in Fig. 2. Accordingly, the error exponent of Propo- sition 1 satisfies

(25) Our task will be to give simplified lower bounds for the right side of (25) for all . In the next section, we characterize the section size required to achieve rates up to capacity. In Sec- tion V, we prove Proposition 1 and in Section VI, we prove Proposition 2. We also remark that in Appendix F we discuss how the bounds of the aforementioned two lemmas may be mod- ified to deal with the subset superposition coding scheme described in Section I-A.

(9)

Since the bounds of Lemma 4 are better than those in Lemma 3 for values near 1, for simplicity we only use the bounds from Lemma 4 in characterizing the error exponents. Corre- spondingly, from hereon we take

(26) as in Lemma 4.

IV. SUFFICIENTSECTIONSIZE

We call the section size rate, that is, the bits required to describe the member of a section relative to the bits required to describe which section. It is invariant to the base of the log. Equivalently, we have and related by . Note that the size of controls the polynomial size of

the dictionary .

The code length may be written as

We do not want a requirement on the section sizes with of order for then the complexity would grow exponentially with this inverse of the gap from capacity. So, instead, we

decompose where .

We investigate in this section the use of to cancel out the combinatorial coefficient appearing in the first term in (23). In subsequent sections, excess in , beyond that needed to cancel the combinatorial coefficient, plus are used to produce exponentially small error probability.

Define and .

Now, is increasing as a function of , so is greater than whenever . Accordingly, we decompose the exponent as the sum of two components,

namely, and the difference .

We then ask whether the first part of the exponent denoted is sufficient to cancel out the effect of the log combinatorial coefficient . That is, we want to arrange for the nonnegativity of the difference

(27) This function is plotted in Fig. 3 for specific choices of , ,

, and .

Using , one finds that for sufficiently large depending on , the difference is nonnegative uniformly for the permitted in . The smallest such section size rate is

(28)

where the maximum is for in . This

definition is invariant to the choice of base of the logarithm, assuming that the same base is used for the communication rate

and for the that arises in the definition of . In the aforementioned ratio, the numerator and denominator are both 0 at and (yielding at the ends).

Accordingly, we have excluded 0 and 1 from the definition of for finite . Nevertheless, limiting ratios arise at these ends.

We give bounds for and show that the value of is fairly insensitive to the value of , with the maximum over the whole range being close to a limit which is characterized by values in the vicinity of .

Let near 15.8 be the solution to

Lemma 5: The quantity has the following properties.

(a) For

(29)

where .

(b) The limit for large of is a continuous function which is given, for , by

(30)

and for by

(31) (c) For all and using log base e, the aforemen-

tioned is bounded by

(32) in the case , which is approximately for small positive , whereas in the case , it is bounded by

(33) which asymptotes to the value 1 for large .

The proof of the aforementioned lemma is routine. For convenience, it is given in Appendix B.

While is undesirably large for small , we have reasonable values for moderately large . In particular, equals 5.0 and 3, respectively, at and , and it is near 1 for large . Numerically, it is of interest to ascertain the minimal section size rate , for a specified such as , for chosen to be a given high fraction of , say , for at a fixed small target fraction of mistakes, say , and for

to be a small target probability, so as to obtain . Here, as in (24). This is illustrated in Fig. 4 plotting the minimal section size rate as a function of for . With such moderately less than , we observe substantial reduction in the required section size rate.

V. PROOF OFPROPOSITION1

In this section, we put the aforementioned conclusions together to prove proposition 1, demonstrating the reliability of approximate least squares. The following lemma will be useful in proving the lower bound for the error exponent in proposition

1. Let as earlier.

(10)

Fig. 3. Exponents of contributions to the error probability as functions of =

`=L using exact least squares, i.e., t = 0, with L = 100, M = 2 , signal-to- noise ratiov = 15, and rate 70% of capacity. The red and blue curves are the0 log [ ~E ] and 0 log [E ] bounds, using the natural logarithm, from the two terms in lemma 4 with optimizedt . The dotted green curve is d (27). With = 0:1, the total probability of at least that fraction of mistakes is bounded by1:8 2 10 .

Fig. 4. Sufficient section size ratea as a function of the signal-to-noise ratio v. The dashed curve shows a atL = 64. Just below it, the thin solid curve is the limit for largeL. For section size M L , the error probabilities are exponentially small for allR < C and any > 0. The bottom curve shows the minimal section size rate for the bound on the error probability contributions to be less thane , withR = 0:8C and = 0:1 at L = 64.

Lemma 6: The following bounds hold.

(a) For positive and correlation , let

. Then

(34)

and

(35)

(b) For , let . Then

(36) For convenience, we put its proof in Appendix A.

We now prove Proposition 1. Consider the exponent appearing in the error bound (23).

Now, has a nondecreasing derivative with

respect to . So is greater than

. Consequently, it lies above the tangent line (the first order Taylor expansion) at , that is

(37) where is the derivative of

with respect to , which is, here, evaluated at . In detail, the derivative is seen to equal

(38)

when , and this derivative is equal to 1 otherwise. (The latter case with derivative equal to 1 includes the

situations and where with ; all

other have ).

We now lower bound the derivative evaluated at . Using the upper bound on given in (36) and the form of , one gets that is bounded by

, which using

and , one gets that

Further using the lower bound in (36), one has

is at least , where we make use of .

Correspondingly

(39) the right side of which is , where is as in (5).

Now, we are in a position to apply lemma 4 and lemma 5.

If the section size rate is at least , we have that cancels the combinatorial coefficient , and hence, the first term in the bound (23) (the part controlling ) is not more than

where . Using and

and (39) yield not more than the sum of

and

(11)

for any choice of . For convenience, we take to be . In this case, the first part of the aforementioned

sum is .

Now, use (34) to get that is at

least , where .

Correspondingly, using , one gets that

. Accordingly, is

at least .

It follows from the aforementioned that

where . Consequently, summing over all

, for which , one gets

The exponent in the right side of the aforementioned equation

is . Now, use the to

complete the proof of proposition 1.

Remarks: The form given for the exponential bound is meant only to reveal the general character of what is available. A compromise was made by introduction of an inequality (the tangent bound on the exponent) to proceed most simply to this demonstration. Now, understanding that it is exponentially small, our best evaluation avoids this compromise and proceeds directly, using the bound (24), as it provides substantial numerical improvement.

In the next section, we prove proposition 2, while at the same time review basic properties of the RS codes.

VI. PROOF OFPROPOSITION2

We employ RS codes [45], [48], [55] as an outer code for correcting any remaining section mistakes. The symbols for the RS code come from a Galois field consisting of elements denoted by , with typically taken to be of the form . If represent message and codeword lengths, respectively, then an RS code with symbols in and minimum distance between codewords given by can have the following parameters:

Here, gives the number of parity check symbols added to the message to form the codeword. In what follows, we find it convenient to take to be equal to so that we can view each symbol in as giving a number between 1 and .

We now demonstrate how the RS code can be used as an outer code in conjunction with our inner superposition code to achieve low block error probability. For simplicity, assume that is a power of 2. First, consider the case when equals . Taking , we have that since is equal to , the RS code length becomes . Thus, one can view each symbol as repre- senting an index in each of the sections. The number of input

symbols is, then, , so setting ,

one sees that the outer rate equals which is at

least .

For code composition, message bits become the input symbols to the outer code. The symbols of the outer codeword, having length , give the labels of terms sent from each section using our inner superposition with

code length . From the received , the

estimated labels using our least squares decoder can be again thought of as output symbols for our RS codes. If denotes the section mistake rate, it follows from the distance property of the outer code that if , then these errors can be corrected. The overall rate is seen to be equal to the product of rates which is at least . Since we arrange for to be smaller than some with exponentially small probability , it follows from the previous that composition with an outer code allows us to communicate with the same reliability, albeit with a slightly

smaller rate given by .

The case when can be dealt with by observing ([45], p. 240) that an RS code as aforementioned can be shortened by length , where , to form an

code with the same minimum distance as earlier. This is easily seen by viewing each codeword as being created by appending parity check symbols to the end of the corresponding message string. Then, the code formed by considering the set of codewords with the leading symbols identical to zero has precisely the properties stated earlier.

With equal to as earlier, we have equals ; so

taking to be , we get an code, with

, , and minimum distance . Now,

since the outer code length is and symbols of this code are in the code composition can be carried out as earlier.

This completes the proof of Proposition 2.

VII. GENERALIZATION TOAPPROXIMATELEASTSQUARES

In conclusion, we remark that our results are equally valid for an approximate least squares decoder, which for some nonnegative chooses a satisfying

(40) where is what is sent. Since the aforementioned is less re- strictive than (2), it may be possible to find a computationally feasible algorithm for it. Indeed, we show in Appendix E that any computationally feasible algorithm, if it be an accurate decoder, then it must be an approximate least squares decoder for some small .

We now describe how our error probability bounds can be generalized to incorporate (40). We note that (40) is equivalent

to finding an , so that , with ,

where is as in (16). We find that the expression for

in lemma 3 holds for approximate least squares decoders with

, if we replace by . Fur-

thermore, the expression for of lemma 4 is also true for , if one replaces the appearing in the second term of the bound by . Accordingly, for such approximate

(12)

decoders, with , the bound corresponding to lemma 4 becomes

(41)

where and

is as in lemma 4.

The analysis of this decoder is quite similar to that of (2).

Interested readers may refer to [11] for a more general analysis incorporating (40).

APPENDIXA PROOF OFLEMMA6

We first prove (a). Write explicitly as an increasing function of the ratio . Working with logarithm base , the derivative with respect to of the expression being maximized yields a quadratic equation which can be solved for the optimal

Using this , we get that ,

which is at least . Here, , with

. Correspondingly, . This

proves (34).

For the lower bound on , recall that the

case case occurs when ,

in which case is at least . Using

proves (35).

Next we prove (b). Notice the has second derivative . It follows that

, since the difference of the two sides has negative second derivative, so it is concave and equals 0 at

and .

For the upper bound, notice that the derivative of is

at and at , where and

. Correspondingly, is bounded from earlier by the minimum of and . Now, it is not hard to see that

Correspondingly, we get the upper bound in (36).

APPENDIXB PROOF OFLEMMA5

We first prove (a). Define , which, using the lower bound on given in lemma 6 (b) and

, is at least . Consequently, is at least

using . Correspondingly, using (35) and the lower

bound (7), one gets that is at least

which is equal to times

Furthermore, can be bounded by and . Therefore, it is at most

, where . Using this,

the lower bound on and the form of given in (28), one gets that can be bounded by

times

Now, use to get that

Now, observe that the second term in the maximum given previously dominates the other two terms for all . This completes the proof of (a).

Next we prove (b). For in , we use

and the strict positivity of to see that the ratio in the definition of tends to zero uniformly within compact sets interior to . So the limit is determined by the maximum of the limits of the ratios at the two ends. In the vicinity of the left and right ends, we replace by the continuous upper bounds and , respectively, which are tight at and , respectively. Then, in accordance with L’Hôpital’s rule, the limit of the ratios equals the ratios of the derivatives at and , respectively. Accordingly

(42)

where and are the derivatives of with respect to evaluated at and , respectively.

To determine the behavior of in the vicinity of 0 and 1, we first need to determine whether the optimal in its definition is strictly less than 1 or equal to 1. From Section II,

the case occurs if and only if . The

right side of this is . So it is equivalent to determine whether the ratio

is less than 1 for in the vicinity of 0 and 1. Using L’Hôpital’s rule, it suffices to determine whether the ratio of derivatives is less than 1 when evaluated at 0 and 1. At , it is

which is not more than 1/2 (certainly less than 1) for all positive , whereas at , the ratio of derivatives is which is less than 1 if and only

(13)

if . In other words, at , the optimum is less than one for all , whereas at , it is less than one if and only if

.

For the cases in which the optimal , we need to determine the derivative of at and . Recall that is

the composition of the functions and

and . Also recall that

the limit of , as tends to 0 or 1, is zero.

Use chain rule for finding the derivative of , taking the products of the associated derivatives. The first of these func-

tions has derivative which is 1/4 at ,

the second of these has derivative which is 1/2 at , and the third of these functions is

which has derivative that evaluates to at and evaluates to

at . Correspondingly, for , the derivative of is for all , whereas for , its deriva-

tive is for .

For , the magnitude of the derivative of at 1 is smaller than at 0. Indeed, taking square roots, this is the same as

the claim that .

Replacing and rearranging, it reduces to

, which is true for since the two sides match at and have derivatives . Thus, the limiting value for near 1 is what matters for the maximum. This pro- duces the claimed form of for .

In contrast for , the optimal equals 1 for in the vicinity of 1. In this case, we use

which has derivative equal to

at , which is again smaller in magnitude than the derivative at , producing the claimed form for for .

At we equate and see that

both of the expressions for the magnitude of the derivative at 1 agree with each other (both reducing to ), so the argument extends to this case, and the expression for is continuous in .

(c) is proved by using and simplifying

the resulting expression. This completes the proof of Lemma 5.

APPENDIXC

IMPROVEMENT INFORM OFEXPONENT

The following improvement in the form of the exponent in Proposition 1 can be obtained.

Theorem 7: Assume , where , and rate is less than capacity . For the least squares decoder

Here

where is positive and tends to as tends to 0. Here

(43) and

(44) Proof of Theorem 7: Here, we determine the minimum value of for which the combinatorial term is canceled, and we characterize the amount beyond that minimum which makes the error probability exponentially small. Arrange to be the solution to the equation

To see its characteristics, let at

using log base . Here, is the inverse of the function which is the composition of the increasing functions

and previously dis-

cussed in Section II. This is near for small . When , the condition is satisfied and indeed solves the aforementioned equation;

otherwise, provides the solution.

Now, , which from earlier

can be bounded by . Also,

. Consequently, is small for large

; moreover, for near 0 and 1, it is of order and , respectively, and via the indicated bounds, derivatives at 0 and 1 can be explicitly determined.

The analysis in Lemma 5 may be interpreted as determining section size rates such that the differentiable upper bounds on

are less than or equal to for ,

where, noting that these quantities are 0 at the endpoints of the interval, the critical section size rate is determined by matching the slopes at . At the other end of the interval, the bound on the difference has a strictly positive slope for

at , given by as in (43). The positivity of follows from recalling that , since the second term in (42) always turns out to be the greater one. Consequently, one

may take for some positive , where

tends to as tends to 0.

Recall that . Express as the sum

of needed to cancel the combinatorial coefficient, and , which is positive. This arises in establishing that the main term in the probability bound is exponentially small. It decomposes as

. Arrange to be

so that .

Consider the exponent as given

in lemma 4. We take a reference for which

and for which is at least and at least a multiple of

. For convenience, we set to

(14)

be half way between and . Recall that has a nondecreasing derivative with respect to . So

is greater than . Con-

sequently, it lies above the tangent line at , that is

where as earlier is the derivative of

with respect to , which is, here, evaluated at . Its expression is as in (38).

We wish to examine the behavior of for near 0.

For this, we first lower bound the derivative . Since this derivative is nondecreasing, it is at least as large as the value at

. Now, recall that has a limit 0 as

tends to 0. Furthermore, has limit as

tends to 0. Consequently, from (38), at , we have tends to , given by (44), as tends to 0. Consequently,

, where is positive and tends to as goes to 0.

Next examine . Since is at least , it follows

that is at least . Consequently, as

in the proof of proposition 1, if the section size rate is at least , then the bound (23) is not more than the sum of

and

Using half way between and , the first part of the bound is at most

This bound is superior to the previous one, when closely matches , because of the addition of the nonnegative term. The second part of the bound can be dealt with as in proposition 1. Accordingly, we have proved that

where for small . It tends to as

tends to 0. This completes the proof of Theorem 7.

APPENDIXD COMPUTATIONS

We describe how the rate curves in Fig. 2 were computed. The block error probability was fixed at and the signal-to- ratio was taken to be 20 and 100. The PPV curve was curve was computed using the right side of (8) for the given and . The maximum achievable (composite) rate for the superposition code was calculated in the following manner. The number of sections, ranged from 20 to 100 in steps of 10, with the corresponding section size taken to be , where as in (32) and (33).

For given and values of , and , the inner coder rate was decreased from to in decrements of

. For a given , the minimum section mistake rate so that the error probability, computed using bounds (24), is at most was computed. The corresponding composite rate is taken to be

The maximum of the composite rates , when ranged from to in decrements of , is the reported maximum achievable rate for the superposition code for the given values of , and .

APPENDIXE

ACCURATEDECODER APPROXIMATELEASTSQUARES

In Lemma 9 in the following, we show that any decoder is an approximate least squares decoder. More specifically, we show that if the fraction of mistakes made by a decoder is small, the distance of the estimated fit from cannot be much greater than distance of the codeword sent, that is , from . To prove this, we require the following lemma, which is a consequence of the restricted isometry property [16], [17] for Gaussian random matrices. We recall that the entries of our matrix are i.i.d .

Lemma 8: Let and . Then,

the following holds except on a set with probability at most :

(45)

where is related to the

restricted isometry property constant.

Proof: Statement (45) is equivalent to giving uniform bounds on the maximum singular value of the matrices , for all , where is as in (13). For , let denote the maximum singular value of . We use a result in [61] (see also [17]), giving tail bounds for the maximum singular value for Gaussian matrices from which one gets that for positive

Accordingly, choose and use

to get that , except on a set with probability .

We need to hold uniformly for all sets

, with high probability. Correspondingly, using , using a union bound, one gets that the probability of the event

is at least . This completes the proof of the lemma.

If , then from standard results on the tail bounds of chi-square random variables, one has that

(46)

(15)

Lemma 8 and (46) gives us the following.

Lemma 9: Assume that a decoder for the superpo- sition code, operating at rate , makes at most section of mistakes. Denote as the estimate of the true

outputted by the decoder. Then, with probability at least , the estimate satisfies

with . In other words, with

high probability, is the solution of an approximate least squares decoder (40) with the given .

Proof: We need to show that cannot be much greater than . Notice that

(47) where for (47) we use the fact that the noise .

Now, , since the decoder makes at most

mistakes. Accordingly, using lemma 8 and (46), one gets that with probability at least , one gets that and . Consequently, from (47), one gets

with probability at least , where

.

APPENDIXF

ERRORBOUNDS FORSUBSETSUPERPOSITIONCODES

The method of analysis also allows the consideration of subset superposition coding described in Section I-A. In this case, all subsets of size correspond to codewords, so with the rate in nats, we have . The analysis proceeds in the same manner, with the same number of choices

of sets where and agree on terms,

but now with choices of sets of size

they disagree. We obtain the same bounds as earlier except that where we have , with the exponent , it is

replaced by , with the exponent defined

by .

Correspondingly, for subset superposition coding, the probability is bounded by the minimum of the same expressions given in Lemma 3 and Lemma 4, except that the term appearing in these expression is replaced by the quantity defined previously. We have not investigated in greater detail for whether there is reliability for any rate below capacity for these codes.

ACKNOWLEDGMENT

We thank John Hartigan, Cong Huang, Yiannis Kontiyiannis, Mokshay Madiman, Xi Luo, Dan Spielman, Edmund Yeh, John

Hartigan, Mokshay Madiman, Dan Spielman, Imre Teletar, Harrison Zhou, David Smalling, and Creighton Heaukulani for helpful conversations.

REFERENCES

[1] A. Abbe and A. R. Barron, “Polar coding schemes for the AWGN channel,” in Proc. IEEE Int. Symp. Inf. Theory, St. Petersburg, Russia, Aug. 2011, pp. 194–198.

[2] M. Akcakaya and V. Tarokh, “Shannon-theoretic limits on noisy compressive sampling,” IEEE Trans. Inf. Theory, vol. 56, no. 1, pp.

492–504, Jan. 2010.

[3] E. Arikan, “Channel polarization,” IEEE Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.

[4] E. Arikan and E. Telatar, “On the rate of channel polarization,” in Proc.

IEEE Int. Symp. Inf. Theory, Seoul, Korea, Jul. 2009, pp. 1493–1495.

[5] A. Barg and G. Zémor, “Error exponents of expander codes under linear-complexity decoding,” SIAM J. Discrete Math., vol. 17, no. 3, pp. 426–445, 2004.

[6] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp.

930–944, May 1993.

[7] A. Barron, A. Cohen, W. Dahmen, and R. Devore, “Approximation and learning by greedy algorithms,” Ann. Statist., vol. 36, no. 1, pp. 64–94, 2007.

[8] A. R. Barron and A. Joseph, Sparse superposition codes: Fast and reliable at rates approaching capacity with Gaussian noise [Online]. Avail- able: http://www.stat.yale.edu/arb4

[9] A. R. Barron and A. Joseph, “Towards fast reliable communication at rates near capacity with Gaussian noise,” in Proc. IEEE. Int. Symp. Inf.

Theory, Austin, TX, Jun. 13–18, 2010, pp. 315–319.

[10] A. R. Barron and A. Joseph, “Least squares superposition codes of moderate dictionary size, reliable at rates up to capacity,” in Proc.

IEEE. Int. Symp. Inf. Theory, Austin, TX, Jun. 13–18, 2010, pp.

275–279.

[11] A. R. Barron and A. Joseph, Least squares superposition codes of moderate dictionary size, reliable at rates up to capacity [Online]. Available:

http://arxiv.org/abs/1006.3780

[12] G. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding: turbo codes,” in Proc. Int. Conf. Commun., Geneva, Switzerland, May 1993, pp. 1064–1070.

[13] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse ap- proximations,” J. Fourier Anal. Appl., vol. 14, pp. 629–654, 2008.

[14] L. Applebaum, W. U. Bajwa, M. F. Duarte, and R. Calderbank, “Mul- tiuser detection in asynchronous on-off random access channels using lasso,” in Proc. 48th Annu. Allerton Conf. Commun., Control, Comput., Sep. 2010, pp. 130–137.

[15] E. Candes and Y. Plan, “Near-ideal model selection by` minimiza- tion,” Ann. Statist., vol. 37, no. 5A, pp. 2145–2177, 2009.

[16] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans.

Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.

[17] E. Candes and T. Tao, “Near-optimal signal recovery from random pro- jections: Universal encoding strategies?,” IEEE Trans. Inf. Theory, vol.

52, no. 12, pp. 5406–5425, Dec. 2006.

[18] J. Cao and E. M. Yeh, “Asymptotically optimal multiple-access com- munication via distributed rate splitting,” IEEE Trans. Inf. Theory, vol.

53, no. 1, pp. 304–319, Jan. 2007.

[19] T. M. Cover, “Broadcast channels,” IEEE Trans. Inf. Theory, vol. 18, no. 1, pp. 2–14, Jan. 1972.

[20] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley-Interscience, 2006.

[21] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,”

Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, 2004.

[22] D. Donoho, “For most large underdetermined systems of linear equa- tions the minimal` -norm near solution approximates the sparest solu- tion,” Commun. Pure Appl. Math., vol. 59, no. 10, pp. 907–934, 2006.

[23] D. L. Donoho, M. Elad, and V. M. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans. Inf. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006.

[24] D. Donoho and X. Huo, “Uncertainty principles and ideal atomic de- composition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–2862, Nov. 2001.