Least Squares Superposition Codes of Moderate Dictionary Size Are Reliable at Rates up to Capacity
Antony Joseph, Student Member, IEEE, and Andrew R Barron, Senior Member, IEEE
Abstract—For the additive white Gaussian noise channel with average codeword power constraint, coding methods are analyzed in which the codewords are sparse superpositions, that is, linear combinations of subsets of vectors from a given design, with the possible messages indexed by the choice of subset. Decoding is by least squares (maximum likelihood), tailored to the assumed form of codewords being linear combinations of elements of the design.
Communication is shown to be reliable with error probability ex- ponentially small for all rates up to the Shannon capacity.
Index Terms—Achieving capacity, compressed sensing, exponen- tial error bounds, Gaussian channel, maximum likelihood estima- tion, subset selection.
I. INTRODUCTION
T
HE additive white Gaussian noise channel is basic to Shannon theory and real communication models. In superposition coding schemes, the codewords are sparse linear combinations of elements from a given dictionary. We show that superposition codes from polynomial size dictionaries with maximum likelihood (minimum distance) decoding achieve exponentially small error probability for any communication rate less than the Shannon capacity. A companion paper [8],[9]provides a fast decoding method and its analysis. The develop- ments involve a merging of modern perspectives on statistical linear model selection and information theory.
The familiar communication problem is as follows. Input
bit strings of length are mapped
into codewords, of length , with control of their power. The channel adds independent noise to the selected code- word yielding a received length string . Using the received string and knowledge of the codebook, the decoder, then, gets an estimate of the transmitted string . Block error is the event , bit error at position is the event , and the bit error rate is . Analogous section error rate for our code is defined as follows. The reliability requirement is that, with sufficiently large , the bit error rate or section error rate is small with high probability, when averaged over input strings as well as the distribution of . A
Manuscript received June 05, 2010; revised July 07, 2011; accepted November 28, 2011. Date of publication January 31, 2012; date of current version April 17, 2012. The material in this paper was presented in part at the 2010 IEEE International Symposium on Information Theory.
The authors are with the Department of Statistics, Yale University, New Haven, CT 06520 USA (e-mail: [email protected]; an- [email protected]).
Communicated by I. Kontoyiannis, Associate Editor for Shannon Theory.
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIT.2012.2184847
more stringent requirement would be to have small block error probability, again averaged over the distributions of and . As will be made clear later on, for ease of analysis, we perform a further averaging over the distribution of our design matrix.
The communication rate is the ratio of the input length to the code length for communication across the channel.
By traditional information theory, as in [20], [35], and [60], the supremum of reliable rates is the channel capacity
, where is a constraint on the power of the codewords. Standard communication models, even in con- tinuous time, have been reduced to the aforementioned discrete- time white Gaussian noise setting, as in [31] and [35].
We now describe the superposition coding scheme. The story begins with a dictionary (design matrix) , with
columns for . We further assume
that , with and being positive integers. The dic- tionary is partitioned into sections, each of size as depicted in Fig. 1.
The codewords take the form of particular linear combina- tions of subsets of columns of the dictionary. Specifically, each codeword is of the form , where belongs to a set given by
Consequently, for , the codeword is a superposition of columns of , with exactly one column selected from each section. The received vector is then in accordance with the sta- tistical linear model
(1) where is the noise vector distributed .
For ease of encoding, it is assumed that the section size is a power of 2. The input bit strings are of length , which split into substrings of size . The encoder maps to simply by interpreting each substring of as giving the index of which coordinate of is nonzero in the corresponding section. That is, each substring is the binary representation of the corresponding index.
As mentioned earlier, we analyze the maximum likelihood decoder. This decoder is the same as that which chooses the that maximizes the posterior probability when the prior distri- bution is uniform over . The decoder is given by
(2) where denotes the Euclidean norm. Here, we implicitly as- sume that if the minimization has a nonunique solution, one may take to be any value in the solution set. Since the earlier
0018-9448/$31.00 © 2012 IEEE
Fig. 1. Schematic rendering of the dictionary matrixX and coefficient vector . The vertical bars in the X matrix indicate the selected columns from a section.
is a least squares minimization problem over coefficient vec- tors in , we also call this the least squares decoder. Although the aforementioned decoder is not a computationally feasible scheme, the result is significant since we show that one can achieve rates up to capacity with a codebook that has a com- pact representation in the form of the dictionary .
The entries of are drawn independently from a normal dis- tribution with mean zero and variance . With this distri- bution, one has that for each , the expected codeword power, given by , is equal to . Our design pro- duces a distribution of codeword powers , across the codewords, that is highly concentrated near , with average
codeword power having expectation
. Use of average power rather than individual power constraint does not increase the capacity.
An alternative method would be to arrange the entries of to be equiprobable random variables. This would achieve an approximately Gaussian distribution for the ’s. It is very likely that this alternative design also achieves capacity, though that is not explored here.
As we have said, the rate of the code is input bits per channel uses and we arrange for arbitrarily close to . For our code, this rate is . For specified rate , the code length . As explained in the following, the section size will be related to the number of sections by an expression of polynomial size. Consequently, the length
and the number of terms agree to within a log factor.
Control of the dictionary size is critical to computationally advantageous coding and decoding. If the number of sections were fixed, then has size that is exponential in , making its direct use impractical. Instead, with agreeing with to within a factor, the dictionary size is more manage- able. In this setting, we construct reliable, high-rate codes with codewords corresponding to linear combinations of subsets of terms in moderate size dictionaries.
The idea of superposition codes for Gaussian channels began with Cover [19] in the context of determination of the capacity region of certain multiple user channels. There represents the number of messages decoded and a selected column represents the codeword for a message. Codes for the Gaussian channel based on sparse linear combinations have been proposed in the compressed sensing community by Tropp [64]. However, as he discusses, the rate achieved there is not up to capacity. Relation-
ship of our study to that in these communities will be discussed in further detail later on.
We now describe our main result concerning the performance of the least squares decoder. We show that if , for any exceeding a particular positive function of the signal-to-noise ratio , then rates arbitrarily close to capacity can be achieved.
This function is near for small and near 1 for large . Consequently, the dictionary has size that is polyno- mial in . This required section size does not depend on the gap , and thus, the dictionary has a compact representation irrespective of the closeness of to .
For , let denote the joint distribution of given . Further, let denote the number of mistakes made by the least squares decoder, that is, the number of sections in which the position of the nonzero term in is different from that in the true . Denote the error event
(3) that the decoder makes mistakes in at least fraction of sec- tions. Assuming that is drawn from a uniform distribution over all elements from , the average probability of error con- ditional on is given by
Deriving bounds for the aforementioned is not easy. We follow the information theory tradition and bound the average of the earlier over the distribution of , given by
(4)
For positive , let . Furthermore, for
, let
(5)
A positive expression possessing properties explained in Section IV, lemma 5 is used. For large , it is near a function near for small and near 1 for large . Our main result is the following.
Proposition 1: Assume , where , and rate is less than capacity . Let represents the fraction of section mistakes made by the least squares decoder. Then
with , where
(6)
is evaluated at and .
Proposition 1 is proved in Section V.
Remark: It is shown in Appendix C that the exponent
can be improved by replacing with
where
Here, is a positive function of and , which for given is near for small , where and are positive expressions as in (43) and (44) shown later.
Let . Then, it is not hard to see that (7) Accordingly, the function , appearing in the lower bound (6), may be replaced by , revealing that the exponent is, up to a constant, of the form , where
. With the improved bound in Appendix C, it is of the form .
Moreover, an approach is discussed which completes the task of identifying the terms by arranging sufficient distance between the subsets, using composition with an outer Reed–Solomon (RS) code of rate near one. It corrects the small fraction of re- maining mistakes so that we end up not only with small mistake rate but also with small block error probability. If
is the rate of an RS code, with , then section error rates less than can be corrected, provided . Fur- thermore, if (or simply ) is the rate associated with our inner (superposition) code, then the total rate after correcting for
the remaining mistakes is given by . The
end result, using our theory for the distribution of the fraction of mistakes of the superposition code, is that the block error prob- ability is exponentially small. One may regard the composite code as a superposition code in which the subsets are forced to maintain at least a certain minimal separation, so that decoding to within a certain distance from the true subset implies exact de- coding. Accordingly, we make the following claim about block error probability.
Proposition 2: For given fraction of mistakes , let be a rate for which the partitioned superposition code with sections has exponentially small probability of Proposi- tion 1. Then, through concatenation with an outer RS code, one obtains a code with rate and block error probability less than or equal to .
Proposition 2 is proved in Section VI.
Particular interest is given to the case that the rate is made to
approach the capacity . Arrange and .
One may let the rate gap tend to zero (e.g., at a rate or any polynomial rate not faster than ); then, the overall
rate continues to have drop from
capacity of order , with the composite code having block error probability of order
The aforementioned exponent, of order for near , is in agreement with the form of the optimal reliability bounds as in [35] and [53], though, here, our constant is not demonstrated to be optimal.
In Fig. 2, we plot curves of achievable rates using our scheme for block error probability fixed at and signal-to-noise ra- tios of 20 and 100. We also compare this to a rate curve given in [53] (PPV curve), where it is demonstrated that for a Gaussian channel with signal-to-noise ratio , the block error probability , code length , and rate with an optimal code can be well approximated by the following relation:
(8)
where is the channel dis-
persion and is the inverse normal cumulative distribution function.
For the superposition code curve, the -axis gives the highest for which the error probability stays below . We see for the given and block error probability values, the achievable rates using our scheme are reasonably close to the theoretically best scheme. Note that the PPV curve was computed with an ap- proach that uses a codebook of size that is exponential in block length, whereas our dictionary, of size , is of considerably smaller size.
A. Variants of the Superposition Scheme
To distinguish it from other sparse superposition codes, the code analyzed here may be called a partitioned superposition code. The motivations for introducing the partitioning versus arbitrary subsets, in the superposition coding scheme, are the ease in mapping the input bit string to the coefficient vector and the ease in composition with the outer RS code. Natural vari- ants of the schemes are subset superposition coding, where one arranges for a number of the coordinates to be nonzero and taking the value 1, with the message conveyed by the choice of subset. With somewhat greater freedom, one may have signed superposition coding, where one arranges the nonzero coeffi- cients to be or . Then, the message is conveyed by the sequence of signs as well as the choice of subset. In both cases, if one takes the elements of to be i.i.d as before, then the expected power of each codeword is . The signed su- perposition coding scheme has been proposed in [36] and [64].
As mentioned earlier, superposition codes began with [19]
for multiuser channels in the context of determination of the ca- pacity region of Gaussian broadcast channels. There the number of users corresponds to . The codewords for user , for
, corresponds to the columns in section . In that set- ting, what is sent is the sum of codewords, one from each user.
With fixed, is exponential in . Here, for the
Fig. 2. Plot of comparison between achievable rates using our scheme and the theoretical best possible rates for block error probability of10 and signal-to-noise ratio (v) values of 20 and 100. The curves for our partitioned superposition code were evaluated at points with number of sections L ranging from 20 to 100 in steps of 10, with correspondingM values taken to be L , where a is as given in Lemma 5, (32), and (33) later on. For the v values of 20 and 100 shown previously, a is around 2.6 and 1.6, respectively. For details on computations, refer to Appendix D.
single-user channel, by allowing to be of the same order as , to within a factor, we make it possible to achieve rates close to capacity with polynomial size dictionaries. Related rate splitting (partitioning) for superposition codes is developed for Gaussian multiple-access problems in [18] and [59].
As to the relationship to single-user decoding, note that in the Gaussian broadcast channel, with optimal decoding, it is ar- ranged that one of the receivers decodes all the messages. This is also the case for the multiple access channel receiver. The termi- nology of superposition codes, rate splitting (partitioning), and issues of power allocations arise from such work in multiuser Shannon theory. Here, to achieve the benefits of the reduced size dictionary, we decode the sections jointly rather than suc- cessively. Here, it does allow the power allocation to be con- stant across sections. In the companion paper [8], in achieving a practical decoder, we do make use of standard variable power allocation in the sections.
Sparse superposition codes have been proposed for commu- nication in random access channels as in [14] and [27].
Our ideas of sparse superposition coding are adapted to Gaussian vector quantization in [42].
B. Related Work on Sparse Signal Recovery
While reviewing works on sparse signal recovery and com- pressed sensing, we adhere to our notation that we have a linear model of the form
where is a deterministic or random matrix and has exactly nonzero values. The quantities , and will be called parameters for the model. In our description in the following, we denote as some positive constant whose value will change from time to time.
The conclusions here complement recent work on sparse signal recovery in the linear model setup as we now discuss.
In a broad sense, these works analyze for various schemes (practical or otherwise), conditions on the parameters so that certain reliability requirements are satisfied with high proba- bility. Closely related to our work is the requirement that only the indices corresponding to the nonzero elements of , that is the support of , be recovered exactly or almost exactly.
The conditions explored by this community do translate into results on communication rate, though heretofore not rates up to capacity.
In this paper, in order to achieve rates arbitrarily close ca- pacity, we require , with precise values of speci- fied later on, putting us in the sublinear sparsity regime, that is, as . Also, if we change the scale and take the elements of the matrix as i.i.d standard normal, the nonzero values of assume the value . Accordingly, al- though most the claims in this area are for more general sparsity regimes and values of , the results most relevant to us are those for the sublinear sparsity regime and when the nonzero ’s are
at least .
A significant portion of the work in this area focuses on deterministic matrices satisfying certain assumptions. A common assumption is the mutual incoherence condition [15], [24], [33] which places controls on the magnitude cor- relation between distinct pairs of columns. Another related assumption is the exact recovery condition [64], [68], [72].
The recovery uses -relaxation methods such as Lasso [62]
or iterative methods such as orthogonal matching pursuit [49], [52]. This line has been pursued by the authors in [23], [24], [32], [70], [72], and others, for general sparse signal recovery problems and by Tropp [64] for the communication problem. While the aforementioned covers broad classes of dictionaries, they impose severe constraints on the dictionaries.
Indeed, when applied to Gaussian matrices, they require
or sparsity ,
which would correspond to rate approaching 0. In contrast,
for our scheme , which using
and , one gets that is sufficient for subset recovery, which is of a smaller order of magnitude than the aforementioned. Consequently, these re- sults on deterministic matrices, when applied to our setting, are insufficient to communicate at positive rates, let alone rates close to capacity.
The aforementioned works allow for decoding of arbitrary sparse subsets with high probability. This rather stringent form of conclusion corresponds to worst case error probability in the communication setting.
Work which does correspond to positive rate, when translated to the communication setting, arises from three approaches.
First, there is the work of Candes and Plan [15] and Tropp [65], [66], in which one looks at the probability of error averaged over codewords (i.e., the subset is chosen randomly). This achieves reliable support recovery with as high as . Second, there is the work of Zhang [71] that employs a more involved forward/backward stepwise selection algorithm, for dictionaries satisfying certain properties, to achieve reliable performance for arbitrary subsets (worst case error probability), again for as high . However, the constants are such that demonstration that rates up to capacity can be achieved has been lacking.
Third, analysis using random matrices in the noisy set- ting has also been carried out in [17], [22], and [68], among others, where the analysis in [68] addresses the issue of sup- port recovery. More closely related to ours, support recovery of the least squares decoder is analyzed in [2], [28], and [67], for Gaussian matrices, where Akcakaya and Tarokh [2] also ad- dress the issue of partial support recovery. Similar to aforemen- tioned, one can infer from this that communication at positive rates is possible using random designs. The signal recovery pur- pose is somewhat different here from our communications pur- pose, in that the work typically does not constrain the nonzero coefficients to the same value, and the resulting freedom in their values lead to order of magnitude conclusions that obstruct in- terpretation in terms of exact rate.
Furthermore, there are result giving necessary conditions for exact support recovery [29], [67], [69] and for partial support recovery [2]. Both these agree in terms of order of magnitude, requiring an order of for the regime we deal with. In [56], it is shown that in the linear sparsity regime, that is, when is of the same order as , one requires for reliable recovery of the support. An implication of this is that the sublinear sparsity regime is necessary for communication at positive rates.
Consequently, one can infer, from some of the aforemen- tioned works, that communication at positive rates is possible with sparse superposition codes. We add to the existing litera- ture by showing that one can achieve any rate up to capacity in certain sparsity regimes with a compact dictionary, albeit for a computationally infeasible scheme. Furthermore, we demon- strate that the error exponents are of the optimal form.
C. Practical Decoding Algorithms Approaching Capacity Along with this paper, we pursued the problem of achieving capacity using computationally feasible schemes. In [8] and [9], an iterative decoding scheme, called adaptive successive de-
coding, is analyzed. This is similar in spirit to iterative decoding techniques such as forward stepwise regression [7], [41], re- laxed greedy algorithm [6], [38], [44], and orthogonal matching pursuit [49], [52], and other iterative algorithms [13], [21], [51].
The rate attained there is of the order of below ca- pacity, with corresponding error probability being exponentially small in . These performance levels are not as good as obtained here with the optimal decoder. The sparse superpo- sition codes achieving these performance levels, by least squares and by adaptive successive decoding, are different in an impor- tant aspect. For this paper, we use a constant power allocation, with the same power for each term. However, in [8] and [9], to yield rates near capacity, we needed a variable power alloca- tion, achieved by a specific schedule of the nonzero ’s. In con- trast, if one were to use equal power allocation for the decoding scheme, then reliable decoding holds only up to a threshold rate , which is less than the capacity . Since the focus here is on the least squares decoder, we defer detailed discussion to the later paper [9].
D. Related Communication Issues and Schemes
The development, here, is specific to the discrete-time
channel for which for with
real-valued inputs and outputs and with independent Gaussian noise.
Standard approaches, as discussed in [31], entail a decom- position of the problem into separate problems of coding and of shaping of a multivariate signal constellation. Notice that we build shaping directly into the coding scheme by choosing code- words to follow a distribution.
For the low signal-to-noise regime, binary codes suffice for communication near capacity and there is no need for shaping.
The performance of the maximum likelihood decoder for binary linear codes, with a random design matrix and with exponential error bounds at rates up to capacity for the binary symmetric channel, has been established in [26]. Computationally feasible schemes, with empirically good performance, for discrete chan- nels include LDPC codes [34], [46], [47], [57], [58] and turbo codes [12], [50]. Error bounds for rates up to capacity for ex- pander codes (related to LDPC) are shown in [5] and for LDPC codes with random low-density design matrix in [43], whereas turbo codes have an error floor [37], [54] that precludes such exponential scaling of error probability. Thus, the work in [5], [26], and [43], with a random design matrix of controlled size, are conclusions for discrete channels that correspond to the con- clusion obtained here for the Gaussian channel for rates up to capacity.
Recently, practical and capacity-achieving polar codes have been developed for discrete channels [3], [4], though with an error probability that is exponentially small in rather than . Unlike the present development, it remains unknown how the exponent for the polar codes depends on the closeness of to
.
When the signal-to-noise ratio is not small, proper shaping for the Gaussian channel requires larger size signal alphabets, as explained in [31]. For example, Abbe and Barron [1] pro- vide such analysis adapting polar codes to use for the Gaussian channel.
The analysis of concatenated codes in [30] is an important forerunner to the development we give here. The author in [30]
identified benefits of an outer RS code paired in theory with an optimal inner code of Shannon–Gallager type and in practice with binary inner codes based on linear combinations of orthog- onal terms (for target rates less than 1 such a basis is avail- able). The challenge concerning theoretically good inner codes is that the number of messages searched is exponentially large in the inner code length. Forney made the inner code length of logarithmic size compared to the outer code length as a step to- ward practical solution. However, caution is required with such a strategy. Suppose the rate of the inner code has only a small drop from capacity, . For small inner code error prob- ability, the inner code length must be of order at least . So with that scheme, one has the undesirable consequence that the required outer code length becomes exponential in .
For the Gaussian noise channel, our tactic to overcome that difficulty uses a superposition inner code with a polynomial size dictionary. We use inner and outer code lengths that are comparable, with the outer code used to correct errors in a small fraction of the sections of the inner code. The overall code length to achieve error probability remains of the order
.
Section II contains brief preliminaries. Section III provides core lemmas on the reliability of least squares for our super- position codes. Section IV analyzes the matter of section size sufficient for reliability. In Sections V and VI, we give proofs of propositions 1 and 2, respectively. In Section VII, we dis- cuss how our results can be adapted for an approximate form of the least squares decoder. The Appendix collects some auxiliary matters.
II. PRELIMINARIES
For vectors of length , let be the sum of squares of coordinates, let be the average square,
and let be the associated inner product.
It is a matter of taste, but we find it slightly more convenient to work, henceforth, with the norm rather than .
Concerning the base of the logarithm ( ) and associated exponential ( ), base 2 is most suitable for interpretation and base most suitable for the calculus. For instance, the rate
is measured in bits if the log is base 2 and nats if the log is base . Typically, conclusions are stated in a manner that can be interpreted to be invariant to the choice of base, and base is used for convenience in the derivations.
We make repeated use of the following moment generating function and its associated Cramer–Chernoff large deviation ex- ponent in constructing bounds on error probabilities. If and are normal with means equal to 0, variances equal to 1, and correlation coefficient , then
(9)
when and infinity otherwise. So, taking
the logarithm, the associated cumulant generating function
of is , with the
understanding that the minus log is replaced by infinity when is at least . For positive , we define the quantity
given by
(10) The expression corresponding to but with the maximum
restricted to is denoted as , that is
(11) When the optimal is strictly less than 1, the value of matches as given previously.
The case occurs when
, or equivalently . Then, the exponent is
, which is as least .
Consequently, in this regime, is between and . The special case is included with .
There is a role for the function
(12)
for , where is the signal-to-noise ratio
and is the channel capacity. We
note that is a nonnegative concave function equal to 0 when is 0 or 1 and strictly positive in between. The quantity is larger by the additional amount , positive when the rate is less than the Shannon capacity .
Remark on average codeword power: The average codeword
power has expectation with respect to
that matches , for all . The distribution of the average codeword power is tightly concentrated around as explained in the [11, Appendix], and will not be explored further here.
III. PERFORMANCE OFLEASTSQUARES
In this section, we examine the performance of the least squares decoder (2) in terms of rate and reliability. For , let denote the set of indices for which is nonzero. Furthermore, let
(13) denote the set of allowed subset of terms. It corresponds to the subsets of of size and comprising of exactly one term from each section.
Recall that we are interested in bounding given in (4).
By symmetry
where . Correspondingly, for fixed
, we proceed to obtain bounds for . Let . Furthermore, let be the least squares solution (2) and
. Notice that , which is
also the number of sections incorrectly decoded.
For let be the event that there are exactly mistakes. Now, can be ex- pressed as a disjoint union of , for . Correspondingly
(14)
In the following two lemmas, we give bounds for
for .
Lemma 3: Set for an . The prob-
ability can be bounded by , where
(15)
where and . Here, is
the signal-to-noise ratio.
Remark: Notice that depends also on , and . Whether is exponentially small depends on the relative size of the combinatorial term and the exponential term in
and .
Proof of Lemma 3: For the occurrence of , there must be an which differs from the subset sent in an amount
and which has , or equivalently has , where
(16) The analysis proceeds by considering an arbitrary such , bounding the probability that , and then using an appropriately designed union bound to put such probabilities together. Notice that the subsets and have an intersection
of size and difference of size
.
Let denote the joint density of and when is sent. Furthermore, let . The actual den- sity of given , denoted by , has mean and variance . Furthermore, there is conditional inde- pendence of and given .
Next, consider the alternative hypothesis that was sent and let denote the corresponding joint density under this hypothesis. The conditional density for given and ,
denoted by , is now Normal( ). With
respect to this alternative hypothesis, the conditional distribu-
tion for given remains Normal( ). That
is, .
We decompose the test statistic in (16) as , where (17) and
(18) Note that depends only on terms in , whereas
depends also on the part of not in .
Concerning , note that we may express it as
(19) where
is the adjustment by the logarithm of the ratio of the normalizing constants of these densities. Using Bayes rule, notice that
Correspondingly, one gets from (19) that
(20) We are examining the event that there is an , with and . For positive , the indicator of this event satisfies
where is of size and of size .
The earlier follows since if there is such an with , then indeed that contributes a term on the right side of value at least 1. Here, the outer sum is over . For each such , for the inner sum, we have sections in each of which, to comprise , there is a term selected from among choices other than the one prescribed by .
To bound the probability of , take the expectation of both sides, bring the expectation on the right inside the outer sum, and write it as the iterated expectation, where on the inside condition on , , and to pull out the factor involving , to get that is not more than
Notice that , that is, is
independent of , and . Correspondingly, the inner expectation may be expressed as . Furthermore, we ar- range for to be not more than 1. Then, by Jensen’s inequality, the expectation may be brought inside the power and inside the inner sum, yielding
(21)
Recall that
from (20). Consequently, one has
which is equal to . The sum over entails less than , where , choices so the bound (21) becomes
(22)
The sum over in the aforementioned expression is over terms. Furthermore, is a sum of independent mean- zero random variables each of which is the difference of squares of normals for which the squared correlation is . So using (9), the expectation is found to be equal
to . When plugged in earlier and
optimized over in , one gets from the expression of given in (11) that the expectation in the right side of (22) is equal to . This completes the proof of the lemma.
Remark: A natural question to ask is why we did not use the simpler union bound for given by
where , is any set with . One could
then use a Chernoff bound for the term . Indeed, this is what we tried initially; however, due to the presence of the two combinatorial terms, we were unable to make the aforemen- tioned go to zero, with large , for all rates less than capacity. In our aforementioned proof, by introducing the term in the ex- ponent, we were able to reduce the term to . Optimizing over revealed the best bound using this method. Somewhat similar analysis has been done before to obtain error exponent for the standard channel coding problem, for example, in [35].
A difficulty with the Lemma 3 bound is that for near 1 and for correspondingly close to , in the key quantity
, the order of is , which is too close to zero to cancel the effect of the combinatorial coefficient .
The following lemma refines the analysis of Lemma 3, ob- taining the same exponent with an improved correlation coef- ficient. The denominator of now becomes
. This is an improvement due to the presence of the factor allowing the conclusion to be useful also for near 1. The price we pay is the presence of an additional term in the bound.
Lemma 4: Let a positive integer be given and let . Then, is bounded by the minimum for
in the interval of , where
(23)
where, here, the quantities and
Proof of Lemma 4: Split the test statistic where
and
Take positive and negative . Then,
, with being the event that there is an , with and . Similarly, is the corre- sponding event that . The part has no dependence on so its treatment is more simple. It is a mean zero average of differences of squared normal random variables, with squared correlation . So using its moment generating func- tion, is exponentially small, bounded by the second of the two expressions in (23).
Concerning , its analysis is much the same as for Lemma 3. We again decompose as the sum
, where is the same as earlier. The differ- ence is that in forming we subtract rather than . Consequently
which again involves a difference of squares of standardized normals. But here the coefficient multiplying is such that we have maximized the correlations between the
and . Consequently, we have reduced the spread of the distribution of the differences of squares of their stan- dardizations as quantified by the cumulant generating function.
One finds that the squared correlation coefficient is
for which .
Accordingly we have that the moment generating function is which gives rise to the bound appearing as the first of the two expressions in (23). This completes the proof of Lemma 4.
From Lemma 4, one gets that , where
Consequently, from Lemmas 3 and 4, along with (14), one gets
that , where
(24)
This is the bound we use to numerically compute the rate curve in Fig. 2. Accordingly, the error exponent of Propo- sition 1 satisfies
(25) Our task will be to give simplified lower bounds for the right side of (25) for all . In the next section, we characterize the section size required to achieve rates up to capacity. In Sec- tion V, we prove Proposition 1 and in Section VI, we prove Proposition 2. We also remark that in Appendix F we discuss how the bounds of the aforementioned two lemmas may be mod- ified to deal with the subset superposition coding scheme de- scribed in Section I-A.
Since the bounds of Lemma 4 are better than those in Lemma 3 for values near 1, for simplicity we only use the bounds from Lemma 4 in characterizing the error exponents. Corre- spondingly, from hereon we take
(26) as in Lemma 4.
IV. SUFFICIENTSECTIONSIZE
We call the section size rate, that is, the bits required to describe the member of a section relative to the bits required to describe which section. It is invariant to the base of the log. Equivalently, we have and related by . Note that the size of controls the polynomial size of
the dictionary .
The code length may be written as
We do not want a requirement on the section sizes with of order for then the complexity would grow exponen- tially with this inverse of the gap from capacity. So, instead, we
decompose where .
We investigate in this section the use of to cancel out the combinatorial coefficient appearing in the first term in (23). In subsequent sections, excess in , beyond that needed to cancel the combinatorial coefficient, plus are used to produce exponentially small error probability.
Define and .
Now, is increasing as a function of , so is greater than whenever . Accordingly, we decompose the exponent as the sum of two components,
namely, and the difference .
We then ask whether the first part of the exponent denoted is sufficient to cancel out the effect of the log combina- torial coefficient . That is, we want to arrange for the nonnegativity of the difference
(27) This function is plotted in Fig. 3 for specific choices of , ,
, and .
Using , one finds that for sufficiently large depending on , the difference is nonnegative uniformly for the permitted in . The smallest such section size rate is
(28)
where the maximum is for in . This
definition is invariant to the choice of base of the logarithm, assuming that the same base is used for the communication rate
and for the that arises in the definition of . In the aforementioned ratio, the numerator and denominator are both 0 at and (yielding at the ends).
Accordingly, we have excluded 0 and 1 from the definition of for finite . Nevertheless, limiting ratios arise at these ends.
We give bounds for and show that the value of is fairly insensitive to the value of , with the maximum over the whole range being close to a limit which is characterized by values in the vicinity of .
Let near 15.8 be the solution to
Lemma 5: The quantity has the following properties.
(a) For
(29)
where .
(b) The limit for large of is a continuous function which is given, for , by
(30)
and for by
(31) (c) For all and using log base e, the aforemen-
tioned is bounded by
(32) in the case , which is approximately for small positive , whereas in the case , it is bounded by
(33) which asymptotes to the value 1 for large .
The proof of the aforementioned lemma is routine. For con- venience, it is given in Appendix B.
While is undesirably large for small , we have reasonable values for moderately large . In particular, equals 5.0 and 3, respectively, at and , and it is near 1 for large . Numerically, it is of interest to ascertain the minimal section size rate , for a specified such as , for chosen to be a given high fraction of , say , for at a fixed small target fraction of mistakes, say , and for
to be a small target probability, so as to obtain . Here, as in (24). This is illustrated in Fig. 4 plotting the minimal section size rate as a function of for . With such moderately less than , we observe substantial reduction in the required section size rate.
V. PROOF OFPROPOSITION1
In this section, we put the aforementioned conclusions to- gether to prove proposition 1, demonstrating the reliability of approximate least squares. The following lemma will be useful in proving the lower bound for the error exponent in proposition
1. Let as earlier.
Fig. 3. Exponents of contributions to the error probability as functions of =
`=L using exact least squares, i.e., t = 0, with L = 100, M = 2 , signal-to- noise ratiov = 15, and rate 70% of capacity. The red and blue curves are the0 log [ ~E ] and 0 log [E ] bounds, using the natural logarithm, from the two terms in lemma 4 with optimizedt . The dotted green curve is d (27). With = 0:1, the total probability of at least that fraction of mistakes is bounded by1:8 2 10 .
Fig. 4. Sufficient section size ratea as a function of the signal-to-noise ratio v. The dashed curve shows a atL = 64. Just below it, the thin solid curve is the limit for largeL. For section size M L , the error probabilities are exponentially small for allR < C and any > 0. The bottom curve shows the minimal section size rate for the bound on the error probability contributions to be less thane , withR = 0:8C and = 0:1 at L = 64.
Lemma 6: The following bounds hold.
(a) For positive and correlation , let
. Then
(34)
and
(35)
(b) For , let . Then
(36) For convenience, we put its proof in Appendix A.
We now prove Proposition 1. Consider the exponent appearing in the error bound (23).
Now, has a nondecreasing derivative with
respect to . So is greater than
. Consequently, it lies above the tangent line (the first order Taylor expansion) at , that is
(37) where is the derivative of
with respect to , which is, here, evaluated at . In detail, the derivative is seen to equal
(38)
when , and this derivative is equal to 1 oth- erwise. (The latter case with derivative equal to 1 includes the
situations and where with ; all
other have ).
We now lower bound the derivative evaluated at . Using the upper bound on given in (36) and the form of , one gets that is bounded by
, which using
and , one gets that
Further using the lower bound in (36), one has
is at least , where we make use of .
Correspondingly
(39) the right side of which is , where is as in (5).
Now, we are in a position to apply lemma 4 and lemma 5.
If the section size rate is at least , we have that cancels the combinatorial coefficient , and hence, the first term in the bound (23) (the part controlling ) is not more than
where . Using and
and (39) yield not more than the sum of
and
for any choice of . For convenience, we take to be . In this case, the first part of the aforementioned
sum is .
Now, use (34) to get that is at
least , where .
Correspondingly, using , one gets that
. Accordingly, is
at least .
It follows from the aforementioned that
where . Consequently, summing over all
, for which , one gets
The exponent in the right side of the aforementioned equation
is . Now, use the to
complete the proof of proposition 1.
Remarks: The form given for the exponential bound is meant only to reveal the general character of what is available. A com- promise was made by introduction of an inequality (the tangent bound on the exponent) to proceed most simply to this demon- stration. Now, understanding that it is exponentially small, our best evaluation avoids this compromise and proceeds directly, using the bound (24), as it provides substantial numerical im- provement.
In the next section, we prove proposition 2, while at the same time review basic properties of the RS codes.
VI. PROOF OFPROPOSITION2
We employ RS codes [45], [48], [55] as an outer code for correcting any remaining section mistakes. The symbols for the RS code come from a Galois field consisting of elements de- noted by , with typically taken to be of the form . If represent message and codeword lengths, re- spectively, then an RS code with symbols in and min- imum distance between codewords given by can have the following parameters:
Here, gives the number of parity check symbols added to the message to form the codeword. In what follows, we find it convenient to take to be equal to so that we can view each symbol in as giving a number between 1 and .
We now demonstrate how the RS code can be used as an outer code in conjunction with our inner superposition code to achieve low block error probability. For simplicity, assume that is a power of 2. First, consider the case when equals . Taking , we have that since is equal to , the RS code length becomes . Thus, one can view each symbol as repre- senting an index in each of the sections. The number of input
symbols is, then, , so setting ,
one sees that the outer rate equals which is at
least .
For code composition, message bits become the input symbols to the outer code. The symbols of the outer codeword, having length , give the labels of terms sent from each section using our inner superposition with
code length . From the received , the
estimated labels using our least squares decoder can be again thought of as output symbols for our RS codes. If denotes the section mistake rate, it follows from the distance property of the outer code that if , then these errors can be corrected. The overall rate is seen to be equal to the product of rates which is at least . Since we arrange for to be smaller than some with exponentially small probability , it follows from the previous that composition with an outer code allows us to communicate with the same reliability, albeit with a slightly
smaller rate given by .
The case when can be dealt with by observing ([45], p. 240) that an RS code as aforementioned can be shortened by length , where , to form an
code with the same minimum distance as earlier. This is easily seen by viewing each codeword as being created by appending parity check symbols to the end of the corresponding message string. Then, the code formed by considering the set of codewords with the leading symbols identical to zero has precisely the properties stated earlier.
With equal to as earlier, we have equals ; so
taking to be , we get an code, with
, , and minimum distance . Now,
since the outer code length is and symbols of this code are in the code composition can be carried out as earlier.
This completes the proof of Proposition 2.
VII. GENERALIZATION TOAPPROXIMATELEASTSQUARES
In conclusion, we remark that our results are equally valid for an approximate least squares decoder, which for some nonneg- ative chooses a satisfying
(40) where is what is sent. Since the aforementioned is less re- strictive than (2), it may be possible to find a computationally feasible algorithm for it. Indeed, we show in Appendix E that any computationally feasible algorithm, if it be an accurate de- coder, then it must be an approximate least squares decoder for some small .
We now describe how our error probability bounds can be generalized to incorporate (40). We note that (40) is equivalent
to finding an , so that , with ,
where is as in (16). We find that the expression for
in lemma 3 holds for approximate least squares decoders with
, if we replace by . Fur-
thermore, the expression for of lemma 4 is also true for , if one replaces the appearing in the second term of the bound by . Accordingly, for such approximate
decoders, with , the bound corresponding to lemma 4 becomes
(41)
where and
is as in lemma 4.
The analysis of this decoder is quite similar to that of (2).
Interested readers may refer to [11] for a more general analysis incorporating (40).
APPENDIXA PROOF OFLEMMA6
We first prove (a). Write explicitly as an in- creasing function of the ratio . Working with logarithm base , the derivative with respect to of the expres- sion being maximized yields a quadratic equation which can be solved for the optimal
Using this , we get that ,
which is at least . Here, , with
. Correspondingly, . This
proves (34).
For the lower bound on , recall that the
case case occurs when ,
in which case is at least . Using
proves (35).
Next we prove (b). Notice the has second derivative . It follows that
, since the difference of the two sides has neg- ative second derivative, so it is concave and equals 0 at
and .
For the upper bound, notice that the derivative of is
at and at , where and
. Correspondingly, is bounded from earlier by the minimum of and . Now, it is not hard to see that
Correspondingly, we get the upper bound in (36).
APPENDIXB PROOF OFLEMMA5
We first prove (a). Define , which, using the lower bound on given in lemma 6 (b) and
, is at least . Consequently, is at least
using . Correspondingly, using (35) and the lower
bound (7), one gets that is at least
which is equal to times
Furthermore, can be bounded by and . Therefore, it is at most
, where . Using this,
the lower bound on and the form of given in (28), one gets that can be bounded by
times
Now, use to get that
Now, observe that the second term in the maximum given pre- viously dominates the other two terms for all . This completes the proof of (a).
Next we prove (b). For in , we use
and the strict positivity of to see that the ratio in the defini- tion of tends to zero uniformly within compact sets interior to . So the limit is determined by the maximum of the limits of the ratios at the two ends. In the vicinity of the left and right ends, we replace by the continuous upper bounds and , respectively, which are tight at and , respectively. Then, in accordance with L’Hôpital’s rule, the limit of the ratios equals the ratios of the derivatives at and , respectively. Accordingly
(42)
where and are the derivatives of with respect to evaluated at and , respectively.
To determine the behavior of in the vicinity of 0 and 1, we first need to determine whether the optimal in its definition is strictly less than 1 or equal to 1. From Section II,
the case occurs if and only if . The
right side of this is . So it is equivalent to determine whether the ratio
is less than 1 for in the vicinity of 0 and 1. Using L’Hôpital’s rule, it suffices to determine whether the ratio of derivatives is less than 1 when evaluated at 0 and 1. At , it is
which is not more than 1/2 (certainly less than 1) for all positive , whereas at , the ratio of derivatives is which is less than 1 if and only
if . In other words, at , the optimum is less than one for all , whereas at , it is less than one if and only if
.
For the cases in which the optimal , we need to deter- mine the derivative of at and . Recall that is
the composition of the functions and
and . Also recall that
the limit of , as tends to 0 or 1, is zero.
Use chain rule for finding the derivative of , taking the products of the associated derivatives. The first of these func-
tions has derivative which is 1/4 at ,
the second of these has derivative which is 1/2 at , and the third of these functions is
which has derivative that evaluates to at and evaluates to
at . Correspondingly, for , the derivative of is for all , whereas for , its deriva-
tive is for .
For , the magnitude of the derivative of at 1 is smaller than at 0. Indeed, taking square roots, this is the same as
the claim that .
Replacing and rearranging, it reduces to
, which is true for since the two sides match at and have derivatives . Thus, the limiting value for near 1 is what matters for the maximum. This pro- duces the claimed form of for .
In contrast for , the optimal equals 1 for in the vicinity of 1. In this case, we use
which has derivative equal to
at , which is again smaller in magnitude than the deriva- tive at , producing the claimed form for for .
At we equate and see that
both of the expressions for the magnitude of the derivative at 1 agree with each other (both reducing to ), so the argument extends to this case, and the expression for is continuous in .
(c) is proved by using and simplifying
the resulting expression. This completes the proof of Lemma 5.
APPENDIXC
IMPROVEMENT INFORM OFEXPONENT
The following improvement in the form of the exponent in Proposition 1 can be obtained.
Theorem 7: Assume , where , and rate is less than capacity . For the least squares decoder
Here
where is positive and tends to as tends to 0. Here
(43) and
(44) Proof of Theorem 7: Here, we determine the minimum value of for which the combinatorial term is canceled, and we characterize the amount beyond that minimum which makes the error probability exponentially small. Arrange to be the solution to the equation
To see its characteristics, let at
using log base . Here, is the inverse of the function which is the composition of the increasing functions
and previously dis-
cussed in Section II. This is near for small . When , the condition is satisfied and indeed solves the aforementioned equation;
otherwise, provides the solution.
Now, , which from earlier
can be bounded by . Also,
. Consequently, is small for large
; moreover, for near 0 and 1, it is of order and , respectively, and via the indicated bounds, derivatives at 0 and 1 can be explicitly determined.
The analysis in Lemma 5 may be interpreted as determining section size rates such that the differentiable upper bounds on
are less than or equal to for ,
where, noting that these quantities are 0 at the endpoints of the interval, the critical section size rate is determined by matching the slopes at . At the other end of the interval, the bound on the difference has a strictly positive slope for
at , given by as in (43). The positivity of follows from recalling that , since the second term in (42) always turns out to be the greater one. Consequently, one
may take for some positive , where
tends to as tends to 0.
Recall that . Express as the sum
of needed to cancel the combinatorial coefficient, and , which is positive. This arises in establishing that the main term in the proba- bility bound is exponentially small. It decomposes as
. Arrange to be
so that .
Consider the exponent as given
in lemma 4. We take a reference for which
and for which is at least and at least a multiple of
. For convenience, we set to
be half way between and . Recall that has a nondecreasing derivative with respect to . So
is greater than . Con-
sequently, it lies above the tangent line at , that is
where as earlier is the derivative of
with respect to , which is, here, evaluated at . Its expression is as in (38).
We wish to examine the behavior of for near 0.
For this, we first lower bound the derivative . Since this derivative is nondecreasing, it is at least as large as the value at
. Now, recall that has a limit 0 as
tends to 0. Furthermore, has limit as
tends to 0. Consequently, from (38), at , we have tends to , given by (44), as tends to 0. Consequently,
, where is positive and tends to as goes to 0.
Next examine . Since is at least , it follows
that is at least . Consequently, as
in the proof of proposition 1, if the section size rate is at least , then the bound (23) is not more than the sum of
and
Using half way between and , the first part of the bound is at most
This bound is superior to the previous one, when closely matches , because of the addition of the nonnegative term. The second part of the bound can be dealt with as in propo- sition 1. Accordingly, we have proved that
where for small . It tends to as
tends to 0. This completes the proof of Theorem 7.
APPENDIXD COMPUTATIONS
We describe how the rate curves in Fig. 2 were computed. The block error probability was fixed at and the signal-to- ratio was taken to be 20 and 100. The PPV curve was curve was computed using the right side of (8) for the given and . The maximum achievable (composite) rate for the superposi- tion code was calculated in the following manner. The number of sections, ranged from 20 to 100 in steps of 10, with the corresponding section size taken to be , where as in (32) and (33).
For given and values of , and , the inner coder rate was decreased from to in decrements of
. For a given , the minimum section mistake rate so that the error probability, computed using bounds (24), is at most was computed. The corresponding composite rate is taken to be
The maximum of the composite rates , when ranged from to in decrements of , is the reported maximum achievable rate for the superposition code for the given values of , and .
APPENDIXE
ACCURATEDECODER APPROXIMATELEASTSQUARES
In Lemma 9 in the following, we show that any decoder is an approximate least squares decoder. More specifically, we show that if the fraction of mistakes made by a decoder is small, the distance of the estimated fit from cannot be much greater than distance of the codeword sent, that is , from . To prove this, we require the following lemma, which is a consequence of the restricted isometry property [16], [17] for Gaussian random matrices. We recall that the entries of our matrix are i.i.d .
Lemma 8: Let and . Then,
the following holds except on a set with probability at most :
(45)
where is related to the
restricted isometry property constant.
Proof: Statement (45) is equivalent to giving uniform bounds on the maximum singular value of the matrices , for all , where is as in (13). For , let denote the maximum singular value of . We use a result in [61] (see also [17]), giving tail bounds for the maximum singular value for Gaussian matrices from which one gets that for positive
Accordingly, choose and use
to get that , except on a set with probability .
We need to hold uniformly for all sets
, with high probability. Correspondingly, using , using a union bound, one gets that the probability of the event
is at least . This completes the proof of the lemma.
If , then from standard results on the tail bounds of chi-square random variables, one has that
(46)
Lemma 8 and (46) gives us the following.
Lemma 9: Assume that a decoder for the superpo- sition code, operating at rate , makes at most section of mistakes. Denote as the estimate of the true
outputted by the decoder. Then, with probability at least , the estimate satisfies
with . In other words, with
high probability, is the solution of an approximate least squares decoder (40) with the given .
Proof: We need to show that cannot be much greater than . Notice that
(47) where for (47) we use the fact that the noise .
Now, , since the decoder makes at most
mistakes. Accordingly, using lemma 8 and (46), one gets that with probability at least , one gets that and . Consequently, from (47), one gets
with probability at least , where
.
APPENDIXF
ERRORBOUNDS FORSUBSETSUPERPOSITIONCODES
The method of analysis also allows the consideration of subset superposition coding described in Section I-A. In this case, all subsets of size correspond to codewords, so with the rate in nats, we have . The analysis proceeds in the same manner, with the same number of choices
of sets where and agree on terms,
but now with choices of sets of size
they disagree. We obtain the same bounds as earlier except that where we have , with the exponent , it is
replaced by , with the exponent defined
by .
Correspondingly, for subset superposition coding, the proba- bility is bounded by the minimum of the same expres- sions given in Lemma 3 and Lemma 4, except that the term appearing in these expression is replaced by the quantity defined previously. We have not investigated in greater detail for whether there is reliability for any rate below capacity for these codes.
ACKNOWLEDGMENT
We thank John Hartigan, Cong Huang, Yiannis Kontiyiannis, Mokshay Madiman, Xi Luo, Dan Spielman, Edmund Yeh, John
Hartigan, Mokshay Madiman, Dan Spielman, Imre Teletar, Harrison Zhou, David Smalling, and Creighton Heaukulani for helpful conversations.
REFERENCES
[1] A. Abbe and A. R. Barron, “Polar coding schemes for the AWGN channel,” in Proc. IEEE Int. Symp. Inf. Theory, St. Petersburg, Russia, Aug. 2011, pp. 194–198.
[2] M. Akcakaya and V. Tarokh, “Shannon-theoretic limits on noisy compressive sampling,” IEEE Trans. Inf. Theory, vol. 56, no. 1, pp.
492–504, Jan. 2010.
[3] E. Arikan, “Channel polarization,” IEEE Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.
[4] E. Arikan and E. Telatar, “On the rate of channel polarization,” in Proc.
IEEE Int. Symp. Inf. Theory, Seoul, Korea, Jul. 2009, pp. 1493–1495.
[5] A. Barg and G. Zémor, “Error exponents of expander codes under linear-complexity decoding,” SIAM J. Discrete Math., vol. 17, no. 3, pp. 426–445, 2004.
[6] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp.
930–944, May 1993.
[7] A. Barron, A. Cohen, W. Dahmen, and R. Devore, “Approximation and learning by greedy algorithms,” Ann. Statist., vol. 36, no. 1, pp. 64–94, 2007.
[8] A. R. Barron and A. Joseph, Sparse superposition codes: Fast and reli- able at rates approaching capacity with Gaussian noise [Online]. Avail- able: http://www.stat.yale.edu/arb4
[9] A. R. Barron and A. Joseph, “Towards fast reliable communication at rates near capacity with Gaussian noise,” in Proc. IEEE. Int. Symp. Inf.
Theory, Austin, TX, Jun. 13–18, 2010, pp. 315–319.
[10] A. R. Barron and A. Joseph, “Least squares superposition codes of moderate dictionary size, reliable at rates up to capacity,” in Proc.
IEEE. Int. Symp. Inf. Theory, Austin, TX, Jun. 13–18, 2010, pp.
275–279.
[11] A. R. Barron and A. Joseph, Least squares superposition codes of mod- erate dictionary size, reliable at rates up to capacity [Online]. Available:
http://arxiv.org/abs/1006.3780
[12] G. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding: turbo codes,” in Proc. Int. Conf. Commun., Geneva, Switzerland, May 1993, pp. 1064–1070.
[13] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse ap- proximations,” J. Fourier Anal. Appl., vol. 14, pp. 629–654, 2008.
[14] L. Applebaum, W. U. Bajwa, M. F. Duarte, and R. Calderbank, “Mul- tiuser detection in asynchronous on-off random access channels using lasso,” in Proc. 48th Annu. Allerton Conf. Commun., Control, Comput., Sep. 2010, pp. 130–137.
[15] E. Candes and Y. Plan, “Near-ideal model selection by` minimiza- tion,” Ann. Statist., vol. 37, no. 5A, pp. 2145–2177, 2009.
[16] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans.
Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.
[17] E. Candes and T. Tao, “Near-optimal signal recovery from random pro- jections: Universal encoding strategies?,” IEEE Trans. Inf. Theory, vol.
52, no. 12, pp. 5406–5425, Dec. 2006.
[18] J. Cao and E. M. Yeh, “Asymptotically optimal multiple-access com- munication via distributed rate splitting,” IEEE Trans. Inf. Theory, vol.
53, no. 1, pp. 304–319, Jan. 2007.
[19] T. M. Cover, “Broadcast channels,” IEEE Trans. Inf. Theory, vol. 18, no. 1, pp. 2–14, Jan. 1972.
[20] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley-Interscience, 2006.
[21] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,”
Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, 2004.
[22] D. Donoho, “For most large underdetermined systems of linear equa- tions the minimal` -norm near solution approximates the sparest solu- tion,” Commun. Pure Appl. Math., vol. 59, no. 10, pp. 907–934, 2006.
[23] D. L. Donoho, M. Elad, and V. M. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans. Inf. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006.
[24] D. Donoho and X. Huo, “Uncertainty principles and ideal atomic de- composition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–2862, Nov. 2001.