S Information-TheoreticLimitsonSparseSignalRecovery:DenseversusSparseMeasurementMatrices

(1)

Information-Theoretic Limits on Sparse Signal Recovery: Dense versus Sparse

Measurement Matrices

Wei Wang, Member, IEEE, Martin J. Wainwright, Member, IEEE, and Kannan Ramchandran, Fellow, IEEE

Abstract—We study the information-theoretic limits of exactly recovering the support set of a sparse signal, using noisy projec- tions defined by various classes of measurement matrices. Our analysis is high-dimensional in nature, in which the number of observations n, the ambient signal dimension p, and the signal sparsityk are all allowed to tend to infinity in a general manner.

This paper makes two novel contributions. First, we provide sharper necessary conditions for exact support recovery using general (including non-Gaussian) dense measurement matrices.

Combined with previously known sufficient conditions, this result yields sharp characterizations of when the optimal decoder can recover a signal for various scalings of the signal sparsityk and sample sizen, including the important special case of linear spar- sity(k = 2(p)) using a linear scaling of observations (n = 2(p)).

Our second contribution is to prove necessary conditions on the number of observations n required for asymptotically reliable recovery using a class of -sparsified measurement matrices, where the measurement sparsity parameter (n; p; k) 2 (0; 1] cor- responds to the fraction of nonzero entries per row. Our analysis allows general scaling of the quadruplet(n; p; k; ), and reveals three different regimes, corresponding to whether measurement sparsity has no asymptotic effect, a minor effect, or a dramatic effect on the information-theoretic limits of the subset recovery problem.

Index Terms— ₁-Relaxation, compressed sensing, Fano’s method, high-dimensional statistical inference, information-the- oretic bounds, sparse approximation, sparse random matrices, sparsity recovery, subset selection, support recovery.

I. INTRODUCTION

S

PARSITY recovery refers to the problem of estimating the support of a -dimensional but -sparse vector , based on a set of noisy linear observations. The sparsity

Manuscript received August 01, 2008; revised September 01, 2009. Current version published May 19, 2010. The work of W. Wang and K. Ramchandran was supported by NSF Grant CCF-0635114. The work of M. J. Wainwright was supported by NSF Grants CAREER-CCF-0545862 and DMS-0605165. This work was posted in May 2008 as Technical Report 754 in the Department of Statistics, University of California, Berkeley, and was posted as arXiv:0806.

0604 [math.ST]. The material in this work was presented in part at the IEEE International Symposium on Information Theory, Toronto, ON, Canada, July 2008.

W. Wang and K. Ramchandran are with the Department of Elec- trical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720 USA (e-mail: wangwei@eecs.berkeley.edu;

kannanr@eecs.berkeley.edu).

M. J. Wainwright is with the Department of Statistics and the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720 USA (e-mail: wainwrig@eecs.berkeley.edu).

Communicated by J. Romberg, Associate Editor for Signal Processing.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIT.2010.2046199

recovery problem is of broad interest, arising in subset selection in regression [19], graphical model selection [18], group testing, signal denoising [8], sparse approximation [20], and compressive sensing [11], [7]. A large body of recent work (e.g., [6]–[8], [11], [12], [18], [27], [28], and [30]) has analyzed the performance of computationally tractable methods, in particular based on or other convex relaxations, for estimating high-dimensional sparse signals. Such results have established conditions, on signal sparsity and the choice of measurement matrices, under which a given recovery method succeeds with high probability.

Of complementary interest are the information-theoretic limits of the sparsity recovery problem, which apply to the performance of any procedure regardless of its computational complexity. Such analysis has two purposes: first, to demon- strate where known polynomial-time methods achieve the information-theoretic bounds, and second, to reveal situations in which current methods are sub-optimal. An interesting question which arises in this context is the effect of the choice of measurement matrix on the information-theoretic limits. As we will see, the standard Gaussian measurement ensemble achieves an optimal scaling of the number of observations required for recovery. However, this choice produces highly dense matrices, which may lead to prohibitively high computational complexity and storage requirements.¹ In contrast, sparse measurement matrices directly reduce encoding and storage costs, and can also lead to fast decoding algorithms by exploiting problem structure (see Section I-B for a brief overview of the growing literature in this area). In addition, measurement sparsity can be used to lower communication cost and latency in distributed sensor network and streaming applications. On the other hand, measurement sparsity can potentially reduce statistical efficiency by requiring more observations to recover the signal.

Intuitively, the nonzeros in the signal may rarely align with the nonzeros in a sparse measurement matrix.² Therefore, an important question is to characterize the trade-off between measurement sparsity and statistical efficiency.

With this motivation, this paper makes two contributions.

First, we derive sharper necessary conditions for exact support recovery, applicable to a general class of dense measurement matrices (including non-Gaussian ensembles). In conjunction with the sufficient conditions from previous work [29], this analysis provides a sharp characterization of necessary and

1For example,` -recovery methods based on linear programming have com- plexityO(p ) in the signal dimension p.

2Note, however, that misalignments between the measurements and the signal still reveal some information about the locations of the nonzeros in the signal.

(2)

sufficient conditions for various sparsity regimes. Our second contribution is to address the effect of measurement sparsity, meaning the fraction of nonzeros per row in the matrices used to collect measurements. We derive lower bounds on the number of observations required for exact sparsity recovery, as a function of the signal dimension , signal sparsity , and measurement sparsity . This analysis highlights a trade-off between the statistical efficiency of a measurement ensemble and the computational complexity associated with storing and manipulating it.

The remainder of the paper is organized as follows. We first define our problem formulation in Section I-A, and then discuss our contributions and some connections to related work in Section I-B. Section II provides precise statements of our main results, as well as a discussion of their consequences.

Section III provides proofs of the necessary conditions for various classes of measurement matrices, while proofs of more technical lemmas are given in Appendices A–F. Finally, we conclude and discuss open problems in Section IV.

A. Problem Formulation

Let be a fixed but unknown vector, with the support set of defined as

(1) We refer to as the signal sparsity, and as the signal dimension. Suppose we are given a vector of noisy observations , of the form

(2) where is the known measurement matrix, and

is additive Gaussian noise. Our goal is to per- form exact recovery of the underlying sparsity pattern , which we refer to as the sparsity recovery problem. The focus of this paper is to find conditions on the model parameters

that are necessary for any method to successfully recover the support set . Our results apply to various classes of dense and -sparsified measurement matrices, which will be defined in Section II.

1) Classes of Signals: The difficulty of sparsity recovery from noisy measurements naturally depends on the minimum value of on its support, defined by the function

(3) In this paper, we study the class of signals parameterized by a lower bound on the minimum value

(4) The associated class of sparsity patterns is the collection of all possible subsets of size . We assume without loss of generality that the noise variance , since any scaling of can be accounted for in the scaling of .

2) Decoders and Error Criterion: Suppose that nature chooses some vector from the signal class . The

statistician observes samples and tries

to infer the underlying sparsity pattern . The results of this paper apply to arbitrary decoders. A decoder is a mapping from the observations to an estimated subset . We measure the error between the estimate and the true support using the -valued loss function , which corresponds to a standard model selection error criterion. The probability of incorrect subset selection

is then the associated 0–1 risk , where

the probability is taken over the measurement noise and the choice of random measurement matrix . We define the maximal probability of error over the class as

(5) We say that sparsity recovery is asymptotically reliable over the

signal class if as .

With this setup, our goal is to find necessary conditions on the parameters that any decoder, regardless of its computational complexity, must satisfy for asymptotically reliable recovery to be possible. We are interested in lower bounds on the number of measurements in general settings where both the signal sparsity and the measurement sparsity are allowed to scale with the signal dimension .

B. Past Work and Our Contributions

One body of past work [14], [24], [1] has focused on the information-theoretic limits of sparse estimation under and other distortion metrics, using power-based SNR measures of the form

(6) (Note that the second equality assumes that the noise variance , and that the measurement matrix is standardized, with each element having zero mean and variance one). It is important to note that the power-based SNR (6), though appro- priate for -distortion, is not the key parameter for the support recovery problem. Although the minimum value is related to this power-based measure by the inequality , for the ensemble of signals defined in (4), the -based SNR (6) can be made arbitrarily large while still having one coefficient equal to the minimum value (assuming that ). Conse- quently, as our results show, it is possible to generate problem instances for which support recovery is arbitrarily difficult—in particular, by sending at an arbitrarily rapid rate—even as the power-based SNR (6) becomes arbitrarily large.

The paper [29] was the first to consider the information-theoretic limits of exact subset recovery using standard Gaussian measurement ensembles, explicitly identifying the minimum value as the key parameter. This analysis yielded necessary and sufficient conditions on general quadruples

for asymptotically reliable recovery. Subsequent work on the problem has yielded sharper conditions for standard Gaussian ensembles [22], [3], [13], [2], and extended this type of analysis to the criterion of partial support recovery [3], [22]. In this paper (initially posted as [32]), we consider only exact support recovery, but provide results for general dense measurement ensembles, including non-Gaussian matrices. In conjunction

(3)

with known sufficient conditions [29], one consequence of our first main result (Theorem 1, below) is a set of sharp necessary and sufficient conditions for the optimal decoder to recover the support of a signal with linear sparsity , using only a linear fraction of observations . As we discuss at more length in Section II-A, for the special case of the standard Gaussian ensemble, Theorem 1 also recovers some results independently obtained in past work by Reeves [22], and concurrent work by Fletcher et al. [13] and Aeron et al. [2].

In addition, this paper addresses the effect of measurement sparsity, which we assess in terms of the fraction

of nonzeros per row of the the measurement matrix . In the noiseless setting, a growing body of work has examined computationally efficient recovery methods based on sparse measurement matrices, including work inspired by expander graphs and coding theory [25], [33], [4], as well as dimension-reducing embeddings and sketching [9], [15], [31]. In addition, some results have been shown to be stable in the or norm in the presence of noise [9], [4]; note, however, that stability does not guarantee exact recovery of the support set. In the noisy setting, the paper [1] provides results for sparse measurements and distortion-type error metrics (using a power-based SNR), as opposed to the subset recovery metric of interest here. For the noisy observation model (2), some concurrent work [21] provides sufficient conditions for support recovery using the Lasso (i.e., -constrained quadratic programming) for appropriately sparsified ensembles. These results can be viewed as complementary to the information-theoretic analysis of this paper, in which we characterize the inherent trade-off between measurement sparsity and statistical efficiency. More specifically, our second main result (Theorem 2, below) provides necessary conditions for exact support recovery using -sparsified Gaussian measurement matrices [defined in (7)], for general scalings of the parameters . This analysis reveals three regimes of interest, corresponding to whether measurement sparsity has no asymptotic effect, a small effect, or a significant effect on the number of measurements necessary for recovery. Thus, there exist regimes in which measurement sparsity fundamentally alters the ability of any method to decode.

II. MAINRESULTS ANDCONSEQUENCES

In this section, we state our main results, and discuss some of their consequences. Our analysis applies to random ensembles of measurement matrices , where each entry is drawn i.i.d. from some underlying distribution. The most commonly studied random ensemble is the standard Gaussian case, in which each . Note that this choice gen- erates a highly dense measurement matrix , with nonzero entries. Our first result (Theorem 1) applies to more general ensembles that satisfy the moment conditions and , which allows for a variety of non-Gaussian distributions (e.g., uniform, Bernoulli, etc.).³In addition, we also

3In fact, our results can be further generalized to ensembles of matrices which have independent rows drawn from any distribution with zero mean and covariance matrix6 (see Theorem 3 in Appendix F).

derive results (Theorem 2) for -sparsified matrices , in which each entry is i.i.d. drawn according to

w.p.

w.p. . (7)

Note that when , the distribution in (7) is exactly the standard Gaussian ensemble. We refer to the sparsification parameter as the measurement sparsity. Our analysis allows this parameter to vary as a function of .

A. Tighter Bounds on Dense Ensembles

We begin by stating a set of necessary conditions on for asymptotically reliable recovery with any method, which apply to general ensembles of zero-mean and unit-variance measurement matrices. In addition to the standard Gaussian ensemble , this result also covers matrices from other common ensembles (e.g., Bernoulli ). Furthermore, our analysis can be extended to matrices with independent rows drawn from any distribution with zero mean and covariance matrix (see Appendix F).

Theorem 1 (General Ensembles): Let the measurement ma- trix be drawn with i.i.d. elements from any distribution with zero mean and unit variance. Then a necessary condition for asymptotically reliable recovery over the signal class

is

(8) where

(9)

for .

The proof of Theorem 1, given in Section III, uses Fano’s method [10], [16], [17], [34], [35] to bound the probability of error in a restricted ensemble, which can then be viewed as a type of channel coding problem. Moreover, the proof constructs a family of restricted ensembles that sweeps the range of possible overlaps between subsets, and tries to capture the difficulty of distinguishing between subsets at various distances.

We now consider some consequences of the necessary conditions in Theorem 1 under two scalings of the signal sparsity:

the regime of linear signal sparsity, in which for some , and the regime of sublinear signal sparsity, meaning . In particular, the necessary conditions in Theorem 1 can be compared against the sufficient conditions in Wainwright [29] for exact support recovery using the standard Gaussian ensemble, as shown in Table I. This comparison reveals that The- orem 1 generalizes and strengthens earlier results on necessary conditions for subset recovery [29]. We obtain tight scalings of the necessary and sufficient conditions in the regime of linear signal sparsity (meaning ), under various scalings of the minimum value (shown in the first three rows of Table I).

(4)

TABLE I

TIGHTSCALINGS OF THENECESSARY ANDSUFFICIENTCONDITIONS ON THENUMBER OFOBSERVATIONSn REQUIRED FOREXACTSUPPORTRECOVERY AREOBTAINED INSEVERALREGIMES OFINTEREST

We also obtain tight scaling conditions in the regime of sublinear signal sparsity (in which ), when (as shown in row 4 of Table I). There remains a slight gap, however, in the sublinear sparsity regime when (see bottom two rows in Table I).

In the regime of linear sparsity, Wainwright [29] showed, by direct analysis of the optimal decoder, that the scaling is sufficient for exact support recovery using a linear fraction of observations. Combined with the necessary condition in Theorem 1, we obtain the following corollary that provides a sharp characterization of the linear-linear regime.

Corollary 1: Consider the regime of linear sparsity, meaning that , and suppose that a linear fraction

of observations are made. Then the optimal decoder can recover the support exactly if and only if .

Theorem 1 has some consequences related to results proved in recent and concurrent work. Reeves and Gastpar [22] have shown that in the regime of linear sparsity , and for standard Gaussian measurements, if any decoder is given only a linear fraction sample size (meaning that ), then one must have in order to recover the support exactly.

This result is one corollary of Theorem 1, since if , then we have

so that the scaling is precluded. In concurrent work, Fletcher et al. [13] used direct methods to show that for the special case of the standard Gaussian ensemble, the number of observations must satisfy . The qualitative form of this bound follows from our lower bound , which holds for standard Gaussian ensembles as well as more general (non-Gaussian) ensembles. However, we note that the direct methods used by Fletcher et al. [13] yield better control of the constant prefactors for the standard Gaussian ensemble.

Similarly, concurrent work by Aeron et al. [2] showed that in the regime of linear sparsity (i.e., ) and for standard

Gaussian measurements, the number of observations must satisfy . This result also follows as a consequence of our lower bound .

The results in Theorem 1 can also be compared to an intuitive bound based on classical channel capacity results, as pointed out previously by various researchers (e.g., [24] and [3]). Consider a restricted problem, in which the values associated with each possible sparsity pattern on are fixed and known at the decoder. Then support recovery can be viewed as a type of channel coding problem, in which the possible support sets of correspond to messages to be sent over a Gaussian channel.

Suppose each support set is encoded as the codeword , where has i.i.d. Gaussian entries. The effective code rate is then , and by standard Gaussian channel capacity results, we have the lower bound

(10) This bound is tight for and Gaussian measurements, but loose in general. As Theorem 1 clarifies, there are additional elements in the support recovery problem that distinguish it from a standard Gaussian coding problem: first, the signal power does not capture the inherent problem difficulty for , and second, there is overlap between support sets for . Note that (with equality in the case when for all indices ), so that Theorem 1 is strictly tighter than the intuitive bound (10). Moreover, by fixing the value of at indices to and allowing the last component of to tend to infinity, we can drive the power to infinity, while still having a non-trivial lower bound in Theorem 1.

B. Effect of Measurement Sparsity

We now turn to the effect of measurement sparsity on subset recovery, considering in particular the -sparsified ensemble (7). Since each has zero mean and unit variance for all choices of by construction, Theorem 1 applies to the -sparsified Gaussian ensemble (7); however, it yields necessary conditions that are independent of . Intuitively, it is clear that the procedure of -sparsification should cause deterioration in support

(5)

Fig. 1. RateR = ( ), defined as the logarithm of the number of possible subsets the decoder can reliably estimate based onn observations, is plotted using (12) in three regimes, depending on how the quantity k scales. In particular, k corresponds to the average number of nonzeros in that align with the nonzeros in each row of the measurement matrix.

recovery. Indeed, the following result provides more refined bounds that capture the effects of -sparsification. We first state a set of necessary conditions on in general form, and subsequently bound these conditions in different regimes of sparsity. Let denote the Gaussian density with mean and variance , and define the family of mixture distributions

with

(11)

Furthermore, let denote the differential entropy functional.

With this notation, we have the following result.

Theorem 2 (Sparse Ensembles): Let the measurement matrix be drawn with i.i.d. elements from the -sparsified Gaussian ensemble (7). Then a necessary condition for asymptotically reliable recovery over the signal class is

(12) where

(13)

for .

The proof of Theorem 2, given in Section III, again uses Fano’s inequality, but explicitly analyzes the effect of measurement sparsification on the entropy of the observations. The necessary condition in Theorem 2 is plotted in Fig. 1, showing dis- tinct regimes of behavior depending on how the quantity scales, where is the measurement sparsification parameter and is the signal sparsity index. In order to characterize the regimes in which measurement sparsity begins to degrade the recovery performance of any decoder, Corollary 2

below further bounds the necessary conditions in Theorem 2 in three cases.

Corollary 2 (Three Regimes): For any scalar , let

denote the entropy of a variate. The necessary conditions in Theorem 2 can be simplified as follows.

(a) If , then

(14a) (b) If for some constant , then

(14b)

where is a constant.

(c) If , then

(14c)

Corollary 2 reveals three regimes of behavior, defined by the scaling of the measurement sparsity and the signal sparsity . Intuitively, is the average number of nonzeros in that align with the nonzeros in each row of the measurement matrix. If as , then the recovery threshold (14a) is of the same order as the threshold for dense measurement ensembles.

In this regime, sparsifying the measurement ensemble has no asymptotic effect on performance. In sharp contrast, if

sufficiently fast as , then the denominator in (14c) goes to zero, and the recovery threshold changes fundamentally compared to the dense case. Hence, the number of measurements that any decoder needs in order to reliably recover increases

(6)

TABLE II

NECESSARY CONDITIONS ON THENUMBER OF OBSERVATIONSn REQUIRED FOREXACTSUPPORT RECOVERY ISSHOWN INDIFFERENTREGIMES OF THEPARAMETERS(p; k; ; )

dramatically in this regime. Finally, if , then the recovery threshold (14b) transitions between the two extremes.

Using the bounds in Corollary 2, the necessary conditions in Theorem 2 are shown in Table II under different scalings of the

parameters . In particular, if and

the minimum value does not increase with , then the denominator goes to zero.

III. PROOFS OF OURMAINRESULTS

In this section, we provide the proofs of Theorems 1 and 2.

Establishing necessary conditions for exact sparsity recovery amounts to finding conditions on (and possibly ) under which the probability of error of any recovery method stays bounded away from zero as . At a high-level, our general approach is quite simple: we consider restricted problems in which the decoder has been given some additional side information, and then apply Fano’s method [10], [16], [17], [35], [34] to lower bound the probability of error. In order to establish the collection of necessary conditions (e.g., ), we construct a family of restricted ensembles which sweeps the range of possible overlaps between support sets. At the extremes of this family are two classes of ensembles: one which captures the bulk effect of having many competing subsets at large distances, and the other which captures the effect of a smaller number of subsets at very close distances [this is illustrated in Fig. 2(a)]. Accordingly, we

consider the family of ensembles ,

where the th restricted ensemble is defined as follows.

Throughout the remainder of the paper, we use the notation to denote column of the matrix , and to denote the submatrix containing columns indexed by set . Similarly, let denote the subvector of corresponding to the index set . In addition, let and denote the entropy and differential entropy functionals, respectively.

A. Restricted Ensemble

Suppose that the decoder is given the locations of all but the smallest nonzero values of the vector , as well as the values of

on its support. More precisely, let represent the true underlying support of and let denote the set of revealed indices,

which has size . Let denote the set

of unknown locations, and assume that for all . Given knowledge of , the decoder may simply sub- tract from , so that it is left with the modified -vector of observations

(15)

By re-ordering indices as need be, we may assume without loss

of generality that , so that

. The remaining sub-problem is to determine, given the observations , the locations of the nonzeros in .⁴ We will now argue that analyzing the probability of error of this restricted problem gives us a lower bound on the probability of error in the original problem. Consider the restricted signal

class defined as

(16)

where we denote the support set of vector as

. For any , we can concatenate

with a vector of nonzeros (with ) at the end to obtain a -dimensional vector. If a decoder can recover the support of any -dimensional -sparse vector , then it can recover the support of the augmented and, hence, the support of . Furthermore, providing the decoder with the nonzero values of cannot increase the probability of error. Thus, we can apply Fano’s inequality to lower bound the

4Note that if we assume the support of is uniformly chosen over all possible subsets of sizek, then given T , the remaining subset U is uniformly distributed over the possible subsets of sizem.

(7)

Fig. 2. Illustration of restricted ensembles. (a) In restricted ensembleC (), the decoder must distinguish between support sets with an average overlap of size , whereas in restricted ensembleC (), it must decode amongst a subset of the k(p 0 k) + 1 supports with overlap k 0 1. (b) In restricted ensemble C (), the decoder is given the locations of the k 0 1 largest nonzeros, and it must estimate the location of the smallest nonzero from the p 0 k + 1 remaining possible indices.

probability of error in the restricted problem, and so obtain a lower bound on the probability of error for the general problem.

B. Applying Fano to Restricted Ensembles

Consider the class of signals defined in (16),

which consists of models

corresponding to the possible subsets

of size . Suppose that a model index is chosen uniformly at random from , and we sample observa-

tions via the measurement matrix .

For any decoding function , the average

probability of error is defined as

while the maximal probability of error over the class is defined as

We first apply Fano’s lemma [10] to bound the error probability over for a particular instance of the random measurement matrix , and subsequently average over the ensemble of matrices. Thus, by Fano’s inequality, the average probability of error and hence the maximal probability of error, is lower bounded as

(17) Consequently, the problem of establishing necessary conditions for asymptotically reliable recovery is reduced to obtaining upper bounds on the conditional mutual information

.

C. Proof of Theorem 1

In this section, we derive the necessary conditions stated in Theorem 1 for the general class of measurement matrices, by applying Fano’s inequality to bound the probability of

decoding error in each of the restricted ensembles in the

family .

We begin by performing our analysis of the error prob-

ability over for any . Let

be a matrix with independent, zero-mean and unit-variance entries. Conditioned on the event that is the true underlying support of , the vector of observations can be written as

Accordingly, the conditional mutual information in (17) can be expanded as

We bound the first term using the fact that the differential entropy of the observation vector for a particular instance of matrix is maximized by the Gaussian distribution with a matched variance. More specifically, for a fixed , the distribution of is a Gaussian mixture with density

, where we are using to denote the density of a Gaussian random vector with mean and covariance . Let denote the covariance matrix of conditioned on (hence, entry on the diagonal represents the variance of given ). With this notation, the entropy associated with the marginal density is upper bounded by . When is randomly chosen, the conditional entropy of given (averaged over the choice of ) can be bounded as

(8)

The conditional entropy can be further bounded by exploiting the concavity of the logarithm and applying Jensen’s inequality, as

Next, the entropy of the Gaussian noise vector

can be computed as . Com-

bining these two terms, we then obtain the following bound on the conditional mutual information:

It remains to compute the expectation , over the ensemble of matrices drawn with i.i.d. entries from any distribution with zero mean and unit variance. The proof of the following lemma involves some relatively straightforward but lengthy calculation, and is given in Appendix A.

Lemma 1: Given i.i.d. with zero mean and unit variance, the averaged covariance matrix of given is

(18) Finally, combining Lemma 1 with equation (17), we obtain that the average probability of error is bounded away from zero if

as claimed.

D. Proof of Theorem 2

This section contains proofs of the necessary conditions in Theorem 2 for the -sparsified Gaussian measurement ensemble (7). We proceed as before, applying Fano’s inequality to each restricted class in the family , in order to derive the corresponding conditions in Theorem 2.

In analyzing the probability of error over , the initial steps proceed as in the proof of Theorem 1, by expanding the conditional mutual information in equation (17) as

using the Gaussian entropy for .

From this point, the key subproblem is to compute the con-

ditional entropy of , when the sup-

port of is uniformly chosen over all possible subsets of size . To characterize the limiting behavior of the random

variable , note that for a fixed matrix , each is distributed according to the density defined as

This density is a mixture of Gaussians with unit variances and

means that depend on the values of ,

summed over subsets with .

At a high-level, our immediate goal is to characterize the

entropy .

Note that as varies over the sparse ensemble (7), the sequence , indexed by the signal dimension , is actually a sequence of random densities. As an intermediate step, the following lemma characterizes the average pointwise behavior of this random sequence of densities, and is proven in Appendix B.

Lemma 2: Let be drawn with i.i.d. entries from the -sparsified Gaussian ensemble (7). For any fixed and ,

, where

(19) is a mixture of Gaussians with binomial weights

.

For certain scalings, we can use concentration results for -statistics [26] to prove that converges uniformly to , and from there that . In general, however, we always have an upper bound, which is sufficient for our purposes. Indeed, since differential entropy is a concave function of , by Jensen’s inequality and Lemma 2, we have

With these ingredients, we conclude that the conditional mutual information in (17) is upper bounded by

where the last inequality uses the fact that the entropies associated with the densities are the same for all . Therefore, the probability of decoding error, averaged over the

(9)

sparsified Gaussian measurement ensemble, is bounded away from zero if

as claimed.

E. Proof of Corollary 2

In this section, we derive bounds on the necessary conditions for , which are stated in Theorem 2.

We begin by applying a simple yet general bound on the entropy of the Gaussian mixture distribution with density defined in (11). The variance associated with the density is equal to , and so is bounded by the entropy of a Gaussian distribution with variance , as

This yields the first set of bounds in (14a).

Next, to derive more refined bounds which capture the effects of measurement sparsity, we will make use of the following lemma (which is proven in Appendix C) to bound the entropy associated with the mixture density .

Lemma 3: For the Gaussian mixture distribution with density defined in (11)

where .

We can further bound the expression in Lemma 3 in three cases, delineated by the quantity . The proof of the following claim in given in Appendix D.

Lemma 4: Let , where

. (a) If , then

and (20a)

(20b)

(b) If for some constant , then

and (21a)

(21b) (c) If , then

and (22a)

(22b)

Finally, combining Lemmas 3 and 4 with some simple bounds on the entropy of the binomial variate (summarized in Lemmas 5 and 6 in Appendix E), we obtain the bounds on

in (14b) and (14c).

IV. DISCUSSION

In this paper, we have studied the information-theoretic limits of exact support recovery for general scalings of the parameters . Our first result (Theorem 1) applies gener- ally to measurement matrices with zero-mean and unit-variance entries. It strengthens previously known bounds, and combined with known sufficient conditions [29], yields a sharp characterization of recovering signals with linear sparsity with a linear fraction of observations (Corollary 1). Our second result (The- orem 2) applies to -sparsified Gaussian measurement ensembles, and reveals three different regimes of measurement sparsity, depending on how significantly they impair statistical efficiency. For linear signal sparsity, Theorem 2 is not a sharp result (by a constant factor in comparison to Theorem 1 in the dense case); however, its tightness for sublinear signal sparsity is an interesting open problem. Finally, Theorem 1 implies that no measurement ensemble with zero-mean and unit-variance entries can further reduce the number of observations necessary for recovery, while the paper [29] shows that the standard Gaussian ensemble can achieve the same scaling. This raises an interesting open question on the design of other, more computationally friendly, measurement matrices which achieve the same information-theoretic bounds.

APPENDIX

A) Proof of Lemma 1: We begin by defining some ad- ditional notation. Recall that for a given instance of the matrix , the observation vector has a Gaussian mixture distri-

bution with density ,

where denotes the Gaussian density with mean and

covariance . Let and

be the mean vector and covariance matrix of given , respectively. Accordingly, we have

and

With this notation, we can now compute the expectation of the covariance matrix , averaged over any distribution

(10)

on with independent, zero-mean and unit-variance entries. To compute the first term, we have

where the second equality uses the fact that ,

and for . Next, we compute the second

term as

From here, note that there are possible subsets of size . For each , a counting argument reveals that there are

subsets of size which have over-

laps with . Thus, the scalar multiplicative factor above can be written as

Finally, using a substitution of variables (by setting ) and applying Vandermonde’s identity [23], we have

Combining these terms, we conclude that

B) Proof of Lemma 2: Consider the following sequences of densities:

and

where . Our goal is to show that for

any fixed , the pointwise average of the stochastic sequence of densities over the ensemble of matrices satisfies

.

By symmetry of the random measurement matrix , it is sufficient to compute this expectation for the subset . When each is i.i.d. drawn according to the -sparsified ensemble (7), the random vari-

able has a Gaussian mixture

distribution which can be described as follows. Denoting the

mixture label by , then if , for

. Moreover, define the modified random variable . Then, conditioned on the mixture label , the random variable has a noncentral chi-square distribution with 1 degree of freedom and parameter

. Letting denote the th

moment-generating function of , we have

Evaluating the moment generating function [5] of a noncentral chi-square random variable then gives the desired quantity

as claimed.

C) Proof of Lemma 3: Let be a random variable distributed according to the density (19) with mixture label

. To compute the entropy of , we expand the mutual information and obtain

(11)

The conditional distribution of given that is Gaussian, and so the conditional entropy of given can be written as

Using the fact that , we obtain the upper and lower bounds on

as claimed.

D) Proof of Lemma 4: Let ,

where . We first derive a general upper

bound on and then show that this bound is reasonably tight in the case when . We can rewrite the binomial probability as

and hence

Taking the first two terms of the binomial expansion of and noting that all the terms are non-negative, we obtain the inequality

and consequently . Using a

change of variables (by setting ) and applying the binomial theorem, we obtain the upper bound

In the case when , we can derive a similar lower bound by first bounding as

Now using the fact that for all , and

for , we have

This yields the upper and lower bounds in (22).

Next, we examine the case when for some constant . The derivation of the upper bound for the case holds when as well. The proof of the lower bound follows the same steps as in the case, except that we stop before applying the last inequality . This gives the bounds in (21).

Finally, we derive bounds in the case when . Since

the mean of a random variable is , by

Jensen’s inequality the following upper bound always holds:

To derive a matching lower bound, we use the fact that the median of a distribution is one of

. This allows us to bound

where in the last step we used the fact that

for , and .

Thus, we obtain the bounds in (20).

E) Bounds on Binomial Entropy:

Lemma 5: Let , then

Proof: We immediately obtain this bound by applying the differential entropy bound on discrete entropy [10]. As detailed in [10], the proof follows by relating the entropy of the discrete random variable to the differential entropy of a particular con- tinuous random variable, and then upper bounding the latter by the entropy of a Gaussian random variable.

Lemma 6: The entropy of a binomial random variable is bounded by

(12)

Proof: We can express the binomial variate as

, where i.i.d. Since

, we have

Lemma 7: If , then as

.

Proof: To find the limit of

, let for some function , and assume that . We can expand the first term as

and so . The second term can also be

expanded as

Since as , we have the limits

and

which in turn imply that

and

F) Generalized Measurement Ensembles: In this section, we extend the necessary conditions in Theorem 1 to a generalized class of measurement matrices by relaxing the i.i.d.

assumption. The proof of Theorem 3 below exactly mirrors that of Theorem 1, and is omitted. The key modification occurs when constructing the restricted ensembles, because the choice of columns to be removed from the matrix affects the distribution of the observations. The proof of Lemma 1 can then be

easily extended to the generalized measurement ensemble. In order to state the result, we define the functions

(23)

for , which sums over all possible subsets of size of the covariance matrix .

Theorem 3: Let each row of the measurement matrix be drawn i.i.d. from any distribution with zero mean and covariance matrix . Then a necessary condition for asymptotically reliable recovery over the signal class is

(24) where

(25)

for .

ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers for helpful comments that improved the presentation of this paper.

REFERENCES

[1] S. Aeron, M. Zhao, and V. Saligrama, “Information-theoretic bounds to sensing capacity of sensor networks under fixed snr,” presented at the Information Theory Workshop, Sep. 2007.

[2] S. Aeron, M. Zhao, and V. Saligrama, Fundamental Limits on Sensing Capacity for Sensor Networks and Compressed Sensing, 2008, Tech.

Rep. arXiv:0804.3439v1 [cs.IT].

[3] M. Akcakaya and V. Tarokh, Shannon Theoretic Limits on Noisy Com- pressive Sampling 2007, Tech. Rep. arXiv:0711.0366v1 [cs.IT].

[4] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss,

“Combining geometry and combinatorics: A unified approach to sparse signal recovery,” presented at the Allerton Conf. Communication, Con- trol and Computing, Monticello, IL, Sep. 2008.

[5] L. Birgé, “An alternative point of view on Lepski’s method,” in State of the Art in Probability and Statistics, ser. IMS Lecture Notes. Beach- wood, OH: Institute of Mathematical Statistics, 2001, pp. 113–133.

[6] E. Candes, J. Romberg, and T. Tao, “Stable signal recovery from in- complete and inaccurate measurements,” Commun. Pure Appl. Math., vol. 59, no. 8, pp. 1207–1223, Aug. 2006.

[7] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans.

Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.

[8] S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.

[9] G. Cormode and S. Muthukrishnan, Towards an Algorithmic Theory of Compressed Sensing, Rutgers Univ., 2005, Tech. Rep.

[10] T. Cover and J. Thomas, Elements of Information Theory. New York:

Wiley, 1991.

[11] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006.

[12] D. Donoho, M. Elad, and V. M. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans.

Inf. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006.

(13)

[13] A. K. Fletcher, S. Rangan, and V. K. Goyal, Necessary and Sufficient Conditions on Sparsity Pattern Recovery, 2008, Tech. Rep. arXiv:0804.

1839v1 [cs.IT].

[14] A. K. Fletcher, S. Rangan, V. K. Goyal, and K. Ramchandran, “De- noising by sparse approximation: Error bounds based on rate-distortion theory,” J. Appl. Signal Process., vol. 10, pp. 1–19, 2006.

[15] A. Gilbert, M. Strauss, J. Tropp, and R. Vershynin, “Algorithmic linear dimension reduction in the` -norm for sparse vectors,” presented at the Allerton Conf. Communication, Control and Computing, Monticello, IL, Sep. 2006.

[16] R. Z. Has’minskii, “A lower bound on the risks of nonparametric esti- mates of densities in the uniform metric,” Theory Prob. Appl., vol. 23, pp. 794–798, 1978.

[17] I. A. Ibragimov and R. Z. Has’minskii, Statistical Estimation: Asymp- totic Theory. New York: Springer-Verlag, 1981.

[18] N. Meinshausen and P. Buhlmann, “High-dimensional graphs and variable selection with the lasso,” Ann. Statist., vol. 34, no. 3, pp.

1436–1462, 2006.

[19] A. J. Miller, Subset Selection in Regression. New York: Chapman- Hall, 1990.

[20] B. K. Natarajan, “Sparse approximate solutions to linear systems,”

SIAM J. Comput., vol. 24, no. 2, pp. 227–234, 1995.

[21] D. Omidiran and M. J. Wainwright, High-Dimensional Subset Re- covery in Noise: Sparsified Measurements Without Loss of Statistical Efficiency, Dept. Statistics, Univ. California, Berkeley, 2008, Tech.

Rep. 753.

[22] G. Reeves and M. Gastpar, “Sampling bounds for sparse support recovery in the presence of noise,” presented at the Int. Symp. Informa- tion Theory, Toronto, Canada, 2008.

[23] J. Riordan, Combinatorial Identities. New York: Wiley, 1968, Wiley Series in Probability and Mathematical Statistics.

[24] S. Sarvotham, D. Baron, and R. G. Baraniuk, “Measurements versus bits: Compressed sensing meets information theory,” presented at the Allerton Conf. Control, Communication and Computing, Sep. 2006.

[25] S. Sarvotham, D. Baron, and R. G. Baraniuk, “Sudocodes: Fast measurement and reconstruction of sparse signals,” presented at the Int.

Symp. Information Theory, Seattle, WA, Jul. 2006.

[26] R. J. Serfling, Approximation Theorems of Mathematical Statistics, ser.

Wiley Series in Probability and Statistics. New York: Wiley, 1980.

[27] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J.

Roy. Statist. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.

[28] J. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp.

1030–1051, Mar. 2006.

[29] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Trans. Inf. Theory, vol.

55, no. 12, pp. 5728–5741, Dec. 2009.

[30] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using` -constrained quadratic programming (Lasso),” IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2183–2202, May 2009.

[31] W. Wang, M. Garofalakis, and K. Ramchandran, “Distributed sparse random projections for refinable approximation,” presented at the Int.

Conf. Information Processing in Sensor Networks, Nashville, TN, Apr.

2007.

[32] W. Wang, M. J. Wainwright, and K. Ramchandran, Information-The- oretic Limits on Sparse Support Recovery: Dense Versus Sparse Mea- surements, Dept. Statistics, Univ. California, Berkeley, 2008, Tech.

Rep. 754.

[33] W. Xu and B. Hassibi, “Efficient compressed sensing with determin- istic guarantees using expander graphs,” presented at the Information Theory Workshop (ITW), Sep. 2007.

[34] Y. Yang and A. Barron, “Information-theoretic determination of min- imax rates of convergence,” Ann. Statist., vol. 27, no. 5, pp. 1564–1599, 1999.

[35] B. Yu, “Assouad, Fano and Le Cam,” in Festschrift for Lucien Le Cam. Berlin, Germany: Springer-Verlag, 1997, pp. 423–435.

Wei Wang (M’09) received the B.S. degree (with honors) in electrical and com- puter engineering from Rice University, Houston, TX, and the M.S. and Ph.D.

degrees in electrical engineering and computer sciences from the University of California, Berkeley.

Her research interests include statistical signal processing, high-dimensional statistics, coding and information theory, and large-scale distributed systems.

She has been awarded an NSF Graduate Fellowship, a Bell Labs Graduate Re- search Fellowship, a GAANN Graduate Fellowship, a Hertz Foundation Grant, and the James S. Waters Award (Rice University).

Martin J. Wainwright (M’03) received the Ph.D. degree in electrical engi- neering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge.

He is currently an Associate Professor at University of California, Berkeley, with a joint appointment between the Department of Statistics and the Depart- ment of Electrical Engineering and Computer Sciences. His research interests include statistical signal processing, coding and information theory, statistical machine learning, and high-dimensional statistics.

Prof. Wainwright has been awarded an Alfred P. Sloan Foundation Fellow- ship, an NSF CAREER Award, the George M. Sprowls Prize for his disserta- tion research (EECS department, MIT), a Natural Sciences and Engineering Re- search Council of Canada 1967 Fellowship, an IEEE Signal Processing Society Best Paper Award in 2008, and several outstanding conference paper awards.

Kannan Ramchandran (S’92–M’93–SM’98–F’05) received the Ph.D. degree in electrical engineering from Columbia University, New York, in 1993.

He is a Professor in the Electrical Engineering and Computer Science De- partment, University of California (UC), Berkeley. He has been at UC Berkeley since 1999. From 1993 to 1999, he was on the faculty of the Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign (UIUC), Urbana. Prior to that, he was a member of the Technical Staff at AT&T Bell Laboratories from 1984 to 1990. His current research interests include distributed signal processing and coding for networks, video communications and peer-to-peer content delivery, multi-user information theory, security, and multi-scale image processing and modeling.

Prof. Ramchandran has published extensively in his field, holds 12 patents, and has received several awards including an Outstanding Teaching award at Berkeley (2009), an Okawa Foundation Research Prize at Berkeley (2001), a Hank Magnusky Scholar Award at the Univesity of Illinois (1999), two Best Paper awards from the IEEE Signal Processing Society (1997 and 1993), an NSF CAREER Award (1997), an ONR Young Investigator Award (1997), an ARO Young Investigator Award (1996), and the Elaihu I. Jury thesis Award for his doctoral thesis at Columbia University (1993). He has additionally won nu- merous best conference paper awards in his field, and serves on the technical program committees for the premier conferences in information theory, communications, and signal and image processing.