S Information-TheoreticLimitsonSparsityRecoveryintheHigh-DimensionalandNoisySetting

(1)

Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting

Martin J. Wainwright, Member, IEEE

Abstract—The problem of sparsity pattern or support set re- covery refers to estimating the set of nonzero coefficients of an un- known vector³ 2 ^p based on a set of n noisy observations.

It arises in a variety of settings, including subset selection in re- gression, graphical model selection, signal denoising, compressive sensing, and constructive approximation. The sample complexity of a given method for subset recovery refers to the scaling of the required sample sizen as a function of the signal dimension p, spar- sity indexk (number of non-zeroes in ³), as well as the minimum valueminof³over its support and other parameters of measure- ment matrix. This paper studies the information-theoretic limits of sparsity recovery: in particular, for a noisy linear observation model based on random measurement matrices drawn from gen- eral Gaussian measurement matrices, we derive both a set of suf- ficient conditions for exact support recovery using an exhaustive search decoder, as well as a set of necessary conditions that any de- coder, regardless of its computational complexity, must satisfy for exact support recovery. This analysis of fundamental limits com- plements our previous work on sharp thresholds for support set re- covery over the same set of random measurement ensembles using the polynomial-time Lasso method (`1-constrained quadratic pro- gramming).

Index Terms—Compressed sensing, `1-relaxation, Fano’s method, high-dimensional statistical inference, information-the- oretic bounds, Lasso, model selection, signal denoising, sparsity pattern, sparsity recovery, subset selection, support recovery.

I. INTRODUCTION

S

UPPOSE that we are given a set of observations of a fixed but unknown vector . In a variety of settings, it is known a priori that the vector is sparse, meaning that its support set —corresponding to those indices for which is nonzero—is relatively small, say with size . Spar- sity recovery refers to the problem of correctly estimating the support set based on a set of noisy observations. This sparsity recovery problem is of broad interest, arising in various areas, including subset selection in regression [24], structure estimation in graphical models [22], sparse approximation [10], [25], signal denoising [7], and compressive sensing [11], [5].

Manuscript received August 28, 2007; revised April 20, 2009. Current version published November 20, 2009. This work was supported in part by the Na- tional Science Foundation under Grants NSF DMS-0528488, CAREER-CCF- 0545862, a Microsoft Research Grant, and a Sloan Foundation Fellowship. The material in this paper was presented in part at the IEEE International Sympo- sium on Information Theory (ISIT), Nice, France, June 2007, and was posted on arXiv in February 2007 (math/0702301).

The author is with the Department of Statistics, and Department of Elec- trical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720 USA.

Communicated by A. Krzy˙zak, Associate Editor for Pattern Recognition, Sta- tistical Learning and Inference.

Digital Object Identifier 10.1109/TIT.2009.2032816

A great deal of work over the past few years has focused on the performance of computationally tractable methods, many based on -norm or other convex relaxations, both for recovering the exact sparsity pattern as well as related problems in sparse approximation. We provide a brief overview of those parts of this extensive literature most relevant to our work in Section I-A. Of equal interest and complementary in nature, however, are the information-theoretic limits associated with the performance of any procedure for sparsity recovery. Such un- derstanding of fundamental limitations is crucial in assessing the behavior of computationally tractable methods. In particular, there is little point in proposing novel methods for sparsity recovery, possibly with higher computational complexity, if currently extant and computationally tractable methods achieve the information-theoretic limits. On the other hand, an information- theoretic analysis can reveal where there currently exists a gap between the performance of computationally tractable methods and the fundamental limits. Indeed, the information-theoretic analysis of this paper makes contributions of both types.

With this motivation in mind, the focus of this paper is on the information-theoretic limitations of sparsity recovery. In particular, our analysis focuses on the noisy and high-dimensional setting, meaning that the observations are contaminated by noise, and all three problem parameters—the number of observations , the model dimension , and the sparsity index , defined below—may tend to infinity. Our main results, stated more precisely in Section II, are necessary and sufficient conditions for subset recovery, stated in terms of the triplet

as well as signal-to-noise parameters such as the minimum value of the signal and the noise variance . More specifically, our analysis applies to the class of random -Gaussian measurement ensembles, in which each measurement is based on the inner product between and a random -vector . As a special but important case, this model includes the standard Gaussian ensemble in which are independent and identically distributed (i.i.d.), obtained by setting . In this paper, we derive a set of sufficient conditions for asymptotically perfect recovery using an exhaustive search decoder, as well as a set of necessary conditions that any decoder must satisfy for perfect recovery.

The analysis given here complements our earlier paper [31]

that established precise thresholds on the success/failure of the Lasso (i.e., -constrained quadratic programming) for sparsity recovery.

The remainder of this paper is organized as follows. In Sec- tion I-A, we provide a more precise formulation of the problem, and a brief discussion of past work, whereas Section II provides a precise statement of our main results, and a discussion of

(2)

their consequences. Section IV provides the proof of the sufficient conditions, based on analyzing the oracle decoder, whereas Section V provides the proof of the necessary conditions. More technical aspects of these proofs are provided in the Appendices.

We conclude in Section VI with a discussion of open directions.

A. Problem Formulation

We begin with a more precise formulation of the problem, as well as a discussion of previous work, with emphasis on that most closely related to the results in this paper. Let be a fixed but unknown vector; we refer to the ambient dimension

as the model dimension. Define the support set of as (1) We refer to its size as the sparsity index. Consider the observation model

(2) where is a vector of observations, is the measurement matrix, and is additive observation noise.

We assume throughout the paper that .

1) Error Metrics: Consider some method that generates the vector as an estimate of the truth . There are various distinct criteria for assessing how close the estimate is to the truth, including

• various norms , especially and , or

• some measurement of predictive power (e.g., , where is the estimate based on ).

Given the abundance of recent results on sparse approximation (not all of which are mutually comparable), it is particularly important to specify up front the choice of error metric. In this paper, we focus exclusively on the sparsity recovery problem, for which the appropriate error metric is simply the – loss associated with the event of recovering the correct support —viz.

(3) Of interest are conditions on the triplet as well as properties of the signal vector and design matrix under which exact support recovery is either possible, or impossible.

2) Past and Ongoing Work: A great deal of recent work has studied the behavior of -relaxations for sparse approximation, including linear programming techniques [7], [12], [5], [14] and -constrained quadratic programming [7], [13], [29], known as the Lasso in the statistics literature [22], [28], [36]. Some papers in this growing literature have provided conditions under which estimation of a noise-contaminated vector via the Lasso [29], [13] or other types of convex relaxation [6] is guaranteed to be stable in the sense; however, it should be noted that such -stability does not guarantee exact recovery of the underlying support set. Most directly related to this paper are results, applicable to -constrained quadratic programming or the Lasso, that provide sufficient conditions [22], [36], [31] or necessary conditions [31] on the amount of data required for subset recovery (i.e., with the error metric (3)). These results isolate a mutual incoherence property [17], [29] of the design matrix that must be satisfied for the Lasso to succeed in recovering the support, and the paper [31] provides sharp scalings on

that demarcate the boundary between success and failure. As we discuss in the sequel, our results show that the exhaustive search decoder can recover the support set for design matrices for which the -based Lasso fails with high probability (see Section III-B), or for sample sizes in which the Lasso fails with high probability (see Section III-A.2).

Some past work on sparse linear regression [1], [27], [16]

shares the information-theoretic motivation of this paper, but focuses on the rate–distortion perspective (i.e., under the -loss), as opposed to the subset recovery metric (3) of interest here.

Since this paper was first posted [30], a number of papers have followed up on the information-theoretic limits of the subset recovery problem. Akcaya and Tarokh [2] analyzed the performance of a certain type of “joint typicality decoder,” obtaining similar results for the support recovery problem studied here as well as various results for partial support recovery metrics (e.g., metrics in which it is sufficient to recover a large fraction of the support, as opposed to any element of the support).

Their analysis is based on the same framework and type of par- titioning scheme, but uses alternative large deviation bounds based on concentration of empirical entropies (joint typicality).

Reeves and Gastpar [26] analyzed the partial support recovery problem in the regime of linear sparsity (i.e., for some ), and showed that the signal-to-noise ratio (SNR) must tend to infinity in order for exhaustive search decoders to succeed. In subsequent work, Fletcher et al. [15] used direct methods to show that for any signal with squared minimum value , any decoder applied to measurement matrices drawn from the standard Gaussian ensemble requires

measurements. Concurrent work by Wang et al. [32] used re- finements of the Fano approach from the initial posting of this work [30], to establish the same scaling for general i.i.d. measurement matrices. Although these extensions did not appear in the original posting of this work [30], following the reviewers’

suggestion, we have also included in this revised version some consequences of the refined Fano approach [32] for necessary conditions (Theorem 2) as applied to the non-i.i.d. -Gaussian ensembles considered here.

Notation: We use the following standard notation for asymp-

totics of real sequences and : (i) means

that for some constant ; (ii)

means that for some constant ; (iii)

is shorthand for and , and

(iv) means that .

II. STATEMENT OFMAINRESULTS

The analysis of this paper applies to the high-dimensional setting, in that all three elements of the triplet are per- mitted to tend to infinity. We provide both positive results—that is, scalings of the triplet and associated signal/measurement parameters such that an exhaustive search decoder can recover the exact support with high probability—and also con- verse results, meaning scalings for which the probability of suc- cessful support recovery remains bounded away from zero for any method. Although we allow for completely general scaling of this triplet, our results can also be specialized to two partic- ular cases of sparsity scaling: (a) the linear sparsity regime e.g.,

(3)

[5], [12], in which for some ; or (b) the sublinear sparsity regime, e.g., [22], [36], in which tends to zero. Depending on the underlying motivation for sparse approximation, both of these sparsity regimes are of independent interest. In covering the full range of scaling, the results given here are complementary to those of our previous paper [31] that provided threshold results, also applicable to general scaling of , for the success/failure of the Lasso when used for sparsity recovery with general Gaussian measurement ensembles.

We focus on the linear observation model (2) in the noisy setting , with the measurement matrix

drawn from the -Gaussian ensemble, meaning that each row

is drawn i.i.d. as for .

Note that setting yields as a special case the standard Gaussian ensemble, for which is i.i.d.

In addition to the three parameters , our results also highlight the importance of some other parameters associated with the signal ensemble. In particular, both the sufficient and necessary conditions require control of the minimum value of the unknown vector on its support. Consequently, for a given minimum value , we consider the class of signals

for all (4)

where is the support of . Our results show that the SNR parameter , as opposed to the more traditional measure that would arise in assessing error, is the key quantity that controls subset selection. Indeed, we show that the quantity can be arbitrarily large without having any effect on the difficulty of subset recovery.

A. Decoders and Error Probabilities A decoder is a mapping from the pair to the family

of all -sized subsets of . The output

corresponds to the decoder’s best estimate of the unknown underlying subset. The underlying true vector is fixed but unknown. We focus on three different types of error, depending on whether (a) the error probability is taken conditionally on a fixed support set , or (b) the error probability is averaged over a support set chosen uniformly at random (u.a.r.) from all possible -sized subsets, or (c) the error probability is worst case over a support set chosen in an adversarial manner. In particular, in the case that has fixed but unknown support , we define the -based error

(5)

Here the probability is taken over the

random measurement , or equivalently, over the observation noise and random design matrix , with the underlying support being , with the probability taken over the measurement noise and the choice of random design matrix . When

is viewed as a random variable, chosen u.a.r., we define the average error probability

(6)

Finally, when is chosen in an adversarial manner, we define the worst case error probability

(7)

B. Design Covariance Parameters

Our second set of parameters involve the covariance matrix that defines the -Gaussian ensemble of design matrices, in which each row of the design matrix is drawn i.i.d.

from the -dimensional normal distribution. We begin with the key quantities that arise in our sufficient conditions, as stated in Theorem 1. Given a pair with

, we define the matrix

(8) Note that corresponds to the Schur complement [19] of

the matrix with respect to the sub-

matrix . With the support set viewed as fixed, we define the quantity

(9) As will be clarified in our analysis, this quantity controls the relative distinguishability of subsets and under the exponential search decoder. For the case of average and worst case error probabilities, we define the uniform bound

(10)

Note that all of these quantities are extremely simple in the case of the standard Gaussian ensemble ; in particular, we have for all pairs of distinct subsets

, and hence for all , and

moreover .

A closely related set of quantities arise in the statement of our necessary conditions on any algorithm, as stated in Theorem 2. In particular, letting denote a subset chosen uniformly at random from , we define

(11)

(4)

As our analysis will demonstrate, this quantity measures the difficulty of distinguishing a subset from the collection of subsets that differ from it in only one position. The second quantity that we define plays a role in specifying the bulk effect of all subsets in

(12) As with the quantities involved that arise in the statement of Theorem 1, these quantities are especially simple for the case of the standard Gaussian ensemble ; in

particular, we have and . More

generally, the quantity is closely related to the quantity

; in particular, letting denote the submatrix of indexed by , we have the inequality

(13) Inequality follows from the definition (11), whereas inequality follows because by choosing subsets and such

that , we have

As mentioned above, these inequalities are met with equality for the standard Gaussian ensemble ; in Sec- tion III-B, we provide a more general family of matrices for which (in particular, see Example 2).

C. Statement of Sufficient Conditions

We now have the necessary ingredients to state conditions on the triplet , minimum value , and design condition parameters or that are sufficient to ensure exact support recovery using an optimal decoder (to be specified later). So as to simplify the statements of our results, we define the function

(14) Here, the quantity will be set to either or , depending on the error probability under discussion.

Theorem 1 (Sufficient Conditions): Given a problem instance from the -Gaussian linear observation model (2), there exists a decoder with the following characteristics.

(a) For any fixed vector with fixed support , if the sample size satisfies

(15)

for some , then .

(b) For the support set chosen uniformly at random from , if the sample size satisfies

(16)

for some , then .

(c) For the support set chosen adversarially from , if the sample size satisfies

(17)

for some , then .

Remarks: Note that there are only minor differences on the conditions required for the three different types of error probability. The mildest conditions are required for corresponding to the error probability associated with a fixed subset. It requires only bounds on from (9)—that is, only a uniform lower bound on the eigenvalues of the matrices , as defined previously (8). The error probabilities and involve any possible subset, and so require lower bounds on the quantity , which measures eigenvalues uniformly over for all distinct pairs . In addition, the worst case error probability requires an additional term in the definition of , corresponding to the (logarithm of the) number of possible subsets of cardinality .

D. Statement of Necessary Conditions

Thus far, we have provided sufficient conditions for an exhaustive search decoder to succeed with high probability in recovering the support set. Of equal interest and complementary in nature are necessary conditions that must be satisfied by any method for reliable recovery to be possible. We state a result of this nature in this subsection.

Before proceeding, note that for any fixed subset , it is not possible to provide any type of lower bound on the probability

, since the trivial decoder

for all always achieves perfect recovery in this setting.

Accordingly, it is necessary to lower-bound either the average probability of error (with drawn uniformly at random from ) or the worst case probability of error. The following result provides lower bounds on the sample size for the average error probability. Since the adversarial setting is not any easier, the following theorem also provides lower bounds for the worst case error.

Theorem 2 (Necessary Conditions): Consider the family of problem instances defined by random -Gaussian designs and the linear observation model (2), with chosen uniformly at random from . If the sample size is upper-bounded as

(18)

then for any decoding algorithm , there exists

a vector such that

The proof of this claim, given in Section V, is somewhat more indirect in nature, based on the Fano method [8], [18], [20], [35], [34] in order to lower-bound the probability of error for

(5)

a restricted ensemble, which can be viewed as a certain type of hypothesis testing or channel decoding problem.

III. SOMECONSEQUENCES OFOURRESULTS

In this section, we explore some consequences of our results. We begin by discussing two regimes in which Theorems 1 and 2 provide a sharp characterization of the sample complexity of the subset recovery problem. By comparison to known threshold results on the Lasso ( -constrained quadratic programming), these results reveal that the Lasso is information- theoretically optimal in some regimes, while dramatically suboptimal in others. We then discuss conditions on the design covariance matrix , and show with an explicit construction that the exhaustive search decoder can succeed for designs for which the Lasso fails with high probability.

A. Consequences for Different SNR and Sparsity Regimes We begin by discussing some regimes of SNR and sparsity in which the results of Theorems 1 and 2 provide a sharp characterization of the sample complexity of the subset recovery problem. In order to make explicit comparisons to the Lasso ( -constrained quadratic programming), we begin by stating its sample complexity. For random design matrices drawn from any -Gaussian ensemble satisfying a certain mutual incoherence condition (see (25) to follow), Wainwright [31] establishes a sharp phase transition for the success/failure of the Lasso. If we assume that and the incoherence parameter re- main bounded away from , the Lasso threshold [31] is of the form

(19)

for constants .

1) Regime of Bounded Norm Vectors: We begin by consid- ering the regime of bounded norm vectors (i.e., ), which implies (due to -sparsity of ) that for some constant . In this regime, we have the following corollary of Theorems 1 and 2.

Corollary 1: Consider a signal with . Then the information-theoretic sample complexity of subset selection is given by

More precisely, there exist constants as follows.

(a) For sequences such that

(20) the exhaustive search decoder has error probability

. (b) For sequences such that

(21) any algorithm fails often—that is, .

Remarks: By comparison to the Lasso threshold (19), Corol- lary 1 implies that for signals with , the sample complexity of the Lasso is equal, up to constants independent of and , to the information-theoretic capacity.

Although Theorems 1 and 2 provide matching scalings for , it should be noted that the conditions do not match for all scalings of the squared minimum value . For instance, if the squared minimum value is constant (i.e., for some constant ), then Theorem 2 implies that samples are needed, whereas Theorem 1 dic- tates that samples are sufficient. It remains an open question to determine the sharp order of scaling for such regimes.

2) Consequences for Linear Sparsity: We have seen that the Lasso is information-theoretically optimal for certain regimes of the SNR parameter . In contrast, for other regimes of SNR and sparsity, Theorem 1 reveals a dramatic difference between the -based Lasso, and the performance of the optimal decoder.

This difference appears in the regime of linear sparsity, in which for some . This linear sparsity regime is particularly relevant for compressed sensing [5], [11], where the parameter corresponds to the fraction of nonzero entries in a signal to be reconstructed. First, note that if , then according to the previously stated Lasso threshold (19), there is a constant such that (for any scaling of the squared minimum value ) the Lasso fails unless the sample size satisfies . Hence, the number of samples required by the Lasso grows faster than linearly (i.e., ). In sharp contrast, as long as the squared minimum value does not decrease too quickly (as made precise below), Theorems 1 and 2 imply that the information-theoretic threshold is observations.

The following corollary makes these observations precise. To simplify the statement, for , define the function

(22) as well as the function

(23)

Here is the binary entropy function

. With this notation, we havethe following.

Corollary 2: Consider a signal with linear sparsity (i.e., for some ). Suppose that the minimum

value for some . Then the information-

theoretic threshold for subset recovery is . More precisely:

(a) Given size , the exhaus-

tive search decoder has error probability .

(b) Conversely, if , then the proba-

bility of error of any algorithm is at least . Remark: Note that we have

(6)

Consequently, for any fixed SNR constant and design parameter , for sufficiently small fractions , the optimal decoder can recover with . We note that the constant in the definition (22) of the threshold function is far from optimal,¹but it certainly could be improved by more careful control of constants in the large deviations analysis.

Proof: Recall the required sample size from (15) of The- orem 1. Under the stated conditions of the corollary, the ratio

is given by

For and , we have . Moreover,

from standard bounds on binomial coefficients (see bound (54) in Appendix C), for , we have

Combining the pieces yields the stated claim in part (a).

Turning to the claim in part (b), from Theorem 2, we know that at least samples are required. Substituting

in and , we obtain

where the final inequality uses the fact that . In summary, for a squared minimum value scaling as for some constant , Corollary 2 demonstrates that the Lasso is highly suboptimal in the linear sparsity regime . Regardless of the linear fraction and the squared minimum , success of the Lasso for support recovery [31] requires the number of samples to scale so

quickly such that .

As pointed out by one of the reviewers, results by Candes and Tao [6] on the Dantzig selector (an -based relaxation) apply to the linear–linear regime of Corollary 2. In particular, for measurement matrices drawn from the standard Gaussian ensemble, they establish bounds on the mean-square error (MSE) prediction as well as on the error of the Dantzig selector. These results show that for the case of design matrices drawn from the standard Gaussian ensemble (with i.i.d.

) entries, a sample size of

(24) is sufficient to achieve a squared reconstruction error that is bounded. Related results by Meinshausen and Yu [23] and Bickel et al. [3] have similar consequences for the Lasso.

1As pointed out by a reviewer, it requires that 10 for a meaningful result.

In the context of this paper (which focuses exclusively on support recovery), we note that the criteria of support recovery is related to but distinct from the criteria of prediction error , or on the reconstruction error . On one hand, given a procedure that correctly recovers the support of the unknown vector , then of course we can simply restrict our problem to the subset , and use standard methods (e.g., ordinary linear regression) to obtain estimates with good MSE prediction or error. In the opposite direction, however, an estimate can be close to but still have a different support than the true vector . Indeed, as discussed above, for standard Gaussian matrices, the sample size (24) guarantees that the Dantzig selector and Lasso achieve errors that are bounded.

As pointed out by one of the reviewers, if the minimum value were also strictly bounded away from zero and if also had entries bounded above by , then an estimate such

that would be sufficient to

guarantee support recovery. However, in the regimes of prac- tical interest, the minimum value decreases to zero at some

rate (e.g., when has constant norm), so

that recovery with constant error bounds is not sufficient. In- deed, a consequence of the results of Wainwright [31] is that the Lasso requires samples to perform support recovery, which scales much more rapidly than in the case of linear sparsity. Theorem 2 demonstrates that when , this scaling—as opposed to

—is unavoidable for subset selection.

Moving onto consideration of arbitrary methods, a consequence of Corollary 2 is that no method can recover the support exactly with observations unless the squared minimum value is lower-bounded as . Nonetheless, it is an interesting question to consider the subset selection performance of other computationally efficient methods.

B. Conditions on the Design Covariance

It is worthwhile comparing the conditions on the design matrix imposed by Theorems 1 and 2 to those conditions imposed in past work on -based methods. One set of conditions, sufficient for guarantees in terms of or prediction error, are based on restricted isometry properties (RIP) [5], [12], requiring that the condition numbers of various submatrices of the matrix are uniformly very close to one. (For instance, among other conditions, RIP requires the bound , for a suitably small .) By known concentration results in random matrix theory [9], such RIP conditions hold with high probability for design matrices whose columns are close to orthogonal (e.g., for drawn from the standard Gaussian ensemble

with and ). It should be noted that

these RIP conditions, while sufficient, are far from necessary to obtain bounds on or prediction error; we refer the reader to Bickel et al. [3] for a much weaker set of conditions for and prediction error consistency.

By contrast, the focus of this paper is on the problem of exact support recovery, for which a related but distinct set of conditions on the design covariance are known to be necessary and sufficient. First, successful Lasso-based support recovery requires that the minimal eigenvalue stay bounded

(7)

away from zero, and secondly (and more significantly), that a certain mutual incoherence parameter stays bounded strictly away from zero—namely

(25) Whereas the lower bound on is a mild condition, the mutual incoherence condition (25) is more restrictive. It was first defined in the context of sample design matrices indepen- dently by Fuchs [17] and Tropp [29], and also imposed in other high-dimensional analysis of the Lasso [22], [36], [31]. Note that the eigenvalue lower bound and mutual incoherence condition (25) are trivially satisfied for random design matrices drawn from the standard Gaussian ensemble .

It is known [36], [31] that if the Lasso is applied to any ensemble of -Gaussian measurement matrices for which the incoherence assumption (25) is violated and the noise has a symmetric distribution, it will fail with probability at least , regardless of the sample size (see Wainwright [31] for a precise statement). Exploiting this fact, the following examples show that there exist covariance matrix families for which the optimal decoder can succeed while the Lasso fails.

Example 1: Consider the -Gaussian family with covariance matrices of the form

... ... ... ... ... ...

(26)

for some . In particular, for , it can be verified that we have uniformly for all

.

Consider some -sized subset that does not include the index , and let be another -sized subset. Using the

notation , we have

If also does not include the index , then , so that . In the more interesting case that includes the index , a little calculation shows that

where is a vector of all ones. Consequently, for this family

where the first inequality uses the definition , and the second inequality uses the fact that . Consequently, the

optimal decoder succeeds in recovering the support set with observations.

On the other hand, suppose that the given subset has cardinality . By definition of the covariance matrix (26) and the mutual incoherence parameter in (25), we have

showing that with , the mutual incoherence condition (25) is violated. Consequently, for this ensemble and for any subset with elements that excludes the index , the probability of incorrect support recovery using -constrained quadratic programming is at least one half [31], regardless of the sample size, whereas the optimal decoder will succeed with high probability for sample sizes .

It is also interesting to consider the quantities and that arise in the necessary and sufficient condition of Theorems 1 and 2. As previously shown (13), the quantity always lower-bounds the quantity . The following example provides a family of matrices for which this lower bound is met with equality, so that the dependence on the design covariance identified by Theorems 1 and 2 is tight.

Example 2: Letting denote the all-ones vector, consider the family of covariance matrices

(27) In this example, we show that for a squared minimum value of

the order and any , Theorems 1 and 2

predict that the critical sample size scales as . To establish this fact, we begin by calculating the matrices

that define the quantity . For any , any , and any pair of -sized subsets with , a little calculation (using the matrix inversion formula [19]) shows that (28) so that for any subset , and hence

. Consequently, the exhaustive search decoder succeeds with high probability (w.h.p.) as long as the sample size satisfies

for some constant .

Let us compare this sufficient condition to the necessary conditions from Theorem 2. For this particular covariance matrix

, a little calculation shows that

and moreover that . Therefore, for this ensemble with , Theorem 2 implies that if

(8)

for some constant , then the error probability of any algorithm is at least . Consequently, for this ensemble of matrices parameterized by , Theorems 1 and 2 provide a set of necessary and sufficient conditions that are matching up to constant factors independent of .

On the other hand, for any -sized subset , the condition number of the submatrix is given by

which tends to infinity for any fixed .

Finally, we provide an example to illustrate that an upper bound on the maximum eigenvalue is not required for -based support recovery.

Example 3: For a given -sized subset , consider the covariance matrix given by

if if

otherwise.

(29)

For this matrix, a simple calculation shows that the maximum

eigenvalue , which tends to infinity

for any fixed as . On the other hand, since for all outside of the subset , the mutual incoherence condition (25) is satisfied with . Moreover, a little

calculation shows that . Therefore, the

Lasso will succeed using samples.

IV. PROOF OFTHEOREM1

This section is devoted to the proof of Theorem 1. We begin by setting up some useful notation to be used throughout the remainder of the analysis. Given any subset , we use the notation to denote the -dimensional subvector , and similarly for other vectors (e.g., , , etc.). In an analogous manner, we use to denote the matrix with columns . We use to denote the transpose of a matrix .

A. Exhaustive Search Decoder

Our route to establishing the sufficient conditions in The- orem 1 is by direct analysis of the decoder that searches ex- haustively over all possible subsets of size . More specifically, the search decoder obtains its estimate by the following two-step procedure.

(a) For each of the subsets subset of size , solve the quadratic program

(30)

(b) Return the subset .

Of interest to us are various error probabilities associated with this procedure. We begin by bounding the -based error

with the probability taken over and the noise vector , when the underlying support is fixed to . Using this result, we then bound the average error probabilities and the worst case error probabilities , as defined in (6) and (7), respectively.

With and random matrices drawn from a Gaussian ensemble with nondegenerate covariance , each of the submatrices has rank with probability one.²Accordingly, we may define the matrices

and (31a)

(31b) Note that and are both orthogonal projection matrices, associated with the -dimensional range space and -dimensional nullspace , respectively. For any pair of subsets and , each with elements, define the random variable

(32) With these definitions, we state the following result.

Lemma 1: For any given vector with support , the exhaustive search decoder prefers to the true underlying if

and only if .

Proof: We begin by showing that for any subset for which is full rank, the quantity defined in (30) is equal to . Under the given rank condition, the linear least squares estimator of is given by

. Noting that by the definition (31a) of , we have we substitute into the quadratic norm and expand, thereby obtaining

Failure occurs if and only if , as

claimed.

Overall, the search decoder fails if and only if at least one (with cardinality ) is preferable to ; consequently, the overall probability of error can be written as

(33)

Consequently, assuming that has support , the technical result central to analyzing the error probability (33) is tight control on the probabilities of the events , for all -sized

subsets .

B. Large Deviations Bound

Before stating a large deviations bound, we require some notation. Recalling the definition (8) of , we use to denote its symmetric matrix square root. For each , define the quantity

(34)

2That is, the probability thatk random Gaussian random vectors in are linearly independent is equal to zero, which follows since the Gaussian has den- sity with respect to Lebesgue measure.

(9)

representing a type of SNR, reflecting how distinguishable subset is from . With this notation, we have the following.

Lemma 2: As long as , for any pair of distinct subsets and , we have

(35) The proof of Lemma 2 is somewhat technical in nature. How- ever, the high-level strategy is straightforward: given some

, we define the events

and (36a) (36b)

We now observe that for any , the event implies that at least one of the events or holds. Indeed, supposing that neither nor is true, then the quantity

is lower-bounded by .

Consequently, by union bound, it suffices to control the two probabilities and . This argument applies for any choice of ; a convenient choice turns out to be

.

With this setup, the proof of Lemma 2 is a consequence of the following two results, proved in Appendices A and B, respectively.

Lemma 3: For all

(37)

Moreover, the choice is valid as long as .

Lemma 4: With , we have

(38)

for any pair of distinct subsets and .

Combining these two results yields the claim of Lemma 2.

C. Analysis of Error Probability

Using Lemma 2, we are now equipped to complete the proof of Theorem 1. Denote by the number of subsets with

cardinality , such that . (Moreover, note that since both and have cardinality , we have as well.) A counting argument yields that, for each with , there are

(39) such subsets.

In order to simplify the statement of the result, we begin by deriving a weaker form of the large deviations bounds from Lemma 2, albeit one that leads to simpler expressions. Observe that the function is increasing on the interval . Consider a pair of subsets and with overlap of size

. Using the definition (9) of , we have

Consequently, for any pair with , the bound (35) implies that

(40)

Combining this upper bound with the union bound applied to the expression (33) yields that is upper-bounded by

which is further upper bounded by

Consequently, in order for the error probability to vanish asymptotically, it suffices to take such that is greater than

Let us upper-bound this quantity (denote it ). By our assumption that , we have . Moreover, we have

Overall, we conclude that

(41)

(10)

Given that the conditions of Theorem 1 certainly imply that , it suffices to restrict attention to the term involving —namely, to upper-bound the quantity

(42) Using standard bounds on binomial coefficients (see Ap- pendix C), we have

Returning to the upper bound (Section IV-C), we conclude that for a sample size satisfying

(43) for some , the error probability associated with detecting support set decays exponentially as

thereby establishing the claim of Theorem 1(a).

Turning to Theorem 1(b), if we replace the quantity in

the lower bound (43) by , then we

can conclude that the average probability of error

also vanishes exponentially fast.

Finally, turning to Theorem 1(c), let us consider the case of the worst case error probability taken over all subsets . In this case, in addition to using the worst case measure , we also need the probability of error for any given subset to con- verge to zero sufficiently fast. In particular, by union bound, we have

Consequently, if for some , we choose a sample size such that

then we have .

V. PROOF OFTHEOREM2

In this section, we prove the necessary conditions stated in Theorem 2. Our method involves two restricted versions of the subset recovery problem, for which the analysis can be reduced to a type of channel decoding problem. We then apply a variant

of Fano’s bound [8] to analyze the error probability over these restricted ensembles. We note the Fano method is a standard approach for obtaining minimax lower bounds in nonparametric statistical problems [18], [20], [35], [34].

A. Basic Setup

We begin by describing the basic setup for the proof of The- orem 2. Let denote a particular subset of the set of all -sized subsets, and let be a set-valued random variable, uniformly distributed over —that is,

for all . Suppose that the decoder is told that the selected subset is a member of , and moreover it is provided with the form of the vectors for all . Its goal is to use the pair to recover the unknown subset . Note that the two forms of side information—namely, that , and the form of for any fixed —cannot harm the decoder’s performance, since the decoder can always choose to ignore this information. Consequently, the error probability of the decoder for the original problem is lower-bounded by the error probability , where is uniform over . This is a multi-way hypothesis testing problem, and we may lower-bound the probability of error of any decoder using Fano’s inequality [8].

We lower-bound this error probability as follows: first, for any fixed and for any decoder

(44) where is the mutual information between and

conditioned on ; explicitly, it is given by

. Taking expectations of both sides (44), we conclude that

(45) Consequently, in order to make effective use of the Fano lower bound (44) or its averaged form (45), we need to construct ensembles for which is relatively large while the mutual information is relatively small. In our analysis, we make use of the upper bound on the mutual information

(46)

which follows from the convexity of mutual information [8].

Here, the quantity

(47) is the Kullback–Leibler divergence between the distributions

and .

B. Bound for Bulk Ensemble

The bulk ensemble is defined by the choice , and then setting

(48)

(11)

for each . A straightforward computation yields that

so that we have

(49)

where

and

Note that each is a random variable (as a function of the random design matrix ); a little calculation shows that

and, moreover, that

where the final inequality follows by our choice (48) of , and the definition (12) of . Consequently, using the bound (49),

we have , and hence, using

the bound (45)

Consequently, if the sample size is upper bounded as

then as claimed.

C. Bound for Nearby Subsets

We now describe bounds based on a second ensemble, intro- duced by Wang et al. [32]. For any subset , let be an index that achieves the minimum in the definition (11) of the function . We then let the ensemble consist of all subsets that contain all indices in the set , and then one more index chosen from the set . Observe that

the resulting family of subsets has cardinality , with the property that for each distinct pair of subsets, the Hamming distance is equal to two.

For each subset , we define its signal vector as follows:

if if

where achieves the minimum in (11). Note that we have by construction.

Now consider a pair of distinct subsets , say with and . With the choices given above, a little calculation shows that

so that we have

Taking expectations and using the definition (11) of the function , we obtain that is equal to

Expanding the expectation yields that is equal to

which is upper-bounded by , using the definition (11). Consequently, by the Fano bound, we obtain

Consequently, if , then the probability of error remains bounded above by , as claimed.

VI. CONCLUSION

In this paper, we have analyzed the information-theoretic limits of the sparsity recovery problem for the linear observation model (2) with measurement vectors drawn from -Gaussian ensembles, including the standard Gaussian one as a special case. We have established both lower and upper bounds on the number of observations as a function of the model dimension , signal sparsity , squared minimum value , and noise variance as well as other parameters of the design covariance that are required for asymptotically reliable recovery. In conjunction with previous

(12)

work [31] on the limits of the Lasso ( -constrained quadratic programming), this analysis has some consequences.

(a) For signals of bounded norm, the Lasso achieves the information-theoretically optimal order of scaling as a function of and (see Corollary 1), whereas (b) for signals with linear sparsity and squared minimum value , the Lasso is suboptimal (see Corollary 2).

There are a variety of open directions suggested by our analysis. First, while our upper and lower bounds are essentially matching for certain regimes of scaling, it is likely that the analysis can be tightened in other regimes. In particular, the necessary conditions stated in Theorem 2 certainly involve some slack, since they are obtained by analyzing restricted ensembles in which the value of on the subset is known a priori to the decoder. It would be interesting to see if sharper results could be obtained via analysis of a less restrictive ensembles.

Second, our work has revealed the suboptimality of current prac- tical methods in the linear sparsity regime ( for some

) for sufficiently high SNR (in particular,

). It is possible that multistage methods (e.g., [33], [23]) could be helpful in closing these gaps. Third, our results highlight various differences between the conditions on the design covariance matrix (from which the random measurement matrices are generated) required by -based methods such as the Lasso, as contrasted with exponential-complexity methods. It would be interesting to see to what extent the mutual incoherence conditions that affect standard methods can be relaxed;

see Meinshausen and Yu [23] for some progress in this direction.

APPENDIX

A. Proof of Lemma 3

Using the linear observation model (2), we note that , so that for any , we can write

(50)

where we have adopted the shorthand notation . The following lemma characterizes the distribution of the random variable to be bounded.

Lemma 5: For any two -sized subsets and with overlap , we have

where and are chi-squared variables with degrees of freedom.

Proof: Note that by the Pythagorean Theorem for projec- tions, we have

Again using the Pythagorean Theorem, we can write

where we have used the facts that

. Since there is an analogous decomposition for , we can write

Now the matrix is a projection matrix with rank

equal to , and similarly for the

matrix . Consequently, is dis-

tributed as , and similarly for the second term.

Using this lemma and the decomposition (50), we may write

By triangle inequality, we have ,

so that by union bound

where . As long as , we may apply the

chi-squared tail bound (56) to conclude that

as claimed.

To establish the validity of the choice , we note that

Consequently, we have

so that it suffices to have , as claimed.

B. Proof of Lemma 4

We begin by conditioning on and , and showing that the random variable follows a noncentral chi-squared distribution. By conditioning on , we can decompose into a linear prediction based on and a zero-mean error term.

In particular, we have

where is a Gaussian random matrix independent of , with i.i.d. rows drawn from the zero-mean Gaussian

(13)

distribution with covariance matrix from (8). Using this decomposition, we have

since the orthogonal projection annihilates any terms in the column space .

Let us diagonalize the orthogonal projection matrix , writing it as where is diagonal with

ones, and zeros, and is a unitary matrix. With this trans- formation, we have

since is unitary. The random vector has i.i.d.

Gaussian entries with variance , so

that multiplication by leaves its distribution unchanged. We conclude that conditioned on and , the rescaled variable

Since the rescaled vector has i.i.d. entries, has a noncentral chi-squared distribution with degrees of freedom, and noncentrality parameter

Now the event can be expressed in terms of and as , where we have introduced the

convenient shorthand . For the choice ,

we have . Consequently, setting

in (58b), we obtain , with

(51)

Finally, let us define the event .

For each fixed , the variable is (central) chi- squared variate with degrees of freedom, so that by

the tail bound (57), we have .

Putting together the pieces, we have

Since the event is a function only of and , our earlier tail bound (51) may be applied. Moreover, conditioned on

, we have , so that we obtain

so that we conclude that

(52)

as claimed.

C. Bounds on Binomial Coefficients

We make use of the following crude bounds on the binomial coefficients:

(53) In addition, the following bound is also standard [8]:

(54)

where is the binary entropy

function.

D. Tail Bounds for -Variates

The following large deviations bounds for centralized are taken from Laurent and Massart [21]. Given a centralized

-variate with degrees of freedom, then for all

and (55a) (55b) The following consequences of these bounds are useful in our analysis. First, for , we have

(56)

Starting with the bound (55a), setting yields

, Since for , we

have for all . On the other

hand, for all , we have , so that the

claim (56) follows. Secondly, we have

(57) which follows by setting in (55a).

More generally, the analogous tail bounds for noncentral , taken from Birgé [4], can be established via the Chernoff tech- nique, and careful bounding of the moment generating function.

Let be a noncentral variable with degrees of freedom and noncentrality parameter . Then for all

(58a) (58b)

ACKNOWLEDGMENT

The author wishes to thank Peter Bickel for helpful discus- sions and pointers, and the anonymous reviewers for careful

(14)

reading and helpful comments and suggestion that improved the presentation.

REFERENCES

[1] S. Aeron, M. Zhao, and S. Venkatesh, “Information-theoretic bounds to sensing capacity of sensor networks under fixed snr,” in Proc. IEEE Information Theory Workshop, San Diego, CA, Sep. 2007.

[2] M. Akcakaya and V. Tarokh, “Shannon Theoretic Limits on Noisy Compressive Sampling,” Harvard Univ., Cambridge, MA, Tech. Rep., Nov. 2007 [Online]. Available: arXiv:cs.IT:0711.0366

[3] P. Bickel, Y. Ritov, and A. Tsybakov, “Simultaneous analysis of Lasso and Dantzig selector,” Ann. Statist., to be published.

[4] L. Birgé, “An alternative point of view on Lepski’s method,” in State of the Art in Probability and Statistics, ser. IMS Lecture Notes, no.

37. Beachwood, OH: Inst. Math. Statist., 2001, pp. 113–133.

[5] E. Candés and T. Tao, “Decoding by linear programming,” IEEE Trans.

Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.

[6] E. Candés and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger than n,” Ann. Statist., vol. 35, no. 6, pp.

2313–2351, 2007.

[7] S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.

[8] T. M. Cover and J. A. Thomas, Elements of Information Theory..

New York: Wiley, 1991.

[9] K. R. Davidson and S. J. Szarek, “Local operator theory, random ma- trices, and Banach spaces,” in Handbook of Banach Spaces. Ams- terdan, The Netherlands: Elsevier, 2001, vol. 1, pp. 317–336.

[10] R. A. DeVore and G. G. Lorentz, Constructive Approximation.. New York: Springer-Verlag, 1993.

[11] D. Donoho, “Compressed sensing,” IEEE Trans. Inf.. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006.

[12] D. Donoho, “For most large underdetermined systems of linear equations, the minimal` -norm solution is also the sparsest solution,”

Commun. Pure and Appl. Math., vol. 59, no. 6, pp. 797–829, Jun.

2006.

[13] D. L. Donoho, “For most large underdetermined systems of linear equations, the minimal` -norm near-solution approximates the sparsest near-solution,” Commun. Pure and Appl. Math., vol. 59, no. 7, pp.

907–934, Jul. 2006.

[14] D. L. Donoho and J. M. Tanner, “Counting faces of randomly-projected polytopes when the projection radically lowers dimension,” J. Amer.

Math. Soc., vol. 22, pp. 1–53, Jul. 2009.

[15] A. K. Fletcher, S. Rangan, and V. K. Goyal, “Necessary and Sufficient Conditions on Sparsity Pattern Recovery,” Univ. Calif., Berkeley, Tech.

Rep., Apr. 2008 [Online]. Available: arXiv:cs.IT/0804.1839 [16] A. K. Fletcher, S. Rangan, V. K. Goyal, and K. Ramchandran, “De-

noising by sparse approximation: Error bounds based on rate-distortion theory,” J. Appl. Signal Process., vol. 10, pp. 1–19, 2006.

[17] J. J. Fuchs, “Recovery of exact sparse representations in the presence of noise,” in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Montreal, QC, Canada, 2004, vol. 2, pp. 533–536.

[18] R. Z. Has’minskii, “A lower bound on the risks of nonparametric esti- mates of densities in the uniform metric,” Theory Prob. Appl., vol. 23, pp. 794–798, 1978.

[19] R. A. Horn and C. R. Johnson, Matrix Analysis.. Cambridge, U.K.:

Cambridge Univ. Press, 1985.

[20] I. A. Ibragimov and R. Z. Has’minskii, Statistical Estimation: Asymp- totic Theory. New York: Springer-Verlag, 1981.

[21] B. Laurent and P. Massart, “Adaptive estimation of a quadratic func- tional by model selection,” Ann. Statist., vol. 28, no. 5, pp. 1303–1338, 1998.

[22] N. Meinshausen and P. Bühlmann, “High-dimensional graphs and vari- able selection with the Lasso,” Ann. Statist., vol. 34, pp. 1436–1462, 2006.

[23] N. Meinshausen and B. Yu, “Lasso-type recovery of sparse represen- tations for high-dimensional data,” Ann. Statist., to be published.

[24] A. J. Miller, Subset Selection in Regression.. New York:

Chapman&Hall, 1990.

[25] B. K. Natarajan, “Sparse approximate solutions to linear systems,”

SIAM J. Comput., vol. 24, no. 2, pp. 227–234, 1995.

[26] G. Reeves and M. Gastpar, “Sampling bounds for sparse support re- covery in the presence of noise,” in Proc. IEEE Int. Symp. Information Theory, Toronto, ON, Canada, Jul. 2008, pp. 2187–2191.

[27] S. Sarvotham, D. Baron, and R. G. Baraniuk, “Measurements versus bits: Compressed sensing meets information theory,” in Proc. Allerton Conf. Communication , Control, and Computing, Monticello, IL, Sep.

2006.

[28] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J.

Roy. Statist. Soc., ser. B, vol. 58, no. 1, pp. 267–288, 1996.

[29] J. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp.

1030–1051, Mar. 2006.

[30] M. J. Wainwright, “Information-Theoretic Bounds for Sparsity Re- covery in the High-Dimensional and Noisy Setting,” Dep. Statist., Univ. Calif., Berkeley, Tech. Rep. 725, Jan. 2007 [Online]. Available:

arxiv:math.ST/0702301, presented at the IEEE Int. Symp. Information Theory, Nice, France, June 2007

[31] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using` -constrained quadratic programming (Lasso),” IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2183–2202, May 2009.

[32] W. Wang, M. J. Wainwright, and K. Ramchandran, “Information-Theo- retic Limits on Sparse Signal Recovery: Dense Versus Sparse Measure- ment Matrices,” Univ. Calif., Berkeley, Tech. Rep., June 2008 [Online].

Available: arXiv:0806.0604, presented at the IEEE Int. Symp. Informa- tion Theory, Toronto, ON, Canada, Jul. 2008

[33] L. Wasserman and K. Roeder, “Multi-stage variable selection: Screen and clean,” Ann. Statist., to be published.

[34] Y. Yang and A. Barron, “Information-theoretic determination of min- imax rates of convergence,” Ann. Statist., vol. 27, no. 5, pp. 1564–1599, 1999.

[35] B. Yu. Assouad, Fano, and Le Cam, In Festschrift for Lucien Le Cam.

Berlin, Germany: Springer-Verlag, 1997, pp. 423–435.

[36] P. Zhao and B. Yu, “On model selection consistency of Lasso,” J. Mach.

Learn. Res., vol. 7, pp. 2541–2567, 2006.

Martin Wainwright (M’03) received the Ph.D. degree in electrical engineering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge.

He is currently an Associate Professor at University of California at Berkeley, with a joint appointment between the Department of Statistics and the Depart- ment of Electrical Engineering and Computer Sciences. His research interests include statistical signal processing, coding and information theory, statistical machine learning, and high-dimensional statistics.

Prof. Wainwright has been awarded an IEEE Signal Processing Society Best Paper Award, an Alfred P. Sloan Foundation Fellowship, an NSF CAREER Award. the George M. Sprowls Prize for his dissertation research, a Natural Sciences and Engineering Research Council of Canada 1967 Fellowship, and several outstanding conference paper awards.