Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting
Martin J. Wainwright, Member, IEEE
Abstract—The problem of sparsity pattern or support set re- covery refers to estimating the set of nonzero coefficients of an un- known vector3 2 p based on a set of n noisy observations.
It arises in a variety of settings, including subset selection in re- gression, graphical model selection, signal denoising, compressive sensing, and constructive approximation. The sample complexity of a given method for subset recovery refers to the scaling of the required sample sizen as a function of the signal dimension p, spar- sity indexk (number of non-zeroes in 3), as well as the minimum valueminof3over its support and other parameters of measure- ment matrix. This paper studies the information-theoretic limits of sparsity recovery: in particular, for a noisy linear observation model based on random measurement matrices drawn from gen- eral Gaussian measurement matrices, we derive both a set of suf- ficient conditions for exact support recovery using an exhaustive search decoder, as well as a set of necessary conditions that any de- coder, regardless of its computational complexity, must satisfy for exact support recovery. This analysis of fundamental limits com- plements our previous work on sharp thresholds for support set re- covery over the same set of random measurement ensembles using the polynomial-time Lasso method (`1-constrained quadratic pro- gramming).
Index Terms—Compressed sensing, `1-relaxation, Fano’s method, high-dimensional statistical inference, information-the- oretic bounds, Lasso, model selection, signal denoising, sparsity pattern, sparsity recovery, subset selection, support recovery.
I. INTRODUCTION
S
UPPOSE that we are given a set of observations of a fixed but unknown vector . In a variety of settings, it is known a priori that the vector is sparse, meaning that its support set —corresponding to those indices for which is nonzero—is relatively small, say with size . Spar- sity recovery refers to the problem of correctly estimating the support set based on a set of noisy observations. This sparsity recovery problem is of broad interest, arising in various areas, including subset selection in regression [24], structure estima- tion in graphical models [22], sparse approximation [10], [25], signal denoising [7], and compressive sensing [11], [5].Manuscript received August 28, 2007; revised April 20, 2009. Current ver- sion published November 20, 2009. This work was supported in part by the Na- tional Science Foundation under Grants NSF DMS-0528488, CAREER-CCF- 0545862, a Microsoft Research Grant, and a Sloan Foundation Fellowship. The material in this paper was presented in part at the IEEE International Sympo- sium on Information Theory (ISIT), Nice, France, June 2007, and was posted on arXiv in February 2007 (math/0702301).
The author is with the Department of Statistics, and Department of Elec- trical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720 USA.
Communicated by A. Krzy˙zak, Associate Editor for Pattern Recognition, Sta- tistical Learning and Inference.
Digital Object Identifier 10.1109/TIT.2009.2032816
A great deal of work over the past few years has focused on the performance of computationally tractable methods, many based on -norm or other convex relaxations, both for recov- ering the exact sparsity pattern as well as related problems in sparse approximation. We provide a brief overview of those parts of this extensive literature most relevant to our work in Section I-A. Of equal interest and complementary in nature, however, are the information-theoretic limits associated with the performance of any procedure for sparsity recovery. Such un- derstanding of fundamental limitations is crucial in assessing the behavior of computationally tractable methods. In particular, there is little point in proposing novel methods for sparsity re- covery, possibly with higher computational complexity, if cur- rently extant and computationally tractable methods achieve the information-theoretic limits. On the other hand, an information- theoretic analysis can reveal where there currently exists a gap between the performance of computationally tractable methods and the fundamental limits. Indeed, the information-theoretic analysis of this paper makes contributions of both types.
With this motivation in mind, the focus of this paper is on the information-theoretic limitations of sparsity recovery. In particular, our analysis focuses on the noisy and high-dimen- sional setting, meaning that the observations are contaminated by noise, and all three problem parameters—the number of observations , the model dimension , and the sparsity index , defined below—may tend to infinity. Our main results, stated more precisely in Section II, are necessary and sufficient condi- tions for subset recovery, stated in terms of the triplet
as well as signal-to-noise parameters such as the minimum value of the signal and the noise variance . More specifically, our analysis applies to the class of random -Gaussian measurement ensembles, in which each measure- ment is based on the inner product between and a random -vector . As a special but important case, this model includes the standard Gaussian ensemble in which are independent and identically distributed (i.i.d.), obtained by setting . In this paper, we derive a set of sufficient conditions for asymptotically perfect recovery using an exhaustive search decoder, as well as a set of necessary conditions that any decoder must satisfy for perfect recovery.
The analysis given here complements our earlier paper [31]
that established precise thresholds on the success/failure of the Lasso (i.e., -constrained quadratic programming) for sparsity recovery.
The remainder of this paper is organized as follows. In Sec- tion I-A, we provide a more precise formulation of the problem, and a brief discussion of past work, whereas Section II pro- vides a precise statement of our main results, and a discussion of
0018-9448/$26.00 © 2009 IEEE
their consequences. Section IV provides the proof of the suffi- cient conditions, based on analyzing the oracle decoder, whereas Section V provides the proof of the necessary conditions. More technical aspects of these proofs are provided in the Appendices.
We conclude in Section VI with a discussion of open directions.
A. Problem Formulation
We begin with a more precise formulation of the problem, as well as a discussion of previous work, with emphasis on that most closely related to the results in this paper. Let be a fixed but unknown vector; we refer to the ambient dimension
as the model dimension. Define the support set of as (1) We refer to its size as the sparsity index. Consider the observation model
(2) where is a vector of observations, is the measurement matrix, and is additive observation noise.
We assume throughout the paper that .
1) Error Metrics: Consider some method that generates the vector as an estimate of the truth . There are various distinct criteria for assessing how close the estimate is to the truth, including
• various norms , especially and , or
• some measurement of predictive power (e.g., , where is the estimate based on ).
Given the abundance of recent results on sparse approximation (not all of which are mutually comparable), it is particularly im- portant to specify up front the choice of error metric. In this paper, we focus exclusively on the sparsity recovery problem, for which the appropriate error metric is simply the – loss as- sociated with the event of recovering the correct support —viz.
(3) Of interest are conditions on the triplet as well as prop- erties of the signal vector and design matrix under which exact support recovery is either possible, or impossible.
2) Past and Ongoing Work: A great deal of recent work has studied the behavior of -relaxations for sparse approximation, including linear programming techniques [7], [12], [5], [14] and -constrained quadratic programming [7], [13], [29], known as the Lasso in the statistics literature [22], [28], [36]. Some pa- pers in this growing literature have provided conditions under which estimation of a noise-contaminated vector via the Lasso [29], [13] or other types of convex relaxation [6] is guaranteed to be stable in the sense; however, it should be noted that such -stability does not guarantee exact recovery of the un- derlying support set. Most directly related to this paper are re- sults, applicable to -constrained quadratic programming or the Lasso, that provide sufficient conditions [22], [36], [31] or nec- essary conditions [31] on the amount of data required for subset recovery (i.e., with the error metric (3)). These results isolate a mutual incoherence property [17], [29] of the design matrix that must be satisfied for the Lasso to succeed in recovering the support, and the paper [31] provides sharp scalings on
that demarcate the boundary between success and failure. As we discuss in the sequel, our results show that the exhaustive search decoder can recover the support set for design matrices for which the -based Lasso fails with high probability (see Section III-B), or for sample sizes in which the Lasso fails with high probability (see Section III-A.2).
Some past work on sparse linear regression [1], [27], [16]
shares the information-theoretic motivation of this paper, but fo- cuses on the rate–distortion perspective (i.e., under the -loss), as opposed to the subset recovery metric (3) of interest here.
Since this paper was first posted [30], a number of papers have followed up on the information-theoretic limits of the subset re- covery problem. Akcaya and Tarokh [2] analyzed the perfor- mance of a certain type of “joint typicality decoder,” obtaining similar results for the support recovery problem studied here as well as various results for partial support recovery metrics (e.g., metrics in which it is sufficient to recover a large frac- tion of the support, as opposed to any element of the support).
Their analysis is based on the same framework and type of par- titioning scheme, but uses alternative large deviation bounds based on concentration of empirical entropies (joint typicality).
Reeves and Gastpar [26] analyzed the partial support recovery problem in the regime of linear sparsity (i.e., for some ), and showed that the signal-to-noise ratio (SNR) must tend to infinity in order for exhaustive search decoders to succeed. In subsequent work, Fletcher et al. [15] used direct methods to show that for any signal with squared minimum value , any decoder applied to measurement matrices drawn from the standard Gaussian ensemble requires
measurements. Concurrent work by Wang et al. [32] used re- finements of the Fano approach from the initial posting of this work [30], to establish the same scaling for general i.i.d. mea- surement matrices. Although these extensions did not appear in the original posting of this work [30], following the reviewers’
suggestion, we have also included in this revised version some consequences of the refined Fano approach [32] for necessary conditions (Theorem 2) as applied to the non-i.i.d. -Gaussian ensembles considered here.
Notation: We use the following standard notation for asymp-
totics of real sequences and : (i) means
that for some constant ; (ii)
means that for some constant ; (iii)
is shorthand for and , and
(iv) means that .
II. STATEMENT OFMAINRESULTS
The analysis of this paper applies to the high-dimensional setting, in that all three elements of the triplet are per- mitted to tend to infinity. We provide both positive results—that is, scalings of the triplet and associated signal/mea- surement parameters such that an exhaustive search decoder can recover the exact support with high probability—and also con- verse results, meaning scalings for which the probability of suc- cessful support recovery remains bounded away from zero for any method. Although we allow for completely general scaling of this triplet, our results can also be specialized to two partic- ular cases of sparsity scaling: (a) the linear sparsity regime e.g.,
[5], [12], in which for some ; or (b) the sublinear sparsity regime, e.g., [22], [36], in which tends to zero. Depending on the underlying motivation for sparse ap- proximation, both of these sparsity regimes are of independent interest. In covering the full range of scaling, the results given here are complementary to those of our previous paper [31] that provided threshold results, also applicable to general scaling of , for the success/failure of the Lasso when used for spar- sity recovery with general Gaussian measurement ensembles.
We focus on the linear observation model (2) in the noisy setting , with the measurement matrix
drawn from the -Gaussian ensemble, meaning that each row
is drawn i.i.d. as for .
Note that setting yields as a special case the standard Gaussian ensemble, for which is i.i.d.
In addition to the three parameters , our results also highlight the importance of some other parameters associated with the signal ensemble. In particular, both the sufficient and necessary conditions require control of the minimum value of the unknown vector on its support. Consequently, for a given minimum value , we consider the class of signals
for all (4)
where is the support of . Our results show that the SNR parameter , as opposed to the more traditional mea- sure that would arise in assessing error, is the key quantity that controls subset selection. Indeed, we show that the quantity can be arbitrarily large without having any effect on the difficulty of subset recovery.
A. Decoders and Error Probabilities A decoder is a mapping from the pair to the family
of all -sized subsets of . The output
corresponds to the decoder’s best estimate of the unknown un- derlying subset. The underlying true vector is fixed but unknown. We focus on three different types of error, depending on whether (a) the error probability is taken conditionally on a fixed support set , or (b) the error probability is averaged over a support set chosen uniformly at random (u.a.r.) from all possible -sized subsets, or (c) the error probability is worst case over a support set chosen in an adversarial manner. In partic- ular, in the case that has fixed but unknown support , we define the -based error
(5)
Here the probability is taken over the
random measurement , or equivalently, over the observation noise and random design matrix , with the underlying sup- port being , with the probability taken over the measurement noise and the choice of random design matrix . When
is viewed as a random variable, chosen u.a.r., we define the average error probability
(6)
Finally, when is chosen in an adversarial manner, we define the worst case error probability
(7)
B. Design Covariance Parameters
Our second set of parameters involve the covariance matrix that defines the -Gaussian ensemble of design matrices, in which each row of the design matrix is drawn i.i.d.
from the -dimensional normal distribution. We begin with the key quantities that arise in our sufficient conditions, as stated in Theorem 1. Given a pair with
, we define the matrix
(8) Note that corresponds to the Schur complement [19] of
the matrix with respect to the sub-
matrix . With the support set viewed as fixed, we define the quantity
(9) As will be clarified in our analysis, this quantity controls the relative distinguishability of subsets and under the expo- nential search decoder. For the case of average and worst case error probabilities, we define the uniform bound
(10)
Note that all of these quantities are extremely simple in the case of the standard Gaussian ensemble ; in particular, we have for all pairs of distinct subsets
, and hence for all , and
moreover .
A closely related set of quantities arise in the statement of our necessary conditions on any algorithm, as stated in Theorem 2. In particular, letting denote a subset chosen uniformly at random from , we define
(11)
As our analysis will demonstrate, this quantity measures the dif- ficulty of distinguishing a subset from the collection of subsets that differ from it in only one position. The second quantity that we define plays a role in specifying the bulk effect of all subsets in
(12) As with the quantities involved that arise in the statement of Theorem 1, these quantities are especially simple for the case of the standard Gaussian ensemble ; in
particular, we have and . More
generally, the quantity is closely related to the quantity
; in particular, letting denote the submatrix of indexed by , we have the inequality
(13) Inequality follows from the definition (11), whereas in- equality follows because by choosing subsets and such
that , we have
As mentioned above, these inequalities are met with equality for the standard Gaussian ensemble ; in Sec- tion III-B, we provide a more general family of matrices for which (in particular, see Example 2).
C. Statement of Sufficient Conditions
We now have the necessary ingredients to state conditions on the triplet , minimum value , and design condition parameters or that are sufficient to ensure exact support recovery using an optimal decoder (to be specified later). So as to simplify the statements of our results, we define the function
(14) Here, the quantity will be set to either or , de- pending on the error probability under discussion.
Theorem 1 (Sufficient Conditions): Given a problem instance from the -Gaussian linear observation model (2), there exists a decoder with the following characteristics.
(a) For any fixed vector with fixed support , if the sample size satisfies
(15)
for some , then .
(b) For the support set chosen uniformly at random from , if the sample size satisfies
(16)
for some , then .
(c) For the support set chosen adversarially from , if the sample size satisfies
(17)
for some , then .
Remarks: Note that there are only minor differences on the conditions required for the three different types of error probability. The mildest conditions are required for cor- responding to the error probability associated with a fixed subset. It requires only bounds on from (9)—that is, only a uniform lower bound on the eigenvalues of the matrices , as defined previously (8). The error probabilities and involve any possible subset, and so require lower bounds on the quantity , which mea- sures eigenvalues uniformly over for all distinct pairs . In addition, the worst case error proba- bility requires an additional term in the definition of , corresponding to the (logarithm of the) number of possible subsets of cardinality .
D. Statement of Necessary Conditions
Thus far, we have provided sufficient conditions for an ex- haustive search decoder to succeed with high probability in re- covering the support set. Of equal interest and complementary in nature are necessary conditions that must be satisfied by any method for reliable recovery to be possible. We state a result of this nature in this subsection.
Before proceeding, note that for any fixed subset , it is not possible to provide any type of lower bound on the probability
, since the trivial decoder
for all always achieves perfect recovery in this setting.
Accordingly, it is necessary to lower-bound either the average probability of error (with drawn uniformly at random from ) or the worst case probability of error. The following result provides lower bounds on the sample size for the average error probability. Since the adversarial setting is not any easier, the following theorem also provides lower bounds for the worst case error.
Theorem 2 (Necessary Conditions): Consider the family of problem instances defined by random -Gaussian designs and the linear observation model (2), with chosen uni- formly at random from . If the sample size is upper-bounded as
(18)
then for any decoding algorithm , there exists
a vector such that
The proof of this claim, given in Section V, is somewhat more indirect in nature, based on the Fano method [8], [18], [20], [35], [34] in order to lower-bound the probability of error for
a restricted ensemble, which can be viewed as a certain type of hypothesis testing or channel decoding problem.
III. SOMECONSEQUENCES OFOURRESULTS
In this section, we explore some consequences of our re- sults. We begin by discussing two regimes in which Theorems 1 and 2 provide a sharp characterization of the sample com- plexity of the subset recovery problem. By comparison to known threshold results on the Lasso ( -constrained quadratic pro- gramming), these results reveal that the Lasso is information- theoretically optimal in some regimes, while dramatically sub- optimal in others. We then discuss conditions on the design co- variance matrix , and show with an explicit construction that the exhaustive search decoder can succeed for designs for which the Lasso fails with high probability.
A. Consequences for Different SNR and Sparsity Regimes We begin by discussing some regimes of SNR and sparsity in which the results of Theorems 1 and 2 provide a sharp char- acterization of the sample complexity of the subset recovery problem. In order to make explicit comparisons to the Lasso ( -constrained quadratic programming), we begin by stating its sample complexity. For random design matrices drawn from any -Gaussian ensemble satisfying a certain mutual incoher- ence condition (see (25) to follow), Wainwright [31] establishes a sharp phase transition for the success/failure of the Lasso. If we assume that and the incoherence parameter re- main bounded away from , the Lasso threshold [31] is of the form
(19)
for constants .
1) Regime of Bounded Norm Vectors: We begin by consid- ering the regime of bounded norm vectors (i.e., ), which implies (due to -sparsity of ) that for some constant . In this regime, we have the following corollary of Theorems 1 and 2.
Corollary 1: Consider a signal with . Then the information-theoretic sample complexity of subset se- lection is given by
More precisely, there exist constants as follows.
(a) For sequences such that
(20) the exhaustive search decoder has error probability
. (b) For sequences such that
(21) any algorithm fails often—that is, .
Remarks: By comparison to the Lasso threshold (19), Corol- lary 1 implies that for signals with , the sample complexity of the Lasso is equal, up to constants independent of and , to the information-theoretic capacity.
Although Theorems 1 and 2 provide matching scalings for , it should be noted that the conditions do not match for all scalings of the squared minimum value . For instance, if the squared minimum value is constant (i.e., for some constant ), then Theorem 2 implies that samples are needed, whereas Theorem 1 dic- tates that samples are sufficient. It remains an open question to determine the sharp order of scaling for such regimes.
2) Consequences for Linear Sparsity: We have seen that the Lasso is information-theoretically optimal for certain regimes of the SNR parameter . In contrast, for other regimes of SNR and sparsity, Theorem 1 reveals a dramatic difference between the -based Lasso, and the performance of the optimal decoder.
This difference appears in the regime of linear sparsity, in which for some . This linear sparsity regime is par- ticularly relevant for compressed sensing [5], [11], where the parameter corresponds to the fraction of nonzero entries in a signal to be reconstructed. First, note that if , then according to the previously stated Lasso threshold (19), there is a constant such that (for any scaling of the squared min- imum value ) the Lasso fails unless the sample size satis- fies . Hence, the number of samples required by the Lasso grows faster than linearly (i.e., ). In sharp contrast, as long as the squared minimum value does not decrease too quickly (as made precise below), Theorems 1 and 2 imply that the information-theoretic threshold is ob- servations.
The following corollary makes these observations precise. To simplify the statement, for , define the function
(22) as well as the function
(23)
Here is the binary entropy function
. With this notation, we havethe following.
Corollary 2: Consider a signal with linear sparsity (i.e., for some ). Suppose that the minimum
value for some . Then the information-
theoretic threshold for subset recovery is . More precisely:
(a) Given size , the exhaus-
tive search decoder has error probability .
(b) Conversely, if , then the proba-
bility of error of any algorithm is at least . Remark: Note that we have
Consequently, for any fixed SNR constant and design param- eter , for sufficiently small fractions , the op- timal decoder can recover with . We note that the con- stant in the definition (22) of the threshold function is far from optimal,1but it certainly could be improved by more careful control of constants in the large deviations analysis.
Proof: Recall the required sample size from (15) of The- orem 1. Under the stated conditions of the corollary, the ratio
is given by
For and , we have . Moreover,
from standard bounds on binomial coefficients (see bound (54) in Appendix C), for , we have
Combining the pieces yields the stated claim in part (a).
Turning to the claim in part (b), from Theorem 2, we know that at least samples are required. Substituting
in and , we obtain
where the final inequality uses the fact that . In summary, for a squared minimum value scaling as for some constant , Corollary 2 demon- strates that the Lasso is highly suboptimal in the linear sparsity regime . Regardless of the linear fraction and the squared minimum , success of the Lasso for support recovery [31] requires the number of samples to scale so
quickly such that .
As pointed out by one of the reviewers, results by Candes and Tao [6] on the Dantzig selector (an -based relaxation) apply to the linear–linear regime of Corollary 2. In particular, for measurement matrices drawn from the standard Gaussian en- semble, they establish bounds on the mean-square error (MSE) prediction as well as on the error of the Dantzig selector. These results show that for the case of design ma- trices drawn from the standard Gaussian ensemble (with i.i.d.
) entries, a sample size of
(24) is sufficient to achieve a squared reconstruction error that is bounded. Related results by Meinshausen and Yu [23] and Bickel et al. [3] have similar consequences for the Lasso.
1As pointed out by a reviewer, it requires that 10 for a meaningful result.
In the context of this paper (which focuses exclusively on support recovery), we note that the criteria of support recovery is related to but distinct from the criteria of prediction error , or on the reconstruction error . On one hand, given a procedure that correctly recovers the support of the unknown vector , then of course we can simply re- strict our problem to the subset , and use standard methods (e.g., ordinary linear regression) to obtain estimates with good MSE prediction or error. In the opposite direction, however, an estimate can be close to but still have a different sup- port than the true vector . Indeed, as discussed above, for stan- dard Gaussian matrices, the sample size (24) guarantees that the Dantzig selector and Lasso achieve errors that are bounded.
As pointed out by one of the reviewers, if the minimum value were also strictly bounded away from zero and if also had entries bounded above by , then an estimate such
that would be sufficient to
guarantee support recovery. However, in the regimes of prac- tical interest, the minimum value decreases to zero at some
rate (e.g., when has constant norm), so
that recovery with constant error bounds is not sufficient. In- deed, a consequence of the results of Wainwright [31] is that the Lasso requires samples to perform support recovery, which scales much more rapidly than in the case of linear sparsity. Theorem 2 demonstrates that when , this scaling—as opposed to
—is unavoidable for subset selection.
Moving onto consideration of arbitrary methods, a conse- quence of Corollary 2 is that no method can recover the support exactly with observations unless the squared min- imum value is lower-bounded as . Nonetheless, it is an interesting question to consider the subset selection per- formance of other computationally efficient methods.
B. Conditions on the Design Covariance
It is worthwhile comparing the conditions on the design ma- trix imposed by Theorems 1 and 2 to those conditions im- posed in past work on -based methods. One set of conditions, sufficient for guarantees in terms of or prediction error, are based on restricted isometry properties (RIP) [5], [12], requiring that the condition numbers of various submatrices of the matrix are uniformly very close to one. (For instance, among other conditions, RIP requires the bound , for a suitably small .) By known concentration results in random matrix theory [9], such RIP conditions hold with high proba- bility for design matrices whose columns are close to orthog- onal (e.g., for drawn from the standard Gaussian ensemble
with and ). It should be noted that
these RIP conditions, while sufficient, are far from necessary to obtain bounds on or prediction error; we refer the reader to Bickel et al. [3] for a much weaker set of conditions for and prediction error consistency.
By contrast, the focus of this paper is on the problem of exact support recovery, for which a related but distinct set of condi- tions on the design covariance are known to be necessary and sufficient. First, successful Lasso-based support recovery requires that the minimal eigenvalue stay bounded
away from zero, and secondly (and more significantly), that a certain mutual incoherence parameter stays bounded strictly away from zero—namely
(25) Whereas the lower bound on is a mild condition, the mutual incoherence condition (25) is more restrictive. It was first defined in the context of sample design matrices indepen- dently by Fuchs [17] and Tropp [29], and also imposed in other high-dimensional analysis of the Lasso [22], [36], [31]. Note that the eigenvalue lower bound and mutual incoherence condi- tion (25) are trivially satisfied for random design matrices drawn from the standard Gaussian ensemble .
It is known [36], [31] that if the Lasso is applied to any en- semble of -Gaussian measurement matrices for which the in- coherence assumption (25) is violated and the noise has a symmetric distribution, it will fail with probability at least , regardless of the sample size (see Wainwright [31] for a precise statement). Exploiting this fact, the following examples show that there exist covariance matrix families for which the optimal decoder can succeed while the Lasso fails.
Example 1: Consider the -Gaussian family with covariance matrices of the form
... ... ... ... ... ...
(26)
for some . In particular, for , it can be verified that we have uniformly for all
.
Consider some -sized subset that does not include the index , and let be another -sized subset. Using the
notation , we have
If also does not include the index , then , so that . In the more interesting case that includes the index , a little calculation shows that
where is a vector of all ones. Consequently, for this family
where the first inequality uses the definition , and the second inequality uses the fact that . Consequently, the
optimal decoder succeeds in recovering the support set with observations.
On the other hand, suppose that the given subset has car- dinality . By definition of the covariance matrix (26) and the mutual incoherence parameter in (25), we have
showing that with , the mutual incoherence condi- tion (25) is violated. Consequently, for this ensemble and for any subset with elements that excludes the index , the probability of incorrect support recovery using -constrained quadratic programming is at least one half [31], regardless of the sample size, whereas the optimal decoder will succeed with high probability for sample sizes .
It is also interesting to consider the quantities and that arise in the necessary and sufficient condition of Theorems 1 and 2. As previously shown (13), the quantity always lower-bounds the quantity . The fol- lowing example provides a family of matrices for which this lower bound is met with equality, so that the dependence on the design covariance identified by Theorems 1 and 2 is tight.
Example 2: Letting denote the all-ones vector, con- sider the family of covariance matrices
(27) In this example, we show that for a squared minimum value of
the order and any , Theorems 1 and 2
predict that the critical sample size scales as . To establish this fact, we begin by calculating the matrices
that define the quantity . For any , any , and any pair of -sized subsets with , a little calculation (using the matrix inversion formula [19]) shows that (28) so that for any subset , and hence
. Consequently, the exhaustive search decoder succeeds with high probability (w.h.p.) as long as the sample size satisfies
for some constant .
Let us compare this sufficient condition to the necessary con- ditions from Theorem 2. For this particular covariance matrix
, a little calculation shows that
and moreover that . Therefore, for this en- semble with , Theorem 2 implies that if
for some constant , then the error probability of any algo- rithm is at least . Consequently, for this ensemble of matrices parameterized by , Theorems 1 and 2 provide a set of necessary and sufficient conditions that are matching up to constant factors independent of .
On the other hand, for any -sized subset , the condition number of the submatrix is given by
which tends to infinity for any fixed .
Finally, we provide an example to illustrate that an upper bound on the maximum eigenvalue is not required for -based support recovery.
Example 3: For a given -sized subset , consider the co- variance matrix given by
if if
otherwise.
(29)
For this matrix, a simple calculation shows that the maximum
eigenvalue , which tends to infinity
for any fixed as . On the other hand, since for all outside of the subset , the mutual inco- herence condition (25) is satisfied with . Moreover, a little
calculation shows that . Therefore, the
Lasso will succeed using samples.
IV. PROOF OFTHEOREM1
This section is devoted to the proof of Theorem 1. We begin by setting up some useful notation to be used throughout the remainder of the analysis. Given any subset , we use the notation to denote the -dimensional subvector , and similarly for other vectors (e.g., , , etc.). In an analogous manner, we use to denote the matrix with columns . We use to denote the transpose of a matrix .
A. Exhaustive Search Decoder
Our route to establishing the sufficient conditions in The- orem 1 is by direct analysis of the decoder that searches ex- haustively over all possible subsets of size . More specifi- cally, the search decoder obtains its estimate by the following two-step procedure.
(a) For each of the subsets subset of size , solve the quadratic program
(30)
(b) Return the subset .
Of interest to us are various error probabilities associated with this procedure. We begin by bounding the -based error
with the probability taken over and the noise vector , when the underlying support is fixed to . Using this result, we then bound the average error probabilities and the worst case error probabilities , as defined in (6) and (7), respec- tively.
With and random matrices drawn from a Gaussian ensemble with nondegenerate covariance , each of the submatrices has rank with probability one.2Accordingly, we may define the matrices
and (31a)
(31b) Note that and are both orthogonal projection matrices, associated with the -dimensional range space and -dimensional nullspace , respectively. For any pair of subsets and , each with elements, define the random variable
(32) With these definitions, we state the following result.
Lemma 1: For any given vector with support , the ex- haustive search decoder prefers to the true underlying if
and only if .
Proof: We begin by showing that for any subset for which is full rank, the quantity defined in (30) is equal to . Under the given rank condition, the linear least squares estimator of is given by
. Noting that by the definition (31a) of , we have we substitute into the quadratic norm and expand, thereby obtaining
Failure occurs if and only if , as
claimed.
Overall, the search decoder fails if and only if at least one (with cardinality ) is preferable to ; consequently, the overall probability of error can be written as
(33)
Consequently, assuming that has support , the technical re- sult central to analyzing the error probability (33) is tight control on the probabilities of the events , for all -sized
subsets .
B. Large Deviations Bound
Before stating a large deviations bound, we require some notation. Recalling the definition (8) of , we use to denote its symmetric matrix square root. For each , define the quantity
(34)
2That is, the probability thatk random Gaussian random vectors in are linearly independent is equal to zero, which follows since the Gaussian has den- sity with respect to Lebesgue measure.
representing a type of SNR, reflecting how distinguishable subset is from . With this notation, we have the following.
Lemma 2: As long as , for any pair of distinct subsets and , we have
(35) The proof of Lemma 2 is somewhat technical in nature. How- ever, the high-level strategy is straightforward: given some
, we define the events
and (36a) (36b)
We now observe that for any , the event im- plies that at least one of the events or holds. Indeed, supposing that neither nor is true, then the quantity
is lower-bounded by .
Consequently, by union bound, it suffices to control the two probabilities and . This argument applies for any choice of ; a convenient choice turns out to be
.
With this setup, the proof of Lemma 2 is a consequence of the following two results, proved in Appendices A and B, respec- tively.
Lemma 3: For all
(37)
Moreover, the choice is valid as long as .
Lemma 4: With , we have
(38)
for any pair of distinct subsets and .
Combining these two results yields the claim of Lemma 2.
C. Analysis of Error Probability
Using Lemma 2, we are now equipped to complete the proof of Theorem 1. Denote by the number of subsets with
cardinality , such that . (Moreover, note that since both and have cardinality , we have as well.) A counting argument yields that, for each with , there are
(39) such subsets.
In order to simplify the statement of the result, we begin by deriving a weaker form of the large deviations bounds from Lemma 2, albeit one that leads to simpler expressions. Observe that the function is increasing on the interval . Consider a pair of subsets and with overlap of size
. Using the definition (9) of , we have
Consequently, for any pair with , the bound (35) implies that
(40)
Combining this upper bound with the union bound applied to the expression (33) yields that is upper-bounded by
which is further upper bounded by
Consequently, in order for the error probability to vanish asymp- totically, it suffices to take such that is greater than
Let us upper-bound this quantity (denote it ). By our assump- tion that , we have . Moreover, we have
Overall, we conclude that
(41)
Given that the conditions of Theorem 1 certainly imply that , it suffices to restrict attention to the term involving —namely, to upper-bound the quantity
(42) Using standard bounds on binomial coefficients (see Ap- pendix C), we have
Returning to the upper bound (Section IV-C), we conclude that for a sample size satisfying
(43) for some , the error probability associated with detecting support set decays exponentially as
thereby establishing the claim of Theorem 1(a).
Turning to Theorem 1(b), if we replace the quantity in
the lower bound (43) by , then we
can conclude that the average probability of error
also vanishes exponentially fast.
Finally, turning to Theorem 1(c), let us consider the case of the worst case error probability taken over all subsets . In this case, in addition to using the worst case measure , we also need the probability of error for any given subset to con- verge to zero sufficiently fast. In particular, by union bound, we have
Consequently, if for some , we choose a sample size such that
then we have .
V. PROOF OFTHEOREM2
In this section, we prove the necessary conditions stated in Theorem 2. Our method involves two restricted versions of the subset recovery problem, for which the analysis can be reduced to a type of channel decoding problem. We then apply a variant
of Fano’s bound [8] to analyze the error probability over these restricted ensembles. We note the Fano method is a standard approach for obtaining minimax lower bounds in nonparametric statistical problems [18], [20], [35], [34].
A. Basic Setup
We begin by describing the basic setup for the proof of The- orem 2. Let denote a particular subset of the set of all -sized subsets, and let be a set-valued random vari- able, uniformly distributed over —that is,
for all . Suppose that the decoder is told that the selected subset is a member of , and moreover it is provided with the form of the vectors for all . Its goal is to use the pair to recover the unknown subset . Note that the two forms of side information—namely, that , and the form of for any fixed —cannot harm the decoder’s perfor- mance, since the decoder can always choose to ignore this in- formation. Consequently, the error probability of the decoder for the original problem is lower-bounded by the error probability , where is uniform over . This is a multi-way hypothesis testing problem, and we may lower-bound the prob- ability of error of any decoder using Fano’s inequality [8].
We lower-bound this error probability as follows: first, for any fixed and for any decoder
(44) where is the mutual information between and
conditioned on ; explicitly, it is given by
. Taking expectations of both sides (44), we conclude that
(45) Consequently, in order to make effective use of the Fano lower bound (44) or its averaged form (45), we need to construct en- sembles for which is relatively large while the mutual information is relatively small. In our analysis, we make use of the upper bound on the mutual information
(46)
which follows from the convexity of mutual information [8].
Here, the quantity
(47) is the Kullback–Leibler divergence between the distributions
and .
B. Bound for Bulk Ensemble
The bulk ensemble is defined by the choice , and then setting
(48)
for each . A straightforward computation yields that
so that we have
(49)
where
and
Note that each is a random variable (as a function of the random design matrix ); a little calculation shows that
and, moreover, that
where the final inequality follows by our choice (48) of , and the definition (12) of . Consequently, using the bound (49),
we have , and hence, using
the bound (45)
Consequently, if the sample size is upper bounded as
then as claimed.
C. Bound for Nearby Subsets
We now describe bounds based on a second ensemble, intro- duced by Wang et al. [32]. For any subset , let be an index that achieves the minimum in the definition (11) of the function . We then let the ensemble consist of all sub- sets that contain all indices in the set , and then one more index chosen from the set . Observe that
the resulting family of subsets has cardinality , with the property that for each distinct pair of subsets, the Hamming distance is equal to two.
For each subset , we define its signal vector as follows:
if if
where achieves the minimum in (11). Note that we have by construction.
Now consider a pair of distinct subsets , say with and . With the choices given above, a little calculation shows that
so that we have
Taking expectations and using the definition (11) of the function , we obtain that is equal to
Expanding the expectation yields that is equal to
which is upper-bounded by , using the definition (11). Consequently, by the Fano bound, we obtain
Consequently, if , then the probability of error remains bounded above by , as claimed.
VI. CONCLUSION
In this paper, we have analyzed the information-theoretic limits of the sparsity recovery problem for the linear observa- tion model (2) with measurement vectors drawn from -Gaussian ensembles, including the standard Gaussian one as a special case. We have established both lower and upper bounds on the number of observations as a function of the model dimension , signal sparsity , squared minimum value , and noise variance as well as other parameters of the design covariance that are required for asymptotically reliable recovery. In conjunction with previous
work [31] on the limits of the Lasso ( -constrained quadratic programming), this analysis has some consequences.
(a) For signals of bounded norm, the Lasso achieves the information-theoretically optimal order of scaling as a function of and (see Corollary 1), whereas (b) for signals with linear sparsity and squared minimum value , the Lasso is suboptimal (see Corollary 2).
There are a variety of open directions suggested by our anal- ysis. First, while our upper and lower bounds are essentially matching for certain regimes of scaling, it is likely that the anal- ysis can be tightened in other regimes. In particular, the nec- essary conditions stated in Theorem 2 certainly involve some slack, since they are obtained by analyzing restricted ensem- bles in which the value of on the subset is known a priori to the decoder. It would be interesting to see if sharper results could be obtained via analysis of a less restrictive ensembles.
Second, our work has revealed the suboptimality of current prac- tical methods in the linear sparsity regime ( for some
) for sufficiently high SNR (in particular,
). It is possible that multistage methods (e.g., [33], [23]) could be helpful in closing these gaps. Third, our results high- light various differences between the conditions on the design covariance matrix (from which the random measurement ma- trices are generated) required by -based methods such as the Lasso, as contrasted with exponential-complexity methods. It would be interesting to see to what extent the mutual incoher- ence conditions that affect standard methods can be relaxed;
see Meinshausen and Yu [23] for some progress in this direc- tion.
APPENDIX
A. Proof of Lemma 3
Using the linear observation model (2), we note that , so that for any , we can write
(50)
where we have adopted the shorthand notation . The following lemma characterizes the distribution of the random variable to be bounded.
Lemma 5: For any two -sized subsets and with overlap , we have
where and are chi-squared variables with degrees of freedom.
Proof: Note that by the Pythagorean Theorem for projec- tions, we have
Again using the Pythagorean Theorem, we can write
where we have used the facts that
. Since there is an analogous decomposition for , we can write
Now the matrix is a projection matrix with rank
equal to , and similarly for the
matrix . Consequently, is dis-
tributed as , and similarly for the second term.
Using this lemma and the decomposition (50), we may write
By triangle inequality, we have ,
so that by union bound
where . As long as , we may apply the
chi-squared tail bound (56) to conclude that
as claimed.
To establish the validity of the choice , we note that
Consequently, we have
so that it suffices to have , as claimed.
B. Proof of Lemma 4
We begin by conditioning on and , and showing that the random variable follows a noncentral chi-squared distribution. By conditioning on , we can decompose into a linear prediction based on and a zero-mean error term.
In particular, we have
where is a Gaussian random matrix indepen- dent of , with i.i.d. rows drawn from the zero-mean Gaussian
distribution with covariance matrix from (8). Using this decomposition, we have
since the orthogonal projection annihilates any terms in the column space .
Let us diagonalize the orthogonal projection matrix , writing it as where is diagonal with
ones, and zeros, and is a unitary matrix. With this trans- formation, we have
since is unitary. The random vector has i.i.d.
Gaussian entries with variance , so
that multiplication by leaves its distribution unchanged. We conclude that conditioned on and , the rescaled variable
Since the rescaled vector has i.i.d. entries, has a noncentral chi-squared distribution with degrees of freedom, and noncentrality parameter
Now the event can be expressed in terms of and as , where we have introduced the
convenient shorthand . For the choice ,
we have . Consequently, setting
in (58b), we obtain , with
(51)
Finally, let us define the event .
For each fixed , the variable is (central) chi- squared variate with degrees of freedom, so that by
the tail bound (57), we have .
Putting together the pieces, we have
Since the event is a function only of and , our ear- lier tail bound (51) may be applied. Moreover, conditioned on
, we have , so that we obtain
so that we conclude that
(52)
as claimed.
C. Bounds on Binomial Coefficients
We make use of the following crude bounds on the binomial coefficients:
(53) In addition, the following bound is also standard [8]:
(54)
where is the binary entropy
function.
D. Tail Bounds for -Variates
The following large deviations bounds for centralized are taken from Laurent and Massart [21]. Given a centralized
-variate with degrees of freedom, then for all
and (55a) (55b) The following consequences of these bounds are useful in our analysis. First, for , we have
(56)
Starting with the bound (55a), setting yields
, Since for , we
have for all . On the other
hand, for all , we have , so that the
claim (56) follows. Secondly, we have
(57) which follows by setting in (55a).
More generally, the analogous tail bounds for noncentral , taken from Birgé [4], can be established via the Chernoff tech- nique, and careful bounding of the moment generating function.
Let be a noncentral variable with degrees of freedom and noncentrality parameter . Then for all
(58a) (58b)
ACKNOWLEDGMENT
The author wishes to thank Peter Bickel for helpful discus- sions and pointers, and the anonymous reviewers for careful
reading and helpful comments and suggestion that improved the presentation.
REFERENCES
[1] S. Aeron, M. Zhao, and S. Venkatesh, “Information-theoretic bounds to sensing capacity of sensor networks under fixed snr,” in Proc. IEEE Information Theory Workshop, San Diego, CA, Sep. 2007.
[2] M. Akcakaya and V. Tarokh, “Shannon Theoretic Limits on Noisy Compressive Sampling,” Harvard Univ., Cambridge, MA, Tech. Rep., Nov. 2007 [Online]. Available: arXiv:cs.IT:0711.0366
[3] P. Bickel, Y. Ritov, and A. Tsybakov, “Simultaneous analysis of Lasso and Dantzig selector,” Ann. Statist., to be published.
[4] L. Birgé, “An alternative point of view on Lepski’s method,” in State of the Art in Probability and Statistics, ser. IMS Lecture Notes, no.
37. Beachwood, OH: Inst. Math. Statist., 2001, pp. 113–133.
[5] E. Candés and T. Tao, “Decoding by linear programming,” IEEE Trans.
Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.
[6] E. Candés and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger than n,” Ann. Statist., vol. 35, no. 6, pp.
2313–2351, 2007.
[7] S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.
[8] T. M. Cover and J. A. Thomas, Elements of Information Theory..
New York: Wiley, 1991.
[9] K. R. Davidson and S. J. Szarek, “Local operator theory, random ma- trices, and Banach spaces,” in Handbook of Banach Spaces. Ams- terdan, The Netherlands: Elsevier, 2001, vol. 1, pp. 317–336.
[10] R. A. DeVore and G. G. Lorentz, Constructive Approximation.. New York: Springer-Verlag, 1993.
[11] D. Donoho, “Compressed sensing,” IEEE Trans. Inf.. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006.
[12] D. Donoho, “For most large underdetermined systems of linear equations, the minimal` -norm solution is also the sparsest solution,”
Commun. Pure and Appl. Math., vol. 59, no. 6, pp. 797–829, Jun.
2006.
[13] D. L. Donoho, “For most large underdetermined systems of linear equa- tions, the minimal` -norm near-solution approximates the sparsest near-solution,” Commun. Pure and Appl. Math., vol. 59, no. 7, pp.
907–934, Jul. 2006.
[14] D. L. Donoho and J. M. Tanner, “Counting faces of randomly-projected polytopes when the projection radically lowers dimension,” J. Amer.
Math. Soc., vol. 22, pp. 1–53, Jul. 2009.
[15] A. K. Fletcher, S. Rangan, and V. K. Goyal, “Necessary and Sufficient Conditions on Sparsity Pattern Recovery,” Univ. Calif., Berkeley, Tech.
Rep., Apr. 2008 [Online]. Available: arXiv:cs.IT/0804.1839 [16] A. K. Fletcher, S. Rangan, V. K. Goyal, and K. Ramchandran, “De-
noising by sparse approximation: Error bounds based on rate-distortion theory,” J. Appl. Signal Process., vol. 10, pp. 1–19, 2006.
[17] J. J. Fuchs, “Recovery of exact sparse representations in the presence of noise,” in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Montreal, QC, Canada, 2004, vol. 2, pp. 533–536.
[18] R. Z. Has’minskii, “A lower bound on the risks of nonparametric esti- mates of densities in the uniform metric,” Theory Prob. Appl., vol. 23, pp. 794–798, 1978.
[19] R. A. Horn and C. R. Johnson, Matrix Analysis.. Cambridge, U.K.:
Cambridge Univ. Press, 1985.
[20] I. A. Ibragimov and R. Z. Has’minskii, Statistical Estimation: Asymp- totic Theory. New York: Springer-Verlag, 1981.
[21] B. Laurent and P. Massart, “Adaptive estimation of a quadratic func- tional by model selection,” Ann. Statist., vol. 28, no. 5, pp. 1303–1338, 1998.
[22] N. Meinshausen and P. Bühlmann, “High-dimensional graphs and vari- able selection with the Lasso,” Ann. Statist., vol. 34, pp. 1436–1462, 2006.
[23] N. Meinshausen and B. Yu, “Lasso-type recovery of sparse represen- tations for high-dimensional data,” Ann. Statist., to be published.
[24] A. J. Miller, Subset Selection in Regression.. New York:
Chapman&Hall, 1990.
[25] B. K. Natarajan, “Sparse approximate solutions to linear systems,”
SIAM J. Comput., vol. 24, no. 2, pp. 227–234, 1995.
[26] G. Reeves and M. Gastpar, “Sampling bounds for sparse support re- covery in the presence of noise,” in Proc. IEEE Int. Symp. Information Theory, Toronto, ON, Canada, Jul. 2008, pp. 2187–2191.
[27] S. Sarvotham, D. Baron, and R. G. Baraniuk, “Measurements versus bits: Compressed sensing meets information theory,” in Proc. Allerton Conf. Communication , Control, and Computing, Monticello, IL, Sep.
2006.
[28] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J.
Roy. Statist. Soc., ser. B, vol. 58, no. 1, pp. 267–288, 1996.
[29] J. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp.
1030–1051, Mar. 2006.
[30] M. J. Wainwright, “Information-Theoretic Bounds for Sparsity Re- covery in the High-Dimensional and Noisy Setting,” Dep. Statist., Univ. Calif., Berkeley, Tech. Rep. 725, Jan. 2007 [Online]. Available:
arxiv:math.ST/0702301, presented at the IEEE Int. Symp. Information Theory, Nice, France, June 2007
[31] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using` -constrained quadratic programming (Lasso),” IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2183–2202, May 2009.
[32] W. Wang, M. J. Wainwright, and K. Ramchandran, “Information-Theo- retic Limits on Sparse Signal Recovery: Dense Versus Sparse Measure- ment Matrices,” Univ. Calif., Berkeley, Tech. Rep., June 2008 [Online].
Available: arXiv:0806.0604, presented at the IEEE Int. Symp. Informa- tion Theory, Toronto, ON, Canada, Jul. 2008
[33] L. Wasserman and K. Roeder, “Multi-stage variable selection: Screen and clean,” Ann. Statist., to be published.
[34] Y. Yang and A. Barron, “Information-theoretic determination of min- imax rates of convergence,” Ann. Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
[35] B. Yu. Assouad, Fano, and Le Cam, In Festschrift for Lucien Le Cam.
Berlin, Germany: Springer-Verlag, 1997, pp. 423–435.
[36] P. Zhao and B. Yu, “On model selection consistency of Lasso,” J. Mach.
Learn. Res., vol. 7, pp. 2541–2567, 2006.
Martin Wainwright (M’03) received the Ph.D. degree in electrical engineering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge.
He is currently an Associate Professor at University of California at Berkeley, with a joint appointment between the Department of Statistics and the Depart- ment of Electrical Engineering and Computer Sciences. His research interests include statistical signal processing, coding and information theory, statistical machine learning, and high-dimensional statistics.
Prof. Wainwright has been awarded an IEEE Signal Processing Society Best Paper Award, an Alfred P. Sloan Foundation Fellowship, an NSF CAREER Award. the George M. Sprowls Prize for his dissertation research, a Natural Sciences and Engineering Research Council of Canada 1967 Fellowship, and several outstanding conference paper awards.