• 沒有找到結果。

Density estimation by total variation penalized likelihood driven by the sparsity ‘1

N/A
N/A
Protected

Academic year: 2022

Share "Density estimation by total variation penalized likelihood driven by the sparsity ‘1"

Copied!
23
0
0

加載中.... (立即查看全文)

全文

(1)

Density estimation by total variation penalized likelihood driven by the sparsity `

1

information

criterion

By SYLVAIN SARDY

Universit´e de Gen`eve and Ecole Polytechnique F´ed´erale de Lausanne and PAUL TSENG

University of Washington, U.S.A.

We propose a density estimator based on penalized likelihood and total variation. Driven by a single smoothing parameter, the nonlinear estimator has the properties of being locally adap- tive and positive everywhere without a log- or root-transform.

For the fast selection of the smoothing parameter we employ the sparsity `1 information criterion. Furthermore the estimated density has the advantage of being the solution to a convex pro- gramming problem; we solve it by an easy-to-implement relax- ation algorithm of which we prove convergence. Finally, we com- pare the finite sample performance of our estimator to existing ones on a Monte Carlo simulation and on two real data sets.

The new estimator is particularly efficient at estimating densi- ties with sharp features from large samples, and is stable to the presence of ties (due to rounding or bootstrapping).

Key Words: Convex programming; Penalized likelihood; Spar- sity `1 information criterion; Total variation; Universal prior.

Address for correspondence: Facult´e des sciences de base, Institut de Math´ematiques Station 8, Swiss Federal Institute of Technology, 1015 Lausanne, Switzerland. Email:

Sylvain.Sardy@epfl.ch

(2)

1 Introduction

An old problem in statistics (Silverman, 1986; Scott, 1992; Simonoff, 1996) is the estimation of a density function f from a sample of observations x1, . . . , xM. Ties might be present due to rounding or bootstrapping: let x1, . . . , xN be the sample with no ties, and let n1, . . . , nN be the number of ties at the order statistics x(1), . . . , x(N ). Nonparametric methods reduce the possible modeling biases in situations where a parametric model is difficult to guess. Writing the nonparametric log-likelihood function

l(f ; x1, . . . , xM) =

N

X

i=1

nilog f (xi), (1)

and maximizing it over all densities f with the constraint of integrating to unity leads to the degenerate nonparametric maximum likelihood estimate f (x) =ˆ M1 PNi=1niδxi(x), where δxi(·) is the Dirac measure at xi. It is degenerate in that its total variation is unbounded.

Many nonparametric estimators are based on regularization (Tikhonov, 1963), by adding a penalty or, equivalently, a constraint to (1), to obtain a smoother and more useful estimate. For instance, the histogram estimator is defined as the unique maximum likelihood estimate constrained to be piecewise constant on bins. In a pioneering paper, Good and Gaskins (1971) add a roughness penalty functional Φ(f ) to (1) to define the functional nonparametric penalized likelihood estimate ˆfλ as the solution to

maxf l(f ; x1, . . . , xM) − λΦ(f ) s.t.

Z

f (x)dx = 1, (2) (“s.t.” is short for “subject to”), where λ > 0 is the smoothing parameter:

the estimate tends to the degenerate nonparametric estimate when λ → 0, and to the parametric maximum likelihood estimate in the kernel of the penalty Φ (i.e., the subspace of densities f such that Φ(f ) is minimal) when λ → ∞. Various penalty functionals, including Φ(f ) = 4R{∇√

f }2 (Good and Gaskins, 1971) and Φ(f ) =R{∇3log f }2 (Silverman, 1982), have been proposed. They have three important properties. First, the existing penal- ties intrinsically assume f is differentiable everywhere or the points of non- differentiability can be ignored–a mathematical assumption which causes practical difficulties when the density is nondifferentiable everywhere, for instance with jumps or peaks. Second, it is often argued that basing the penalty on derivatives of√

f or log f has the advantage of avoiding negative estimates, but we contend that it is redundant to insure positivity. Indeed,

(3)

the terms log f (xi) in (1) are already natural barriers against negative values of f (xi) at all observations, and f would be positive everywhere if we make an additional mild assumption that f is piecewise monotone between xis.

Third, penalizing smoothness on a square root- or log-scale might cause adverse effects because smoothness at low and high density values is not penalized equally.

Penalized likelihood served as a framework for smoothing splines esti- mators (Wahba, 1990). O’Sullivan (1988) developed a spline-based esti- mate to calculate an approximation to the log-density estimate of Silverman (1982). Kooperberg and Stone (1991) derived the logspline estimator which selects spline knots: the location of potential knots are recommended to be near order statistics and their number is automatically selected based on an AIC-like criterion. Silverman (1982) and Stone (1990) derived asymptotic optimal convergence properties under smoothness assumptions.

For the estimation of nonsmooth functions, nonlinear wavelet-based es- timators pioneered by Waveshrink in regression (Donoho and Johnstone, 1994) have been developed for density estimation as well (see Vidakovic (1999) for a review). To guarantee positivity of the wavelet estimate, Penev and Dechevsky (1997) and Pinheiro and Vidakovic (1997) estimated√

f at the cost of losing local adaptivity, showing again that the use of a transform is not innocuous. Wavelet estimators are also sensitive to the choice of the dyadic grid (Renaud, 2002) as the histogram is sensitive to the choice of bins. Recently, Willett and Nowak (2003) proposed to adaptively prune a multiscale partition, Davies and Kovac (2004) proposed taut string, a simple and yet efficient locally adaptive estimator which measures complexity by the number of modes, and Koenker and Mizera (2006) proposed a logdensity estimate regularized by a total variation penalty on the first derivative.

In this article, we propose a density estimator based on a total variation penalty, a functional that does not even have first-order differentiability.

Consequently, the new estimator defined in Section 2 has the ability to ef- ficiently estimate nonsmooth densities. To select the smoothing parameter, Section 3 derives an information criterion based on the Gumbel prior for the hyperparameter. Section 4 derives and proves convergence of two relaxation algorithms, based on primal and dual transformations, to calculate the esti- mate. Section 5 investigates the finite sample properties of the new density estimator in comparison with existing ones on a Monte Carlo simulation.

We then consider two real data sets in Section 6, and draw some conclusions in Section 7.

(4)

2 Total variation penalized likelihood

2.1 Univariate case

We propose to regularize the maximum likelihood estimate (1) by penalizing the log-likelihood function with the total variation of the density. So we choose Φ(f ) = TV(f ) in (2) with

TV(f ) = supX

j

|f (uj+1) − f (uj)|, (3) where the sup is taken over all possible partitions of the domain of the univariate density. If one assumes that f is absolutely continuous, then total variation matches a more conventional smoothness measure since TV(f ) = R|f0(x)|dx. However total variation (3) is not tractable numerically, because looking at all possible partitions (which includes the histogram bins) is not computationally feasible, even if restricted to a combinatorial problem on a fine grid. So, as O’Sullivan (1988) developed an approximation to calculate the functional estimate of Silverman (1982), we derive an approximation to total variation under the following assumptions.

Since our primary goal is the estimation of the density fi = f (x(i)) at the N unique order statistics and since no information is available in between, we avoid unnecessary variance by assuming piecewise monotone interpola- tion between midpoints of order statistics, and monotone extrapolation to zero outside the range of the observations. This interpolation scheme is reminiscent of logspline which places knots at order statistics, in the sense that the modeling of the underlying density depends on the sample. We will see however that, for total variation, the “knot” selection has the advantage of being driven by a single smoothing parameter. To avoid any arbitrary specification of the unknown underlying domain and for reasons linked to the universal rule (see Section 3.1), we consider the total variation function on the range [x(1), x(N )] of the data. With these assumptions, the total variation of f has a simple vector expression in f = (f1, . . . fN):

TV(f ) =

N −1

X

i=1

|f (x(i+1)) − f (x(i))| =

N −1

X

i=1

|fi+1− fi|. (4) As a byproduct, piecewise monotone interpolation defines a positive estimate everywhere, provided that all fi’s are positive. We further assume that the interpolation and the extrapolation schemes allow us to write the functional integral constraint in a vector linear form

1 = Z

f (x)dx = a0f , (5)

(5)

where the weights a = ax are function of the order statistics, but not of f . For instance, we assume in the following that the density is piecewise linear between the (M − 1) midpoints, and equals zero outside the range of the observations.

We are now ready to describe our estimator. Given a smoothing param- eter λ, the total variation penalized likelihood estimate ˆfλ of the density f at the order statistics solves

minf

N

X

i=1

nilog fi+ λkBf k1, s.t. a0f = 1, (6)

where B is the sparse (N −1)×N matrix such that kBf k1=PN −1i=1 |fi+1−fi| (i.e., Bi,i = −Bi,i+1 = −1 for i = 1, . . . , N − 1 and zero otherwise). Thanks to the log-barrier likelihood and to the monotone interpolation, the estimate is positive everywhere. Moreover, ties are naturally handled in the likelihood part (on the contrary, taut string (Davies and Kovac, 2004) seems unstable when ties are present, as we observe in Section 6). The total variation estimator also has the following properties.

Property 1. The Karush–Kuhn–Tucker first-order optimality conditions of (6) are

fi− ni/{(B0w)i+ zai} = 0 ∀i, (7)

−1 +

N

X

i=1

aifi = 0, (8)

kwk ≤ λ, (9)

So the Uniform density fi= c with c = 1/(x(N )− x(1)) for i = 1, . . . , N (i.e., the function f in total variation kernel such that kBf k1 = 0 on the range of the data) is the solution to (6) for a finite λx = kwk< ∞, where (w, z) satisfies (7) to (9), i.e., wi= NPij=1aj − ic and z = N .

At the other end, as λ → 0+, the estimate converges to the empirical esti- mate ˆfλ,i = ani

iM, since a continuity argument shows that each accumulation point of ˆfλ is a minimum of (6) with λ = 0.

Property 2. By strict convexity of the cost function and by linearity of the constraint, any local minimizer of (6) is its unique strict global minimizer.

This nice property is similar to the convexity property in O’Sullivan (1988) and contrasts with the multimodality encountered with logspline (Kooper- berg and Stone, 1991; Kooperberg and Stone, 2002).

(6)

Property 3. For each 1 ≤ i ≤ N − 1 with nai

inai+1

i+1 (resp., nai

inai+1

i+1), then ˆfλ,i ≥ ˆfλ,i+1 (resp., ˆfλ,i≤ ˆfλ,i+1) for all λ ≥ 0.

For the proof see Appendix E. This shows that the estimate ˆfλ,i is ordered (relative to the neighboring values) consistently with the ratio nai

i for all λ.

Property 4. Under affine transformation of the data Y = α+βX with β >

0, the estimate defined by minimizing (6) satisfies ˆfλX(x) = β ˆfλβY (α + βx).

2.2 Multivariate case

If the sample is large enough to allow efficient estimation, nonparametric techniques can be useful to estimate densities of two or more variables. In this section, we sketch a possible extension to higher dimension. Assuming the multivariate density f ∈ L1(U ), where U is an open subset of Rp(p ≥ 1), the more general multivariate definition of total variation

TV(f ) = sup

Z

U

f divϕdx | ϕ ∈ Cc1(U ; Rp), |ϕ| ≤ 1



becomes numerically tractable, if we further assume the function is piecewise constant on a partition of the support U of f (·). The Voronoi tessellation is the natural multivariate extension of univariate midpoint splits used in Section 2.1. The Voronoi polygon assigned to the point Xi consists of all the points in U that are closer to Xi than to any other point Xj, j 6= i.

Defining Voronoi polygons for the sample X1, . . . , XN results in the Voronoi tessellation. Two points Xi and Xj are said to be Voronoi neighbors if the Voronoi polygons enclosing them share a common edge Ei,j of length kEi,jk. Let ∂i be the index set of all Voronoi neighbors of Xi. Assuming f is a piecewise constant function on the Voronoi tessellation, its total variation becomes

TV(f ) =

n

X

i=1

X

j∈∂i\{1,...,i−1}

|fi− fj| · kEi,jk.

The corresponding penalized likelihood is similar to (6), where the compo- nents of a are the areas of the Voronoi polygons and where the `1differences are weighted by kEi,jk. The dual relaxation algorithm of Section 4.2 can therefore be employed to find the unique optimum of the penalized likelihood in the multivariate situation as well.

(7)

3 Selection of the smoothing parameter

The smoothing parameter λ ≥ 0 indexes a continuous class of models. Its selection is crucial to find the model that best fits the data. Model selection is an old problem, for which key contributions are the AIC, Cp and BIC criteria (Akaike, 1973; Mallows, 1973; Schwarz, 1978) in the context of vari- able selection, that is, for the discrete problem equivalent to `0 penalized likelihood. Candidates for selecting the hyperparameter indexing a con- tinuous class of `1 penalized likelihood models are cross validation (Stone, 1974), an approximate generalized cross validation (Fu, 1998) derived for the Lasso (Tibshirani, 1995), BIC borrowed from variable selection (Koenker, Ng, and Portnoy, 1994), the empirical Bayes approach (Good, 1965) used for `1 Markov random field smoothing (Sardy and Tseng, 2004), and the universal rule (Donoho and Johnstone, 1994). However, cross validation is computationally intensive while generalized cross validation, originally devised for linear estimators (Craven and Wahba, 1979), requires for our problem an ill-defined linearization of the `1-based estimator near the solu- tion. The Bayesian information criterion was intended for variable selection (`0), not for `1-based penalized likelihood. Finally, empirical Bayes requires maximizing the marginal likelihood of the data with respect to the prior, which is rarely available in a closed form and therefore requires numerical integration tools.

3.1 Sparsity `1 information criterion The universal penalty λN =√

2 log N (Donoho and Johnstone, 1994) origi- nally developed in regression for Gaussian wavelets smoothing (Donoho and Johnstone, 1994) has the property of being near minimax (Donoho, John- stone, Kerkyacharian, and Picard, 1995) and of controlling the maximum amplitude of i.i.d. Gaussian random variables to fit the underlying constant function with a probability tending to one. Direct extension of the lat- ter property to total variation density estimation defined by (6) would give λ?N =pN log(log N )/2 to control the maximum amplitude of a discretized Brownian bridge (see Appendix A) to fit the underlying Uniform density with a probability tending to one. This bound oversmooths in most appli- cations however.

We employ instead the sparsity `1 information criterion (Sardy, 2006) based on a Bayesian interpretation of total variation and on the universal prior for λ that we derive below. Borrowing from Markov random field smoothing (Besag, 1986; Geman and Geman, 1984), the vector of true den-

(8)

sity f at the N samples can be seen as the realization of a first-order pairwise Laplace Markov random field shifted and rescaled to make it a density. We moreover consider the regularization parameter as a random variable with prior πλ(λ; τ ) that we derive in Proposition 1 below. Hence using Bayes theorem, the posterior joint distribution of f and λ leads to the sparsity `1 information criterion for total variation density estimation

SL1IC(f , λ) = −

N

X

i=1

nilog fi+ λ

N −1

X

i=1

|fi+1− fi| − (N − 1) log λ

− log πλ(λ; τ ) s.t. a0f = 1. (10) Minimizing SL1IC over f and λ provides an estimate of the density as well as a selection of the hyperparameter at once. Derived in Appendix B, the universal prior πλ(λ; τ ) defined in Proposition 1 below is based on asymptotic considerations linked to Property 1.

Proposition 1 Let G0(x) = exp(− exp(−x)) be the Gumbel distribution and let

G(λ; τ ) = G0[4{λ/τ − (log N )/4}]. (11) Then the universal prior used to estimate the density f and the hyperparam- eter λ based on SL1IC (10) is defined as

πλ(λ; τ ) = G0(λ; τN) with τN = log(N log N )/N. (12) With this prior, SL1IC is both a stable and computationally efficient method to select the regularization parameter, especially by comparison with cross validation.

3.2 Kullback-Leibler V -fold cross validation

Another approach, though more computationally intensive, would be to turn to cross validation. For a given smoothing parameter λ, the Kullback-Leibler information

KL(f, ˆfλ) = Z

log{f (x)/ ˆfλ(x)} f (x)dx (13) measures the quality of the fit between the true but unknown density f and the estimated density ˆfλ. This quantity can be estimated by cross validation (CV) or by the bootstrap (see Efron and Tibshirani (1993) for a review). We consider V -fold CV which is computationally feasible as long as V is not too large: it consists of grouping the data set into V randomly selected disjoint

(9)

sets xv = {xi, i ∈ Sv}, v = 1, . . . , V , using x−v = {xi, i /∈ Sv} to estimate f by ˆfλ,−v, and using the interpolation scheme to predict {f (xi)}i∈Sv by { ˆfλ,−v(xi)}i∈Sv. Repeating the same operation V times yields the estimate CV(λ) = −PVv=1Pi∈Svlog ˆfλ,−v(xi) of (13). Calculating this criteria for several λ’s, we select the smoothing parameter λCV which minimizes it.

Van der Laan, Dudoit, and Keles (2004) studied the choice of V and show asymptotic equivalence with a benchmark selection based on the calibration set as long as N/V goes to infinity. This condition is satisfied by 2-fold CV, but rules out leave-one-out CV for which V = N . They moreover show asymptotic equivalence with a benchmark selection based on the entire N observations as long as 1/VN converges slowly enough to zero. This rules out 2-fold CV, but is satisfied for instance by VN = O(log N ).

4 Computation of total variation estimate

Calculating the total variation penalized likelihood estimate is not a triv- ial task owing to the nondifferentiability, high-dimensionality and the con- straint of (6). Using the constraint a0f = 1, we could express fN = (1 − PN

i=1aifi)/aN and rewrite (6) as an unconstrained minimization problem in f1, . . . , fN −1. However, the resulting objective function remains nondifferen- tiable, so the Newton-Raphson method cannot be used. The total variation term is moreover non-separable (e.g., f2 appears in two terms: |f2 − f1| and |f3− f2|), so a block coordinate relaxation (BCR) method could not be directly used either (Tseng, 2001). Koenker and Mizera (2006) use interior point methods. Instead we propose a simple BCR method applied to two transformations of the penalized likelihood into an unconstrained problem of the form

v=(vmin1,...,vn)0h0(v) + h(v),

where h0 a differentiable convex function and h is a separable nondifferen- tiable convex function (i.e., h(v) =Pni=1hi(vi)). The BCR method is simple to implement as it successively solves subproblems of dimension one until convergence, and is computationally efficient using the dual transformation.

In particular, starting with an initial guess v, it chooses an i ∈ {1, ..., n} and minimizes h0(v) + h(v) with respect to vi while holding the other compo- nents of v fixed. This is then repeated with the new v and so on. The rule for choosing i at each iteration is crucial for convergence. In Appendix C, we consider two rules for which we prove convergence of BCR.

(10)

4.1 Primal transformation

The obvious transformation detailed in Appendix C is to make a change of variables such that the nondifferentiable part h(·) in the transformed problem

minv g(Cv) + h(v), (14)

becomes separable. Then Theorem 1 below guarantees convergence of the BCR method. The C matrix is not sparse however, so there is no closed form solution to the convex univariate subproblems. Consequently, the line search required at each iteration slows down the algorithm.

Theorem 1 . For any initial v with Cv > −1/bN, the sequence of iterates generated by the BCR method, using either the essentially cyclic rule or the optimal descent rule, converges to the unique global minimum of (14).

For the proof see Appendix C.

4.2 Dual transformation

Instead we can use an extension of the iterated dual mode (IDM) algo- rithm (Sardy and Tseng, 2004) to handle constraints. The second transfor- mation uses results from convex programming duality theory (Rockafellar, 1970; Rockafellar, 1984) and has computational advantages over the primal method. Introducing the Lagrange multipliers w and z, the minimization (6) is equivalent to

minf ,u max

w,z

N

X

i=1

nilog fi+ λ

N −1

X

i=1

|ui| + w0(Bf − u) + z(a0f − 1)

= max

w,z −z + min

f N

X

i=1

[−nilog fi+ fi{(B0w)i+ zai}] + min

u N −1

X

i=1

λ|ui| − wiui

= M −

N

X

i=1

nilog ni+ max

kwk≤λ,z−z +

N

X

i=1

nilog{(B0w)i+ zai},

where the first equality uses a strong duality result for monotropic pro- gram, i.e., convex program with linear constraints and separable cost func- tion (Rockafellar, 1984, Sec. 11D). Dropping the constant terms, the above maximization problem can be rewritten as

minw,z z + g(B0w + za) + h(w), (15)

(11)

where g(v1, ..., vN) = −PNi=1nilog vi and h(w) = 0 if kwk ≤ λ and +∞

otherwise. The nondifferentiable term h(w) is convex and has a separable form. Notice that (w, z) needs to satisfy kwk ≤ λ and B0w + za > 0.

Convergence to the dual solution in (w, z) is guaranteed by Theorem 2 below, and the primal solution is then ˆfi = ni/{(B0w)i+ zai} for i = 1, . . . , N .

Theorem 2 . For any initial (w, z) with kwk ≤ λ and B0w + za > 0, the sequence of iterates generated by the BCR method to (15), using the essentially cyclic rule, converges to the unique global minimum of (14).

For the proof see Appendix D.

Thanks to the sparsity of B0 with at most two nonzero entries per col- umn, the univariate subproblems in wj have a closed form solution. Conse- quently the dual relaxation algorithm is efficient, even if programmed in a high-level language like R. Note that the dual transformation is still appli- cable for higher derivatives such as Φ(f ) =R |f00(x)|dx.

5 Simulation

We perform a Monte Carlo simulation to compare the finite sample prop- erties of four estimators: total variation using SL1IC, taut string (Davies and Kovac, 2004), logspline (Kooperberg and Stone, 2002) and rectangular kernel with global bandwidth (Sheather and Jones, 1991) for benchmark.

Table 1: Densities used in the simulation, their domain, the interval Ω on which the mean integrated squared error is calculated on a fine grid.

Densities Domain Ω

1. Sharp peak † R [0, 12]

2. Claw ‡ R [−3, 3]

3. Weighted Uniform3 [0, 1] [0, 1]

4. Heaviexp4 R [−5, 5]

† Hansen and Kooperberg (2002), ‡ Marron and Wand (1992, #10 p.720) f3(x) =P13

i=1wiU(ai, bi)/P13 i=1wi

with

( w = (1, 1, 5, 1, 1, .2, 1, 1, 10, .1, 1, 1, 5)

a = (0, .1, .13, .15, .23, .25, .40, .44, .65, .76, .78, .81, .97) b = (.1, .13, .15, .23, .25, .40, .44, .65, .76, .78, .81, .97, 1) f4(x) = 14{2 + Exp(5)} + 14{−Exp(1)} +12N(0, 1)

(12)

0 2 4 6 8 10 12

0.00.20.40.60.81.01.2

1. Sharp peak

−3 −2 −1 0 1 2 3

0.00.10.20.30.40.50.6

2. Claw

0.0 0.2 0.4 0.6 0.8 1.0

02468

3. Weighted Uniform

−4 −2 0 2 4

0.00.20.40.60.81.01.2

4. Heaviexp

Figure 1: The four test densities used in the Monte Carlo simulation.

Similarly to Donoho and Johnstone (1994), we use four test densities with heterogeneous features (see Table 1 and Figure 1). As in Kooperberg and Stone (2002), the criteria we use to compare estimators is the mean in- tegrated squared error, MISE( ˆf , f ) = ER( ˆf (x) − f (x))2dx, approximated by a Riemann sum on a fine grid of 213 points on the interval Ω given in Table 1. Since the standard error on the estimation of the MISE decreases as the sample size increases, the expectation is taken over Q ∈ {1000, 500, 100}

samples of respective sizes M ∈ {200, 800, 3200} ranging from small to large to evaluate the relative empirical rates of convergence. We also consid- ered the expected Kullback–Leibler information to compare the estimators, and found correlated results with MISE. Note that for logspline, we had to increase the default value of the maximum number of knots to fit the com- plexity of the Weighted Uniform density; this revealed occasional numerical problems.

(13)

Table 2: Results of the simulation to compare density estimators on the four test functions (average MISE ×100).

N Kernel Logspline Taut string Total variation 1. Sharp peak

200 4.9 4.1 5.1 3.8

800 1.4 1.2 1.7 1.4

3200 0.41 0.38 0.62 0.56

2. Claw

200 5.8 5.3 5.6 3.4

800 1.5 1.4 2.1 1.3

3200 0.32 0.40 0.70 0.54

3. Weighted Uniform

200 161 89 93 65

800 78 55 36 18

3200 36 43 9.0 5.7

4. Heaviexp

200 7.8 3.8 7.6 4.7

800 4.7 1.8 2.5 2.0

3200 2.7 0.95 0.62 0.63

NOTE: in bold, the best linewise between Taut string and Total variation;

underlined, the best linewise between all four estimators. (standard error of the order of the precision reported).

We can draw the following conclusions on the finite sample performance of the estimators considered based on the simulation results of Table 2. First the sparsity `1 information criterion provides total variation with both a fast and efficient selection of the regularization parameter. Second total variation is overall better than taut string in term of mean integrated squared error;

if instead the criterion was the correct number of modes detected, then taut string would perform better than total variation which tends to detect false modes. Finally kernel and logspline are competitive estimators only when the true density does not have sharp features. So total variation is overall best in term of mean integrated squared error.

6 Application

We consider two data sets with multimodality: the ‘galaxy data’ (Roeder, 1990) consists of the velocities of 82 distant galaxies diverging from our own galaxy, and the ‘1872 Hidalgo data’ (Izenman and Sommer, 1988) consists of thickness measurements of 485 stamps printed on different types of paper.

(14)

10 15 20 25 30 35

0.000.150.30

Total Variation

Galaxy data

A histogram

10 15 20 25 30 35

0.000.150.30

Histogram of R. & G.

10 15 20 25 30 35

0.000.150.30

10 15 20 25 30 35

0.000.150.30

Taut string

0.06 0.08 0.10 0.12

04080

Total Variation

1872 Hidalgo data

A histogram 0.06 0.08 0.10 0.12

04080

Histogram of I. & S.

0.06 0.08 0.10 0.12

04080

0.06 0.08 0.10 0.12

04080

Taut string

Figure 2: Four density estimates for each data set. From top to bottom:

total variation, two histograms with different binning, taut string. The thin lines are bootstrapped pointwise confidence intervals.

(15)

Ties are numerous since the precision of the measurements is 0.001. These data sets were respectively used to illustrate a Bayesian analysis of mixtures to test models of varying dimensions using reversible jump McMC, and to compare the kernel method to parametric likelihood ratio tests for mixtures.

The focus of the analysis is the number of modes. The histograms plotted in the published analysis (see third line of Figure 2) are based on a subjective bin width and left point values, and might render the false impression of too many modes. To illustrate the histogram’s sensitivity, we plotted two other histograms (with different bin width and left point values) on the second line of Figure 2. The models of Richardson and Green (1997) and Izenman and Sommer (1988) lead to the result that a likely number of modes is k1 ∈ {4, 5, 6, 7} for the ‘galaxy data’, and k2 = 7 and k2 = 3 with a nonparametric and parametric approaches for the ‘1872 Hidalgo data’.

Comparing these results to k1= 3 and k2= 7 obtained with total varia- tion plotted on the first line of Figure 2, and k1 = 1 and k2 = 3 obtained with taut string plotted on the fourth line of Figure 2, we see that the estimation of the number of modes is consistent with previous results. Taut string, which does not handle ties naturally, gives a significantly smaller number of modes however. The first and fourth lines of Figure 2 also show a 90%

pointwise confidence intervals based on 100 bootstraped replicates, except for taut string that experienced numerical problems for the ‘galaxy data’

because of the ties; for the ‘1872 Hidalgo data’, the taut string confidence interval shows potential additional modes not revealed by its estimate. On the contrary total variation behaves as well on the bootstrap samples as on the original one.

7 Conclusions

The total variation estimator shows overall better estimating properties than its considered competitors. The efficient estimation of both smooth and nonsmooth densities is especially remarkable considered that only a single regularization parameter has to be selected to accommodate to regions of varying smoothness. The downside is a bumpier look than taut string. The good properties of the total variation estimator are due to the stability of solving of convex program (6) and to the method of selection of λ using the sparsity `1information criterion SL1IC. Furthermore total variation handles ties naturally (as opposed to taut string) which is particularly useful for bootstrapping the procedure, or when data have been truncated as in our two applications. Finally it generalizes to higher dimension as described in

(16)

Section 2.2.

8 Acknowledgments

This work was partially funded by grant FNS PP001–106465.

A Bound λ

?N

For a given sample x1, . . . , xN, Property 1 states the existence of a finite λx (which depends on the data) such that the estimate ˆfλx solution to (6) equals the Uniform density f0. The standard universal rule is defined as the smallest smoothing parameter λ?N (which depends only on the sample size) such that the estimate ˆfλ?

N equals the Uniform density f0 with a probability tending to one when the size N of the Uniform sample goes to infinity. With- out loss of generality, we can restrict to f0 = U[0, 1] thanks to Property 4.

So, given a sample U1, . . . , UN i.i.d.∼ U[0, 1], we work on the rescaled variables Xi = (Ui− U(1))/(U(N )− U(1)), such that X(1) = 0 and X(N ) = 1 and the density to estimate is f = 1; note that the two samples are asymptotically equivalent. Using the dual formulation of (6) and its KKT first-order op- timality conditions (7) to (9), the dual vector (w, z) is the unique global minimizer of the dual problem (15). So the estimate ˆfλ fits the Uniform (i.e., ˆfi= 1) provided λ is at least as large as λx= kwk, where w satisfies all the KKT conditions (with ni = 1 with probability one since the Uniform is continuous). Consequently, we have that:

Pˆfλ = 1 = P{kwk≤ λ : [B0 a] w z

!

= 1}

= P{ max

i=1,...,N −1|N

i

X

j=1

aj− i| ≤ λ}.

Assuming piecewise linear interpolation, the vector a satisfying (5) is a1 = X(2)/2, ai = (X(i+1) − X(i−1))/2 for i = 2, . . . , N − 1, and aN = (1 − X(N −1))/2. So Pij=1aj has a simple expression and

P(ˆfλ = 1) = P( max

i=1,...,N −1|N/2(X(i)+ X(i+1)) − i| ≤ λ)

N →∞−→ P( max

i=1,...,N

N |X(i)− i/N | ≤ λ/√ N )

(17)

since cor(X(i), X(i+1)) N →∞−→ 1. Moreover maxi=1,...,N

N (X(i)− i/N ) is an asymptotic pivot, since the uniform quantile process converges to a Brownian bridge W on [0, 1], so

P(ˆfλ= 1) N →∞−→ P(kW k≤ λ

√N) = 1 − 2

X

k=1

(−1)k+1exp(−2k2( λ

√N)2).

Using bounds on a Brownian bridge (Shorack and Wellner, 1986), we find that the bound λ?N :=pN log(log N )/2 controls the convergence to one at the rate O(1/ log N ).

B Universal prior

The idea of a universal penalty comes from wavelet smoothing. The first level Haar wavelets for instance, defined by1

2(f2k−f2k−1) for k = 1, . . . , N/2 (we assume N is even), are reminiscent of the successive finite differences fi+1− fi for i = 1, . . . , N used by the total variation penalty in (6). How- ever, while the Haar wavelets’ finite differences are separable, the ones used for total variation are not separable (for instance f2 is used in |f2− f1| and

|f3−f2|). This causes inflation of the bound that controls the maximum am- plitude of the Brownian bridge (see Appendix A), and causes oversmoothing.

So instead of considering all finite order differences, we derive the universal prior by considering problem (6) with the (N/2)×N matrix ˜B corresponding to every other finite differences in place of B. Following similar derivations as in Appendix A, the smallest hyperparameter λ that fits a pairwise con- stant density based on a sample of size N from f0 = U[0, 1] (i.e., to satisfy the KKT conditions (7) to (9) with f2k−1 = f2k for k = 1, . . . , N/2 and B = ˜B) is λX = kWk with Wk = z/2(a2k−1− a2k) = z/4(−X(2k−2) + X(2k−1)+ X(2k)− X(2k+1)) and z =PN/2k=12/f2k≈ N since f0 = U[0, 1]. Each

|Wk| converges in distribution to an exponential distribution Exp(4) since the density function of Wk is given by (Ramallingam, 1989)

fW(w) = 1(−N/4,0](w) 2(1 + 4w

N )N −1+ 1(0,N/4](w) 2(1 − 4w N )N −1

N →∞−→ 2 exp(−4|w|).

Neglecting the correlation between Wk’s, results from extreme value theory guarantee that

4(kWk−log N

4 ) −→dG0(x),

(18)

where G0(x) = exp(− exp(−x)) is the Gumbel distribution. Finally we cali- brate the scale parameter τ of the uniform prior (11) such that the universal penalty λN = log(N log N )/4 is the root to the first order optimality condi- tion of (10) with respect to λ to fit fi= 1 for all i = 1, . . . , N :

N −1

X

i=1

|fi+1− fi| −N − 1 λ −π0

π(λ; τ ) = 0 with

( PN −1

i=1 |fi+1− fi| = 0

λ = λN .

The approximate root is τN = 4λN/N .

C Proof of Theorem 1

Letting un= fn− fn+1 and uN = fN, the minimization (6) becomes minu

N

X

i=1

nilog(

N

X

j=i

uj) + λ

N −1

X

i=1

|ui|, s.t. b0u = 1, (16) where b = T0a and T is the upper triangular matrix of ones. Since bN = PN

j=1aj and aj > 0, we have that bN > 0. From the equality constraint, we also have that uN = (1 −PN −1i=1 biui)/bN, so letting v = (u1, . . . , uN −1) and Cni= 1(i ≥ n) − bi/bN be the entries of the N × (N − 1) matrix C, the problem writes as (14) where g(w1, . . . , wN) = −PNi=1nilog(1/bN + wi).

Since bN > 0, then 0 ∈ domg = {w|w > 1/bN}. So dom(g ◦ C) is also nonempty and open. Moreover, g ◦ C is continuously differentiable on its domain. Thus, g ◦ C satisfies Assumption 1 in Tseng (2001). Also, g and h are convex functions, so f = g ◦ C + h is convex (and hence pseudoconvex).

Since h has bounded level sets, then so does f (since log(·) tends to infinity sublinearly). Thus, the kth iterate vk (k = 0, 1, ...) generated by the BCR method is bounded. In fact, they lie in a compact subset of dom(g ◦ C).

The classical cyclic rule chooses i in, for example, increasing order i = 1, . . . , n and then repeats this. The more general essentially cyclic rule entails that each i ∈ {1, . . . , N } is chosen at least once every T (T ≥ N ) successive iterations to allow different orders. Lemma 3.1 and Theorem 4.1(a) in Tseng (2001) imply that each accumulation point of v0, v1, ... is a stationary point of f , the unique global minimum of f by strict convexity.

The optimal descent rule (Sardy, Bruce, and Tseng, 2000) chooses an i for which the minimum-magnitude partial derivative of the cost function with respect to vi, i.e.,

η∈∂hmini(vi)

∂h0(v)

∂vi

+ η ,

(19)

is the largest. This yields the highest rate of descent at the current v, and often considerably improves the efficiency of the algorithm. Since v0, v1, ...

lie in a compact subset of dom(g ◦ C), over which g ◦ C is continuously differentiable, and g ◦ C is strictly convex, the proof of Theorem 2 in Sardy, Bruce, and Tseng (2000) can be applied with little change to yield that each accumulation point of v0, v1, ... is a stationary point of f . Since f is strictly convex, then the accumulation point is the unique global minimum of f .

D Proof of Theorem 2

The dual problem (15) is of the form minw,zf (w, z) = f0(w, z)+h(w), where f0(w, z) = z + g(B0w + az). Since a > 0, then (0, 1) ∈ domf0 so domf0

is nonempty. Moreover, f0 is continuously differentiable on domf0, which is an open set. Thus, f0 satisfies Assumption 1 in Tseng (2001). Also, f0

and h are convex functions, so f is convex (and hence pseudoconvex). The kth iterate (wk, zk) generated by the BCR method satisfies kwkk ≤ λ.

Since f (wk, zk) = f0(wk, zk) is non-increasing with k, this and the fact that log(·) tends to infinity sublinearly implies zk is bounded. Thus (wk, zk) lies in a compact subset of domf0. Then Lemma 3.1 and Theorem 4.1(a) in Tseng (2001) imply that each accumulation point of (w0, z0), (w1, z1), . . . is a stationary point of f .

E Proof of Property 3

Consider any 1 ≤ i ≤ N − 1 with nai

inai+1

i+1. Suppose ˆfλ,i< ˆfλ,i+1 for some λ ≥ 0. We will show that, by increasing fiand decreasing fi+1, we can obtain a lower cost for (6), thus contradicting ˆfλ being a global minimum of (6). In particular, consider moving from ˆfλ in the direction d with components

dj =

1 if j = i

−ai/ai+1 if j = i + 1

0 else

.

Then a0d = 0, so that a0(ˆfλ + αd) = 1 for all α ∈ <. Moreover ˆfλ + αd > 0 and kB(ˆfλ + αd)k1 ≤ kBˆfλk1 (where the matrix B in defined in Section 4.2) for all α > 0 sufficiently small, and the directional derivative of −PNi=1nilog fi in the direction of d at ˆfλ is −ˆni

fλ,i + ˆni+1

fλ,i+1

ai

ai+1. Since

ni

ainai+1

i+1 > 0 and 0 < ˆfλ,i < ˆfλ,i+1, this directional derivative is negative.

(20)

Thus ˆfλ+ αd improves (6) strictly compared to ˆfλ for all α > 0 sufficiently small. This contradicts ˆfλ being a global minimum.

References

Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In 2nd International Symposium on Information Theory, pp. 267–281. Budapest: Akademiai Kiado: Eds. B.N. Petrov and F. Csaki.

Besag, J. (1986). On the statistical analysis of dirty pictures (with discus- sion). Journal of the Royal Statistical Society, Series B 48, 192–236.

Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline func- tions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31, 377–403.

Davies, P. L. and Kovac, A. (2004). Densities, spectral densities and modality. The Annals of Statistics 32, 1093–1136.

Donoho, D., Johnstone, I., Kerkyacharian, G., and Picard, D. (1995).

Wavelet shrinkage: Asymptotia? (with discussion). Journal of the Royal Statistical Society, Series B 57 (2), 301–369.

Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425–455.

Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap.

Chapman & Hall Ltd.

Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso.

Journal of Computational and Graphical Statistics 7, 397–416.

Geman, S. and Geman, D. (1984). Stochastic relaxation. Gibbs distribu- tions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 61, 721–741.

Good, I. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods. Cambridge, MA: M.I.T. Press.

Good, I. J. and Gaskins, R. A. (1971). Nonparametric roughness penalties for probability densities. Biometrika 58, 255–277.

Hansen, M. H. and Kooperberg, C. (2002). Spline adaptation in extended linear models (with discussion). Statistical Science 17 (1), 2–20.

(21)

Izenman, A. J. and Sommer, C. J. (1988). Philatelic mixtures and multi- modal densities. Journal of the American Statistical Association 83, 941–953.

Koenker, R. and Mizera, I. (2006). Density estimation by total variation regularization. preprint .

Koenker, R., Ng, P., and Portnoy, S. (1994). Quantile smoothing splines.

Biometrika 81, 673–680.

Kooperberg, C. and Stone, C. J. (1991). A study of logspline density estimation. Computational Statistics and Data Analysis 12, 327–347.

Kooperberg, C. and Stone, C. J. (2002). Logspline density estimation with free knots. Computational Statistics and Data Analysis 12, 327–347.

Mallows, C. L. (1973). Some comments on Cp. Technometrics 15, 661–

675.

Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. The Annals of Statistics 20, 712–736.

O’Sullivan, F. (1988). Fast computation of fully automated log-density and log-hazard estimators. SIAM Journal on Scientific and Statistical Computation 9, 363–379.

Penev, S. and Dechevsky, L. (1997). On non-negative wavelet-based den- sity estimators. Journal of Nonparametric Statistics 7, 365–394.

Pinheiro, A. and Vidakovic, B. (1997). Estimating the square root of a density via compactly supported wavelets. Computational Statistics and Data Analysis 25, 399–415.

Ramallingam, T. (1989). Symbolic computing the exact distributions of L-statistics from a Uniform distribution. Annals of the Institute of Statistical Mathematics 41, 677–681.

Renaud, O. (2002). Sensitivity and other properties of wavelet regression and density estimators. Statistica Sinica 12 (4), 1275–1290.

Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mix- tures with an unknown number of components (Disc: p758-792) (Corr:

1998V60 p661). Journal of the Royal Statistical Society, Series B, Methodological 59, 731–758.

Rockafellar, R. (1970). Convex Analysis. Princeton: Princeton University Press.

(22)

Rockafellar, R. T. (1984). Network Flows and Monotropic Programming.

New-York: Wiley-Interscience; republished by Athena Scientific, Bel- mont, 1998.

Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association 85, 617–624.

Sardy, S. (2006). Estimating the dimension of parametric and nonpara- metric `ν penalized likelihood models for adaptive sparsity. submitted . Sardy, S., Bruce, A., and Tseng, P. (2000). Block coordinate relaxation methods for nonparametric wavelet denoising. Journal of Computa- tional and Graphical Statistics 9, 361–379.

Sardy, S. and Tseng, P. (2004). On the statistical analysis of smoothing by maximizing dirty markov random field posterior distributions. Journal of the American Statistical Association 99, 191–204.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6, 461–464.

Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley.

Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, Methodological 53, 683–690.

Shorack, G. and Wellner, J. (1986). Empirical Processes with Applications to Statistics. Wiley.

Silverman, B. W. (1982). On the estimation of a probability density func- tion by the maximum penalized likelihood method. Annals of Statis- tics 10, 795–810.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Anal- ysis. Chapman & Hall.

Simonoff, J. S. (1996). Smoothing Methods in Statistics. New York:

Springer-Verlag.

Stone, C. J. (1990). Large sample inference for logspline model. Annals of Statistics 18, 717–741.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B 36, 111–147.

(23)

Tibshirani, R. (1995). Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society, Series B 57, 267–288.

Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, 1035–1038.

Tseng, P. (2001). Convergence of block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications 109, 475–494.

Van der Laan, M. J., Dudoit, S., and Keles, S. (2004). Asymptotic opti- mality of likelihood based cross-validation. Statistical Applications in Genetics and Molecular Biology 3, Article 4.

Vidakovic, B. (1999). Statistical Modeling by Wavelets. Wiley.

Wahba, G. (1990). Spline Models for Observational Data. Society for In- dustrial and Applied Mathematics.

Willett, R. and Nowak, R. D. (2003, August). Multiscale Density Estima- tion. Technical report, Rice University.

參考文獻

相關文件

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-field operator was developed

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-eld operator was developed

He proposed a fixed point algorithm and a gradient projection method with constant step size based on the dual formulation of total variation.. These two algorithms soon became

We would like to point out that unlike the pure potential case considered in [RW19], here, in order to guarantee the bulk decay of ˜u, we also need the boundary decay of ∇u due to

In particular, we present a linear-time algorithm for the k-tuple total domination problem for graphs in which each block is a clique, a cycle or a complete bipartite graph,

The value of total merchandise export for January 2011 amounted to MOP656 million, up by 5.8% year- on-year, of which value of domestic exports increased by 17.0% to MOP249 million,

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

• But Monte Carlo simulation can be modified to price American options with small biases..