• 沒有找到結果。

廣義線性模型使用懲罰概似函數之模型選取

N/A
N/A
Protected

Academic year: 2021

Share "廣義線性模型使用懲罰概似函數之模型選取"

Copied!
48
0
0

加載中.... (立即查看全文)

全文

(1)

國 立 交 通 大 學

統計學研究所

碩 士 論 文

廣義線性模型使用懲罰概似函數之模型選取

Model Selection for Generalized Linear Models

Using Penalized Likelihood

研 究 生 : 宋佩芸

指導教授 : 黃信誠 教授

(2)

廣義線性模型使用懲罰概似函數之模型選取

Model Selection for Generalized Linear Models

Using Penalized Likelihood

研 究 生 :宋佩芸 Student:Pei-Yun Sung

指 導 教 授:黃信誠 Advisor:Dr. Hsin-Cheng Huang

國 立 交 通 大 學

統計學研究所

碩 士 論 文

A Thesis

Submitted to Institute of Statistics College of Science

National Chiao Tung University in Partial Fulfillment of the Requirements

for the Degree of Master

in Statistics June 2011

Hsinchu, Taiwan, Republic of China

(3)

i

廣義線性模型使用懲罰概似函數之模型選取

研究生:宋佩芸 指導教授:黃信誠 教授

國立交通大學統計學研究所

摘要

隨著越來越多高維度資料需要被了解,在一般線性模型迴歸和廣益線性模型 迴歸問題下找出資料中的重要變數已成為越來越重要的問題。許多方法已在文獻 中被提出,包括 Akaike’s information criterion (AIC)、Bayesian information criterion (BIC)、 Lasso 等等,但AIC和BIC在變數量很大時都難以實行,因此通 常是透過使用一些逐步的選取程序。在這篇論文中,我們延伸 Shen et al. (2010) 的方法,將有懲罰項的最小平方法和L1懲罰的變形應用在廣義線性迴歸上,這 樣的方法在變數量大時可以逼近AIC和BIC。我們介紹一個有效的演算方法,其 利用Difference convex programming (DCP), iteratively reweighted penalized least squares以及coordinate descent演算法解決問題。在文章中,我們探討此方法具有 之性質並提供數值模擬呈現我們方法的優勢。最後,將該方法使用於分析體重過 輕嬰兒的數據,並指出影響出生嬰兒體重過輕之因素。

(4)

Model Selection for Generalized Linear Models

Using Penalized Likelihood

Student: Pei-Yun Sung

Advisor: Dr. Hsin-Cheng Huang

Institute of Statistics National Chiao Tung University

Abstract

With higher and higher dimensional data being available, identifying important variables among many variables has become more and more important in regres-sion and generalized linear regresregres-sion. Many approaches have been proposed in the literature, including Akaike’s information criterion (AIC), Bayesian information criterion (BIC), Lasso, etc. However, both AIC and BIC are difficult to implement when the number of variables is large, and hence are usually done using some step-wise procedures. In this thesis, we extend an approach of Shen et al. (2010), who considered a penalized least squares method with a truncated L1 penalty for linear

regression, to generalized linear regression. This approach enables us to well approx-imate AIC and BIC even when the number of variables is large. A computational efficient algorithm is introduced, which utilizes difference convex programming, it-eratively reweighted penalized least squares, and the coordinate descent algorithm. Some oracle property of the proposed method is established, and some numerical examples are provided to demonstrate the superiority of the proposed method over AIC and BIC. Finally, the proposed method is applied to analyze a low birth weight dataset, in which we identify important variables associated with a low birth weight baby.

Key words: Akaike information criterion, Bayesian information criterion, coordinate descent, difference convex programming, Lasso, L0 penalty, oracle property.

(5)

iii

致謝

身為交大統計所碩士生兩年,我在學校裡學到了很多事,不管是

老師們在課堂上教的專業知識,還有課間閒聊的人生道理,我覺得我

真的獲益良多。除了老師們的指導之外,也結交很多知心的朋友,讓

我這兩年的生活,充實且快樂,我真的很滿足。

此篇論文的完成,我要感謝我的指導教授黃信誠老師,每次的討

論都指點我許多,作研究遇到困難時,也都幫助我一起解決,讓我能

夠順利完成這篇論文。

在論文口試時,感謝清大統計所許文郁老師、鄭少為老師以及本

所的王維菁老師,對於很多文章內容的細節,不吝提出指導與修正,

並提供很多寶貴的意見和建議,協助我完成論文的修正。

最後要感謝的是我的家人及朋友,在我面對壓力的時候陪我聊

天,幫助我排解困難,讓我順利完成階段性的人生目標。

宋 佩 芸 謹誌于

國立交通大學統計研究所

中 華 民 國 一百年六月

(6)

Contents

1 Introduction 1

2 Variable Selection 2

2.1 Sequential Hypothesis Testing . . . 2 2.2 Information Criteria . . . 3 2.3 Lasso . . . 3 3 The Proposed Method 4 3.1 Computation Algorithm . . . 5 3.2 Selection of tuning parameters . . . 8

4 Examples 9 4.1 Poisson Regression . . . 9 4.2 Logistic Regression . . . 10 5 Oracle Property 11 6 Numerical Examples 14 6.1 Poisson Regression . . . 14 6.2 Logistic Regression . . . 19 6.3 Data Analysis : Low Birth Weight Data . . . 23

7 Discussion 28

Reference 29

Appendices 31

A Data analysis code 31 B Poisson simulation code 34 C Logistic simulation code 38

(7)

List of Tables

1 Performance of various methods for Poisson regression with p = 6 and n = 50, where values given in parentheses are the corresponding standard errors. . . 15 2 Performance of various methods for Poisson regression with p = 10 and

n = 50, where values given in parentheses are the corresponding standard errors. . . 16 3 Performance of various methods for Poisson regression with p = 20 and

n = 50, where values given in parentheses are the corresponding standard errors. . . 18 4 Performance of various methods for Poisson regression with p = 20 and

n = 100, where values given in parentheses are the corresponding standard errors. . . 18 5 Performance of various methods for Poisson regression with p = 40 and

n = 100, where values given in parentheses are the corresponding standard errors. . . 19 6 Approximations to AIC and BIC for Poisson regression with p = 6. . . 20 7 Approximations to AIC and BIC for Poisson regression with p = 10. . . 20 8 Performance of various methods for logistic regression with p = 6 and

n = 100, where values given in parentheses are the corresponding standard errors. . . 21 9 Performance of various methods for logistic regression with p = 10 and

n = 100, where values given in parentheses are the corresponding standard errors. . . 22 10 Performance of various methods for logistic regression with p = 20 and

n = 100, where values given in parentheses are the corresponding standard errors. . . 24 11 Performance of various methods for logistic regression with p = 20 and

n = 200, where values given in parentheses are the corresponding standard errors. . . 24

(8)

12 Performance of various methods for logistic regression with p = 40 and n = 200, where values given in parentheses are the corresponding standard errors. . . 25 13 Approximations to AIC and BIC for logistic regression with p = 6 . . . 25 14 Approximations to AIC and BIC for logistic regression with p = 10 . . . . 26 15 Variables in the low birth weight dataset. . . 27 16 Estimated parameters obtained from various methods. . . 27

(9)

List of Figures

1 Truncated L1 penalty with τ = 1. . . 5

2 Separate the penalty term of the objective function into two convex func-tions with τ = 1. . . 6 3 KL losses (left) and the corresponding BIC scores based on PML (right)

for poisson regression with p = 6 and n = 50. . . 17 4 KL losses (left) and the corresponding BIC scores based on PML (right)

for poisson regression with p = 10 and n = 50. . . 17 5 KL losses (left) and the corresponding BIC scores based on PML (right)

for logistic regression with p = 6 and n = 50. . . 22 6 KL losses (left) and the corresponding BIC scores based on PML (right)

(10)

1

Introduction

Suppose that we observe data, {(xi, yi) : i = 1, . . . , n}, where yi and xi = (1, xi1, . . . , xip)0

represent the response and a (p + 1)-dimensional vector of explanatory variables, respec-tively. If y1, . . . , yn are continuous variables that are roughly normally distributed, then

the linear regression model is usually applied:

yi = x0iβ + εi, εi ∼ N (0, σ2); i = 1, . . . , n,

where β = (β0, β1, . . . , βp)0 is a vector of regression parameters. In some situations,

y1, . . . , yn are deviated from normal distributions, or take only discrete values, the usual

linear regression model is no longer valid. Generalized linear models provide a flexible framework for these types of data.

For a generalized linear model, we assume that yi has the following probability density

function (pdf): f (yi; θi) = exp  yiθi− b(θi) a(φ) + c(yi, φ)  , (1.1) for some known functions a(·), b(·), and c(·), where φ is a dispersion parameter. If φ is known, (1.1) forms an exponential family model with canonical parameter θi. For example,

the pdf of a normal distribution with mean µiand variance σi2can be written as the form of

(1.1) with θi = µi, φ = σi2, a(φ) = φ, b(θi) = θ2i/2, and c(yi, φ) = −(yi2/σi2+ log(2πσi2))/2.

In general, the mean and the variance of yi can be expressed as:

µi = E(yi) = b0(θi),

σ2i = var(yi) = b00(θi)a(φ).

In a generalized linear model, there are two parts involved: the random component (response), yi, and the systematic component (linear predictor), ηi = x0iβ, for i = 1, . . . , n.

The two parts are related through a link function g(·) by the followings:

µi = E(yi|xi), (1.2)

g(µi) = g(b0(θi)) = x0iβ. (1.3)

A link function g(·) is called the canonical link if it satisfies g(µi) = θi = x0iβ. For

examples, the identity function is the canonical link for normal regression, the log function is that for Poisson regression, and the logit function is that for logistic regression.

(11)

The log-likelihood function under the canonical link can be written as: `(β) = n X i=1  yix0iβ − b(x0iβ) a(φ) + c(yi, φ)  , (1.4) with its first and second derivatives given by

∂ ∂β`(β) = n X i=1 xi(yi− µi(β))/a(φ), ∂2 ∂β∂β0`(β) = − n X i=1 σi2(β)xix0i/a(φ). (1.5)

The parameter vector β is usually estimated by maximum likelihood (ML). The ML estimate of β, given by arg max

β

`(β), can be obtained by solving the score equations, ∂`/∂β = 0 with ∂`/∂β being the score function.

The rest of this thesis is organized as follows. In Chapter 2, we introduce some commonly used model selection methods for generalized linear models. In Chapter 3, we introduce a penalized likelihood method with a truncated L1 (TL) penalty, which

similar to Lasso (Tibshirani 1996), allows simultaneous variable selection and parameter estimation. But unlike Lasso, which tends to produce bias estimates, the proposed method shrinks only small coefficients to zero but not large coefficients. In Chapter 4, we give two examples showing some more details about the proposed method. The oracle property is established In Chapter 5, and some numerical examples are provided in Chapter 6. Some brief discussion is given in Chapter 7. Finally, the appendix contains R code used in the numerical examples.

2

Variable Selection

Consider the log-likelihood function given in (1.4) with p variables. Then there are 2p

candidate models to choose. Basically, a large model tends to produce low bias but high variance, and vice versa. Some commonly used methods are given in the following subsections.

2.1

Sequential Hypothesis Testing

One approach to identify important variables is to perform a sequence of tests using a forward selection procedure, backward selection procedure, or a stepwise procedure

(12)

that mixes between the two. Forward selection starts from the null model with only the intercept term and adds one variable at a time until some criterion is met. At each step, the variable with the smallest p-value is added until no p-value is smaller than a pre-specified threshold. Backward selection performs similarly but in the opposite direction by starting from the full model. At each step, the variable corresponding to the largest p-value is removed until all variables have p-values smaller than a pre-specified threshold. These procedures involve multiple tests, which make theoretical justification of this method difficult.

2.2

Information Criteria

Another approach for variable selection is to apply an information criterion, which selects the model by minimizing a cost function given in the form of

−1

n`(β) + J (β), (2.1) where J (β) is a penalty term, which is typically larger for a larger model. By choosing a proper penalty term that suitably controls between goodness-of-fit and model parsimony, a good estimate of β balancing between bias and variance can be obtained. For example, the generalized information criterion (GIC; Nishii 1984) has the following L0 penalty:

−1 n`(β) + λ p X j=1 I(|βj| 6= 0), (2.2)

which penalizes the number of parameters. It includes Akaike’s information criteria (AIC; Akaike 1974) with λ = 1/n and Bayesian information criteria (BIC; Schwarz 1978) with λ = log(n)/(2n) as special cases. Although asymptotic justification of GIC can be found in Shao (1997), it is difficult to minimize GIC directly, because GIC is nonconvex and discontinuous, which generally requires computing GIC values for 2p candidate models separately, and hence is computationally infeasible when p is large. Consequently, a stepwise procedure similar to those described in Section 2.1 is often applied, resulting in less satisfactory estimates.

2.3

Lasso

Instead of considering the penalized likelihood with an L0 penalty as in (2.2), a penalized

(13)

called the least absolute shrinkage and selection operator (Lasso), which is very popular in recent years. The Lasso estimate minimizes

−1 n`(β) + λ p X j=1 |βj|, (2.3)

where λ is a tuning parameter controlling the shrinkage of β toward zeros with a large λ corresponding to a higher degree of shrinkage. Different from the ridge regression, which substitutes an L2 penalty p P j=1 |βj|2 for p P j=1

|βj| in (2.3), Lasso not only does shrinkage,

but also estimates some β as exactly zeros. Consequently, it provides parameter esti-mation and model selection simultaneously. Originally, Lasso was solved using quadratic programming (Tibshirani 1996), which is not particularly fast. It was not until the intro-duction of the least angle regression (LARS) algorithm (Efron et al. 2004) that the Lasso becomes very popular, because the entire Lasso solutions along a path of λ can be solved in the order equivalent to solving the ordinary least squares estimate based on the full model. Interestingly, the LARS algorithm is equivalent to a homotopy algorithm intro-duced earlier by Osborne et al. (2000) for solving Lasso. Recently, Friedman et al. (2007) demonstrate that the simple coordinate descent algorithm can do even faster than the LARS algorithm.

However, the Lasso is known to produce a bias estimate of β, and hence can only achieve selection consistency under some restricted conditions (Zhao and Yu 2006). Some nonconvex penalties have been proposed to remedy the bias problem of Lasso in the context of linear regression, including the SCAD penalty of Fan and Li (2001), the MCP penalty of Zhang, and the TL penalty of Shen and Huang (2010) and Shen et al. (2010).

3

The Proposed Method

Consider the generalized linear model given by (1.4). Following the TL penalty introduced by Shen and Huang (2010) and Shen et al. (2010), we propose to estimate β by minimizing the following penalized log-likelihood function:

S(β) = −1 n`(β) + λ p X j=1 minn|βj| τ , 1 o , (3.1) where λ ≥ 0 is a tuning parameter controlling the degree of shrinkage with a larger λ corresponding to a higher degree of shrinkage, and τ > 0 is a thresholding parameter.

(14)

−2 −1 0 1 2 0.0 0.5 1.0 1.5 2.0 β penalty

Figure 1: Truncated L1 penalty with τ = 1.

Since min{|β|/τ, 1} → I(|β| 6= 0) as τ → 0, the L0 penalized log-likelihood of (2.2) can

be well approximated by S(β) if τ is sufficiently small. By choosing a small τ , we obtain approximated AIC and BIC with λ = 1/n and log(n)/(2n), respectively. On the other hand, when τ is large, S(β) can be rewritten as −1n`(β) + λτ

p

P

j=1

|βj|, leading to the usual

Lasso with tuning parameter λ/τ . The penalty function, min{|β|/τ, 1}, with τ = 1 is shown in Figure 1.

Although from (1.5), the loglikelihood function `(β) is concave, the objective function S(β) of (3.1) is not convex unless τ = ∞, which in general is difficult to minimize directly. We propose to solve the nonconvex optimization problem of (3.1) using difference convex programming (DCP) (An and Tau 1997), iteratively reweighted penalized least squares and the coordinate descent algorithm, which will be introduced in the next subsection.

3.1

Computation Algorithm

Since S(·) is not a convex function, we adopt the approach of Shen and Huang (2010) and Shen et al. (2010) by finding a minimizer of (3.1) through DCP, which decomposes S(·) into a difference of two convex functions:

S(β) = S1(β) − S2(β), (3.2) where S1(β) = −n1`(β) + λ p P j=1 |βj|/τ and S2(β) = λ p P j=1 max {|βj|/τ − 1, 0}.

(15)

−2 −1 0 1 2 0.0 0.5 1.0 1.5 2.0 β penalty S1(β) S2(β)

Figure 2: Separate the penalty term of the objective function into two convex functions with τ = 1.

Figure 2 shows the decomposition of min{|β|/τ, 1} into the difference of two convex functions, |β|/τ − max(|β| − τ, 0)/τ . The main idea of DCP is to approximate a minimum of S(β) iteratively by the minimizers from a sequence convex functions. Specifically, we successively replace S2(β) at the m-th step iteration with its first order approximation:

S2(m)(β) = S2 ˆβ(m−1)  + |β| − ˆβ(m−1) T ∇S2 ˆβ(m−1) , = λ p X j=1 |βj| τ − 1  I ˆβ (m−1) j > τ, evaluated at the solution ˆβ(m−1) = βˆ(m−1)

1 , . . . , ˆβ (m−1)

p 

0

obtained in the previous step, where ∇S2(β) = λ p P j=1 1 τI(|βj| > τ )sign(βj), |β| ≡ (|β1|, . . . , |βp|) 0, and ˆβ (m−1) λ,τ is defined similarly. Thus we obtain a sequence of upper approximation functions given by:

S(m)(β) = S1(β) − S (m) 2 (β) = −1 n`(β) + λ τ p X j=1 |βj|I ˆβ (m−1) j ≤ τ − λ p X j=1 I ˆβ (m−1) j > τ, (3.3) where the last term of (3.3), involving no β, can be removed when solving for ˆβ(m). Note

that S(m)(·) is a convex function and has a form similar to Lasso in (2.3), except that λ is

replaced by λ/τ and the L1 penalty is only on those βj’s for which

ˆβ (m−1) j < τ . Clearly, sign ˆβ (m) j −τ = sign ˆβ (m−1) j

−τ for all j = 1, . . . , p, implies that ˆβ(m+1) = ˆβ(m), which

happens when the components of ˆβ(m+1)show no crossing over τ from ˆβ(m). Since there are

only a finite number of possible combinations for sign ˆβ

(m) 1 − τ, . . . , sign ˆβ (m) p − τ 0 , the DCP algorithm has a quite unique feature that it converges in a finite number of

(16)

steps. We denote the solution after convergence based on tuning parameters λ and τ by ˆ

β(∞)(λ, τ ).

The penalized likelihood S(m)(·) of (3.3) can be minimized efficiently using the iterated

reweighted penalized least squares (IRPLS) method. First, we consider a second order approximation `Q(β) to the log-likelihood `(β) at ˜β:

`(β) ≈ `Q(β) = `( ˜β) + (β − ˜β)0∇`( ˜β) + 1 2(β − ˜β) 02`( ˜β)(β − ˜β) = −1 2a(φ) n X i=1 σ2i( ˜β)  x0iβ +˜ yi− µi( ˜β) σ2 i( ˜β) − x0iβ 2 + C( ˜β), (3.4) where C( ˜β) is some constant independent of β. Then (3.3) can be solved by iteratively minimizing 1 2n n X i=1 wi( ˜β) zi( ˜β) − x0iβ 2 + λ τ p X i=1 |βj|I(| ˆβ (m−1) j | ≤ τ ), (3.5)

with ˜β being the current estimate of β in IRPLS, where wi( ˜β) = σ2i( ˜β)/a(φ) and zi( ˜β) =

x0iβ + (y˜ i− µi( ˜β))/σi2( ˜β).

As in Friedman et al. (2007) and Friedman et al. (2010), we apply the coordinate decent algorithm to solve the Lasso problem of (3.5), which iteratively minimizes one parameter at a time while holding the others fixed until convergence. The coordinate descent algorithm iteratively update β0, β1, . . . , βp by:

˜ ˜ β0(new) = n X i=1 wi( ˜β)  zi( ˜β) − p X j=1 xijβ˜˜ (old) j  n X i=1 wi( ˜β) , (3.6) and ˜ ˜ βj(new) = S 1 n n X i=1 wi( ˜β)xij  zi( ˜β) − x0i ˜ ˜ β0(old)− xijβ˜˜ (old) j  , λ τI ˆβj (m−1) ≤ τ   1 n n X i=1 wi( ˜β)x2ij ,(3.7)

for j = 1, . . . , p, where S(x, α) ≡ sign(x)(|x| − α)+ is the soft-thresholding operator.

Note that the solution ˆβ(m) of (3.3) is continuous and piecewise linear in λ. Therefore,

(17)

solve for ˆβ(m) along a decreasing sequence of λ values with each solution being used as a warm start for the next. Given λ and τ , the proposed algorithm in solving (3.1) can be summarized below:

• Outer loop : Solve the DCP solution ˆβ(m)(λ, τ ) of (3.3) until sign(| ˆβj(m)| − τ ) = sign(| ˆβj(m−1)| − τ ) for j = 1, . . . , p.

• Middle loop : Update the quadratic approximation expressed in (3.5).

• Inner loop : Apply the coordinate descent algorithm with the updating formulae given by (3.6) and (3.7).

In practice, we compute the penalized ML (PML) estimate ˆβ(∞)(λ, τ ) of β at some

evaluation points: (λ, τ ) ∈ {λ1, . . . , λU} × {τ1, . . . , τV}, where λ1 > λ2 > · · · > λU and

τ1 > τ2 > · · · > τV. Since the solution ˆβ(m)(λ, τ ) of (3.3) is continuous in λ for each

m and τ , to ensure fast convergence, the proposed algorithm is applied in the following order:

1. Start by computing ˆβ(∞)(λ1, τ1) with initial ˆβ(0)(λ1, τ1) = 0.

2. Successively compute ˆβ(∞)(λ1, τv) with initial ˆβ(0)(λ1, τv) = ˆβ(∞)(λ1, τv−1), for v =

2, . . . , V .

3. For each v = 1, . . . , V , successively compute ˆβ(∞)

u, τv) with initial ˆβ(0)(λu, τv) =

ˆ

β(∞)(λu−1, τv), for u = 2, . . . , U .

3.2

Selection of tuning parameters

In the previous subsection, we focus on how to obtain the PML solution of (3.1) for a fixed pair of (λ, τ ). In this subsection, we introduce two methods, K-fold cross-validation (CV) and BIC, to select the tuning parameters λ and τ . Let Y = (y1, . . . , yn)0 denote the

response and X = (xi, . . . , xn)0 denote the design matrix. For K-fold CV, the index set

{1, . . . , n} and the data (Y , X) are randomly divided into K sections of roughly the same size. Denote the k-th section of data by (Yk, Xk); k = 1, . . . , K. For each k, we obtain the

PML estimate of ˆβ−k(∞)(λ, τ ) based on data (Y−k, X−k) of all the other K − 1 sections. To

assess the prediction performance of ˆµ(∞)−k (λ, τ ) on Yk, we consider the following deviance

criterion: Dev ˆβ−k(∞)(λ, τ ) ≡ −2 logLk µˆ (∞) −k (λ, τ ), Yk  Lk(Yk, Yk) , (3.8)

(18)

where ˆµ(∞)−k (λ, τ ) is the estimate of µk based on ˆβ (∞)

−k (λ, τ ), and Lk µˆ (∞)

−k (λ, τ ), Yk is the

likelihood function of Yk with mean µ (∞)

−k (λ, τ ). Then our CV criterion is defined by the

total deviance: CV(λ, τ ) ≡ K X k=1 Dev( ˆβ(∞)−k (λ, τ ))). (3.9) The K-fold CV then selects the pair (λ, τ ) with the smallest CV value.

The second criterion we consider for tuning parameter selection is similar to the usual BIC criterion for model selection:

BIC(λ, τ ) ≡ −2`( ˆβ(∞)(λ, τ )) + log(n)

p

X

j=1

I ˆβj(∞)(λ, τ ) 6= 0. The BIC criterion selects the tuning parameters, (ˆλ, ˆτ ) = arg min

(λ,τ )

BIC(λ, τ ). The re-sulting estimate of β is given by ˆβ(∞)λ, ˆτ ). We shall apply this criterion for all of our

numerical examples, because from our limited experience, it appears to perform better is computationally more efficient than CV.

4

Examples

To understand how the proposed PML method works, we provide two examples: Poisson regression and logistic regression, with detailed computation steps in this section.

4.1

Poisson Regression

For Poisson regression with the canonical link function: g(µi) = log µi = x0iβ, the

log-likelihood can be expressed as: `(β) = − n X i=1 exp(x0iβ) + n X i=1 yi(x0iβ) − n X i=1 log yi! . (4.1)

The first and second derivatives of the log-likelihood functions are: ∇`(β) = − n X i=1 exp(x0iβ)xi+ n X i=1 yixi = n X i=1 xi(yi− µi), ∇2`(β) = − n X i=1 exp(x0iβ)xix0i = − n X i=1 µixix0i.

(19)

Thus from (3.4), the quadratic approximation of the log-likelihood function expanded at ˜ β is: − 1 2n n X i=1 µi( ˜β) ( x0iβ +˜ yi− µi( ˜β) µi( ˜β) − x0iβ )2 + C( ˜β). (4.2) Consequently, the m-th step DCP solution ˆβ(m) of (3.3) can be obtained by iteratively

solving arg min β    1 2n n X i=1 µi( ˜β) ( x0iβ +˜ yi− µi( ˜β) µi( ˜β) − x0iβ )2 + λ τ n X i=1 |βj|I(| ˆβj(m−1)| ≤ τ )    , using the coordinate decent algorithm given in (3.6) and (3.7). From (3.9), the tuning parameters selected by CV satisfies

(ˆλ, ˆτ ) = arg min (λ,τ ) 2 n X i=1 n yi(log yi− log ˆµi(λ, τ )) − (yi− ˆµi(λ, τ )) o (4.3) and the tuning parameters selected by BIC satisfies

(ˆλ, ˆτ ) = arg min (λ,τ )  2 n X i=1  exp(x0iβˆ(∞)(λ, τ ))) − yi(x0iβˆ (∞)(λ, τ )) + log(n) p X j=1 I ˆβj(∞)(λ, τ ) 6= 0  .

4.2

Logistic Regression

The canonical link function for logistic regression is the logit function: g(pi) = log

pi

1 − pi

= x0iβ, where pi is the Bernoulli probability of yi, which is equal to exp(x0iβ)/(1 + exp(x

0 iβ))

for i = 1 . . . n. With the canonical link function, the log-likelihood can be represented by `(β) = n X i=1 {yi(x0iβ) − log(1 + exp(x 0 iβ))} . (4.4)

The first and second derivatives of the log-likelihood functions are: ∇`(β) = n X i=1 yixi− n X i=1 exp{x0iβ}xi 1 + exp{x0iβ} = n X i=1 xi(yi − pi), ∇2`(β) = − n X i=1 exp{x0iβ}xix0i 1 + exp{x0iβ}  1 − exp{x 0 iβ} 1 + exp{x0iβ}  = − n X i=1 pi(1 − pi)xix0i.

(20)

Thus from (3.4), the quadratic approximation of the log-likelihood function expanded at ˜ β is: − 1 2n n X i=1 ˜ pi(1 − ˜pi)  x0iβ +˜ yi− ˜pi ˜ pi(1 − ˜pi) − x0iβ 2 + C( ˜β), (4.5) where ˜pi = exp{x0iβ}/(1 + exp{x˜

0

iβ}). Consequently, the m-th step DCP solution ˆ˜ β(m)

of (3.3) can be obtained by iteratively solving arg min β ( 1 2n n X i=1 ˜ pi(1 − ˜pi)  x0iβ +˜ yi− ˜pi ˜ pi(1 − ˜pi) − x0iβ 2 +λ τ n X i=1 |βj|I(| ˆβ (m−1) j | ≤ τ ) )

using the coordinate decent algorithm given in (3.6) and (3.7). From (3.9), the tuning parameters selected by CV satisfies

(ˆλ, ˆτ ) = arg min (λ,τ ) 2 n X i=1  yilog yi ˆ pi(λ, τ ) + (1 − yi) log 1 − yi 1 − ˆpi(λ, τ )  , (4.6) and the tuning parameters selected by BIC satisfies

(ˆλ, ˆτ ) = arg min (λ,τ )  2 n X i=1

{log(1 + exp(x0iβ)) − yi(x0iβ)} + log(n) p X j=1 I ˆβj(∞)(λ, τ ) 6= 0  .

5

Oracle Property

Let β0 ≡ (β00, β10, . . . , βp0)0 be the true regression coefficients and let A = {j : |βj0| 6=

0, j = 1, . . . , p} be the index set of the true model. For notational simplicity, the proposed estimate ˆβ(∞)(λ, τ ) is written as ˆβ(∞). Let ˆA ≡ j = 1, . . . , p :

ˆβ

(∞) j

> 0 be the index set selected by the proposed PML method, and let ˆβ(ml) = βˆ0(ml), ˆβ1(ml), . . . , ˆβp(ml)

0 be the maximum likelihood (ML) estimate of β based on the true model, where ˆβj(ml) = 0 if j /∈ A. We would like to establish P ˆβ(∞)= ˆβ(ml) → 1 as n → ∞ under some regularity conditions.

We first show that ˆβj(∞)6= τ for j = 1, . . . , p.

Lemma 5.1. Let h(·) be any differentiable function in Rp and β= (β∗ 1, . . . , β ∗ p) 0 be a local minimizer of f (β) = h(β) + λ p P j=1

min{|βj|/τ, 1}. Then βj∗ 6= τ for j = 1, . . . , p.

Proof. Prove by contradiction. Suppose βk∗ = τ for some k ∈ {1, . . . , p} and define Jj(|βj|) = min n j| τ , 1 o . Let fk(βk) = f (β1∗, . . . , βk, . . . , β∗p) and hk(βk) = h(β1∗, . . . , βk, . . . , βp∗)

(21)

with βj = βj∗ for j 6= k. Then fk(βk) = hk(βk) + λ

P

j6=k

Jj(|βj∗|) + λJk(|βk|). The right

derivative of Jk(|βk|) at βk∗ is 0 and the left derivative of Jk(|βk|) at βk∗ is 1

τ. Let D denote

the derivative of P

j6=k

Jj(|βj∗|). Since βk∗ is a local minimizer of fk(βk), the right and left

derivatives of fk(βk) at βk∗ are larger than 0 and smaller than 0. Then the right and

left derivatives of hk(βk) at βk∗ are larger than −λD and smaller than −λ(D + τ1), which

contradicts to the fact that h(·) is a differentiable function in Rp. Thus we obtain the

conclusion that βj∗ 6= τ for j = 1, . . . , p.

Differentiating the object function S(·) in (3.1) with respect to β through the concept of subdifferentials, we obtain the local optimality condition:

−1 n ∂ ∂βj `(β) + bj λ τ = 0, where bj        = sign(βj) if 0 < |βj| < τ, = 0 if |βj| > τ, ∈ [−1, 1] if |βj| = 0. (5.1)

for j = 1, . . . , p. This local optimality condition is known as the Karush-Kuhn-Tucker (KKT) condition (see Lange 2004). Note that when βj = 0, the local optimality condition

can be represented by ∂ ∂βj `(β) ≤ nλ/τ .

We shall establish the oracle property of ˆβ(∞) in the theorem below by showing first that both ˆβ(∞) and ˆβ(ml) satisfy the KKT condition of (5.1), and then by showing that the solution of (3.1) is unique with probability tending to one under some conditions. Theorem 5.2. Consider the generalized linear model of (1.4) with β ∈ Θ ⊂ Rp+1, where

Θ is compact. Let ˆβ(∞) be the PML estimate of (3.1). Suppose that (i) max{|A|, | ˆA|} <

q for some constant 0 < q < ∞, (ii) inf

j∈A|βj0| > 2τ , (iii) n

1/2λ/τ → ∞, and (iv)

τ2cmin− 4λ

q > 0, where cmin denotes the minimum eigenvalue of −

1

n minβ∈Θ∇ 2

`(β). Then P ˆβ(∞)= ˆβA(ml) → 1 as n → ∞.

Proof. Let F = F1T F2, where

F1 =  min j∈A ˆβ (ml) j > 3 2τ  and F2 =  max j /∈A − 1 n ∂ ∂βj `( ˆβ(ml)) ≤ λ τ  .

First, we show that both ˆβ(∞) and ˆβ(ml) satisfy the KKT condition of (5.1) on F . Since

ˆ β(m) minimizes S(m)(β) = −1 n`(β) + λ τ p X j=1 |βj|I(|β (m−1) j | ≤ τ ), it follows that ˆβ(∞) is a

(22)

minimizer of S(∞)(β) = −1 n`(β) + λ τ p X j=1 |βj|I(|β (∞) j | ≤ τ ). This implies ˆβ (∞) satisfies the

KKT condition. Regarding ˆβA(ml), clearly it satisfies the KKT condition for j ∈ A under F1. In addition, it satisfies − 1 n ∂ ∂βj `( ˆβ(ml)) ≤ λ

τ, and hence the KKT condition, for j /∈ A under F2. Therefore, ˆβ(ml) also satisfies the KKT condition under F .

Next, we show ˆβ(∞)= ˆβ(ml) on F . We prove by contradiction. Suppose that ˆβ(∞)6= ˆ

β(ml) on F . Consider two cases below, where we use k · k to denote the L

2 norm. (i) ˆβ(∞)− ˆβ(ml) ≤ τ /2: For j ∈ A, since ˆβ (ml) j > 3τ /2, it implies ˆβ (∞) j > τ . For j /∈ A, since ˆβj(ml)= 0, it implies ˆβ (∞) j

< τ . Therefore, both estimates are solutions of a strictly convex function, −1

n`(β) + λ τ

X

j /∈A

|βj|, on F , and hence has the unique

minimum on F . Thus ˆβ(∞)= ˆβ(ml) on F . (ii) ˆβ(∞)− ˆβ(ml) > τ /2: We have  ∂ ∂βA S ˆβ(∞) − ∂ ∂βA S ˆβ(ml) T βˆ(∞)− ˆβ(ml) ˆβ(∞)− ˆβ(ml) ≥ f1− f2, where f1 =  1 n∇`( ˆβ (∞)) − 1 n∇` ˆβ (ml) T ( ˆβ(∞)− ˆβ(ml)) ˆβ(∞)− ˆβ(ml) , f2 = λ τ  sign ˆβ(∞)I | ˆβ(∞)| < τ − sign ˆβ(ml)I | ˆβ(ml)| < τT ( ˆβ (∞)− ˆβ(ml)) ˆβ(∞)− ˆβ(ml) . For F1, by the mean value theorem, there exists ˆβ∗ on the line segment between

ˆ

β(∞) and ˆβ(ml) such that f1 = 1 n  ∇2` ˆβ∗ ˆ β(∞)− ˆβ(ml) T βˆ(∞)− ˆβ(ml) ˆβ(∞)− ˆβ(ml) ≥ cmin ˆβ(∞)− ˆβ(ml) ≥ cmin τ 2. For f2, by Cauchy-Schwartz inequality,

f2 = λ τ  sign ˆβ(∞)I | ˆβ(∞)| < τ − sign ˆβ(ml)I | ˆβ(ml)| < τT βˆ (∞)− ˆβ(ml) ˆβ(∞)− ˆβ(ml) ≤ λ τ sign ˆβ (∞)I | ˆβ(∞)| < τ ) − sign( ˆβ(ml)I ˆβ(ml) < τ  ˆβ(∞)− ˆβ (ml) A0 ˆβ(∞)− ˆβ (ml) A0 ≤ 2√qλ/τ .

(23)

Consequently,  ∂ ∂βA S ˆβ(∞) − ∂ ∂βA S ˆβ(ml) T βˆ(∞)− ˆβ(ml) ˆβ(∞)− ˆβ(ml) ≥ cminτ /2−2 √ qλ/τ > 0, which contradicts to that 0 ∈

 ∂ ∂βA S ˆβ(∞) − ∂ ∂βA S ˆβ(ml) T ˆ β(∞)− ˆβ(ml). Thus ˆβ(∞)= ˆβ(ml) on F .

Finally, it is straightforward to show that P (F ) → 1 as n → ∞. This completes the proof.

Note that under some regularity conditions, n1/2 βˆ(ml)− β0

 d

→ N 0, I−1, (5.2) where I is the Fisher information. It follows that

n1/2 βˆ(∞)− β0

 d

→ N 0, I−1, under (5.2) and the conditions given by the above theorem.

6

Numerical Examples

In this chapter, we consider two simulation experiments with one for Poisson regression and the other for logistic regression cases. We are interested in knowing the performance of the proposed PML method in comparison with AIC, BIC, and their stepwise versions in terms of the Kullback-Leibler (KL) risk and the probability of identifying the true model.

6.1

Poisson Regression

We generate data y1, . . . , ynindependently from Poisson distributions with means µ1, . . . , µn

according to the model:

log µi = β0+ β1xi1+ · · · + βpxip= x0iβ; i = 1, . . . , n,

where xi’s are generated independently from N (0, Σ) with ρ|i−j| being the (i, j)th entry

of Σ, and β = (0, 2, −1, 0, . . . , 0)0. We apply the proposed PML method for τ = τ1, . . . , τ5

and λ = λ1, . . . , λ100, which are equally spaced in the log scale. Throughout the

simula-tion, we choose τ1 = 5 and τ5 = 0.1. In addition, we choose λ1 to be the smallest value

such that ˆβ1(∞)(λ, 5) = · · · = ˆβp(∞)(λ, 5) = 0, which can be obtained from the R package

“glmnet” developed by Friedman et al. (2010) only for the Lasso penalty, and choose λ100 = 0.001.

(24)

Table 1: Performance of various methods for Poisson regression with p = 6 and n = 50, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 3.822 (0.273) 1.00 3.582 (0.319) 1.00 True model MLE 1.610 (0.173) 0.00 1.603 (0.191) 0.00 AIC 2.978 (0.298) 0.53 2.717 (0.325) 0.42 BIC 2.135 (0.244) 0.17 2.042 (0.258) 0.14 ˆ β∞(ˆλ, 0.1) 2.045 (0.245) 0.13 1.983 (0.258) 0.12 ˆ β∞(ˆλ, 0.27) 1.728 (0.181) 0.08 1.771 (0.212) 0.09 ˆ β∞(ˆλ, 0.71) 1.909 (0.216) 0.14 2.428 (0.338) 0.23 ˆ β∞(ˆλ, 1.88) 3.393 (0.282) 0.65 3.386 (0.347) 0.70 ˆ β∞(ˆλ, 5) 3.396 (0.278) 0.74 3.439 (0.342) 0.73 ˆ β∞(ˆλ, ∞) 3.396 (0.278) 0.74 3.439 (0.342) 0.73 ˆ β∞(ˆλ, ˆτ ) 2.030 (0.243) 0.12 1.975 (0.258) 0.11

For different methods, we estimate the Kullback-Leibler risk (KL risk) which take expectation of the KL loss function. KL loss measures the difference between the estimates and the true parameters. The KL loss for Poisson cases is defined by

n X i=1  µi(log µi− log ˆµi) − (µi− ˆµi)  .

From the KL risk, we can see whether our method overcomes the other competitors. Table 1 and 2 present the results of repeating 100 times simulations and show the KL risk together with its standard error. The error rate is estimated by the proportion of how many times the methods chose the wrong model during the 100 simulations. When determining the tuning parameters, we utilize the BIC score method, which choose the estimates corresponding to the smallest BIC score.

From the results of Table 1 and 2, we notice that when there are more zero coeffi-cients than non-zero coefficoeffi-cients, our method performs better than the others. We can see that the ML estimate of β based on the full model are more unstable. The method PML chooses a suitable tuning parameters pair (λ, τ ) can do the model selection and estimation as good as AIC/BIC does even sometimes overcomes them. When the number of interested variables increases, AIC and BIC tend to consume much of time on compu-tation. Therefore, PML offers an efficient way to select a suitable model and estimates the coefficient simultaneously, which not only saves the energy on examine each possible

(25)

Table 2: Performance of various methods for Poisson regression with p = 10 and n = 50, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 5.731 (0.320) 1.00 5.830 (0.370) 1.00 True model MLE 1.610 (0.173) 0.00 1.603 (0.191) 0.00 AIC 3.931 (0.346) 0.75 3.955 (0.373) 0.73 BIC 2.529 (0.286) 0.27 2.522 (0.311) 0.26 ˆ β∞(ˆλ, 0.1) 2.050 (0.243) 0.14 2.317 (0.306) 0.20 ˆ β∞(ˆλ, 0.27) 1.689 (0.174) 0.08 1.915 (0.251) 0.15 ˆ β∞(ˆλ, 0.71) 1.957 (0.235) 0.19 2.812 (0.390) 0.36 ˆ β∞(ˆλ, 1.88) 4.012 (0.327) 0.80 4.421 (0.366) 0.86 ˆ β∞(ˆλ, 5) 3.902 (0.299) 0.89 4.405 (0.361) 0.90 ˆ β∞(ˆλ, ∞) 3.902 (0.299) 0.89 4.405 (0.361) 0.90 ˆ β∞(ˆλ, ˆτ ) 2.045 (0.242) 0.14 2.309 (0.306) 0.19 models but also provides better estimates with smaller risk and standard error.

Figure 3 and 4 show the how the KL loss and BIC scores behave with the path of λ∗ under log scale. Each curve denotes one candidate value of τ . It is easy to see that these curves for each τ has a minimum value. Obviously, the two figures present almost the same patterns. It is intuitive to use BIC scores as a criterion of determining the tuning parameters. Even though we have different values of τ , their BIC scores can help us find a good λ∗ for each τ and give not bad estimates. Note that in the two tuning parameters, λ plays a more important role in model selection than τ does.

From small p cases, we can see that the PML method can provide better estimates which are more precise and accurate than the others. However, high dimension problems become more and more important recently. We are interested in whether PML is available when p increases. We try larger p = 20 and p = 40 for Poisson regression. Note that p is to large to use AIC/BIC method since there are 2p models to compare. We utilize

stepwise AIC/BIC instead. The results are shown in Table 3 to Table 5.

Under high dimensional conditions, the benefits of PML method are easier to see. Compared to the other methods, our method provides little risk and smaller standard error. The proportion of selecting the wrong model is smaller than its competitors.

(26)

−6 −4 −2 0 2 4 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 log(λ τ) KL loss 0.1 0.5 1 Inf −6 −4 −2 0 2 4 −4420 −4415 −4410 −4405 log(λ τ) BIC score 0.1 0.5 1 Inf

Figure 3: KL losses (left) and the corresponding BIC scores based on PML (right) for poisson regression with p = 6 and n = 50.

−6 −4 −2 0 2 4 2 3 4 5 6 7 log(λ τ) KL loss 0.1 0.5 1 Inf −6 −4 −2 0 2 4 −4610 −4600 −4590 −4580 log(λ τ) BIC score 0.1 0.5 1 Inf

Figure 4: KL losses (left) and the corresponding BIC scores based on PML (right) for poisson regression with p = 10 and n = 50.

(27)

Table 3: Performance of various methods for Poisson regression with p = 20 and n = 50, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 12.016 (0.610) 1.00 12.216 (0.528) 1.00 True model MLE 1.610 (0.173) 0.00 1.603 (0.191) 0.00 Stepwise AIC 8.223 (0.571) 0.94 8.306 (0.537) 0.97 Stepwise BIC 4.628 (0.502) 0.52 4.500 (0.505) 0.54 ˆ β∞(ˆλ, 0.1) 2.367 (0.300) 0.20 2.928 (0.430) 0.29 ˆ β∞(ˆλ, 0.27) 1.746 (0.181) 0.10 1.846 (0.256) 0.15 ˆ β∞(ˆλ, 0.71) 2.107 (0.251) 0.25 3.324 (0.425) 0.54 ˆ β∞(ˆλ, 1.88) 5.104 (0.340) 0.93 6.203 (0.456) 0.98 ˆ β∞(ˆλ, 5) 4.963 (0.325) 0.95 6.234 (0.455) 1.00 ˆ β∞(ˆλ, ∞) 4.963 (0.325) 0.95 6.234 (0.455) 1.00 ˆ β∞(ˆλ, ˆτ ) 2.310 (0.295) 0.18 2.858 (0.429) 0.26

Table 4: Performance of various methods for Poisson regression with p = 20 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 10.555 (0.317) 1.00 10.786 (0.310) 1.00 True model MLE 1.652 (0.145) 0.00 1.725 (0.145) 0.00 Stepwise AIC 7.057 (0.354) 0.92 7.185 (0.332) 0.94 Stepwise BIC 3.411 (0.268) 0.39 3.076 (0.257) 0.31 ˆ β∞(ˆλ, 0.1) 1.927 (0.162) 0.11 1.954 (0.163) 0.08 ˆ β∞(ˆλ, 0.27) 1.758 (0.151) 0.09 1.766 (0.145) 0.04 ˆ β∞(ˆλ, 0.71) 1.760 (0.151) 0.11 2.310 (0.281) 0.08 ˆ β∞(ˆλ, 1.88) 5.278 (0.399) 0.89 6.222 (0.396) 0.88 ˆ β∞(ˆλ, 5) 5.378 (0.389) 0.87 6.442 (0.390) 0.90 ˆ β∞(ˆλ, ∞) 5.378 (0.389) 0.97 6.442 (0.390) 0.90 ˆ β∞(ˆλ, ˆτ ) 1.927 (0.162) 0.11 1.954 (0.163) 0.08

(28)

Table 5: Performance of various methods for Poisson regression with p = 40 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 22.497 (0.554) 1.00 22.585 (0.519) 1.00 True model MLE 1.652 (0.145) 0.00 1.725 (0.145) 0.00 Stepwise AIC 14.538 (0.585) 1.00 14.246 (0.501) 1.00 Stepwise BIC 5.364 (0.435) 0.58 5.011 (0.400) 0.55 ˆ β∞(ˆλ, 0.1) 1.968 (0.184) 0.13 2.001 (0.157) 0.11 ˆ β∞(ˆλ, 0.27) 1.864 (0.166) 0.12 1.750 (0.145) 0.03 ˆ β∞(ˆλ, 0.71) 1.938 (0.187) 0.15 2.905 (0.375) 0.22 ˆ β∞(ˆλ, 1.88) 6.430 (0.409) 0.92 8.435 (0.451) 0.98 ˆ β∞(ˆλ, 5) 7.098 (0.399) 1.00 8.481 (0.444) 1.00 ˆ β∞(ˆλ, ∞) 7.098 (0.399) 1.00 8.481 (0.444) 1.00 ˆ β∞(ˆλ, ˆτ ) 1.968 (0.184) 0.13 1.982 (0.158) 0.10

Note that our objective function (3.1) can approximate AIC/BIC when τ is close to zero. We are interested whether our L0 approach works for the approximation of

exhausted AIC and BIC. We want to know whether AIC/BIC and our approach select the same model and whether their AIC/BIC score are close under small p cases. Table 6 and Table 7 shows the proportion that our approach well approximates the results of AIC and BIC.

From the results, we can see that larger sample size leads to larger well-approximation proportion.

6.2

Logistic Regression

For logistic regression simulation settings, the observation y1, y2, . . . , ynare generated from

bernoulli distribution with probability pi satisfying the model

pi

1 − pi

= β0+ β1xi1+ · · · + βpxiq = x0iβ,

where xi = (xi0, xi1, . . . , xip)0 and β = (β0, β1, . . . , βp)0 are (p + 1)−dimension vectors.

The predictors xi are generated form i.i.d. normal N (0, Σp×p) distribution where Σp×p

denotes the correlation matrix of time series AR(1) model with each component expressed by ρ|i−j|. We assign β = (0, 3, −2, 0, . . . , 0)0 We do the same procedure as we do in Poisson

(29)

Table 6: Approximations to AIC and BIC for Poisson regression with p = 6.

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=50 τ = 0.5 0.02 0 0 0 0 0 τ = 0.1 0.26 0.63 0.10 0.55 0.10 0.55 τ = 0.01 0.41 0.46 0.41 0.46 0.41 0.46 τ = 0.001 0.05 0.04 0.05 0.04 0.05 0.04 n=100 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0.05 0.27 0 0.26 0 0.26 τ = 0.01 0.45 0.78 0.44 0.78 0.44 0.78 τ = 0.001 0.14 0.05 0.14 0.05 0.14 0.05 n=200 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0.03 0.10 0 0.08 0 0.08 τ = 0.01 0.67 0.85 0.64 0.85 0.64 0.85 τ = 0.001 0.43 0.26 0.43 0.26 0.43 0.26 n=500 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0.02 0 0 0 0 0 τ = 0.01 0.76 0.91 0.59 0.91 0.59 0.91 τ = 0.001 0.55 0.89 0.55 0.89 0.55 0.89

Table 7: Approximations to AIC and BIC for Poisson regression with p = 10.

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=50 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0.08 0.40 0.04 0.34 0.04 0.34 τ = 0.01 0.16 0.38 0.16 0.38 0.16 0.38 τ = 0.001 0.04 0.01 0.03 0 0.03 0 n=100 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0 0.10 0 0.05 0 0.05 τ = 0.01 0.21 0.62 0.21 0.62 0.21 0.62 τ = 0.001 0.09 0.05 0.09 0.05 0.09 0.05 n=200 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0 0.02 0 0.01 0 0.01 τ = 0.01 0.27 0.71 0.26 0.71 0.26 0.71 τ = 0.001 0.30 0.14 0.30 0.14 0.30 0.14 n=500 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0 0 0 0 0 0 τ = 0.01 0.52 0.81 0.31 0.81 0.31 0.81 τ = 0.001 0.24 0.71 0.24 0.71 0.24 0.71

(30)

Table 8: Performance of various methods for logistic regression with p = 6 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 4.906 (0.399) 1.00 4.133 (0.242) 1.00 True model MLE 2.004 (0.204) 0.00 1.591 (0.128) 0.00 AIC 3.727 (0.352) 0.47 3.074 (0.234) 0.47 BIC 2.558 (0.255) 0.12 2.017 (0.187) 0.10 ˆ β∞(ˆλ, 0.1) 2.558 (0.255) 0.12 2.017 (0.187) 0.10 ˆ β∞(ˆλ, 0.27) 2.561 (0.200) 0.11 2.017 (0.187) 0.10 ˆ β∞(ˆλ, 0.71) 2.317 (0.236) 0.09 1.875 (0.188) 0.09 ˆ β∞(ˆλ, 1.88) 3.693 (0.256) 0.56 3.045 (0.183) 0.74 ˆ β∞(ˆλ, 5) 4.095 (0.312) 0.60 3.154 (0.166) 0.74 ˆ β∞(ˆλ, ∞) 4.108 (0.313) 0.60 3.154 (0.166) 0.74 ˆ β∞(ˆλ, ˆτ ) 2.558 (0.255) 0.12 2.017 (0.187) 0.10 regression. The KL loss for logistic regression is defined by

n X i=1  pilog pi ˆ pi + (1 − pi) log 1 − pi 1 − ˆpi  .

First we consider smaller dimension cases with p = 6, p = 10 and ρ = 0, ρ = 0.5, respectively. Table 8 and 9 presents the results of repeating 100 times of simulations and show the KL risk together with its standard error. The error rate is estimated by the proportion of how many times during the 100 simulations the methods chose the wrong model. When determining the tuning parameters, we utilize the BIC score method and choose the estimates with the smallest BIC score.

Compare to the results of Poisson simulations, the results of logistic regression are very unstable. The reason can be seen from Figure 5 and Figure 6. We can see the patterns in KL loss and BIC scores do not match. We choose our estimates with smallest BIC value, but the estimates we choose does not achieve the point with the smallest loss. The most possible reason might be the sample size is too small to estimate the coefficient precisely and accurately. The response we have is binary data and it tells little information. Al-though we can not choose the optimize solution, PML method still competes the other methods. We can see the KL risk of PML is smaller than the others and the proportion

(31)

Table 9: Performance of various methods for logistic regression with p = 10 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 9.528 (0.741) 1.00 8.442 (0.674) 1.00 True model MLE 1.804 (0.210) 0.00 2.001 (0.310) 0.00 AIC 6.405 (0.505) 0.85 5.899 (0.574) 0.89 BIC 2.949 (0.288) 0.26 3.133 (0.388) 0.24 ˆ β∞(ˆλ, 0.1) 2.955 (0.307) 0.25 3.092 (0.393) 0.22 ˆ β∞(ˆλ, 0.27) 2.680 (0.289) 0.18 2.849 (0.390) 0.14 ˆ β∞(ˆλ, 0.71) 2.139 (0.237) 0.12 2.359 (0.340) 0.18 ˆ β∞(ˆλ, 1.88) 4.449 (0.267) 0.82 4.706 (0.309) 0.82 ˆ β∞(ˆλ, 5) 4.408 (0.210) 0.75 4.831 (0.254) 0.80 ˆ β∞(ˆλ, ∞) 4.408 (0.210) 0.75 4.985 (0.299) 0.80 ˆ β∞(ˆλ, ˆτ ) 2.898 (0.286) 0.25 3.092 (0.393) 0.22 −7 −6 −5 −4 −3 −2 2 3 4 5 6 log(λ τ) KL loss 0.1 0.5 1 5 Inf −7 −6 −5 −4 −3 −2 58 60 62 64 66 68 70 log(λ τ) BIC score 0.1 0.5 1 5 Inf

Figure 5: KL losses (left) and the corresponding BIC scores based on PML (right) for logistic regression with p = 6 and n = 50.

(32)

−7 −6 −5 −4 −3 −2 2 3 4 5 6 7 log(λ τ) KL loss 0.1 0.5 1 5 Inf −7 −6 −5 −4 −3 −2 60 65 70 75 80 log(λ τ) BIC score 0.10.5 1 5 Inf

Figure 6: KL losses (left) and the corresponding BIC scores based on PML (right) for logistic regression with p = 10 and n = 50.

of selecting the wrong model is smaller than its competitors. So we can still conclude our method does better.

We are also interested in high dimensional logistic regression, then we try p = 20 and p = 40. The method of exact AIC and BIC are replaced by stepwise AIC/BIC procedure. Table 10 to 12 shows the simulation results.

We can see that the results in Table 10 to 12 our method can still give not bad estimates with smaller risk and error rate than the others. Then we say our method is effective.

Similarly, we are interested whether our L0 approach works for approximating the

exact AIC and BIC methods. Table 13 and Table 14 shows the proportion that our approach well approximates the results of AIC and BIC.

6.3

Data Analysis : Low Birth Weight Data

We apply our method to a low birth weight dataset of Hosmer and Lemeshow (1989). The data on 189 new born babies were collected at Baystate Medical Center, Springfield, MA during 1986. The data contains a binary response that indicates whether a new born baby has a low birth weight. The dataset also includes several risk factors associated with low birth weights, which are used as explanatory variables. We apply the following

(33)

Table 10: Performance of various methods for logistic regression with p = 20 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 83.310 (8.767) 1.00 42.549 (6.512) 1.00 True model MLE 1.839 (0.209) 0.00 2.001 (0.309) 0.00 Stepwise AIC 75.063 (9.184) 0.98 33.391 (6.673) 0.98 Stepwise BIC 67.630 (9.637) 0.70 24.598 (6.801) 0.55 ˆ β∞(ˆλ, 0.1) 8.111 (1.527) 0.52 5.304 (0.595) 0.38 ˆ β∞(ˆλ, 0.27) 6.947 (1.757) 0.27 4.033 (0.519) 0.22 ˆ β∞(ˆλ, 0.71) 4.777 (1.390) 0.25 2.690 (0.389) 0.34 ˆ β∞(ˆλ, 1.88) 10.523 (1.966) 0.73 6.690 (0.403) 0.88 ˆ β∞(ˆλ, 5) 11.642 (2.470) 0.65 8.335 (1.611) 0.84 ˆ β∞(ˆλ, ∞) 6.624 (0.400) 0.64 6.733 (0.305) 0.84 ˆ β∞(ˆλ, ˆτ ) 9.297 (1.976) 0.52 5.341 (0.594) 0.39

Table 11: Performance of various methods for logistic regression with p = 20 and n = 200, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 15.659 (0.801) 1.00 13.712 (0.334) 1.00 True model MLE 1.767 (0.127) 0.00 1.576 (0.090) 0.00 Stepwise AIC 9.683 (0.591) 0.93 8.632 (0.316) 0.96 Stepwise BIC 3.618 (0.243) 0.36 3.556 (0.217) 0.38 ˆ β∞(ˆλ, 0.1) 3.456 (0.239) 0.32 3.286 (0.202) 0.35 ˆ β∞(ˆλ, 0.27) 2.111 (0.164) 0.05 2.231 (0.169) 0.10 ˆ β∞(ˆλ, 0.71) 1.852 (0.128) 0.05 1.739 (0.093) 0.11 ˆ β∞(ˆλ, 1.88) 4.835 (0.224) 0.75 6.290 (0.319) 0.81 ˆ β∞(ˆλ, 5) 6.759 (0.182) 0.63 7.904 (0.275) 0.81 ˆ β∞(ˆλ, ∞) 6.759 (0.182) 0.63 7.904 (0.275) 0.81 ˆ β∞(ˆλ, ˆτ ) 3.456 (0.239) 0.32 3.286 (0.202) 0.35

(34)

Table 12: Performance of various methods for logistic regression with p = 40 and n = 200, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 124.839 (9.943) 1.00 48.942 (2.483) 1.00 True model MLE 1.767 (0.127) 0.00 1.576 (0.090) 0.00 Stepwise AIC 93.931 (10.402) 1.00 26.552 (1.370) 1.00 Stepwise BIC 67.207 (10.908) 0.61 6.841 (0.472) 0.63 ˆ β∞(ˆλ, 0.1) 6.228 (0.695) 0.43 5.472 (0.402) 0.51 ˆ β∞(ˆλ, 0.27) 3.266 (0.600) 0.08 2.042 (0.157) 0.07 ˆ β∞(ˆλ, 0.71) 1.819 (0.127) 0.06 1.696 (0.094) 0.13 ˆ β∞(ˆλ, 1.88) 7.179 (0.298) 0.77 8.579 (0.382) 0.87 ˆ β∞(ˆλ, 5) 9.340 (0.198) 0.55 10.947 (0.291) 0.78 ˆ β∞(ˆλ, ∞) 9.340 (0.198) 0.55 10.947 (0.291) 0.78 ˆ β∞(ˆλ, ˆτ ) 6.228 (0.695) 0.43 5.472 (0.402) 0.51

Table 13: Approximations to AIC and BIC for logistic regression with p = 6

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=10 τ = 0.5 0.23 0.67 0.11 0.58 0.11 0.58 τ = 0.1 0.50 0.17 0.50 0.09 0.50 0.09 τ = 0.01 0 0 0 0 0 0 τ = 0.001 0 0 0 0 0 0 n=200 τ = 0.5 0.09 0.43 0.02 0.37 0.02 0.37 τ = 0.1 0.53 0.87 0.53 0.87 0.53 0.87 τ = 0.01 0 0 0 0 0 0 τ = 0.001 0 0 0 0 0 0 n=500 τ = 0.5 0.01 0.14 0 0.12 0 0.12 τ = 0.1 0.93 0.99 0.56 0.99 0.56 0.99 τ = 0.01 0.26 0 0.24 0 0.24 0 τ = 0.001 0 0 0 0 0 0 n=1000 τ = 0.5 0.01 0.09 0 0.09 0 0.09 τ = 0.1 0.54 0.97 0.29 0.97 0.29 0.97 τ = 0.01 0.44 0 0.44 0 0.44 0 τ = 0.001 0 0 0 0 0 0

(35)

Table 14: Approximations to AIC and BIC for logistic regression with p = 10

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=10 τ = 0.5 0.09 0.43 0 0.33 0 0.33 τ = 0.1 0.27 0.18 0.27 0.09 0.17 0.09 τ = 0.01 0 0 0 0 0 0 τ = 0.001 0 0 0 0 0 0 n=200 τ = 0.5 0.01 0.17 0 0.16 0 0.16 τ = 0.1 0.25 0.86 0.25 0.86 0.25 0.86 τ = 0.01 0 0 0 0 0 0 τ = 0.001 0 0 0 0 0 0 n=500 τ = 0.5 0 0 0 0 0 0 τ = 0.1 0.82 0.91 0.33 0.91 0.33 0.91 τ = 0.01 0.13 0 0.12 0 0.12 0 τ = 0.001 0 0 0 0 0 0 n=1000 τ = 0.5 0 0.01 0 0.01 0 0.01 τ = 0.1 0.32 0.93 0.12 0.93 0.12 0.93 τ = 0.01 0.26 0 0.26 0 0.26 0 τ = 0.001 0 0 0 0 0 0 model:

low = β0+ β1× age + β2× lwt + β3× race : white + β4× race : black

+ β5× smoke + β6× ht + β7× ui + β8× f tv + β9× ptl,

where age and lwt are standardized to have mean 0 and variance 1. The details of these predictors are shown in Table 15 and the results of the selected model and the estimates are presented in Table 16. The standard deviations of the estimated coefficients are shown as “boot.std” in Table 16 using a parametric bootstrap method. We also show the standard deviations “glm.std” obtained based on the selected model using the R function “glm”.

From the results shown in Table 16, we can see that AIC selected more variables than the others. Obviously, the variables lwt and ptl are important risk factors of this research. It is intuitive to see that the weight of the mothers and whether the mothers were in premature labors directly influence the results of having a low birth weight baby. One worth mentioning thing is that both AIC and BIC select the variable ht but PML does not. Note that there are only 12 people over the 189 observations having history of hypertension, so it is not easy to tell whether this variable is important. Another reason we can see from the “glm.std” and “boot.std”. In contrast to the standard deviations

(36)

Table 15: Variables in the low birth weight dataset. Name Description

low indicator of birth weight less than 2.5kg age mother’s age in years

lwt mother’s weight in pounds at last menstrual period race mothers race (”white”, ”black”, ”other”)

smoke smoking status during pregnancy ht history of hypertension

ui presence of uterine irritability

ftv number of physician visits during the first trimester ptl number of previous premature labors

Table 16: Estimated parameters obtained from various methods.

Variable PML AIC BIC

ˆ

λ∗ 0.0355

ˆ

τ 0.01

coeff (boot.std) coeff (glm.std) (boot.std) coeff (glm.std) (boot.std) intercept -1.116 (0.260) -1.211 (0.279) (0.399) -1.225 (0.200) (0.257) age -0.322 (0.274) 0.000 (0.000) (0.189) 0.000 (0.000) (0.083) lwt -0.321 (0.278) -0.431 (0.201) (0.294) -0.523 (0.207) (0.341) race:white 0.000 (0.069) -1.011 (0.396) (0.562) 0.000 (0.000) (0.089) race:black 0.000 (0.171) 0.000 (0.000) (0.478) 0.000 (0.000) (0.223) smoke 0.000 (0.127) 0.931 (0.399) (0.530) 0.000 (0.000) (0.122) ht 0.000 (0.254) 1.848 (0.705) (1.123) 1.888 (0.720) (2.200) ui 0.000 (0.274) 0.739 (0.461) (0.691) 0.000 (0.000) (0.250) ftv 0.000 (0.110) 0.000 (0.000) (0.332) 0.000 (0.000) (0.192) ptl 1.523 (0.733) 1.119 (0.451) (0.607) 1.407 (0.428) (0.679) Size of model 3 6 3 Log-likelihood -107.181 -99.309 -105.131 BIC score 230.087 230.068 225.987

(37)

obtained from parametric bootstrap, the standard deviations obtain by “glm” are much smaller, which are somewhat expected, since they do not take model selection into account and are expected to be underestimated.

7

Discussion

In the context of model selection under generalized linear models, we propose a PML method that enables simultaneous model selection and parameter estimation. Despite using a nonconvex penalty, the proposed estimates can be efficiently computed for high dimensional data thanks to difference convex programming and the coordinate descent al-gorithm. In addition, some numerical and theoretical results are provided to demonstrate the effectiveness of the proposed method.

(38)

References

[1] Akaike, H. (1973). Information theory and the maximum likelihood principle. In International Symposium on Information Theory. Edited by Petrov, V. and Cs´aki, F. Akademiai Ki´ado, Budapest.

[2] Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Trans-actions on Automatic Control, 19, 716-713.

[3] An, H. L. T. and Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. Journal of Global Optimization, 11, 253-285.

[4] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression (with discussion). The Annals of Statistics, 32, 407-499.

[5] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360. [6] Friedman, J., Haste, T., Hofling, H., and Tibshirani, R. (2007). Pathwise coordinate

optimization. The Annals of Applied Statistics, 1, 302-332.

[7] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Gen-eralized Linear Models via Coordinate Decent. Journal of Statistic Software, 33. [8] Hosmer, D.W., Lemeshow, S., (1989). Applied Logistic Regression. Wiley, New

York

[9] Lange, K. (2004). Optimization. Springer, New York.

[10] Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in mul-tiple regression. The Annals of Statistics, 12, 758-765.

[11] Osborne, M., Presnell, B., and Turlach, B. (2000). On the LASSO and its dual. Journal of Computational and Graphical Statistics, 9, 319-337.

[12] Schwarz, G. (1978). Estimating the Dimension of a Model. The Annuals of Statis-tics, 19, 461-464.

[13] Shao, J. (1997). An asymptotic theory for linear model selection. The Annals of Statistics, 7, 221-242.

(39)

[14] Shen, X. and Huang, H.-C. (2010). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 490, 727-739. [15] Shen, X., Pan, W., Zhu, Y., and Zhou, H. (2010). On L0 regularization in

high-dimensional regression, manuscript.

[16] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, B, 58, 267-288.

[17] Zhao, P. and Yu, B. (2006). On model selection consistency of LASSO. Journal of Machine Learning Research, 7, 2541-2567.

[18] Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty, The Annals of Statistics, 38, 894-942.

(40)

Appendices

A

Data analysis code

Description

Fit a generalized linear model with truncated L1 penalty.

Usage

glmTLP(y,x,lambda.min,maxtau,mintau,tol) TLP-logistic(y,x,lambda.min,maxtau,mintau,tol) Arguments

y Response with binary output 0 and 1.

x Input matrix with dimension n × p for n observations and p predictors. lambda.min Smallest value for lambda path.

maxtau Largest thresholding parameter. mintau Smallest thresholding parameter.

tol Convergence tolerance for the coordinate decent and iterated reweighted penalized least square.

# ========================================================================================================== TLP-logistic <- function(y,x,lambda.min,maxtau,mintau,tol,return=False) { n <- nrow(x); p <- ncol(x) up <- glmnet(x,y,family="binomial",thresh=1e-06,standardize=F,maxit=1e+6)$lambda[1] lambda <- 10^(c(0:100)*((log(up,10)-log(lambda.min,10))/100)+log(lambda.min,10)) lambda <- sort(c(0,lambda),decreasing=T); nlambda <- length(lambda)

lasso <- glmnet(x,y,family="binomial",lambda=lambda,thresh=1e-06,standardize=F,maxit=1e+6) est.lasso <- cbind(as.vector(lasso$a0),t(as.matrix(lasso$beta)))

tau <- c(10^(c(0:4)*((log(maxtau,10)-log(mintau,10))/4)+ log(mintau,10)),100); q <- length(tau) est.table <- array(0,c(nlambda,(p+1),q)); est.ch <- array(0,c(q,(p+1)))

case.lambda <- BIC.lambda <- rep(0,q); score <- array(0,c(nlambda,q)) for(j in 1:q)

{

est.table[,,j] <- lambda_path(y,x,est.lasso,lambda,tau[j],tol) CH <- lambda_choose_BIC(y,x,est.table[,,j])

case <- CH$case; case.lambda[j] <- case; score[,j] <- CH$BIC est.ch[j,] <- est.table[case,,j]; BIC.lambda[j] <- CH$min.BIC }

case.tau <- which.min(BIC.lambda)

est.tau <- est.ch[case.tau,]; BIC.tau <- BIC.lambda[case.tau]

z <- list(est.tau=est.tau,BIC.tau=BIC.tau,est.lambda=est.ch,BIC.lambda=BIC.lambda, lambda=lambda[case.lambda[case.tau]],tau=tau[case.tau])

}

# ========================================================================================================== # Estimate the standard deviation of the coefficients using parametric bootstrap resampling

glmTLP <- function(y,x,lambda.min,maxtau,mintau,tol) {

(41)

p <- ncol(x); n <- nrow(x); k <- 100; table <- array(0,c(k,p+1)) temp <- TLP-logistic(y,x,lambda.min,maxtau,mintau,tol); est <- temp$est.tau for(i in 1:k)

{

xx <- cbind(1,x)

prob <- exp(xx %*% est)/(1+exp(xx %*% est)) new.y <- rbinom(n,1,prob) table[i,] <- TLP-logistic(new.y,x,lambda.min,maxtau,mintau,tol)$est.tau } std <- apply(table,2,sd) list(estimates=est,std=std,BIC.score=temp$BIC.tau) } # ========================================================================================================== # Computing the estimates along lambda path for a fixed tau (logistic)

lambda_path <- function(Y,X,lasso,lambda.star,tau,tol) {

nlambda <- length(lambda.star); p <- ncol(X); table <- rep(0,(p+1)); initial <- lasso[1,] for(i in 1:nlambda)

{

est <- DiffConvProg(lasso[i,],initial,Y,X,lambda.star[i],tau,tol)$estimates table <- rbind(table,est); initial <- est

}

table[(-1),] }

# ========================================================================================================== # Choosing a suitable tuning parameter lambda with minimun BIC score (logistic)

lambda_choose_BIC <- function(Y,X,table) {

nlambda <- nrow(table); n <- length(Y); total <- rep(0,nlambda) for(i in 1:nlambda)

{

total[i] <- -2*Lfun(table[i,],Y,X) + log(n)*sum(table[i,-1]!=0) }

case <- which.min(total)

list(estimates=table[case,],case=case,min.BIC=total[case],BIC=total) }

# ========================================================================================================== # Difference convex programing (logistic)

DiffConvProg <- function(estD,estQ,Y,X,lambda,tau,tol) {

p <- ncol(X); ans <- estD; check <- (-1); k <- 0 N.old <- (abs(ans[-1]) <= tau)

while(check != p) {

est <- Quapprox(ans,estQ,Y,X,lambda,tau,tol)$estimates

N.new <- (abs(est[-1]) <= tau); check <- sum(N.old == N.new) N.old <- N.new; ans <- est; k <- k+1

}

list(estimates=ans,times=k) }

# ========================================================================================================== # Quadratic approximation of the log-likelihood function (logistic)

Quapprox <- function(estD,estQ,Y,X,lambda,tau,tol) {

check <- T; k <- 0; ans <- estQ while(check==T && k<250)

{

est <- CoorDecent(estD,ans,Y,X,lambda,tau,tol)$estimates check <- max(abs(ans-est))> tol ; ans <- est; k <- k+1 }

(42)

list(estimates=ans,times=k) }

# ================================================================================= # Coordinate decent algorithm (logistic)

CoorDecent <- function(estD,estQ,Y,X,lambda,tau,tol) {

A <- cbind(1,X); Xans <- XestQ <- A%*%estQ; ph <- exp(XestQ)/(1+exp(XestQ)) ph[XestQ >=100] <- 1; ph[ph <= (1e-6)] <- 1e-6; ph[ph>=(1-(1e-6))] <- 1-(1e-6) w <- as.vector(ph*(1-ph)); Z <- XestQ + (Y-ph)/w; temp <- w*(A^2)

check <- T; k <- 0; ans <- est <- estQ while(check==T && k<250)

{

Z0 <- Xans-ans[1]; est[1] <- sum(w*(Z-Z0))/sum(w) for(j in 2:(p+1))

{

Xans <- Xans-A[,j-1]*(ans[j-1]-est[j-1]); Zj <- Xans-A[,j]*ans[j]

est[j] <- soft((1/n)*sum(w*A[,j]*(Z-Zj)),lambda*(abs(estD[j])<=tau))/((1/n)*sum(temp[,j])) }

Xans <- Xans-A[,p+1]*(ans[p+1]-est[p+1])

check <- max(abs(ans-est))> tol; k <- k+1; ans <- est } list(estimates=ans,times=k) } # ========================================================================================================== # Soft-thresholding operator soft <- function(z,r) { ans <- 0

if(z>0 && abs(z)>r) ans <- z-r if(z<0 && abs(z)>r) ans <- z+r ans

}

# ========================================================================================================== # Log-likelihood function (logsitic)

Lfun <- function(beta,Y,X) {

X <- cbind(1,X); Xbeta <- X %*% beta sum(Y*(Xbeta)-log(1+exp(Xbeta))) }

# ========================================================================================================== # Exac AIC & BIC : Exausted computation (logistic)

verify=function(Y,X) {

n <- length(Y); p <- ncol(X); index <- rep(c(T,F),each=2^(p-1)) for(k in 1 :(p-1))

{

temp <- rep(rep(c(T,F),each=2^(p-1-k)),2^k) index <- rbind(index,temp)

}

index <- t(index); total <- 2^p; score.aic <- score.bic <- rep(0,total) for(i in 1:total)

{

subdata <- data.frame(Y) for(j in 1:p)

{

if(index[i,j]==T) subdata <- data.frame(subdata,X[,j]) }

para <- glm(Y~.,subdata,family=binomial)$coeff subdata <- as.matrix(subdata)

(43)

score.aic[i] <- -2*L+2*check; score.bic[i] <- -2*L+log(n)*check }

case.aic <- which.min(score.aic); case.bic <- which.min(score.bic) subdata <- data.frame(Y)

for(j in 1:p) {

if(index[case.aic,j]==T) subdata <- data.frame(subdata,X[,j]) }

tempA <- glm(Y~.,subdata,family=binomial); temp.aic <- tempA$coeff; est.aic <- temp.aic[1] i <- 1

for(j in 1:p) {

if(index[case.aic,j]==F) {est.aic <- c(est.aic,0)}

if(index[case.aic,j]==T) {i=i+1; est.aic <- c(est.aic,temp.aic[i])} }

subdata <- data.frame(Y) for(j in 1:p)

{

if(index[case.bic,j]==T) subdata <- data.frame(subdata,X[,j]) }

tempB <- glm(Y~.,subdata,family=binomial); temp.bic <- tempB$coeff; est.bic <- temp.bic[1] i <- 1

for(j in 1:p) {

if(index[case.bic,j]==F) {est.bic <- c(est.bic,0)}

if(index[case.bic,j]==T) {i <- i+1; est.bic <- c(est.bic,temp.bic[i])} }

list(aic.model=index[case.aic,],est.AIC=est.aic,bic.model=index[case.bic,],est.BIC=est.bic, coefficientA=(summary(tempA))$coeff,coefficientB=(summary(tempB))$coeff)

}

B

Poisson simulation code

m <- 100; n <- 50; beta <- c(0,2,-1,rep(0,4)) tau.max <- 5; tau.min <- 0.1; tol <- 1e-6

tau <- c(10^(c(0:4)*((log(tau.max,10)-log(tau.min,10))/4)-1),100) p <- length(beta)-1; q <- length(tau); nlambda <- 1022 time1 <- Sys.time()

num <- 1

est.mle <- est.true <- estchtau <- est.aic <- est.bic <- array(0,c(m,p+1)) estimate <- array(0,c(nlambda,(p+1),q,m))

estch <- array(0,c(q,p+1,m))

KL <- BIC.value <- array(0,c(nlambda,q,m)) KL.ch <- array(0,c(m,q))

KL.mle <- KL.true <- KL.aic <- KL.bic <- rep(0,m) KL.chtau <- KL.aic <- KL.bic <- est.model <- rep(0,m) case.tau <- value.tau <- rep(0,q)

while(num <= m) {

x <- matrix(rnorm(p*n,0,1),ncol=p); mean <- y <- rep(0,n) xx <- cbind(1,x); mean <- exp(xx %*% beta); y <- rpois(n,mean) # MLE of the full model :

data <- data.frame(y,x)

fit <- glm(y~.,data,family=poisson); est.mle[num,] <- fit$coeff

trueMLE <- glm(y~X1+X2,data,family=poisson)$coeff; est.true[num,] <- c(trueMLE,rep(0,p-2)) KL.mle[num] <- KLloss(beta,est.mle[num,],y,x); KL.true[num] <- KLloss(beta,est.true[num,],y,x) up <- glmnet(x,y,family="poisson",thresh=1e-06,standardize=F,maxit=1e+6)$lambda[1]

lambda.star <- 10^(c(0:100)*((log(up,10)+3)/100)-3); lambda.star <- sort(c(0,lambda.star),decreasing=T) lasso <- glmnet(x,y,family="poisson",lambda=lambda.star,thresh=1e-06,standardize=F,maxit=1e+6)

est.lasso <- cbind(as.vector(lasso$a0),t(as.matrix(lasso$beta))) for(k in 1:q)

參考文獻

相關文件

The main advantages of working with continuous designs are (i) the same method- ology can be essentially used to find continuous optimal designs for all design criteria and

The proof is based on Hida’s ideas in [Hid04a], where Hida provided a general strategy to study the problem of the non-vanishing of Hecke L-values modulo p via a study on the

The existence and the uniqueness of the same ratio points for given n and k.. The properties about geometric measurement for given n

Numerical experiments are done for a class of quasi-convex optimization problems where the function f (x) is a composition of a quadratic convex function from IR n to IR and

Given a connected graph G together with a coloring f from the edge set of G to a set of colors, where adjacent edges may be colored the same, a u-v path P in G is said to be a

To improve the convergence of difference methods, one way is selected difference-equations in such that their local truncation errors are O(h p ) for as large a value of p as

[7] C-K Lin, and L-S Lee, “Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features,” in Proc. “ Speech

n SCTP ensures that messages are delivered to the SCTP user in sequence within a given stream. n SCTP provides a mechanism for bypassing the sequenced