• 沒有找到結果。

On some Variable Selection Procedures Based on Data for Regression Models

N/A
N/A
Protected

Academic year: 2021

Share "On some Variable Selection Procedures Based on Data for Regression Models"

Copied!
7
0
0

加載中.... (立即查看全文)

全文

(1)

Inference 136 (2006) 2020 – 2034

www.elsevier.com/locate/jspi

On some variable selection procedures based on data

for regression models

Deng-Yuan Huang

a,∗

, Ren-Fen Lee

b

, S. Panchapakesan

c aInstitute of Applied Statistics, Fu-Jen Catholic University, 510 Chung Cheng Road, Hsinchuang,

Taipei Hsien, Taiwan, ROC

bDepartment of Accounting, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, ROC cDepartment of Mathematics, Southern Illinois University, Carbondale, IL 62901-4408, USA

Available online 8 September 2005

Abstract

We discuss variable selection procedures in regression models based on large data sets. Our purpose is to discover the pattern of the association between the dependent variable and the independent variables. The real-valued variable Y is the response (dependent) variable. The variables in the vector X are the predictor (independent) variables, which may be either ordered or categorical. A function d(X) is defined on the measure space taking on real values. In regression models, predictors have been constructed using a parametric approach under the assumption that E(Y |X= x) = d(x, ), where d has known functional form depending on x and a finite set of parameters = (1, 2, . . . , m). Then  is estimated by the least squares method. In practical applications, however, the functional form of d is usually unknown. In such a situation, it is difficult to determine the regression function and we use Classification And Regression Trees (CART) methods that integrate the data and the model. We propose selection procedures to select important predictor variables in the regression model based on data. Some criteria for selecting the important variables are discussed. An empirical study based on an annual survey of inbound visitors in Taiwan is provided to illustrate the implementation of our multiple decision procedure.

© 2005 Elsevier B.V. All rights reserved. MSC: Primary 62J02; secondary 62F07; 62J05; 62J20

Keywords: Regression model; CART; Variables selection; Multiple decision

Corresponding author.

E-mail address:[email protected](D.-Y. Huang). 0378-3758/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.08.038

(2)

1. Introduction

Let (X, Y ) be a large data set. The variable Y is the response (dependent) variable. The vari-ables in the vector X are the predictor (independent) varivari-ables, which may be either ordered or categorical. A function d(X) is defined on the measure space  taking real values. In regression models, predictors have been constructed by using a parametric approach under the assumption that E(Y |X= x) = d(x, ), where d has a known functional form depending on x and a finite set of parameters = (1, 2, . . . , m).

Suppose we have a learning sample consisting of N cases: (x1, y1), (x2, y2), . . . , (xN, yN).

This is used to construct a predictor d(x). As a measure of the accuracy of such a predictor, we use the measure R(d) = E(Y − d(x))2. The predictor dBwhich minimizes R(d) (or equivalently,

maximizes the precision) is

dB(x) = E(Y |X = x).

This is the well-known method of least squares. The predictor dB(X) is often referred to as the

regression surface of Y on X. The error R(d) is usually estimated by R(d) = 1

N



n

(yn− d(xn))2. (1.1)

In practical applications, the functional form of d is usually unknown and it is difficult to determine the regression function. We will use the Classification And Regression Trees (CART) methods that integrate the data and the model.

Our focus is on data sets whose dimensionality requires some sort of variables selection. Therefore, we first describe the tree structure regression.

2. Tree structure regression

The binary tree structure (Fig.1) classifies by repeatedly splitting subsets of the measure space

 into two descending subsets. The procedure, at the first step, begins with  itself. A tree structure

predictor partitions the space by a sequence of binary splits into terminal nodes. The entire construction of a tree involves three elements:

(1) the selection of the splits,

(2) the decision as to declare a node as terminal or to continue splitting it, and (3) the assignment of each terminal node to a class.

A node is continued to be split until one of the following occurs (seeAnswer Tree 2.0: User’s Guide, 1998):

(a) All cases in a node have identical values for all predictors.

(b) The node becomes pure, that is, all cases in the node have the same target value. (c) The depth of the tree has reached its prespecified maximum value.

(d) The number of cases constituting the node is less than a prespecified minimum parent node size.

(e) The split at the node results in producing a child node whose number of cases is less than a prespecified minimum child node size.

(3)

t1 t2 t 3 t4 t5 t6 t7 t8 t9 split 1 split 2 split 3 split 4 y(t4) y(t 5) y(t6) y(t7) y(t8) Fig. 1.

(f) For CART only, the maximum decrease in impurity is less than a prespecified value. The value of d(xn) that minimizes R(d) in (1.1) is the average ¯y(t) for all cases (xn, yn) falling

into the terminal node t; i.e., ¯y(t) = (1/N(t))xn∈tyn, where N(t) is the total number of cases

in node t. In the tree structure regression, we use instead of R(d) the quantity R(T ) defined by

R(T ) = 1 N  t∈ ˜T  xn∈t (yn− ¯y(t))2, (2.1)

where ˜T denotes the set of all nodes for the splits.

Consider a node t split into tLand tR. Define

s2(t) = 1 N(t)  xn∈t (yn− ¯y(t))2. (2.2) Since  xn∈t (yn− ¯y(t))2=  xn∈tL (yn− ¯y(t))2+  xn∈tR (yn− ¯y(t))2, we get s2(t) = p(tL)s2(tL) + p(tR)s2(tR), (2.3) where s2(tL) = 1 N(tL)  xn∈tL (yn− ¯y(t))2,

(4)

s2(tR) = 1 N(tR)  xn∈tR (yn− ¯y(t))2, p(tL) =N(tL) N(t) , p(tR) = N(tR) N(t) and N(t) = N(tL) + N(tR).

Now, let p(t) = N(t)/N for any set t. Then, we define

R(t) ≡ 1 N  xn∈t (yn− ¯y(t))2= s2(t)p(t). (2.4) We can rewrite R(T ) in (2.1) as R(T ) = t∈ ˜T R(t). (2.5)

The above expression in (2.5) has a simple interpretation. For every node t,xn∈t(yn− ¯y(t))2

is the within node sum of squares. That is, it is the total squared deviations of the ynin t from

their average. Summing over t ∈ ˜T gives the total within node sum of squares and dividing by N

gives the average.

Definition 2.1. Given any set of splits S of a current terminal node t in ˜T , the best split s∗is that split in S which yields the maximum decrease in R(T ). More precisely, for any split s of t into

tLand tR, letR(s, t) = R(t) − R(tL) − R(tR). Then, the best split is s∗for which

R(s, t) = max

s∈S R(s, t). (2.6)

For a detailed treatment of the CART techniques, the reader is referred to the book byBreiman et al. (1984).

3. Preliminary variables selection

Our analysis based on the data begins with a preliminary variables selection by using a measure of the importance of a variable in a node.

To start with, we have the following obvious algebraic relations among the various sums of squares associated with a node t that is split into tLand tR:

 xn∈t (yn− ¯y)2=  x∈tL (yn− ¯y)2+  x∈tR (yn− ¯y)2,  xn∈tL (yn− ¯y)2=  x∈tL (yn− ¯y(tL))2+  x∈tL ( ¯y(tL) − ¯y)2,  xn∈tR (yn− ¯y)2=  x∈tR (yn− ¯y(tR))2+  x∈tR ( ¯y(tR) − ¯y)2.

(5)

These relations yield SSTt=  xn∈t (yn− ¯y)2=  x∈tL (yn− ¯y)2+  x∈tR (yn− ¯y)2 = ⎧ ⎨ ⎩  xn∈tL (yn− ¯y(tL))2+  xn∈tR (yn− ¯y(tR))2 ⎫ ⎬ ⎭ + {N(tL)( ¯y(tL) − ¯y)2+ N(tR)( ¯y(tR) − ¯y)2}

= SSEt+ SSBt,

where SSEtand SSBtare the within nodes sum of squares and between nodes sum of squares for the given split of the node t.

Now, let {X1, X2, . . . , Xk} denote the set of k independent variables. Suppose the node t

contains the subset{X1, X2, . . . , Xp−1} of the independent variables. Consider adding the variable

Xpto this set. Let=E(Y ) and let ip=Ex1,...,xp−1(Yi|Xp=xp), the expectation of Yiconditioned

on Xpin the node t. Let

SSBp=

N(t) i=1

(ip− )2,

SS ˆBxp(t) = N(tL)( ¯y(tL) − ¯y)2+ N(tR)( ¯y(tR) − ¯y)2, (3.1)

where SS ˆBxpis calculated with xpincluded.

A measure of the importance of including xp in splitting the node t into node tLand tR is

defined by



Cxp(t) =

(1/N(t))SS ˆBxp(t)

(1/N)SST , (3.2)

where SST=Ni=1(yi − ¯y)2and N(t) = N(tL) + N(tR).

Substituting (3.1) in (3.2) we have  Cxi(t) =(N(tL)/N(t))( ¯y(tL) − ¯y) 2+ (N(t R)/N(t))( ¯y(tR) − ¯y)2 (1/N)Ni=1(yi − ¯y)2 , i = 1, . . . , k. (3.3)

The larger the value ofCxi(t), the more important the variable xiin node t is.

In a tree,Cxp(t) is computed for each last split in the stem to assess the importance of the stem.

If the stem is the most important, then the variables in this stem are all included in the subsequent model.

4. Assessing the variables

After our preliminary selection of important variables, suppose we have p independent variables

x1, x2, . . . , xp. Suppose we put p−1 of these variables in CART except one, say xi. We define two

measures for assessing the importance of the variable xiin a node t. Once a variable is dropped

(6)

(1) Suppose the node t is split into tLand tRwithout xi. Define

Ixi(t) =

(N(tL)/N(t))( ¯y(tL) − ¯y)2+ (N(tR)/N(t))( ¯y(tR) − ¯y)2

(1/N)Ni=1(yi− ¯y)2

,

i = 1, . . . , p. (4.1)

Let ˜˜Ti denote the set of all intermediate nodes for the splits without variable xi, and let #( ˜˜Ti)

denote the number of all intermediate nodes. Define

Ixi = 1 #( ˜˜Ti)  t∈ ˜˜Ti Ixi(t). (4.2)

A smaller Ixi means that, without xi, the weighted average of the between sum of squares

(SSB), weighted inversely by the counts, is affected less in all intermediate nodes. In other words, smaller Iximeans that the variable xi is more important.

(2) Define Rxi2 =  t∈TiSSB(t) SST =  t∈TiNt(y(t) − y)2 N j =1(yi− y)2 , (4.3)

where Ti denotes the set of all terminal nodes for the splits without variable xi. The measures

R2xi and R2, defined as the coefficient of determination for linear regression models, are similar in the sense that both represent the proportion of the total variation explained by the between sum of squares (SSB). There are, however, some differences between Rxi2 and R2. First, R2xidoes not contain the independent variable xi. Second, R2assumes a linear relationship among these

variables, whereas R2xi does not have any assumption. Also, Rxi2 can measure the interactions among variables and assess the impact of each through the tree structure. The smaller Rxi2 is, the

more important xi is.

We propose the rule: rank the variables according to the ordered Ixi or R2xi, the smallest

indicating the most important.

All rules presented here are based on the nodes determined by using the stopping rules for a tree (discussed in Section 2). If the stopping rule is changed, then the variables considered may be different and/or the nodes may be different.

5. Model fitting

Once we have selected the important variables as described previously, the next step is to model the association between the dependent variable and the independent variables. Symmetric data are found efficient for model fitting. So we first transform the dependent variable suitably so that the data for the transformed variable Y exhibit symmetry. Suppose the independent variables

x1, . . . , xp are the variables from root to the node t selected by using the criterion in (3.3). We

first consider the regression model:

Y = 0+ 1x1+ 2x2+ · · · + pxp+ 

= T

x + , (5.1)

(7)

Estimates of parameters

Parameter Standard T for H0:

Variable DF Estimate Error Parameter= 0 Prob > |T | Tolerance

NIGHT4( 4) 1 16.582290 4.69427046 3.532 0.0004 0.02629870 OC1 ( 5) 1 −12.503078 2.62241263 −4.768 0.0001 0.04998445 OC2 ( 6) 1 −2.018459 0.68034618 −2.967 0.0030 0.36638318 R1 ( 7) 1 5.242674 1.22991467 4.263 0.0001 0.08667428 R2 ( 8) 1 6.409203 1.42409266 4.501 0.0001 0.13211097 R3 ( 9) 1 1.953004 1.49518191 1.306 0.1916 0.79668501 R4 ( 10) 1 1.656168 1.48579646 1.115 0.2651 0.79809733 R5 ( 11) 1 −2.948404 1.27713284 −2.309 0.0210 0.54001625 R6 ( 12) 1 5.727534 1.97232790 2.904 0.0037 0.5701644 References

Answer Tree 2.0: User’s Guide, 1998 SPSS Inc

Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A., 1984. Classification and Regression Trees. Wadsworth, Belmont, California. (reprinted by CRC Press, Boca Raton, Florida).

Chang, Y.K., 1999. On statistical inference for scale parameter with doubly censored data. Ph.D. Thesis, Department of Statistics, National Cheng-Chu University, Taipei.

參考文獻

相關文件

• Many statistical procedures are based on sta- tistical models which specify under which conditions the data are generated.... – Consider a new model of automobile which is

In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye (2002) leads to a consistent model selector or just reduce the overfitting of

Particularly, combining the numerical results of the two papers, we may obtain such a conclusion that the merit function method based on ϕ p has a better a global convergence and

By integrating data from a variety of government and commercial sources, we discovered 19,397 potential new commercial properties to inspect, based on the property usage types that

• Use table to create a table for column-oriented or tabular data that is often stored as columns in a spreadsheet.. • Use detectImportOptions to create import options based on

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

Predict daily maximal load of January 1999 A time series prediction problem.. Data

Our main goal is to give a much simpler and completely self-contained proof of the decidability of satisfiability of the two-variable logic over data words.. We do it for the case