On some Variable Selection Procedures Based on Data for Regression Models

(1)

Inference 136 (2006) 2020 – 2034

www.elsevier.com/locate/jspi

On some variable selection procedures based on data

for regression models

Deng-Yuan Huang

a,∗

, Ren-Fen Lee

b

, S. Panchapakesan

c a_{Institute of Applied Statistics, Fu-Jen Catholic University, 510 Chung Cheng Road, Hsinchuang,}

Taipei Hsien, Taiwan, ROC

b_{Department of Accounting, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, ROC} c_{Department of Mathematics, Southern Illinois University, Carbondale, IL 62901-4408, USA}

Available online 8 September 2005

Abstract

We discuss variable selection procedures in regression models based on large data sets. Our purpose is to discover the pattern of the association between the dependent variable and the independent variables. The real-valued variable Y is the response (dependent) variable. The variables in the vector X are the predictor (independent) variables, which may be either ordered or categorical. A function d(X) is defined on the measure space taking on real values. In regression models, predictors have been constructed using a parametric approach under the assumption that E(Y |X= x) = d(x, ), where d has known functional form depending on x and a finite set of parameters = (1, 2, . . . , m). Then is estimated by the least squares method. In practical applications, however, the functional form of d is usually unknown. In such a situation, it is difficult to determine the regression function and we use Classification And Regression Trees (CART) methods that integrate the data and the model. We propose selection procedures to select important predictor variables in the regression model based on data. Some criteria for selecting the important variables are discussed. An empirical study based on an annual survey of inbound visitors in Taiwan is provided to illustrate the implementation of our multiple decision procedure.

Keywords: Regression model; CART; Variables selection; Multiple decision

∗_{Corresponding author.}

(2)

1. Introduction

Let (X, Y ) be a large data set. The variable Y is the response (dependent) variable. The vari-ables in the vector X are the predictor (independent) varivari-ables, which may be either ordered or categorical. A function d(X) is deﬁned on the measure space taking real values. In regression models, predictors have been constructed by using a parametric approach under the assumption that E(Y |X= x) = d(x, ), where d has a known functional form depending on x and a ﬁnite set of parameters = (1, 2, . . . , _m).

Suppose we have a learning sample consisting of N cases: (x1, y1), (x2, y2), . . . , (x_N, y_N).

This is used to construct a predictor d(x). As a measure of the accuracy of such a predictor, we use the measure R∗(d) = E(Y − d(x))2. The predictor dBwhich minimizes R∗(d) (or equivalently,

maximizes the precision) is

dB(x) = E(Y |X = x).

This is the well-known method of least squares. The predictor dB(X) is often referred to as the

regression surface of Y on X. The error R∗(d) is usually estimated by R(d) = 1

N

n

(yn− d(xn))2. (1.1)

In practical applications, the functional form of d is usually unknown and it is difﬁcult to determine the regression function. We will use the Classiﬁcation And Regression Trees (CART) methods that integrate the data and the model.

Our focus is on data sets whose dimensionality requires some sort of variables selection. Therefore, we ﬁrst describe the tree structure regression.

2. Tree structure regression

The binary tree structure (Fig.1) classiﬁes by repeatedly splitting subsets of the measure space

into two descending subsets. The procedure, at the ﬁrst step, begins with itself. A tree structure

predictor partitions the space by a sequence of binary splits into terminal nodes. The entire construction of a tree involves three elements:

(1) the selection of the splits,

(2) the decision as to declare a node as terminal or to continue splitting it, and (3) the assignment of each terminal node to a class.

A node is continued to be split until one of the following occurs (seeAnswer Tree 2.0: User’s Guide, 1998):

(a) All cases in a node have identical values for all predictors.

(b) The node becomes pure, that is, all cases in the node have the same target value. (c) The depth of the tree has reached its prespeciﬁed maximum value.

(d) The number of cases constituting the node is less than a prespeciﬁed minimum parent node size.

(e) The split at the node results in producing a child node whose number of cases is less than a prespeciﬁed minimum child node size.

(3)

t₁ t₂ _t 3 t₄ t₅ t₆ t₇ t₈ t₉ split 1 split 2 split 3 split 4 y(t₄) _y(t 5) y(t6) y(t₇) y(t₈) Fig. 1.

(f) For CART only, the maximum decrease in impurity is less than a prespeciﬁed value. The value of d(x_n) that minimizes R(d) in (1.1) is the average ¯y(t) for all cases (x_n, yn) falling

into the terminal node t; i.e., ¯y(t) = (1/N(t))_x_n_∈tyn, where N(t) is the total number of cases

in node t. In the tree structure regression, we use instead of R(d) the quantity R(T ) deﬁned by

R(T ) = 1 N t∈ ˜T xn∈t (yn− ¯y(t))2, (2.1)

where ˜T denotes the set of all nodes for the splits.

Consider a node t split into tLand tR. Deﬁne

s2(t) = 1 N(t) xn∈t (yn− ¯y(t))2. (2.2) Since xn∈t (yn− ¯y(t))2= xn∈tL (yn− ¯y(t))2+ xn∈tR (yn− ¯y(t))2, we get s2(t) = p(tL)s2(tL) + p(tR)s2(tR), (2.3) where s2(tL) = 1 N(tL) xn∈tL (yn− ¯y(t))2,

(4)

s2(tR) = 1 N(tR) xn∈tR (yn− ¯y(t))2, p(tL) =N(tL) N(t) , p(tR) = N(tR) N(t) and N(t) = N(tL) + N(tR).

Now, let p(t) = N(t)/N for any set t. Then, we deﬁne

R(t) ≡ 1 N xn∈t (yn− ¯y(t))2= s2(t)p(t). (2.4) We can rewrite R(T ) in (2.1) as R(T ) = t∈ ˜T R(t). (2.5)

The above expression in (2.5) has a simple interpretation. For every node t,_x_n_∈t(yn− ¯y(t))2

is the within node sum of squares. That is, it is the total squared deviations of the ynin t from

their average. Summing over t ∈ ˜T gives the total within node sum of squares and dividing by N

gives the average.

Deﬁnition 2.1. Given any set of splits S of a current terminal node t in ˜T , the best split s∗is that split in S which yields the maximum decrease in R(T ). More precisely, for any split s of t into

tLand tR, letR(s, t) = R(t) − R(tL) − R(tR). Then, the best split is s∗for which

R(s∗, t) = max

s∈S R(s, t). (2.6)

For a detailed treatment of the CART techniques, the reader is referred to the book byBreiman et al. (1984).

3. Preliminary variables selection

Our analysis based on the data begins with a preliminary variables selection by using a measure of the importance of a variable in a node.

To start with, we have the following obvious algebraic relations among the various sums of squares associated with a node t that is split into tLand tR:

xn∈t (yn− ¯y)2= x∈tL (yn− ¯y)2+ x∈tR (yn− ¯y)2, xn∈tL (yn− ¯y)2= x∈tL (yn− ¯y(tL))2+ x∈tL ( ¯y(tL) − ¯y)2, xn∈tR (yn− ¯y)2= x∈tR (yn− ¯y(tR))2+ x∈tR ( ¯y(tR) − ¯y)2.

(5)

These relations yield SST_t= xn∈t (yn− ¯y)2= x∈tL (yn− ¯y)2+ x∈tR (yn− ¯y)2 = ⎧ ⎨ ⎩ xn∈tL (yn− ¯y(tL))2+ xn∈tR (yn− ¯y(tR))2 ⎫ ⎬ ⎭ + {N(tL)( ¯y(tL) − ¯y)2+ N(tR)( ¯y(tR) − ¯y)2}

= SSEt+ SSBt,

where SSE_tand SSB_tare the within nodes sum of squares and between nodes sum of squares for the given split of the node t.

Now, let {X1, X2, . . . , Xk} denote the set of k independent variables. Suppose the node t

contains the subset{X1, X2, . . . , Xp−1} of the independent variables. Consider adding the variable

Xpto this set. Let=E(Y ) and let ip=Ex1,...,xp−1(Yi|Xp=xp), the expectation of Yiconditioned

on Xpin the node t. Let

SSB_p=

N(t) i=1

(_ip− )2,

SS ˆB_xp(t) = N(tL)( ¯y(tL) − ¯y)2+ N(tR)( ¯y(tR) − ¯y)2, (3.1)

where SS ˆB_xpis calculated with xpincluded.

A measure of the importance of including xp in splitting the node t into node tLand tR is

deﬁned by

Cxp(t) =

(1/N(t))SS ˆBxp(t)

(1/N)SST , (3.2)

where SST=N_i=1(yi − ¯y)2and N(t) = N(tL) + N(tR).

Substituting (3.1) in (3.2) we have Cxi(t) =(N(tL)/N(t))( ¯y(tL) − ¯y) 2_{+ (N(t} R)/N(t))( ¯y(tR) − ¯y)2 (1/N)N_i=1(yi − ¯y)2 , i = 1, . . . , k. (3.3)

The larger the value of_C_xi(t), the more important the variable xiin node t is.

In a tree,Cxp(t) is computed for each last split in the stem to assess the importance of the stem.

If the stem is the most important, then the variables in this stem are all included in the subsequent model.

4. Assessing the variables

After our preliminary selection of important variables, suppose we have p independent variables

x1, x2, . . . , xp. Suppose we put p−1 of these variables in CART except one, say xi. We deﬁne two

measures for assessing the importance of the variable xiin a node t. Once a variable is dropped

(6)

(1) Suppose the node t is split into tLand tRwithout xi. Deﬁne

Ixi(t) =

(N(tL)/N(t))( ¯y(tL) − ¯y)2+ (N(tR)/N(t))( ¯y(tR) − ¯y)2

(1/N)N_i=1(yi− ¯y)2

,

i = 1, . . . , p. (4.1)

Let ˜˜Ti denote the set of all intermediate nodes for the splits without variable xi, and let #( ˜˜Ti)

denote the number of all intermediate nodes. Deﬁne

Ixi = 1 #( ˜˜Ti) t∈ ˜˜Ti Ixi(t). (4.2)

A smaller Ixi means that, without xi, the weighted average of the between sum of squares

(SSB), weighted inversely by the counts, is affected less in all intermediate nodes. In other words, smaller Iximeans that the variable xi is more important.

(2) Deﬁne R_xi2 = t∈TiSSB(t) SST = t∈TiNt(y(t) − y)2 _N j =1(yi− y)2 , (4.3)

where Ti denotes the set of all terminal nodes for the splits without variable xi. The measures

R2_xi and R2, deﬁned as the coefﬁcient of determination for linear regression models, are similar in the sense that both represent the proportion of the total variation explained by the between sum of squares (SSB). There are, however, some differences between R_xi2 and R2. First, R2_xidoes not contain the independent variable xi. Second, R2assumes a linear relationship among these

variables, whereas R2_xi does not have any assumption. Also, R_xi2 can measure the interactions among variables and assess the impact of each through the tree structure. The smaller Rxi2 is, the

more important xi is.

We propose the rule: rank the variables according to the ordered Ixi or R2xi, the smallest

indicating the most important.

All rules presented here are based on the nodes determined by using the stopping rules for a tree (discussed in Section 2). If the stopping rule is changed, then the variables considered may be different and/or the nodes may be different.

5. Model ﬁtting

Once we have selected the important variables as described previously, the next step is to model the association between the dependent variable and the independent variables. Symmetric data are found efficient for model fitting. So we first transform the dependent variable suitably so that the data for the transformed variable Y exhibit symmetry. Suppose the independent variables

x1, . . . , xp are the variables from root to the node t selected by using the criterion in (3.3). We

ﬁrst consider the regression model:

Y = 0+ 1x1+ 2x2+ · · · + _px_p+

= T

x + , (5.1)

(7)

Estimates of parameters

Parameter Standard T for H0:

Variable DF Estimate Error Parameter= 0 Prob > |T | Tolerance

NIGHT4(∧4) 1 16.582290 4.69427046 3.532 0.0004 0.02629870 OC1 (∧5) 1 −12.503078 2.62241263 −4.768 0.0001 0.04998445 OC2 (∧6) 1 −2.018459 0.68034618 −2.967 0.0030 0.36638318 R1 (∧7) 1 5.242674 1.22991467 4.263 0.0001 0.08667428 R2 (∧8) 1 6.409203 1.42409266 4.501 0.0001 0.13211097 R3 (∧9) 1 1.953004 1.49518191 1.306 0.1916 0.79668501 R4 (∧10) 1 1.656168 1.48579646 1.115 0.2651 0.79809733 R5 (∧11) 1 −2.948404 1.27713284 −2.309 0.0210 0.54001625 R6 (∧12) 1 5.727534 1.97232790 2.904 0.0037 0.5701644 References

Answer Tree 2.0: User’s Guide, 1998 SPSS Inc

Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A., 1984. Classiﬁcation and Regression Trees. Wadsworth, Belmont, California. (reprinted by CRC Press, Boca Raton, Florida).

Chang, Y.K., 1999. On statistical inference for scale parameter with doubly censored data. Ph.D. Thesis, Department of Statistics, National Cheng-Chu University, Taipei.