Inference 136 (2006) 2020 – 2034
www.elsevier.com/locate/jspi
On some variable selection procedures based on data
for regression models
Deng-Yuan Huang
a,∗, Ren-Fen Lee
b, S. Panchapakesan
c aInstitute of Applied Statistics, Fu-Jen Catholic University, 510 Chung Cheng Road, Hsinchuang,Taipei Hsien, Taiwan, ROC
bDepartment of Accounting, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, ROC cDepartment of Mathematics, Southern Illinois University, Carbondale, IL 62901-4408, USA
Available online 8 September 2005
Abstract
We discuss variable selection procedures in regression models based on large data sets. Our purpose is to discover the pattern of the association between the dependent variable and the independent variables. The real-valued variable Y is the response (dependent) variable. The variables in the vector X are the predictor (independent) variables, which may be either ordered or categorical. A function d(X) is defined on the measure space taking on real values. In regression models, predictors have been constructed using a parametric approach under the assumption that E(Y |X= x) = d(x, ), where d has known functional form depending on x and a finite set of parameters = (1, 2, . . . , m). Then is estimated by the least squares method. In practical applications, however, the functional form of d is usually unknown. In such a situation, it is difficult to determine the regression function and we use Classification And Regression Trees (CART) methods that integrate the data and the model. We propose selection procedures to select important predictor variables in the regression model based on data. Some criteria for selecting the important variables are discussed. An empirical study based on an annual survey of inbound visitors in Taiwan is provided to illustrate the implementation of our multiple decision procedure.
© 2005 Elsevier B.V. All rights reserved. MSC: Primary 62J02; secondary 62F07; 62J05; 62J20
Keywords: Regression model; CART; Variables selection; Multiple decision
∗Corresponding author.
E-mail address:[email protected](D.-Y. Huang). 0378-3758/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.08.038
1. Introduction
Let (X, Y ) be a large data set. The variable Y is the response (dependent) variable. The vari-ables in the vector X are the predictor (independent) varivari-ables, which may be either ordered or categorical. A function d(X) is defined on the measure space taking real values. In regression models, predictors have been constructed by using a parametric approach under the assumption that E(Y |X= x) = d(x, ), where d has a known functional form depending on x and a finite set of parameters = (1, 2, . . . , m).
Suppose we have a learning sample consisting of N cases: (x1, y1), (x2, y2), . . . , (xN, yN).
This is used to construct a predictor d(x). As a measure of the accuracy of such a predictor, we use the measure R∗(d) = E(Y − d(x))2. The predictor dBwhich minimizes R∗(d) (or equivalently,
maximizes the precision) is
dB(x) = E(Y |X = x).
This is the well-known method of least squares. The predictor dB(X) is often referred to as the
regression surface of Y on X. The error R∗(d) is usually estimated by R(d) = 1
N
n
(yn− d(xn))2. (1.1)
In practical applications, the functional form of d is usually unknown and it is difficult to determine the regression function. We will use the Classification And Regression Trees (CART) methods that integrate the data and the model.
Our focus is on data sets whose dimensionality requires some sort of variables selection. Therefore, we first describe the tree structure regression.
2. Tree structure regression
The binary tree structure (Fig.1) classifies by repeatedly splitting subsets of the measure space
into two descending subsets. The procedure, at the first step, begins with itself. A tree structure
predictor partitions the space by a sequence of binary splits into terminal nodes. The entire construction of a tree involves three elements:
(1) the selection of the splits,
(2) the decision as to declare a node as terminal or to continue splitting it, and (3) the assignment of each terminal node to a class.
A node is continued to be split until one of the following occurs (seeAnswer Tree 2.0: User’s Guide, 1998):
(a) All cases in a node have identical values for all predictors.
(b) The node becomes pure, that is, all cases in the node have the same target value. (c) The depth of the tree has reached its prespecified maximum value.
(d) The number of cases constituting the node is less than a prespecified minimum parent node size.
(e) The split at the node results in producing a child node whose number of cases is less than a prespecified minimum child node size.
t1 t2 t 3 t4 t5 t6 t7 t8 t9 split 1 split 2 split 3 split 4 y(t4) y(t 5) y(t6) y(t7) y(t8) Fig. 1.
(f) For CART only, the maximum decrease in impurity is less than a prespecified value. The value of d(xn) that minimizes R(d) in (1.1) is the average ¯y(t) for all cases (xn, yn) falling
into the terminal node t; i.e., ¯y(t) = (1/N(t))xn∈tyn, where N(t) is the total number of cases
in node t. In the tree structure regression, we use instead of R(d) the quantity R(T ) defined by
R(T ) = 1 N t∈ ˜T xn∈t (yn− ¯y(t))2, (2.1)
where ˜T denotes the set of all nodes for the splits.
Consider a node t split into tLand tR. Define
s2(t) = 1 N(t) xn∈t (yn− ¯y(t))2. (2.2) Since xn∈t (yn− ¯y(t))2= xn∈tL (yn− ¯y(t))2+ xn∈tR (yn− ¯y(t))2, we get s2(t) = p(tL)s2(tL) + p(tR)s2(tR), (2.3) where s2(tL) = 1 N(tL) xn∈tL (yn− ¯y(t))2,
s2(tR) = 1 N(tR) xn∈tR (yn− ¯y(t))2, p(tL) =N(tL) N(t) , p(tR) = N(tR) N(t) and N(t) = N(tL) + N(tR).
Now, let p(t) = N(t)/N for any set t. Then, we define
R(t) ≡ 1 N xn∈t (yn− ¯y(t))2= s2(t)p(t). (2.4) We can rewrite R(T ) in (2.1) as R(T ) = t∈ ˜T R(t). (2.5)
The above expression in (2.5) has a simple interpretation. For every node t,xn∈t(yn− ¯y(t))2
is the within node sum of squares. That is, it is the total squared deviations of the ynin t from
their average. Summing over t ∈ ˜T gives the total within node sum of squares and dividing by N
gives the average.
Definition 2.1. Given any set of splits S of a current terminal node t in ˜T , the best split s∗is that split in S which yields the maximum decrease in R(T ). More precisely, for any split s of t into
tLand tR, letR(s, t) = R(t) − R(tL) − R(tR). Then, the best split is s∗for which
R(s∗, t) = max
s∈S R(s, t). (2.6)
For a detailed treatment of the CART techniques, the reader is referred to the book byBreiman et al. (1984).
3. Preliminary variables selection
Our analysis based on the data begins with a preliminary variables selection by using a measure of the importance of a variable in a node.
To start with, we have the following obvious algebraic relations among the various sums of squares associated with a node t that is split into tLand tR:
xn∈t (yn− ¯y)2= x∈tL (yn− ¯y)2+ x∈tR (yn− ¯y)2, xn∈tL (yn− ¯y)2= x∈tL (yn− ¯y(tL))2+ x∈tL ( ¯y(tL) − ¯y)2, xn∈tR (yn− ¯y)2= x∈tR (yn− ¯y(tR))2+ x∈tR ( ¯y(tR) − ¯y)2.
These relations yield SSTt= xn∈t (yn− ¯y)2= x∈tL (yn− ¯y)2+ x∈tR (yn− ¯y)2 = ⎧ ⎨ ⎩ xn∈tL (yn− ¯y(tL))2+ xn∈tR (yn− ¯y(tR))2 ⎫ ⎬ ⎭ + {N(tL)( ¯y(tL) − ¯y)2+ N(tR)( ¯y(tR) − ¯y)2}
= SSEt+ SSBt,
where SSEtand SSBtare the within nodes sum of squares and between nodes sum of squares for the given split of the node t.
Now, let {X1, X2, . . . , Xk} denote the set of k independent variables. Suppose the node t
contains the subset{X1, X2, . . . , Xp−1} of the independent variables. Consider adding the variable
Xpto this set. Let=E(Y ) and let ip=Ex1,...,xp−1(Yi|Xp=xp), the expectation of Yiconditioned
on Xpin the node t. Let
SSBp=
N(t) i=1
(ip− )2,
SS ˆBxp(t) = N(tL)( ¯y(tL) − ¯y)2+ N(tR)( ¯y(tR) − ¯y)2, (3.1)
where SS ˆBxpis calculated with xpincluded.
A measure of the importance of including xp in splitting the node t into node tLand tR is
defined by
Cxp(t) =
(1/N(t))SS ˆBxp(t)
(1/N)SST , (3.2)
where SST=Ni=1(yi − ¯y)2and N(t) = N(tL) + N(tR).
Substituting (3.1) in (3.2) we have Cxi(t) =(N(tL)/N(t))( ¯y(tL) − ¯y) 2+ (N(t R)/N(t))( ¯y(tR) − ¯y)2 (1/N)Ni=1(yi − ¯y)2 , i = 1, . . . , k. (3.3)
The larger the value ofCxi(t), the more important the variable xiin node t is.
In a tree,Cxp(t) is computed for each last split in the stem to assess the importance of the stem.
If the stem is the most important, then the variables in this stem are all included in the subsequent model.
4. Assessing the variables
After our preliminary selection of important variables, suppose we have p independent variables
x1, x2, . . . , xp. Suppose we put p−1 of these variables in CART except one, say xi. We define two
measures for assessing the importance of the variable xiin a node t. Once a variable is dropped
(1) Suppose the node t is split into tLand tRwithout xi. Define
Ixi(t) =
(N(tL)/N(t))( ¯y(tL) − ¯y)2+ (N(tR)/N(t))( ¯y(tR) − ¯y)2
(1/N)Ni=1(yi− ¯y)2
,
i = 1, . . . , p. (4.1)
Let ˜˜Ti denote the set of all intermediate nodes for the splits without variable xi, and let #( ˜˜Ti)
denote the number of all intermediate nodes. Define
Ixi = 1 #( ˜˜Ti) t∈ ˜˜Ti Ixi(t). (4.2)
A smaller Ixi means that, without xi, the weighted average of the between sum of squares
(SSB), weighted inversely by the counts, is affected less in all intermediate nodes. In other words, smaller Iximeans that the variable xi is more important.
(2) Define Rxi2 = t∈TiSSB(t) SST = t∈TiNt(y(t) − y)2 N j =1(yi− y)2 , (4.3)
where Ti denotes the set of all terminal nodes for the splits without variable xi. The measures
R2xi and R2, defined as the coefficient of determination for linear regression models, are similar in the sense that both represent the proportion of the total variation explained by the between sum of squares (SSB). There are, however, some differences between Rxi2 and R2. First, R2xidoes not contain the independent variable xi. Second, R2assumes a linear relationship among these
variables, whereas R2xi does not have any assumption. Also, Rxi2 can measure the interactions among variables and assess the impact of each through the tree structure. The smaller Rxi2 is, the
more important xi is.
We propose the rule: rank the variables according to the ordered Ixi or R2xi, the smallest
indicating the most important.
All rules presented here are based on the nodes determined by using the stopping rules for a tree (discussed in Section 2). If the stopping rule is changed, then the variables considered may be different and/or the nodes may be different.
5. Model fitting
Once we have selected the important variables as described previously, the next step is to model the association between the dependent variable and the independent variables. Symmetric data are found efficient for model fitting. So we first transform the dependent variable suitably so that the data for the transformed variable Y exhibit symmetry. Suppose the independent variables
x1, . . . , xp are the variables from root to the node t selected by using the criterion in (3.3). We
first consider the regression model:
Y = 0+ 1x1+ 2x2+ · · · + pxp+
= T
x + , (5.1)
Estimates of parameters
Parameter Standard T for H0:
Variable DF Estimate Error Parameter= 0 Prob > |T | Tolerance
NIGHT4(∧4) 1 16.582290 4.69427046 3.532 0.0004 0.02629870 OC1 (∧5) 1 −12.503078 2.62241263 −4.768 0.0001 0.04998445 OC2 (∧6) 1 −2.018459 0.68034618 −2.967 0.0030 0.36638318 R1 (∧7) 1 5.242674 1.22991467 4.263 0.0001 0.08667428 R2 (∧8) 1 6.409203 1.42409266 4.501 0.0001 0.13211097 R3 (∧9) 1 1.953004 1.49518191 1.306 0.1916 0.79668501 R4 (∧10) 1 1.656168 1.48579646 1.115 0.2651 0.79809733 R5 (∧11) 1 −2.948404 1.27713284 −2.309 0.0210 0.54001625 R6 (∧12) 1 5.727534 1.97232790 2.904 0.0037 0.5701644 References
Answer Tree 2.0: User’s Guide, 1998 SPSS Inc
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A., 1984. Classification and Regression Trees. Wadsworth, Belmont, California. (reprinted by CRC Press, Boca Raton, Florida).
Chang, Y.K., 1999. On statistical inference for scale parameter with doubly censored data. Ph.D. Thesis, Department of Statistics, National Cheng-Chu University, Taipei.