• 沒有找到結果。

Since the dependent variable, Return90, is dichotomous, fitting an ordinary linear regression model is not appropriate. Therefore, we employ the logit models for estimation. Let 𝑌𝑖 denote Return90 for the ith EZTable user

𝑌𝑖 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑃𝑖) (1) where 𝑃𝑖 denotes the probability of a returned visit within 90 days. From a

regression perspective, 𝑃𝑖 is the expectation of Return90, and can be specified as:

E(𝑌𝑖 = 1|𝑋𝑖β) = 𝑃𝑖 (2) where 𝑋𝑖 is the vector of independent variables. Consequently, we can specify a logit model

𝑃𝑖 = exp(α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖)

1 + exp(α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖) (3) where (𝛽1, 𝛽2, … , 𝛽𝑘) are the parameters of explanatory variables.

Equation (3) can be written as log ( 𝑃𝑖

1 − 𝑃𝑖) = α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖 (4) where log (1−𝑃𝑃𝑖

𝑖) as the logit of 𝑃𝑖, which allows us to form a linear relationship between independent variables and logit(𝑃𝑖).

The right hand side of this generalized linear model in equation (4) is additive and linear. However, empirical data may have various data generation processes and requires non-linear forms of independent variables. In other words, the relation between the sum of the independent variables in the right hand side of equation (4) and response variable may be nonlinear. For example, Figure 4.1 shows the model fits better when we allow non-linear form of age (see below). It plots the simulated ages with corresponding wages. The straight line illustrates a linear relation between the independent variable and dependent variable. And the curve shows a non-linear relation between the two variables. The non-linear model fits the data better than the linear one. It catches the non-linear form of age around 20 years old to 30 years old.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

19

Fig 4.1

In order to capture non-linear patterns, we apply the generalized additive model (GAM). GAM is a generalized linear modelthat uses smooth function to incorporate non-linear patterns of independent variables

log ( 𝑃𝑖

1 − 𝑃𝑖) = α + 𝛽1𝑆1(𝑋1𝑖) + 𝛽2𝑆2(𝑋2𝑖) + ⋯ + 𝛽𝑘𝑆𝑘(𝑋𝑘𝑖) (5) where 𝑆() stands for the smooth function of continuous independent variables.

We use package gam in R to perform GAM. Through this package, we want to find a function 𝑆 that fits the data well, but it is smooth at the same time (James, Gareth, et al, 2013). One natural way is to determine the function 𝑆 that minimizes the spline objective function

1

𝑛∑(𝑦𝑖− 𝑆(𝑥𝑖))2+ 𝜆 ∫(𝑆′′(𝑥))2𝑑𝑥

𝑛

𝑖=1

(6) With y preditcted by curve 𝑆(𝑥), the first term of equation (6) is the mean

squared error (MSE), trying to make 𝑆(𝑥) match the data at each 𝑥𝑖. As for the second term, it measures the curvature of 𝑆. The curvature controls how wiggly 𝑆(𝑥) is. It is modulated by the turning parameter λ ≥ 0. The higher the value of λ is, the smoother the curve is. We can consider that the smoothing function would search the minimum value of MSE while the average curvature is contingent on a restriction. The degree of 𝑆(𝑥) needs to strike a balance between minimizing the mean squared error and the penalty due to increased curvature.

Such model extends the form of a generalized linear regression by allowing

nonlinear pattern. This is more flexible because the relation between independent variable and dependent variable is not necessarily defined as linearity. In other words, although the regression is not linear in x, 𝑆(𝑥), the converted variable, makes it linear in 𝑥. Like we mention about Fig 4.1, the transformation can capture the nonlinearity between age and wage that a general linear regression would miss (Kim Larsen, 2015).

The purpose of our study is to enhance the predictive performance of returned visits for EZTable. We explore different model specifications, and evaluate their performances by receiver operating characteristics (ROC). ROC is a common tool for evaluating prediction accuracy of a binary classification system. It is widely used in data mining or machine learning research (James, et al., 2013). Among the methods of ROC, ROC curve is the most popular one. It is a two-dimensional graph. It is defined by TP rate (Equation (7)) and FP rate (Equation (8)). In a binary classification

problem, there are four possible outcomes (see Table 6). Given the outcome is in fact positive, true positive (TP) is when the model predicts a positive result, and false negative (FN) is when the model predicts a negative result. Similarly, given the outcome is in fact negative, true negative (TN) is when the model predicts a negative result, and false positive (FP) is when the model predicts a positive result.

TPR(true positive rate) = 𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (7) FPR(false positive rate) = 𝐹𝑃

𝑇𝑁 + 𝐹𝑃 (8) Table 7 confusion table

Actual Predict

Positive Negative Total

Positive True Positive

(TP)

False Positive (FP)

TP+FP

Negative False Negative

(FN)

True Negative (TN)

FN+TN

Total TP+FN FP+TN TP+FP+FN+TN

ROC curve utilizes TPR (true positive rate) and FPR (false positive rate). True positive rate is also known as sensitivity, which is the possibility to classify a positive response variable as positive. False positive, on the other hand, is the possibility to classify a negative response variable as positive. TPR are plotted on the Y axis and FPR are plotted on the X axis. Each point on the curve represents a combination of true positive rate and false positive rate.

In Figure 4.2, the dotted line is the combinations of TPR and FPR. If the dotted

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

21

line passes through the point (0, 1), the model classifies all outcomes perfectly with TPR equaling 1 and FPR equaling 0. In our case, it correctly predicts every member would return to place an order in 90 days or not. On the other hand, if the dotted line passes though the point (1, 0), it wrongly predict every member’s behavior.

The diagonal line y = x divides the ROC space into two parts, A and B. The diagonal line represents the randomly guessing classification model with the same TPR and FPR. If the curve is above the diagonal, it represents good classification results and is better than the randomly guessing model. However, if the line is below the diagonal, the model performs worse than the randomly guessing the model (see below).

.

Fig 4.2

In addition to comparing the curvature from different model, area under curve (AUC) is another metric to evaluate the performance of predictive models. AUC is the area under the ROC curve. Its possible value is from 0 to 1. The area of A, which is under the diagonal line, is 0.5. This is a randomly guessing model with AUC 0.5. If AUC is larger than 0.5, say, the dotted line, we are able to confirm that the model performs better than a randomly guessing one. That is, the bigger the AUC is, the better the model performs. In our study, we are going to use AUC to compare the predictive performances of different models.

A

B

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

22

相關文件