• 沒有找到結果。

3. Data

3.2 Measures

EZTable would like to find out whether or not the customer would place

reservations again. If EZTable knows which members would return to use its services in the first place, it will be able to effectively design promotions. So, our goal is to use EZTable’s booking records to accurately estimate the return probabilities of members.

The data set contains over 100 thousand different user’s bookings records in EZTABLE between 2012 and 2014. The records are all from different member’s booking history. Therefore, each reservation record represents a member. EZTable defines 90 days as a meaningful period of time to record whether the member would place another reservation again during the period. Each row represents the

information of one member’s booking record, including member id, restaurant id, booking date, dining date, number of dining people, purpose, status. And there are two parts to these records: 1. MEMBER data, including over 620 thousand members’

information, like member id, gender, and birthday, and 2. RESTAURANT data, containing 724 registered restaurants’ profiles, like its nationality, country, providing WIFI or not.

There are over 100,000 booking records, with the variables in Table 1. With such a large data set, which is commonly seen in data mining research, we randomly split the data into two groups: one set is the training data for fitting models, while another is used to evaluate the performance of the models, and that is the so-called test data.

In this case, we define a training set, which contains 62083 bookings data accounting for about 60% of all cases and the remaining 40% data (41546 cases) are for

evaluation. The probability of return (placing order within 90 days, Return90) is 0.1979 in training data set and is 0.20105 in test data set. The probability of not return (not placing order within 90 days, Return90) is 0.8021 in training data set and is 0.79895 in test data set.

3.2 Measures

Return90 is the binary dependent variable of the data set. 1 denotes that the member returns to place another reservation in 90 days and 0 otherwise. The independent variables are the data of last reservation record, which includes the reservation information and restaurant information.

Reservation information contains the detailed booking information, including member’s age (age16-25. 1 represents age 16-25, while 0 represents other ranges.

age26-35. 1 represents age 26-35, while 0 represents other ranges. age36-45. 1 represents age 36-45, while 0 represents other ranges. ageOther. 1 represents age is not included in the former three ranges or is missing, while 0 represents other ranges.), member’s gender (gender. 1 represents men, while 0 represents woman), the number of days between dining day and placing order day (timediff), and the size of the party of diners (people). Dummy variables of status are the information recorded by EZTABLE according to the booking information form restaurant. If status is new or ok, that means the restaurant has no extra status information of the record (status_ok is 1). If the member changes his/her booking, status_changes is 1. Finally, if the status_canceled of a record is 1, it means the member cancelled the reservation.

Restaurant information includes dummy variables, which denote the city area the restaurant is located (area), if the restaurant is situated in a hotel (1/0) (is_hotel), and providing wifi or not (1/0) (wifi). Table 2 and Table 3 shows the summary statistics of the variables for training data and test data correspondingly.

Fig 3.1 shows the trend between People and Return90, and Timediff and Return90, respectively. As the number of People goes up, the value of Return90 correspondingly decreases. This implies that if the size of dining people is large, the member has less probability to place an order in 90 days. Generally speaking, we seldom dine out with a large group of people unless there is a family gathering or class reunion. Therefore, it is hard to motivate members to place another order in short term. On the contrary, a smaller group of people has a higher possibility to place an order in 90 days. If the member usually dines out with his/her intimate partner, it is more possible for them to place a reservation in advance.

The trend of relation between Return90 and Timediff is just the opposite. When it is a long period of time between dining day and order-placed day, the member has a higher chance to place another order in short term. This indicates that the earlier the member place an order, the higher the chance is that he/she will place an order soon.

The members derive their early order-placing behavior from their habit of preparing in advance. They get used to placing a reservation early and make sure they can dine on time even if the restaurant is fully booked. Accordingly, the probability of

Return90 is higher when Timediff increases. Conversely, if the value of timediff is small, the member might book the order temporarily. He/she might want to dine out on the spur of the moment. Hence, the probability of Return90 is smaller.

Fig 3.2 displays the return rates between four types of statuses and Return90.

There are four group of status: status_ok (0.155), status_canceled (0.379), status_change (0.253), and no status recorded (0.119). If a reservation status is

‘canceled’ (status_canceled), the return probability of its booker is higher than all the other situations. On the other hand, if a reservation status is ‘ok’ or ‘new’ (status_ok),

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

11

its booker has a lower chance to place another order in 90 days. This indicates that if a member canceled a reservation, there is a higher chance for the member to place another order in 90 days. The member might cancel the reservation for some reason and decided to dine out on another day. Therefore, he/she places another reservation in 90 days.

Gender Dummy variable

1: Male 0: Female

Timediff Number of days between dining day

and placing order day

People Size of the party of diners

Status Status of reservation

status_ok: No information from the restaurant

Dummy variable 1: True

0: False

status_canceled: Reservation cancelled

Dummy variable 1: True

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

13

0: False

status_changed: Reservation changed

Dummy variable 1: True

0: False Restaurant Information

Is_hotel Is located in a hotel

Dummy variable 1: True

0: False

Cityarea Area the restaurant located

new_taipei_city In New Taipei City

Dummy variable 1: True

0: False

out_of_greater_taipei Exclude Taipei & New Taipei City Dummy variable

1: True 0: False

Wifi Providing wifi or not

Dummy variable 1: True

0: False

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

14

Table 2 Statistics of Training Data

Variables n mean Stdev. Min Max

Dependent Variable

Return90 62083 0.20 0.40 0.00 1.00

Reservation Information Age

age16-25 9314 0.15 0.36 0.00 1.00

age26-35 17770 0.29 0.45 0.00 1.00

age36-45 9097 0.15 0.35 0.00 1.00

ageOther 25902 0.42 0.49 0.00 1.00

Gender 62083 0.45 0.50 0.00 1.00

Timediff 62083 9.75 13.57 0.00 142.00

People 62083 4.05 2.93 1.00 45.00

Status

status_ok: 62083 0.73 0.45 0.00 1.00

status_canceled: 62083 0.19 0.40 0.00 1.00

status_changed: 62083 0.01 0.12 0.00 1.00

Restaurant Information

Is_hotel 62083 0.36 0.48 0.00 1.00

Cityarea

new_taipei_city 62083 0.03 0.17 0.00 1.00

out_of_greater_taipei 62083 0.06 0.23 0.00 1.00

Wifi 62083 0.61 0.49 0.00 1.00

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

15

Table 3 Statistics of Test Data

Variables n mean Stdev. Min Max

Dependent Variable

Return90 41546 0.20 0.40 0.00 1.00

Reservation Information Age

age16-25 6144 0.15 0.35 0.00 1.00

age26-35 11874 0.29 0.15 0.00 1.00

age36-45 6046 0.15 0.35 0.00 1.00

ageOther 17482 0.42 0.49 0.00 1.00

Gender 41546 0.45 0.50 0.00 1.00

Timediff 41546 9.68 13.39 0.00 147.00

People 41546 4.06 3.00 0.00 39.00

Status

status_ok: 41546 0.72 0.45 0.00 1.00

status_canceled: 41546 0.20 0.40 0.00 1.00

status_changed: 41546 0.02 0.12 0.00 1.00

Restaurant Information

Is_hotel 41546 0.36 0.48 0.00 1.00

Cityarea

new_taipei_city 41546 0.03 0.17 0.00 1.00

out_of_greater_taipei 41546 0.06 0.23 0.00 1.00

Wifi 41546 0.61 0.49 0.00 1.00

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

16

Table 4 Statistics of people

Group People Sample size Return90=True : Return90=False

Return prob.

people1-2 1-2 28909 6250 : 22659 0.216

people3-5 3-5 26936 5156 : 21780 0.191

people6-10 6-10 11813 2105 : 9708 0.178

people_others others 2342 344 : 1998 0.147

Table 5 Statistics of timediff

Timediff Sample size Return90=True : Return90=False

Return prob.

timediff0-3 0-3 31869 5138 : 26731 0.161

timediff4-7 4-7 13711 2620 : 11091 0.191

timediff8-14 8-14 10292 2311 : 7981 0.225

timediff_others others 14128 3786 : 10342 0.268

Fig 3.1a

Fig 3.1b

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

17

Table 6 Statistics of status

Group status Sample size Return90=True : Return90=False

Return prob.

new status_ok=1

status_canceled=0 status_changed=0

50504 7844: 42660 0.115

cancel status_ok=0 status_canceled=1 status_changed=0

13669 5184 : 8485 0.379

change status_ok=0 status_canceled=0 status_changed=1

999 253 : 746 0.253

none status_ok=0 status_canceled=0 status_changed=0

4828 574 : 4254 0.119

Fig 3.2

Since the dependent variable, Return90, is dichotomous, fitting an ordinary linear regression model is not appropriate. Therefore, we employ the logit models for estimation. Let 𝑌𝑖 denote Return90 for the ith EZTable user

𝑌𝑖 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑃𝑖) (1) where 𝑃𝑖 denotes the probability of a returned visit within 90 days. From a

regression perspective, 𝑃𝑖 is the expectation of Return90, and can be specified as:

E(𝑌𝑖 = 1|𝑋𝑖β) = 𝑃𝑖 (2) where 𝑋𝑖 is the vector of independent variables. Consequently, we can specify a logit model

𝑃𝑖 = exp(α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖)

1 + exp(α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖) (3) where (𝛽1, 𝛽2, … , 𝛽𝑘) are the parameters of explanatory variables.

Equation (3) can be written as log ( 𝑃𝑖

1 − 𝑃𝑖) = α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖 (4) where log (1−𝑃𝑃𝑖

𝑖) as the logit of 𝑃𝑖, which allows us to form a linear relationship between independent variables and logit(𝑃𝑖).

The right hand side of this generalized linear model in equation (4) is additive and linear. However, empirical data may have various data generation processes and requires non-linear forms of independent variables. In other words, the relation between the sum of the independent variables in the right hand side of equation (4) and response variable may be nonlinear. For example, Figure 4.1 shows the model fits better when we allow non-linear form of age (see below). It plots the simulated ages with corresponding wages. The straight line illustrates a linear relation between the independent variable and dependent variable. And the curve shows a non-linear relation between the two variables. The non-linear model fits the data better than the linear one. It catches the non-linear form of age around 20 years old to 30 years old.

相關文件