3. Data
3.2 Measures
EZTable would like to find out whether or not the customer would place
reservations again. If EZTable knows which members would return to use its services in the first place, it will be able to effectively design promotions. So, our goal is to use EZTable’s booking records to accurately estimate the return probabilities of members.
The data set contains over 100 thousand different user’s bookings records in EZTABLE between 2012 and 2014. The records are all from different member’s booking history. Therefore, each reservation record represents a member. EZTable defines 90 days as a meaningful period of time to record whether the member would place another reservation again during the period. Each row represents the
information of one member’s booking record, including member id, restaurant id, booking date, dining date, number of dining people, purpose, status. And there are two parts to these records: 1. MEMBER data, including over 620 thousand members’
information, like member id, gender, and birthday, and 2. RESTAURANT data, containing 724 registered restaurants’ profiles, like its nationality, country, providing WIFI or not.
There are over 100,000 booking records, with the variables in Table 1. With such a large data set, which is commonly seen in data mining research, we randomly split the data into two groups: one set is the training data for fitting models, while another is used to evaluate the performance of the models, and that is the so-called test data.
In this case, we define a training set, which contains 62083 bookings data accounting for about 60% of all cases and the remaining 40% data (41546 cases) are for
evaluation. The probability of return (placing order within 90 days, Return90) is 0.1979 in training data set and is 0.20105 in test data set. The probability of not return (not placing order within 90 days, Return90) is 0.8021 in training data set and is 0.79895 in test data set.
3.2 Measures
Return90 is the binary dependent variable of the data set. 1 denotes that the member returns to place another reservation in 90 days and 0 otherwise. The independent variables are the data of last reservation record, which includes the reservation information and restaurant information.
Reservation information contains the detailed booking information, including member’s age (age16-25. 1 represents age 16-25, while 0 represents other ranges.
‧
age26-35. 1 represents age 26-35, while 0 represents other ranges. age36-45. 1 represents age 36-45, while 0 represents other ranges. ageOther. 1 represents age is not included in the former three ranges or is missing, while 0 represents other ranges.), member’s gender (gender. 1 represents men, while 0 represents woman), the number of days between dining day and placing order day (timediff), and the size of the party of diners (people). Dummy variables of status are the information recorded by EZTABLE according to the booking information form restaurant. If status is new or ok, that means the restaurant has no extra status information of the record (status_ok is 1). If the member changes his/her booking, status_changes is 1. Finally, if the status_canceled of a record is 1, it means the member cancelled the reservation.
Restaurant information includes dummy variables, which denote the city area the restaurant is located (area), if the restaurant is situated in a hotel (1/0) (is_hotel), and providing wifi or not (1/0) (wifi). Table 2 and Table 3 shows the summary statistics of the variables for training data and test data correspondingly.
Fig 3.1 shows the trend between People and Return90, and Timediff and Return90, respectively. As the number of People goes up, the value of Return90 correspondingly decreases. This implies that if the size of dining people is large, the member has less probability to place an order in 90 days. Generally speaking, we seldom dine out with a large group of people unless there is a family gathering or class reunion. Therefore, it is hard to motivate members to place another order in short term. On the contrary, a smaller group of people has a higher possibility to place an order in 90 days. If the member usually dines out with his/her intimate partner, it is more possible for them to place a reservation in advance.
The trend of relation between Return90 and Timediff is just the opposite. When it is a long period of time between dining day and order-placed day, the member has a higher chance to place another order in short term. This indicates that the earlier the member place an order, the higher the chance is that he/she will place an order soon.
The members derive their early order-placing behavior from their habit of preparing in advance. They get used to placing a reservation early and make sure they can dine on time even if the restaurant is fully booked. Accordingly, the probability of
Return90 is higher when Timediff increases. Conversely, if the value of timediff is small, the member might book the order temporarily. He/she might want to dine out on the spur of the moment. Hence, the probability of Return90 is smaller.
Fig 3.2 displays the return rates between four types of statuses and Return90.
There are four group of status: status_ok (0.155), status_canceled (0.379), status_change (0.253), and no status recorded (0.119). If a reservation status is
‘canceled’ (status_canceled), the return probability of its booker is higher than all the other situations. On the other hand, if a reservation status is ‘ok’ or ‘new’ (status_ok),
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
11
its booker has a lower chance to place another order in 90 days. This indicates that if a member canceled a reservation, there is a higher chance for the member to place another order in 90 days. The member might cancel the reservation for some reason and decided to dine out on another day. Therefore, he/she places another reservation in 90 days.
‧
Gender Dummy variable
1: Male 0: Female
Timediff Number of days between dining day
and placing order day
People Size of the party of diners
Status Status of reservation
status_ok: No information from the restaurant
Dummy variable 1: True
0: False
status_canceled: Reservation cancelled
Dummy variable 1: True
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
13
0: False
status_changed: Reservation changed
Dummy variable 1: True
0: False Restaurant Information
Is_hotel Is located in a hotel
Dummy variable 1: True
0: False
Cityarea Area the restaurant located
new_taipei_city In New Taipei City
Dummy variable 1: True
0: False
out_of_greater_taipei Exclude Taipei & New Taipei City Dummy variable
1: True 0: False
Wifi Providing wifi or not
Dummy variable 1: True
0: False
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
14
Table 2 Statistics of Training Data
Variables n mean Stdev. Min Max
Dependent Variable
Return90 62083 0.20 0.40 0.00 1.00
Reservation Information Age
age16-25 9314 0.15 0.36 0.00 1.00
age26-35 17770 0.29 0.45 0.00 1.00
age36-45 9097 0.15 0.35 0.00 1.00
ageOther 25902 0.42 0.49 0.00 1.00
Gender 62083 0.45 0.50 0.00 1.00
Timediff 62083 9.75 13.57 0.00 142.00
People 62083 4.05 2.93 1.00 45.00
Status
status_ok: 62083 0.73 0.45 0.00 1.00
status_canceled: 62083 0.19 0.40 0.00 1.00
status_changed: 62083 0.01 0.12 0.00 1.00
Restaurant Information
Is_hotel 62083 0.36 0.48 0.00 1.00
Cityarea
new_taipei_city 62083 0.03 0.17 0.00 1.00
out_of_greater_taipei 62083 0.06 0.23 0.00 1.00
Wifi 62083 0.61 0.49 0.00 1.00
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
15
Table 3 Statistics of Test Data
Variables n mean Stdev. Min Max
Dependent Variable
Return90 41546 0.20 0.40 0.00 1.00
Reservation Information Age
age16-25 6144 0.15 0.35 0.00 1.00
age26-35 11874 0.29 0.15 0.00 1.00
age36-45 6046 0.15 0.35 0.00 1.00
ageOther 17482 0.42 0.49 0.00 1.00
Gender 41546 0.45 0.50 0.00 1.00
Timediff 41546 9.68 13.39 0.00 147.00
People 41546 4.06 3.00 0.00 39.00
Status
status_ok: 41546 0.72 0.45 0.00 1.00
status_canceled: 41546 0.20 0.40 0.00 1.00
status_changed: 41546 0.02 0.12 0.00 1.00
Restaurant Information
Is_hotel 41546 0.36 0.48 0.00 1.00
Cityarea
new_taipei_city 41546 0.03 0.17 0.00 1.00
out_of_greater_taipei 41546 0.06 0.23 0.00 1.00
Wifi 41546 0.61 0.49 0.00 1.00
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
16
Table 4 Statistics of people
Group People Sample size Return90=True : Return90=False
Return prob.
people1-2 1-2 28909 6250 : 22659 0.216
people3-5 3-5 26936 5156 : 21780 0.191
people6-10 6-10 11813 2105 : 9708 0.178
people_others others 2342 344 : 1998 0.147
Table 5 Statistics of timediff
Timediff Sample size Return90=True : Return90=False
Return prob.
timediff0-3 0-3 31869 5138 : 26731 0.161
timediff4-7 4-7 13711 2620 : 11091 0.191
timediff8-14 8-14 10292 2311 : 7981 0.225
timediff_others others 14128 3786 : 10342 0.268
Fig 3.1a
Fig 3.1b
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
17
Table 6 Statistics of status
Group status Sample size Return90=True : Return90=False
Return prob.
new status_ok=1
status_canceled=0 status_changed=0
50504 7844: 42660 0.115
cancel status_ok=0 status_canceled=1 status_changed=0
13669 5184 : 8485 0.379
change status_ok=0 status_canceled=0 status_changed=1
999 253 : 746 0.253
none status_ok=0 status_canceled=0 status_changed=0
4828 574 : 4254 0.119
Fig 3.2
‧
Since the dependent variable, Return90, is dichotomous, fitting an ordinary linear regression model is not appropriate. Therefore, we employ the logit models for estimation. Let 𝑌𝑖 denote Return90 for the ith EZTable user
𝑌𝑖 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑃𝑖) (1) where 𝑃𝑖 denotes the probability of a returned visit within 90 days. From a
regression perspective, 𝑃𝑖 is the expectation of Return90, and can be specified as:
E(𝑌𝑖 = 1|𝑋𝑖β) = 𝑃𝑖 (2) where 𝑋𝑖 is the vector of independent variables. Consequently, we can specify a logit model
𝑃𝑖 = exp(α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖)
1 + exp(α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖) (3) where (𝛽1, 𝛽2, … , 𝛽𝑘) are the parameters of explanatory variables.
Equation (3) can be written as log ( 𝑃𝑖
1 − 𝑃𝑖) = α + 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖 (4) where log (1−𝑃𝑃𝑖
𝑖) as the logit of 𝑃𝑖, which allows us to form a linear relationship between independent variables and logit(𝑃𝑖).
The right hand side of this generalized linear model in equation (4) is additive and linear. However, empirical data may have various data generation processes and requires non-linear forms of independent variables. In other words, the relation between the sum of the independent variables in the right hand side of equation (4) and response variable may be nonlinear. For example, Figure 4.1 shows the model fits better when we allow non-linear form of age (see below). It plots the simulated ages with corresponding wages. The straight line illustrates a linear relation between the independent variable and dependent variable. And the curve shows a non-linear relation between the two variables. The non-linear model fits the data better than the linear one. It catches the non-linear form of age around 20 years old to 30 years old.