Chapter 4: Methodology
4.2. Data mining process
Table 16: Variables used in regression model
Arrival Model Departure Model
Dimension Type Dimension Type
Airline Qualitative Airline Qualitative
Aircraft Type Qualitative Aircraft Type Qualitative
Origin Qualitative Destination Qualitative
Year Qualitative Year Qualitative
Month Qualitative Month Qualitative
Weekday Qualitative Weekday Qualitative
Hourgroup Qualitative Hourgroup Qualitative
Load factor Quantitative Load factor Quantitative
Passengers Quantitative Passengers Quantitative
Scheduled turnaround time Quantitative Scheduled turnaround time Quantitative Arrival delay Quantitative
After regression analysis using factors for categorical variables, the adjusted R-squared statistics also showed the combination of factors and variables do not sufficiently explain delays at Taoyuan Airport. An interesting case however, is that after accounting for flights with scheduled turnaround times and removing flights with long turnaround times, the adjusted R-squared improved significantly for departure delays but the model still does not explain the variance in delays significantly. The adjusted R-squared values are displayed in Table 15.
4.2. Data mining process
As pointed out by Sternberg et. al. (2016) regression may not be the most useful model to determine frequent patterns in flight delays. Although classical statistical models can be used to determine which factors cause an effect on flight delays, it is not possible to show which treatments within the factors are more prone to experience a delay. This is possibly due to that fact that dummy variables are used for qualitative factors. Also, very little correlation exists between variables and the dependent variable in addition to it not being possible to find correlation effects between qualitative (delay minutes) and quantitative factors.
Data mining methods, and more specifically, association rules, can be used to find these hidden patterns within variables and factors as suggested by Sternberg et. al.
(2016). Associations rules express a set of events, known as the antecedent or the
left-‧
hand side, that lead to another, also known as the consequent or right-hand side, in conditional probability form (Tan et. al., 2006).
Association rules can be expressed as
𝑋 → 𝑌
where 𝑋 and 𝑌 are disjoint subsets (i.e. 𝑋 ∪ 𝑌 = 𝑋 + 𝑌), and 𝑋 is an itemset which leads to 𝑌 , therefore 𝑋 is the antecedent and 𝑌 is the consequent term. Such an association rule implies that there is some relationship between 𝑋and 𝑌 to lead to such implication. This can be illustrated through the following simple example:
Transaction Contents
1 {apples, pears, juice}
2 {apples, milk, juice, bread}
3 {milk, apples, cola}
4 {apples, juice, bread}
5 {pears, juice, cola, milk}
From this trivial example, we can conclude that the following is an association rule:
{𝑎𝑝𝑝𝑙𝑒𝑠} → {𝑗𝑢𝑖𝑐𝑒},
because purchasing apples may suggest purchasing juice as well. The above association has a length of two (one item in the antecedent plus one item in the consequent). Perhaps, the following rule, of length three, may be less obvious at first sight but is also an association rule:
{𝑎𝑝𝑝𝑙𝑒𝑠, 𝑗𝑢𝑖𝑐𝑒} → {𝑏𝑟𝑒𝑎𝑑}
Association rules are implication expressions which strength is measured by support and confidence (Tan et al., 2006). Together support and confidence can be practically applied and interpreted through lift rules, which is a probability the data in the itemset to lead to a specified result measured as a function of support and confidence.
Support serves as a proxy to eliminate any itemsets that occur by chance (Tan et. al., 2006). Setting a high support might result in very few rules of interest whereas setting a high support will result in too many insignificant rules. Confidence on the other hand measures the conditional probability of Y given X, which provides the reliability of rule (Tan et. al., 2006).
Originally used as a technique to determine shopping basket behavior, association rules were used to determine which items were simultaneously purchased (Han et.
al., 2011; Tan et. al., 2006). By using association rules, shops could determine which items should be placed together to increase the chance of such products being purchased because of consumer behavior. Given an itemset 𝐼 = {𝑖1, 𝑖2, … 𝑖𝑚}and 𝑇 = {𝑡1, 𝑡2, … 𝑡𝑛}, every transaction 𝑡 contains items from 𝐼. Furthermore, a transaction contains 𝑋 if 𝑋 is a subset of 𝑡 (Tan et al., 2006).
‧
Support count, denoted by 𝜎(𝑋) for any subset 𝑋 is defined as the number of times 𝑋 appears in a transaction (Tan et. al., 2006). This can be mathematically represented as follows:
𝜎(𝑋) = |{𝑡𝑖 | 𝑋 ⊆ 𝑡𝑖, 𝑡𝑖 ∈ 𝑇}|.
Using this definition in the example above, 𝜎(𝑋) = 2 when 𝑋 = {𝑎𝑝𝑝𝑙𝑒𝑠, 𝑏𝑟𝑒𝑎𝑑}, as apples and bread appear in two of the transactions. Applying this definition to our original definition of an association rule, we can define support and confidence as for 𝑋 → 𝑌 as
𝑠 (𝑋 → 𝑌) =𝜎(𝑋 ∪ 𝑌) 𝑁 𝑐 (𝑋 → 𝑌) =𝜎(𝑋∪𝑌)
𝜎(𝑋) .
Applying these definitions to the example above, we find that 𝑠 ({𝑎𝑝𝑝𝑙𝑒𝑠, 𝑗𝑢𝑖𝑐𝑒} → {𝑏𝑟𝑒𝑎𝑑}) = 0.4 𝑐 ({𝑎𝑝𝑝𝑙𝑒𝑠, 𝑗𝑢𝑖𝑐𝑒} → {𝑏𝑟𝑒𝑎𝑑}) = 0.67.
Tan et al. (2006) however stresses that looking only at confidence can be misleading as it only considers support for the antecedent and not the consequent. Therefore, to address this problem, another metric is introduced known as lift which can be defined as follows:
𝐿 (𝑋 → 𝑌) =𝑐 (𝑋→𝑌)
𝑠 (𝑌) .
The lift, also known as the interest factor, of an association rule considers the confidence of the rule in relation to the support of the consequent. The lift measure is also able to show whether there is a positive or negative correlation between 𝑋and 𝑌.
By applying the lift to the example above we find
𝐿 ({𝑎𝑝𝑝𝑙𝑒𝑠, 𝑗𝑢𝑖𝑐𝑒} → {𝑏𝑟𝑒𝑎𝑑}) = 1.675,
which means that buying apples and juice is positively correlated to buying bread.
Furthermore, a lift of 1.675 shows shoppers that bought apples and juice have an 67.5% probability of also buying bread. If a lift is less than one, it would imply that the two events are negatively correlated.
Using association rules, we can determine what factors are common and related to on time performance by manipulating the consequent term to contain an on-time performance measure and letting the antecedent term contain the factors that could lead to such performance. Therefore, this study will be analyzing lift rules that would look like:
𝐿 ({𝑓𝑎𝑐𝑡𝑜𝑟𝑠} → {𝑜𝑛 𝑡𝑖𝑚𝑒 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒}).
This will return a set of lift values that represent the probabilities of certain factors that lead to a certain on time performance. The higher a lift is the more likely a given
‧
國立 政 治 大 學
‧
N a
tio na
l C h engchi U ni ve rs it y
31
event leads to the specified on-time performance. This will form the basis of the analysis.
The data mining process consists of (i) data indexing, (ii) rules generation, and (iii) rules analysis. Parts (i) and (ii) will be discussed below and part (iii) in the following section.
Data indexing
As association rules treat each value, working in the case of 290,000 flights, such rules may not be obvious at first glance and may also be time and resource consuming.
When dealing with large amounts of data it is useful to index values, applying that divide values into manageable groups or bins. This process “standardizes” the data in two ways: continuous variables get discretized and categorical variables that take on too many values to be get analyzed get grouped or clustered (Han et. al., 2011;
Yaghini, 2008).
Indexing techniques used in this study include conceptual hierarchy, binning, and categorization as outlined by Han et. al. (2011). Conceptual hierarchy is used when detailed observations are too unique to provide support threshold (Sternberg et. al., 2016). Thus, these observations are replaced with higher intervals, making them more manageable and providing enough threshold to generate rules. In this study, the scheduled time of flights have been divided into year, month, weekday and time of day groups.
Binning is a common technique to group continuous intervals. Although bin ranges of equal length are traditionally determined using statistical or mathematical techniques, this study uses binning ranges that are typically employed in the airline industry by slightly modifying bins so that they comply with industrial analysis. This kind of industrial binning has also been used in past literature concerning this industry (e.g. Abdel-Aty et al., 2007 and Perez-Rodriguez, 2017). The dimension that used traditional binning is passengers per flight, whereas dimensions that used modified or industrial binning include route distance, airport size, competition on route, passenger per flight, turnaround time and on time performance.
Finally, categorization was applied to categorical variables. Categorical attributes have a finite number of values it can take on (although it may be large) unlike discrete variables, have no ordering attribute (Yaghini, 2008). Categorical variables in this study include airline alliances and airline service type. In this study, aircraft type employed a combination of both categorization and concept hierarchy, being divided
‧
summary of all data indexing is presented in Table 17.Table 17: Data indexing criteria Dimension Indexed
Factor Indexed values Indexing
technique Airline
Characteristics Airline Airline's ICAO 3-letter code Categorization Alliance ONE: oneworld
Service Type FSC: Full service carrier
LCC: Low cost carrier Categorization
Aircraft
Characteristics Aircraft type Airbus Narrow: A319, A320, A321
Airbus Wide: A332, A333, A343, A359, A388 Boeing Narrow: B734, B737, B738, B739, B757 Boeing Wide: B744, B748, B763, B788, B789 Regional: E190, MD89, MD90
Categorization Conceptual hierarchy
Route
Characteristics Airport Airport IATA 3-letter code Categorization Airport size Small: below 10m passengers a year
Medium: 11-30m passengers a year Large: 31-50m passengers a year Mega: 50m+ passengers a year
Binning
Scheduled time Year 2014-2016 Concept
hierarchy
Month 1-12 Concept
hierarchy
Weekday Monday-Sunday Concept
hierarchy Hour group Dawn: 00:00-05:59
Early morning: 06:00-08:59
‧
Factor Indexed values Indexing
technique Passenger
loading Load factor Empty: 0-50%
Medium: 51-70%
characteristics Turn type Very short: 0-59 minutes Short: 60-119 minutes
performance Early: before scheduled time
On time: scheduled time until 15 minutes after Delay: 16 minutes-1 hour after scheduled time Overtime: more than 1 hour after scheduled time
Binning
Rules generation
This study uses the R package called arules, and makes use of the apriori algorithm.
This algorithm finds frequent itemsets and limits those that do not meet the minimum confidence requirement. To avoid too many or too little rules, the support and confidence thresholds need to be maximized. Each flight is manipulated to take on values from each variable factor of the flight, creating an itemset of elements contained in the antecedent term. Only one value can of each factor can be contained in the antecedent. For example, a flight cannot be operated during “dawn” and
“midday” simultaneously
As mentioned before, the consequent term has been manipulated to contain the on-time performance variable and to find a relationship with another variable, the minimum length of the rule is two. Antecedents which contain no elements, (antecedent length of zero) contain no information on what factors caused the delay.
However, if rules appear whose antecedents are frequently from the same itemset, it would imply that some correlation may be present. Similarly, antecedents whose elements length is more than one and seldom contain elements from the same itemsets could imply that there are possibly some interaction effects between the two itemset that could be investigated. On the contrary, if we know that there is interaction between two variables, we would expect to see elements from these itemsets together.
‧
國立 政 治 大 學
‧
N a
tio na
l C h engchi U ni ve rs it y
34
For confidence threshold, Sternberg et al. (2016) used the probability of a flight being delayed as their confidence threshold. If this paper would apply the same principle then the weighted average of flights delayed would be 31.7%, as a confidence threshold. However, given that confidence level can be interpreted as a conditional probability (Sternberg, et. al., 2016), this would imply rules that would lead to an on-time performance of 68.3%, much lower than the regional top 10 average which has been fluctuating around 80% in recent years (FlightStats, 2017). Thus, the minimum support used for this study is 0.2 to generate rules which could aid Taoyuan Airport’s overall performance.
To find minimum support, Sternberg et. al. (2016) considered every flight at least arriving once every day for six year. We extend this convention to include a departure and one arrival flight for everyday within all departure and arrival flights within three years. Therefore, the minimum support for this study is 0.0037565. The maximum length for the algorithm has been left unconstrained and left up to algorithm to run until rules no longer meet the minimum support requirement.
Using the package arules with set minimum length, minimum support and confidence, we apply the variables outlined in Table 17 over four dimensions namely airline, temporal, route characteristics and operations. Over the three years covered from 2014-2016; 292,221 departure flights and 290,883 arrival flights will be studied. As the variables for arrival and departure flights vary slightly in the operations dimensions, algorithms and rules will be obtained separately for arrival and departure flights. Rules obtained were finally sorted by lift and not by confidence, showing the likelihood of respective on-time performance variables.
Lift rules will first be obtained by setting a maximum rule length of 2 to find correlations between factors and departure delays. Besides on time performance, a bottom-up approach will be taken for each of the factors, beginning from very detailed antecedents to antecedents containing classification groups. Finally, to determine if there are any interaction between factors, a top down approach will be taken by including two or more classification groups in the antecedent. To find rules where factors are correlated with one another, rules will be filtered through where the factor concerned will set in the antecedent.
5 Due to computer restraints, the actual support used in results was 0.02. This however does not make a difference in the results as it limits the amount of rules returned and limits rules to those that are of more interest.
‧
As the data sample is very large only variables with a p-value of under 0.001 will be presented in this section. Where coefficients are not significant, blank spaces have been included to represent this fact.
Table 18: Regression results for numerical factors Arrival Departure Variable Estimate Sign Estimate Sign Load Factor
Passengers 0.046 + 0.065 +
Arrival delay n/a 0.507 +
Scheduled Turnaround Time 0.019 + -0.055 -
When analyzing the numerical factors as shown in Table 18, the largest estimation coefficient is the arrival delay for departures. Given all else constant for minute of arrival delay, the departure delay increases by 0.507 minutes, roughly 30 seconds.
Although the correlation plot in Figure 13 may suggest a one-to-one relationship, regression considers all data, most of which is contained below 350 minutes. Another interesting observation is that load factor is negatively correlated to arrival delay, opposite to what one might expect. This may be that due high load factors may possible cause a departure delay and therefore pilots have a higher incentive to fly faster to make up the departure delay.
In the case of categorical variables, R automatically sets the first element in the vector as the control variable and therefore all coefficients are interpreted in relation to this variable. Although this is not ideal because it does not display the effect of the control variable, it provides a relative analysis of categorical variables and their effect on flight delays.
Table 19: Regression results for airlines
Arrival Departure Airline Estimate Sign Estimate Sign
Air Asia 18.400 + -17.250 -