Data cleansing and pre-processing - Data collection and processing

4. Data collection and processing

4.2 Data cleansing and pre-processing

The data that was collected from web API is not fully reliable representation of the activity of each station due to no response of web server, unexpected programme error, and insufficient memory as mentioned in Table 17 previously. Besides these technical errors, maintenance work of stations, broken bicycles or parking slots, data not as expected in 5-min based and etc. also result in noisy data. As a result, data cleansing process is required to pre-process the data through detecting, correcting and removing before analysis. Here is a multiple step cleansing process which is proposed by Froehlich et al. (2009) firstly and used by Lathia et al. (2012) as well is used to detect and eliminate the noise. Some definitions and notations are described following before conducing data cleansing process.

4.2.1 Definition and notations

It defines the terms, notations and intermediary process used in the analysis as suggested by Froehlich et al. (2009).

• Station Size: The specified station size of each station is obtained from the web API officially. However, due to bicycle lost or bicycle maintenance the actual station size varies which can be calculated as the sum of available bicycles (denoted as 𝐵𝐵𝑡𝑡) and parking slots (denoted as 𝑆𝑆_𝑡𝑡) at time 𝑡𝑡 at each station;

• Observation normalisation: Normalised available bicycles (NAB) is used to normalised stations’ data by dividing each observation by the specified station size (see Equation 1), aiming to adjust different values to a common scale; thus it allows to be able to compare the usage of station activity in different size:

𝑁𝑁𝐴𝐴𝐵𝐵_{𝑖𝑖,𝑡𝑡} = ^𝐵𝐵_𝑆𝑆^{𝑖𝑖,𝑡𝑡}

𝑖𝑖,𝑡𝑡 ( 1 )

where 𝑆𝑆_{𝑖𝑖,𝑡𝑡} and 𝐵𝐵_{𝑖𝑖,𝑡𝑡} are referred to the size of station 𝑤𝑤 at a given time 𝑡𝑡 obtained from web API and available bicycles of station 𝑤𝑤 at given time t respectively. Note that NAB may be interpreted as the proportion of available bicycles occupied in parking slots for a given station. It can be seen as a key measure for each bike station and used by

numbers of studies (also (Froehlich et al., 2008; Froehlich et al., 2009; Vogel et al., 2011; Lathia et al., 2012; O’Brien et al., 2014). NAB ranges from 0 (empty) to 1 (full);

• Station activity and event score: While the Activity Score (AS) is a measure of how activity a station is in a given time, the Event Score (ES) is a binary version of Activity Score, indicating whether the net flow of bicycles is greater than zero or not of a station at a given time. Activity Score is calculated by the absolute value of difference of available bicycles between current time 𝑡𝑡 and the last time window 𝑡𝑡 − 1:

𝐴𝐴𝑆𝑆(𝑡𝑡) = |𝐵𝐵_𝑡𝑡− 𝐵𝐵_𝑡𝑡−1| ( 2 ) Where 𝐵𝐵_𝑡𝑡 is the number of available bicycles at time 𝑡𝑡.

4.2.2 Cleansing process and processing

The data cleansing process can be generally employed in three steps as discussed in the following:

1. Station size consistency: The sum of 𝐵𝐵_𝑡𝑡 and 𝑆𝑆_𝑡𝑡 should remain constant; however it may fluctuate over time due to temporarily broken bicycles or parking slots, station capacity expansion or contraction. Since the station size of each station is reported by YouBike official website, the observed values of 𝐵𝐵𝑡𝑡 + 𝑆𝑆𝑡𝑡 is used to examine whether the observation is greater than the specified size for the given station or not. The observations are considered invalid and removed.

2. Day Data Threshold: If a specific station has a higher proportion of invalid or missing observations during a single day, it should be removed. More specifically, for those stations contain less than 70% of 288 possible observations (i.e., 202 samples) during a single day, the entire data of that day would be removed. Consequently, this accounts for abnormal station behaviour.

3. Station Data Threshold: If the station data is less than 45% of the possible weekday’s data (i.e., 907 samples), the entire data of that station is removed.

In addition, for those stations have unexpected values obtained from web API with abnormal values such as 𝐵𝐵𝑡𝑡 + 𝑆𝑆𝑡𝑡 ≤ 0 or specified station capacity ≤ 0 is seen as invalid data and being removed. This process is to ensure each station is in operation properly.

At the end of data cleansing process, it is not surprised that only very few data is removed (only 0.277% of all observations as shown in Table 19) which may be due to robust web server system and internet connectivity. More importantly, the errors are identified and addressed in a short time. Most of data is removed in terms of Day Data Threshold criteria and unexpected sum of 𝐵𝐵_𝑡𝑡 + 𝑆𝑆_𝑡𝑡 as Y17 Youth Recreation Centre station accounts for 5,042 out of 5,456 (92.41%

approximately) data missing which may result from maintenance work. In total, 1,960,772 of observations remained after data cleansing process. Nevertheless, it should be noted that while numbers of stations have almost the identical number of observations (11,897 samples) during the data collection period whereas the new station operated from 17^th May only have 2,957 samples.

While Table 18 illustrates the detected errors through data cleaning process, the data continuity is important as well for the following temporal analysis and clustering. Therefore, data supplement is required to generate the reasonable value of data for analysing and plotting the figure in a more comfortable manner. Missing data is supplemented by adding the “null” value in the case of no additional information provided from neighbouring data whereas “linear interpolation” is used to fit the missing data given neighbouring data is known.

Table 18 Details of removed data

Station name or station ID Removed criteria Time²¹ Observation

Xizhi Dist. Office

Jianguo & Nongan Intersection 19:44 25^th April 1

Dapeng Community 10:25 29^th April 1

MRT Zhongshan Elementary School(Exit.4) 00:08 to 03:35 17^th May 42

Citizen Square 13:44 to 14:44 17^th May 12

Taipei Medical University 07:24 to 09:09 20^th May 22

Xinyi Square(Taipei 101) 21:10 20^th May to 00:50 21^th May 45

21 refers to the last system update time of invalid data while scraping

4.2.3 Analysis tool

In this study, several tools are used in order to accomplish the presentation of analysis. Navicat Premium (version 11.0.18) which is a database administration tool is used to manage and process SQLite3 format (filename extension as .db). Additionally, query function is also used to filter the specific data efficiently and can be exported to common file formats such as txt, csv or excel format for analysis. IBM SPSS Statistics (version 22) is used as well for plotting usage patterns. R programming language is also used due to the abundant of packages provided to accommodate the common statistics needs. It is used to process the plotting of average temporal tendency of station activity, and generates the high quality and high resolution figures. More importantly, these figures may be contributed to the help of MATLAB in processing the original data since the data is stored in the database in terms of the order of station ID repeatedly every 5 minutes. Through a simple coding, MATLAB helps to reorganise the data in a time series order by each station automatically, thus decreasing the processing time if processed manually.

Orange which is an open source visual programming environment for data mining developed by Bioinformatics Laboratory in the University of Ljubljana. Not only Orange features a visual programming for data analysis and visualisation, but also being capable of developing advanced algorithms and executing complex data analysis procedure joyfully in terms of Python scripting (Orange, 2014). While this software is used to perform hierarchical clustering, partitional clustering (i.e., K-means algorithm) is done by R because of conducting Silhouette coefficient measure.

This study has shown that using multiple ways of pursuing the aims and objectives in the view of task-oriented.

在文檔中台北公共自行車租賃系統使用型態之分析 (頁 86-91)