• 沒有找到結果。

4. Data collection and processing

4.1 Data retrieval

The data is collected automatically from a dedicated web Application Programming Interface (API) in JSON (JavaScript Object Notation) format which is provided freely by Depart of Transportation, Taipei City Government. The easiness of data retrieval from online takes advantage of Open Data policy to public, facilitating obtaining data and further research. The official website also provides information service for users through the Google map API, illustrating a map of Taipei City overlaid with small smile markers to indicate YouBike station locations, the amount of available bicycles and free slots17 at a given time for each station.

JSON is designed to be a data exchange language and open standard format which is more human-readable and significantly faster for interchanging data compared to XML format (Nurseitov et al., 2009). Such data streams are often used for transmitting data between web servers and web applications as well as mobile phone applications or dashboard monitoring of the system concerned (O’Brien et al., 2014).

A Python (programming language) script which is developed by Shane Lynn18 which underpins the main process for data retrieving and data storage is customised specifically for suitably retrieving data from the web API19 on a regular basis (every 5 minutes) to access the YouBike docking station data online, excepting the web server error or the software and computer issue, parsing it and storing in a SQLite database. This Python script is run by PyCharm version 3.4 which is a Python IDE (Integrated Development Environment) developed by Czech company JetBrains and used for programming in Python. Demonstration of this Python script for retrieving date from YouBike open data API is shown in Figure 17 below.

17 slots refer to parking space for one bicycle

18 The original Python source code is contributed to Shane Lynn at http://shanelynn.ie/index.php/scraping-dublin-city-bikes-data-using-python/#more-222 and is adapted in order to meet the dissertation objective and the provided data structure

19 Web API url: http://210.69.61.60:8080/you/gwjs_cityhall.json (before May 27th, 2014)

69

This JSON API contains several information including station name, latitude and longitude, station address, data update time, total slots etc.; however, there are only some information this study interests in as a result of specific data are selected for collection as shown in the following:

• Station ID number;

• Station name;

• Total slots;

• Available number of bicycles;

• Free slots;

• Update time; and

• Scrape time.

Figure 17 Demonstration of Python script run in PyCharm

Source: image by author

Table 15 Structure of bikesharing ride data

Station ID Station name Total slots Available bikes Free slots Update time Scrape time

0007 Xinyi Square (Taipei 101) 80 22 57 20140415092426 20140415092810

Source: author

70

Above Table 15 illustrates the overall structure of bikesharing ride data which take Xinyi Square (Taipei 101) for example whereas following Figure 18 illustrates how the data is stored in a SQLite database which is managed by database administration tool, Navicat.

Figure 18 Demonstration of part of data stored in the database

Source: image by author

In order to analyse the station activity patterns, we have been collecting data online since April 15th, 2014 to May 27th, 2014 every 5 minutes, parsing it and storing in a SQL database all the relevant information. Note that as the information changes over time, the data is added automatically to the database; for example, the data of new station are added automatically though this new station is not appeared firstly. Therefore, conducting a data cleansing process before analysing the data is needed. Note that the data continuously collects data for all stations at 5-minutes interval; however, the data may not be retrieved and returned to databased due to software errors or server downtime and maintenance or poor internet connection. Until the next data retrieved successfully, the data is logged in the database followed by previous recorded data.

71

4.1.1 Basic quantities of the data collected

Due to some problems occurred which demonstrated in the following Table 16 during the data collection period, some data are missing and may not be used for the study. For example, data of 24th May from 9:37 am. to 10:51 pm. was missing because of software idle. Notably on 27th May, the web API link is changed20. Therefore, the data collection results are based on the data before May 27th during the 6 weeks data we collected except those missing, not recorded data.

Overall, the collected data in our study is from 166 stations with a total of 7,280 slots (165 stations with total 7,204 free slots since 15th April initially) provided. Total of 1,966,228 observations are collected. Note that the capacity for some stations would change during the data collection period. The station size per station ranges from 26 to 180 slots actually.

Table 16 Errors during data retrieval period

Time Duration Types of error

4/27 18:06 120 min Insufficient memory

5/9 16:54 7 hour 15 min Programme error

5/15 23:58 1 min Programme error

5/24 21:37 74 min Programme error

Source: this study

According to Figure 16 most of error are unexpected programme error and the duration is up to more than 7 hours. The reason of various programme error duration is that while the error occurs, it is detected until checking the programme deliberately. Although the python script is run automatically, it still needs manpower involved to solve the unexpected error while running the programme. As the programme running for a while, the use of memory increases and they cannot release automatically. The error of insufficient memory occurs consequently as PyCharm is limited to use below the desired memory. The solution is to restart the programme

20 The new web API url: http://opendata.dot.taipei.gov.tw/opendata/gwjs_cityhall.json since 27th May, 2014

72

or adjust the permitted amount of memory of PyCharm to use. However, the memory management is a delicate process; thus in this study we choose to restart the programme again instead. The error lead to inconsistent observations and diminish the quality of the collected data.

It is expected that the system expands over time; thus it is likely to build new stations or changes of station sizes. According to collected data, it is recorded that several station sizes are changed during the data collection period which is shown in Table 17 below. It shows that the sizes of MRT Dongmen Station (Exit 4) is diminished to 46 whereas the capacity of MRT S.Y.S Memorial Hall Station is increased to 48. There is a new station established, namely MRT Zhongshan Elementary School (Exit 4) with capacity of 70 bicycles.

Table 17 Changes of station size during data collection period

Station Pre-size post-size When

MRT S.Y.S Memorial Hall Station 38 48 00:24 20th May, 2014 MRT Dongmen Station (Exit 4) 50 46 09:42 18th April, 2014 MRT Zhongshan Elementary School (Exit 4) n.a. 70 00:08 17th May, 2014 Source: this study

4.1.2 Limitation

It may be argued that the data collection frequency (i.e., 5 min) may not be sufficient to demonstrate station activity patterns accurately throughout the day. Nevertheless, the study by Froehlich et al. (2009) address their data in terms of 5 min increments to use Bayesian network model for predicting available bicycles of specific station at a given time. As a result, it may be still acceptable that our data collection frequency would be appropriate. Note that the open data of YouBike provided by Taipei City Government can be retrieved in a more frequent base or even instantly though the application to Taipei City Government is needed. Moreover, it is designed to provide instant real-time information to mobile apps other applications for users to query the information at any time anywhere and everywhere; mobile apps developer also hand in the reports of application results and activity to the government annually. In addition, fixed-IP is also required to retrieve web API. Since this study only used for academic research, retrieving data in a 5 min base, allowing for retrieving data freely, is appropriate to use if consideration of the convenience of data collection. It is important to note that the collected data does not incorporate trips, i.e., trip originations and trip destinations. As a result, flows of bicycle between stations cannot be examined, indicating that this data cannot be used to count

73

journeys. It only simply observes that the available bicycles and parking spaces at a given time, thus demonstrating the activity patterns of stations at a given time.

相關文件