• 沒有找到結果。

Data Collection and Data Cleansing __________________________________ 47

CHAPTER 4 RESULTS AND DISCUSSIONS

4.1 S TATUS OF F INAL D ATA S ET

4.1.1 Data Collection and Data Cleansing __________________________________ 47

The data of this study is collected from the Patent Integration Software, and according to the set up of keyword searching to ensure the quality of the data, this study uses Fintech, finance technology, and financial technology as three searching key terms.

The period of data collection is until the latest update of database on May 4th, 2017, and the total amount of data is 3739.

According to the cleansing rule in Figure 7, based on the data status showed on the Patent Integration Software, the data that are withdrawal, expired, or final rejected are excluded, so are the patent that had been published for more than 20 years. What’s more, this study excludes the data that aren’t written in English, because this study mainly researches on English patent data. Finally, in accordance with the title, abstract, and claims of patent data, there has a total of 1208 data left that are relevant to the filed of Fintech for the next part of analysis, which is summarized in Table 7.

Figure 7: Data Cleansing Rules

Table 7: The Summary of Data Cleansing and the Amount of the Remaining Data

Key terms Original Data

Data Remaining

Data Cleansing

Fintech 6 3 Withdrawal-1; Irrelevant-2

Finance

4.1.2 Data Integration and Clustering

This study is based on the historical timing of development and the literature that Schueffel conducted Fintech-related research in 2016 to cut the patent data into three periods, namely, the period before 2004, the period from 2005 to 2010, and the period after 2010 (Schueffel, 2016). The published year of collected data was from 1998 to 2017, and it was found that the number of effective data before 2000 was small, so this amount of data would be included in period one. Therefore, the study was divided into three periods. After integrating the data from three key terms and cleansing again the duplicated and irrelevant data, there had 142 data in the period one from 1998 to 2004, 354 data in the period two from 2005 to 2010, and 463 data in the period three from 2011 to 2017, summarized in Table 8.

Table 8: The Summary of Data Clustering

Period Applied Issued Total

1998 to 2004 120 22 142

2005 to 2010 286 68 354

2011 to 2017 322 141 463

Total 728 231 959

Source:Calculated by this study

4.2 Data Analysis

This section of data analysis follows the six steps of KeyGraph algorithm that had mentioned in chapter 3, shown in Figure 8, to analyze and reorganized the data.

Figure 8: Six Steps of KeyGraph

4.2.1 Step One: Word Preprocess

The data that were downloaded from the Patent Integration Software didn’t have the garbled text, and no need to convert the files, either. As a result, here comes first cleanse the data, and separate the data into three periods. Second, use the program that had mentioned before in chapter 3 to extract the terms. Although the program can help extract the terms in N-gram mode, it will still appear the non-relevant terms or duplicated terms that need to be deleted, or synonyms or text types needing to be combined. This problem is resolved by continuously running the program over and over

Step 1

• Preprocess the word

Step 2

• Extract high-frequent words

Step 3

• Extract links

Step 4

• Calculate the weights and extract the key terms

Step 5

• Extract key links

Step 6

• Select keywords

again and keeping inputting the terms into stop word lists so as to delete them.

This study finds that a single word often appears in the compound terms and the synonyms. Take radio frequency identification technology as an example. There will appear radio, radio frequency, frequency identification…etc., so these terms have to be combined into one compound word. About the synonym, since there has many different spelling and the usage of the terms, such as two dimensional bar code, two-dimensional bar code, or two dimensional barcode, this study manually unifies the terms and deletes the same terms to avoid repeated calculations and facilitate the subsequent analyses, and the final numbers of key terms has been shown in Table 9.

Table 9: The Demonstration of Numbers of Key terms in Each Period

Period Applied Issued Total

1998 to 2004 68 42 110

2005 to 2010 80 48 128

2010 to 2017 105 79 184

Total 253 169 422

4.2.2 High-frequent Terms Extraction

After word preprocessing section, input the results into the adding lists of the program, which had been mentioned in chapter 3, and the results of terms’ frequency will show. Table 10 below demonstrates the issued patent of first period.

Table 10: Lists of High-frequent Terms (Example of issued patent from1998 to 2004)

High-frequent Terms Frequency

Encryption and Decryption 150

Transaction Card 102

Identification Technology 87 Wireless Communication Technology 58

… …

The threshold must be set up in advance to determine those terms with high frequency. There has no recommended threshold in the original research (Ohsawa, 1998), so this study uses Double-Helical model to keep adjusting the threshold for high-frequent words (Ohsawa & McBurney, 2003). Table 11 below summarizes the final setting threshold for all the data in this study. Since the key link of the subsequent steps is to be linked to two or more clusters (Chen et al., 2013), it is important to note the groups of high-frequent terms must have at least two groups.

Table 11: Lists of Final Thresholds of All Data

Period Apply Issued

1998 to 2004 20 20

2005 to 2010 20 20

2010 to 2017 20 20

4.2.3 Links Extraction

Third step is to link the association between the terms with high frequency, and it mainly discusses the relationship between the high-frequent terms and marks the links, meaning co-occurrence of the terms in the same sentence.

4.2.4 Key Terms Extraction

The fourth step is to extract the key terms, which means the terms with low frequency. These low-frequent terms had been calculated through the program in step two, which are those don’t meet the threshold for high-frequent terms. Then according to the weight calculation, calculates the weight of each low-frequent word. The higher the value is, the more important keyword it is. Therefore, based on the Ohsawa’s research and suggested formula in 1998, this study’s calculations of key terms’ weights by the following formula, on the basis of formula with a little adjustments:

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟𝑠 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑎 𝑤𝑜𝑟𝑑 𝑐𝑜 − 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐴𝑙𝑙 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟𝑠 𝑜𝑓 𝑠𝑎𝑚𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑐𝑜 − 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

If a word co-occurrents in the two same sentences with A word, three same sentences with B word, and 6 same sentences with C word in the document, this word will have three links, and numbers of each link are 2, 3, 6 respectively. If all the terms’

total numbers of links in the document are 30, then the weight of this word is (2+3+6)/30=0.367. Table 12 shows the list of low-frequency terms, and Table 13 lists and sorts the key terms and their value of weight.

Table 12: Lists of Low-frequent Terms (Example of issued patent from1998 to 2004)

Low-frequent Terms Frequency

… …

Location Technology 10

Data Storage 9

Reader Technology 7

Key Cryptosystem 7

E Banking Service 5

… …

Table 13: Lists of the Weight of Low-frequent Terms (Example of issued patent from1998 to 2004)

Low-frequent Terms Weight

Reader Technology 0.11

E Wallet 0.069

Data Encryption 0.061

Location Technology 0.049

Key Cryptosystem 0.024

E Banking Service 0.012

… …

4.2.5 Key Links Extraction

According to the Chen’s and Ohsawa’s research, both talks about the concept of the clusters to represent the association between high- and low-frequent terms, and the link has to be linked through the clusters. Chen even stated that the potential key terms mean that low-frequent key terms must connect at least two groups of high-frequent terms. Thus, to fit the purposes of this research and ground the basic idea of key terms in the original research, this study adjusts the meaning of key links that terms with low frequency must co-occurent with at least three different high-frequent terms in the same sentences so that they can be regarded as potential key terms.

4.2.6 Key terms Extraction

Final extraction of key terms is based on the links between high-frequent terms and key terms, and chooses the top three with highest weight as the key terms of this study.

This study analyzes three periods of data, and each period includes patent status of apply and issued to generate results by following six steps of KeyGraph algorithm to calculate the frequencies and associations of the terms, and visualize the results of each period.