R ESEARCH M ETHOD

CHAPTER 2 LITERATURE REVIEW

3.1 R ESEARCH M ETHOD

In this study, various national patent data were collected as collection objects. As the content of patent information is comprehensive, the study utilizes a lexical extraction program to extract key terms from the patent data, and apply Chance Discovery tool KeyGraph (Ohsawa et al., 1998) to explore the interrelationships and the characteristics between the terms. By using KeyGraph, this study visualized the graph of Fintech’s related patent application to see the differences of different time of application, technology field, and the direction of trends. This study also designs a KeyGraph algorithm program to analyze the trend of patent data.

3.1.2 N-gram

Lexicon extraction program extracts the text string from the columns of patent data, which are Title, Abstract, and Claims in the N-gram mode (Manning & Schuetze, 1999),

cutting all the vocabularies in n text strings as a term. Due to important information usually hidden in the noun phrases, it is much important to extract noun phrases rather than a single vocabulary because a single vocabulary will be too general to be recognized its meaning. As a consequence, this study applies N-gram to extract the vocabulary group, comprised of zero or an adjective followed by one or more nouns.

Take financial information technology as an example. What N-gram extracts the text type is financial information, information technology, and financial information technology. In order to confirm the result’s accuracy and adequacy, it needs to be viewed manually. Moreover, there are two researchers with finance backgrounds were invited to check the way of examinations. With a view to make sure they understand the definition of text selection and selection rules of the study, they needed to be educated and trained first, so that they could confirm the correctness of the extractions and the combinations in the right way. In the end, after they finished the confirmations, they had been rewarded to increase the incentives of their willingness for helping improve the creditability of the study.

3.1.3 KeyGraph

(一)、 Introduction

KeyGraph algorithm is the most widely used tool to visualize data of Chance Discovery, which is proposed by Ohsawa et al. in 1998. KeyGraph is an algorithm

using data visualization tools to gain insight into the future, especially when a rare event co-occurs in the high-frequent clusters (Ohsawa, 2002a).

KeyGraph can analyze the collected documents, find out the terms with high frequency from the file, and understand the relationship between the word clusters having high correlations and the other terms that isn’t high frequent but have key influence within the word clusters. Finally, the results are converted into visualized graphics. The terms that although they are with low frequency, they have a vital influence on the word clusters are called key terms with potential opportunity (Hong, 2009), which is the main object of this research.

The visualized graphics that KeyGraph generated will appear three kinds of nodes and the two types of lines that connected the nodes, just as shown in Figure 3 below.

The three nodes are the solid black points, representing high-frequent terms, hollow points, meaning important vocabulary with low frequency, and double circle nodes, representative of key terms. The lines between the connected nodes represent the correlations between the two vocabularies, inclusive of the solid lines and dashed lines.

The solid lines refer to the relevance of two terms in the cluster, while the dashed lines means that the terms have slight relations, which is called Weak Tie (Chen et al., 2013).

In addition, there has no meaningful meaning about the size of the nodes and the length of the two lines.

Figure 3: KeyGraph Illustration Source: Ohsawa & Yachida (1998) (二)、 KeyGraph Algorithm

KeyGraph algorithm needs to go through a series of calculations, consisting of six steps (Ohsawa & Yachida, 1998), shown in Figure 4 below.

Figure 4: KeyGraph Calculation Process Source: Organized by this study

First of all, document preprocessing mainly contains data cleansing and the

Preprocess

data

•  document preprocessing

Extract words

•  extracting high frequency terms

•  extracting links

Extract keywords

•  extracting key terms

•  extracting key links

•  extracting keywords

integration of terms and phrases. Then, the second step is according to the number of times that the terms appear, extracting and mining data through KeyGraph to capture the terms that appear relatively high, and gather all the high-frequent vocabularies together, denoted by Nh as the vocabularies with high frequency.

The third step of extracting links is to use the formula (1) to calculate the co-occurrence between the high-frequent terms, which can obtain the degree of lexicon relevance. After finishing the calculations based on equation (1), the values will be sorted. The two terms having high correlations will have a link.

𝑎𝑠𝑠𝑐𝑜 𝑤

, 𝑤

= 𝑚𝑖𝑛 𝑤

_{! !}

, 𝑤

_{! !}

!∈!

1

Where wi

and w

j are the ith and the jth terms in Nh, s means sentence and D refers to documents. assoco(wi, wj) calculates all the relevance of lexicons in Nh, and 𝑤_{! !}, 𝑤_{! !} represents the number of times that wi

and w

j appear in the same sentence.

After knowing the degree of correlation between the terms, the fourth step is to extract the key terms to explore the important terms and phrases hidden in the data, in other terms, extract the terms and phrases that have low frequency but are the potential vocabularies. Utilizing the equation (2) to proceed the fourth step, mainly to calculate the weight of all the terms in the data and the high frequent terms Nh (Ohsawa, 1998).

The higher the wights are, the more likely that the terms can be the key terms, denoted

by Kht as the potential key terms. Key (w) in equation (2) denotes the co-occurrence between all the terms and high-frequent terms Nh.

𝑘𝑒𝑦 𝑤 = 1 − 1 − 𝑏𝑎𝑠𝑒𝑑 𝑤, 𝑔

𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑔

!⊂!

2 𝑏𝑎𝑠𝑒𝑑 𝑤, 𝑔 = 𝑤

!∈!

𝑔 − 𝑤

(3)

𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑔 = 𝑤

!∈!

𝑔 − 𝑤

(4)

where 𝑔 − 𝑤

= 𝑔

− 𝑤

, 𝑖𝑓 𝑤 ∈ 𝑔

𝑔 , 𝑖𝑓 𝑤 ∉ 𝑔

Among the equations, w is all the terms in the document, g is the terms that belong to G cluster, 𝑔 _! is the number of times appear in the sentence s. Keyword Kht has the characteristics of lower frequency than Nh, which is added to the cluster G after the step fourth.

The fifth step of extracting key links is calculated through the weight of formula (2) and the formula (1) to explore the association of the high-frequent terms Nh and the key terms Kht. At this time, wi

and w

j in formula (1) are the ith term in Nh and the jth term in

K

ht respectively. The links that connect the two word clusters are the key links. The final step is to extract the key terms. On a basis of the key links that are calculated by formula (1), sort those key links and set up a threshold. Those who fit the threshold are

the potential key terms inside the documents.

In summary, the patent data that uses Chance Discovery to analyze through the preprocessing procedure, including extracting the meaningful terms and combining the synonym into one meaningful term, to convert the sentences into the collections of lexicons. Then applying KeyGraph algorithm to explore the potential opportunities step by step can not only understand the relation between the high-frequent terms from the large scale of patent data, but also can find out the key terms with great potential and observe the relationship between the key terms and high-frequent terms. In the end, the result of the output is presented in a visualized terms’ relationship graph, creating the link between the high-frequent terms and the co-occurring terms.

This research hopes to explore the financial technology’s development and future potential trends by applying KeyGraph to search for the interrelations between terms with high frequency, key terms, and the key terms, and visualize these text-type data, making readers easy to understand the correlations between the terms.

(三)、 Double-Helical Model

In addition to use KeyGraph to demonstrate the relevance of the terms, there also needs to use Double-Helical Model, shown in Figure 5 below, to assist the data analysis section. Ohsawa and McBurney proposed this model theory in 2002 to apply to the data

analysis of Chance Discovery. In the double helical, one is operated manually, the other one is implemented by the computer, that is, first the operator inputs the data or threshold, and then let the computer execute the results. After the execution is complete, the operator checks the result, and then go on to the next analysis. At last, the best result can be gained through this continuous interaction between human and computer.

Figure 5: Double-Helical Model Source: Ohsawa & McBurney (2003)

3.2 Research Process

This study’s research framework and process is shown as Figure 6. For the following research process, it can be divided into three phases: data collection, word extraction and result visualization.

Figure 6: Research Framework and Process Source: Organized by this study (一)、 Phase one: data collection

This study is based on the analysis of the published patent data, and the use of patent database is Patent Integration software, which includes many countries’ patent information, such as America, Europe, British, France, Germany, Taiwan, China, Japan, Korea, Mexico, Switzerland, and even WIPO organization, and it is useful for searching for the related Fintech fields as many as possible. This research finds that many

Data collection

•  Patent database selection

•  Patent data cleansing rule

Word extraction

•  Word extraction and reorganization

•  High-frequent term extraction by the program

•  Link and Keyword extraction by the program

Result visualization

•  KeyGraph analysis

•  Graph interpretation

previous relevant studies of patent analysis mostly used United States Patent and Trademark Office (USPTO) (Wu et al., 2015; Tsai, Huang & Yang, 2016) to search for the patent. However, Patent Integration software integrates not only the patent data from USPTO, but also other countries’ patent data, which increases the scope and the width of the study. Moreover, it demonstrates every patent’s status that can help control the data quality much easier. Therefore, this study chooses Patent Integration software to collect the patent from the database of America, Europe, British, France, Germany, Taiwan, China, Japan, Korea, Mexico, Switzerland, and WIPO, and the collected data for the study updates till May 4^th, 2017.

Based on the main purpose of this research is to outline the overall development and the future trends of Fintech, the key terms that are used to search for the patent are the definition and the synonym of Fintech, namely, uses these three terms, Fintech, finance technology, and financial technology to search for the patent. The reason is that Fintech field is constituted of many financial fields like payments, investment, capital raising, deposit and withdrawal, etc., and each field has many different innovative technologies, it will be time-consuming to collect and classify them one field by one field. Besides, the quality of data will also be hard to control.

After the data collection, there sets up five cleansing rules to clean the data. First, the patent status that is withdrawal, expired, or final rejected will be excluded. Second,

the patent having been published for more than 20 years will be excluded, because they might be outdated. Third, owing to the research mainly collecting English patent data, the data that isn’t published in English will be excluded. Fourth, exclude the patent data that is duplicated because the patent can be applied in any country. Fifth, if the text of title, abstract or claims containing irrelevant information about financial technology, this kind of data will be excluded.

In sum, this study uses Patent Integration software to collect a variety of countries’

patent database with high reliability, and set up the data cleansing rules to complete the research’s data collection process, which can also increases the creditability of this study.

(二)、 Phase two: word extraction

This phase extracts the terms from the three columns of patent data, Title, Abstract, and Claim, by the program developed in the Python 2.7 language, and manually classifies each word, which will be useful to explore the trend of Fintech field from the classified lexicons by KeyGraph algorithm. The program has five functions:

1. Calculate and sort the frequency of each word.

2. Generate bi-gram results and calculate and sort the frequency of each result.

3. Create your own removing rule list, able to erase terms from original data.

4. Create your own adding rule list and calculate and sort the frequency.

5. Run two-word search from adding rule list, and output the frequency and co-occurrence in the same sentences of the results.

Because of the complexity of original terms and phrases, the study combines the terms and their synonym in order to proceed phase three.

(三)、 Phase three: result visualization

Each period calculates frequency and weight through KeyGraph, and analyzes the high frequency and co-occurrence to establish the relationship between the terms, and then uses visualization tool to show the relationship of terms in each period, and finally interprets the correlations of the terms and the implied trends.

From the aspect of graphic interpretation, if the number of term is insufficient, the terms will be presented in an independent state, and it will not fully present this study’s relevance of key terms and future potential trend. On the other hand, if the number of terms is too much, the graph will be much more complicated and a lot harder to interpret. Only through the Double-Helical Model for human-computer interaction to readjust the terms’ frequency and the threshold of co-occurrence can the key terms’

association and trend graph be easy to interpret and analyze.

在文檔中運用文字探勘技術分析金融科技之發展與趨勢 - 政大學術集成 (頁 35-47)

CHAPTER 2 LITERATURE REVIEW

3.1 R ESEARCH M ETHOD

3.1.2 N-gram

3.1.3 KeyGraph

Preprocess

data

Extract words

Extract keywords

𝑎𝑠𝑠𝑐𝑜 𝑤

, 𝑤

= 𝑚𝑖𝑛 𝑤

, 𝑤

1

and w

and w

𝑘𝑒𝑦 𝑤 = 1 − 1 − 𝑏𝑎𝑠𝑒𝑑 𝑤, 𝑔

𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑔

2

𝑏𝑎𝑠𝑒𝑑 𝑤, 𝑔 = 𝑤

𝑔 − 𝑤

(3)

𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑔 = 𝑤

𝑔 − 𝑤

(4)

where 𝑔 − 𝑤

= 𝑔

− 𝑤

, 𝑖𝑓 𝑤 ∈ 𝑔

𝑔 , 𝑖𝑓 𝑤 ∉ 𝑔

and w

K

3.2 Research Process

• Patent database selection

• Patent data cleansing rule

• Word extraction and reorganization

• High-frequent term extraction by the program

• Link and Keyword extraction by the program

• KeyGraph analysis

• Graph interpretation

•  Patent database selection

•  Patent data cleansing rule

•  Word extraction and reorganization

•  High-frequent term extraction by the program

•  Link and Keyword extraction by the program

•  KeyGraph analysis

•  Graph interpretation