臺灣科技大學機構典藏 NTUSTR:Item 987654321/57096

(1)

以文字探勘分析雲端運算產業專利網絡

黃嘉彥*

國立勤益科技大學資訊管理系

摘要

隨著雲端運算產業崛起，各資訊大廠紛紛搶入市場，但是目前與此產業相關的專利分析文獻極少。本研究的主要貢獻在於提出二階段式專利檢索策略，

以及系統化的方法執行雲端運算的網絡分析。本研究運用書目計量學與網絡分析驗證，二階段式專利檢索策略可成功地辨識出屬於雲端運算產業的專利。至於以系統化方法進行專利網路分析之有效性，則可透過以灰關聯為基的集群分析得到證實。本研究藉由視覺化網路與技術中心性指標，揭櫫雲端運算整體專利結構以及三類商業模式之核心專利。本研究運用文字探勘配合專利網絡的分析方法，可提供有意進軍雲端運算產業之企業的重要參考依據。

關鍵詞：雲端運算，分區專利檢索，文字探勘，專利網路分析，集群分析，

灰關聯分析。

PATENT NETWORK ANALYSIS OF CLOUD COMPUTING BY TEXT MINING

Jia-Yen Huang*

Department of Information Management National Chin-Yi University of Technology

Taichung, Taiwan 41170, R.O.C.

Key Words: cloud computing, block building patent retrieval, text mining, patent network analysis, cluster analysis, gray relational analysis.

ABSTRACT

With the rise of cloud computing technology, many IT manufacturers have rushed to enter the market. However, to date, only a few attempts have been made at the patent analysis of cloud computing. The main contribution of this research is to propose a two-stage retrieval strategy and a systematic method for conducting patent network analysis on cloud computing. By conducting bibliometric and network analyses, we show that a two-stage patent retrieval method can successfully identify patents belonging to cloud computing. The effectiveness of the proposed system- atic method for patent network analysis is proved through gray rela-

*通訊作者：黃嘉彥，e-mail: [email protected]

Corresponding author: Jia-Yen Huang, e-mail: [email protected]

(2)

tional-based cluster analysis. With the aid of visual networks and tech- nology centrality indexes, this study seeks to reveal the overall patent structure and identify the key technologies of cloud computing. The re- sults of this study can provide a valuable reference for enterprises intending to enter the cloud computing industry.

I. INTRODUCTION

Currently, the most significant challenges confronting the global community include an energy crisis, ensuring the efficient use of resources, medical system integration, and financial risk monitoring. To deal with these challenges, a service system equipped with a rapid expansion capability, numerous operations, data processing, and storage capabili- ties is required. Cloud computing could be the best plat- form due to its centralizes computing resources, effective allocation, and diverse application models [1]. The rising need for commercial service models of cloud computing has promoted the development of software, hardware, and other relevant industries, and it is highly valued by ad- vanced countries.

Regarding cloud computing as a significant develop- ment and trend in the wake of the concept of Web 2.0, many enterprises in the information and communication technology industry have released relevant innovative ap- plications to increase their market share. For consumers, cloud computing can help them to access better services;

for enterprises, it can reduce their costs and risks. More recently, several cloud computing services and platforms have increased dramatically, including notable examples such as Google’s File System (GFS), Amazon’s Dynamo, and Microsoft’s Azure. Cloud computing has become cen- tral to current discussions about corporate information tech- nology [2].

Many enterprises consider cloud computing as the next business opportunity. The strategy of patent application in this field should be organized with overall consideration for the industry to lower the risk of R & D investment.

Patent analysis is driven by several objectives in terms of technical and economic decision-making, including tracing the trajectory of the evolution of technology to find im- portant opportunities and formulate strategic plans for tech- nological innovation. However, because of the large num- ber of professional and technical terms, it is not easy for patent engineers to read and analyze patents effectively.

Moreover, the increasing volumes of technical data presented in patents are difficult to analyze for the various purposes mentioned above. Consequently, engineers can no longer

completely rely on their own knowledge and skills to ana- lyze the patents. This has resulted in the use of computer- aided tools for such analysis [3].

Patent analysis is a complex procedure that incorpo- rates a variety of different statistical algorithms, many of which are still evolving. Most of the patent information is now in electronic form, but there is still too much in- formation to be read, and the traditional processes are be- coming too bulky to support iterative exploration. One of the disadvantages of using patent classifications, such as the International Patent Classification (IPC), for evaluating the technological trends is that different techniques may be categorized into the same type. The emerging methods of patent analysis, which have attracted significant attention in recent years, involve the use of text mining and visuali- zation of the analysis results. For example, since 1997, the Japan Patent Office has offered more than 200 kinds of patent map services. The patent offices in other countries, such as Korea and Italy, have also provided similar ser- vices to simplify the complexity of patent analysis. The technology of text mining is the cornerstone of patent net- work analysis. The aim of text mining is to retrieve the hidden information within a large amount of documents, and to extract and transfer unstructured words into struc- tured files that can organize the scattered words into valu- able information and rules. The advantage of text mining is that it enables us to unearth important specific termi- nologies in a document, without the need for them to have been previously defined. Therefore, text mining can be applied in a variety of areas, reducing the need for profes- sionals to be involved.

The traditional patent map does not help to make pat- ent analysis easy to understand and analyze; it fails to provide intuitive insight for the targeted technologies.

Considering patent literature itself as a complex network

system, an increasing number of analyzers adopt patent

network analysis and present their results visually [3]. In

light of the promotion of the application of text mining

techniques over the past few years, this study applies such

a technique, in combination with network analysis, to conduct

a patent analysis of cloud computing. The network-based

patent analysis, equipped with a visual display, generates

(3)

quantitative values of patents in terms of the degree of si- milarity. The visual results of network analysis enable the analyzer to understand easily both the overall structure of the patent set, and determine the positions and relative importance of individual patents. The influential patents selected from the analysis can assist managers to under- stand the context and up-to-date trends of cloud computing, helping them to enhance the quality of the decision-making process for establishing patent strategies.

Most of the studies of cloud computing are focused on the technology’s development, security and commercial application [4, 5]. However, the literature on the patent analysis of cloud computing is limited. Before conducting patent analysis, the work of patent retrieval must first be accomplished, and the classification and scope of cloud patent should be carefully determined. This is no easy task.

If the retrieval range is too narrow, important patents may be missed. In contrast, if the retrieval range is too broad, many patents that do not belong to cloud computing may be included. Other difficulties in relation to collecting patents of cloud computing arise from the characteristics it has inherited from grid computing and distributed com- puting, which makes it difficult to distinguish between their patents during retrieval. Although network-based patent analysis is currently one of the mainstream analytical tech- niques, existing studies [6] have established an adjacency matrix by using a process that is too subjective and have failed to provide any verification for the results. Consid- ering the aforementioned weaknesses of network-based patent analysis, this study aims to propose a two-stage retrieval method and a systematic method for constructing patent networks of cloud computing. This paper specifi- cally focuses on presenting evidence illustrating the effec- tiveness of the proposed methods.

This paper is organized as follows. In section 2, the status of patent analysis of cloud computing is presented, and the drawbacks of network-based patent analysis are highlighted. The section also introduces the overall process of this study. Section 3 presents the process of the two-stage patent retrieval strategy. In section 4, we discuss the patent network analysis and identify the core technologies of the cloud industry. Furthermore, the verification of the pro- posed two-stage retrieval strategy and our systematic method for constructing patent networks are presented. Section 5 contains our concluding remarks.

II. LITERATURE REVIEW

The National Institute of Standards and Technology

(NIST) has clearly defined three service models relating to cloud computing (known as the SPI model): Software as a Service (SaaS), Platform as a Service (PaaS), and Infra- structure as a Service (IaaS) [7]. Although Androcec et al.

[8] have put forward a much more complex model than the existing SPI model, their model lacks a comprehensive identity. In general, most enterprises still regard SaaS, PaaS, and IaaS as the three main service models [9].

Although SaaS is the application used by consumers, it does not control the operating system, hardware, or the basic Internet structure of operation. Software service su- ppliers provide services within the framework of renting instead of buying. The common pattern is to provide a username and password, for example, in the case of Micro- soft Dynamic CRM and Salesforce.com CRM. Consumers of PaaS can control the environment for running the ap- plication. For instance, the users of Google App Engine take over parts of the mainframe, but not the operating system, hardware, or the basic Internet structure of opera- tion. Consumers of IaaS mainly use the basic computing resources provided by suppliers, such as processing capac- ity, storage space, network components, or middleware.

Consumers can control the operating system, storage space, deployed applications, and network components (such as firewalls, load balancers, etc.), but they cannot control the basic structure of the cloud [10]. The most famous IaaS provider is Amazon's Elastic Compute Cloud (EC2).

In the age of competitive knowledge-based economies, patent rights have become the most tangible and influential intellectual asset, and cloud computing is no exception.

To enhance their international competitiveness, it is im- perative for companies to have the ability to innovate, re- search, and develop. Importantly, all the results of such efforts should be under patent protection. A patent data- base is the most abundant source of technical information.

Before developing new techniques, it is advised that enter- prises should conduct patent retrieval and transfer it into business intelligence to promote the quality of business decisions. Tools usually used for patent retrieval include the patent databases provided by Patent and Trademark Offices, commercial patent databases, and patent analysis software used by researchers. At present, advanced coun- tries and regions can obtain most of the related information about patent documents on the Internet. Through keyword searches, the patent information, which is deliberately hid- den in the database, can be retrieved effectively [11].

Even though the SPI model is generally recognized as

one of the three major cloud business models, this does not

(4)

imply that the range of cloud computing patents is already clearly defined. The difficulties of retrieving the patents on cloud computing are threefold. First, due to the char- acteristics inherited from distributed computing and grid computing, the included areas of cloud computing are still vague, and the patent retrieval process is difficult [12].

Secondly, the development of the ontology of cloud com- puting is still incomplete (this will be discussed further in section 3). Thirdly, there are very few analysis reports about the determination of the classification and the scope of current cloud patents.

Xue et al. [13] studied six target enterprises of cloud computing, including Google, IBM, Microsoft, Apple, Ama- zon and Intel, in order to understand the mainstream tech- nical field of cloud computing and the status of patent portfolios. Because cloud computing is inherited from distributed computing, the range of patent retrieval is lim- ited to two IPC classes: G06 (including Electronic Data- Processing, Data Handling System or Data Processing Means, and Information Storage Memory) and H04 (Tele- communication and Communication). However, even though relevant patents of cloud computing should be classified under IPC H04 and/or G06, they are vast classes containing much that could not be considered to be cloud computing. In other words, these are very broad classes whose meaning is open to different interpretations; thus, the IPC classes and the six enterprises used as an analytical basis for Xue’s study [13] are not sufficient for observing the technology strategy.

Due to the increasing demand for patent engineers and decision makers using automated patent analysis, some scholars have already applied text mining to this field.

For example, in order to determine the R & D trend, Lent [14] analyzed a patent database with the sequence mining method, helping the industry discover areas with niche business opportunities. Sheremetyeva [15] and Shinmori et al. [16] conducted information extraction, summary, and integration from patent documents through natural language processing. Tseng et al. [17] developed a technique where- by people can conduct patent analysis by text mining, in- cluding through the following steps: text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. Hsu et al. [18] proposed a text mining process to identify the potential development trend of information security areas out of patent databases. Using the method- ologies of text mining analysis and patent content cluster- ing, Taghaboni-Dutta et al. [19] studied the development strategies of two competing factions of radio frequency

identification (RFID) technology developers. Their find- ings support a strong link between intellectual property and competitive advantage.

Although text mining has been applied to patent analysis for quite a while, some argue against the use of this tool [20]. Many professional patent analyzers are rather suspicious of linguistic technologies for patent analysis [21]

because such analyzers are unable to exert full control over the internal operation of the linguistic tools. On the other hand, analyzers have been paying increasing attention to the use of network analysis for analyzing the structures and relationships between patents. Since a certain corre- lation exists between similar patents, it is logical to use a network technique for patent analysis. This technique was first applied to sociometric studies of small groups, where their relationship to each other was presented in the form of graphics. In recent years, it has been widely used in the mapping and measuring of relationships and flows between people, groups, organizations, computers, and other con- nected information entities [22]. In the process of per- forming network analysis, the related patent keywords have to be extracted by text mining, and the Euclidean distance between patents are calculated to establish the re- lationships between them. The visual display of the results enables the analyzer to understand easily the global struc- ture of the patent set, and it is helpful for easing the suspi- cion of analyzers who perceive text mining as a black box.

Considering the serious drawbacks of the traditional citation analysis, Yoon and Park [6] proposed a methodol- ogy for visual patent network analysis based on text min- ing. However, as their process for constructing the adja- cency matrix is too subjective, we propose a systematic method that addresses this weakness. The visuals of the patent network cannot only be used in analyzing up-to-date technologies of cloud computing, but also as a tool to ver- ify the effectiveness of the proposed two-stage retrieval strategy. Detailed descriptions are presented in section 4.

Actors and relationships are the basic elements of network analysis. Actors are the main body of the entire network and are regarded as nodes. In this study, actors stand for patents. We perform the network analysis in UCINET6, a network analysis software package developed by Borgatti, Everett, and Freeman from the University of California. The overall process of conducting patent analysis goes through several steps, as illustrated in Fig. 1.

The underlying theory of patent analysis includes feature

extraction and patent network development. The study uses

text mining as a tool for data processing and keywords

(5)

Two-stage patent retrieval

Patents of cloud computing

Feature extraction Trend analysis of

cloud patent classification

Construction of adjacency matrix

Systematic tests of network quality

Patent network

Patent analysis

Relationship between the core patents and business

V erification

Gray relation-based clusteranalysis

Fig. 1 Overall framework of network-based patent analysis

extraction, and transferring the original patent data into structured data. Through the measures of similarity be- tween patents, the complex relationship among different cloud computing patents can be described more clearly by a visual network. Finally, the study applies the results of the network analysis to patent analysis. The purpose of gray relation-based cluster analysis is to verify the effec- tiveness of the proposed systematic method of constructing the patent network. A more detailed discussion of each step will be presented in the following sections.

III. TWO-STAGE PATENT RETRIEVAL STRATEGY

To support cloud users in finding a cloud service over the Internet, Han and Sim [23] presented a cloud service discovery system (CSDS), which interacts with cloud on- tology to determine the similarities between and among services. However, owing to the capacity limitation of our patent retrieval tool, the Patent Guider, it is not possi- ble to input all these ontology keywords into this software at one time. That is why we need to develop the two-stage strategy to obtain patents of cloud computing.

Stage 1: Snowball strategy

Retrieving patents through query

“cloud computing”

Manually screening and classifying the retrieved patents

Pearl patents

Extracting keywords by TF-IDF

Patents of cloud computing Stage 2: Block building search Text mining

Constraint of patent granted year

Fig. 2 The overall process of two-stage patent retrieval

In general, there are three kinds of patent search strate- gies: known item search, citation pearl growing, and block building [24]. The first method is suitable for analyzers who already hold some bibliometric information, and then proceed to the retrieval process accordingly. Its retrieval precision rate is usually higher. Analyzers may use the second method to identify relevant patents based on the citation relationship with patents on hand. This method can be repeated to expand the search results. The third method, block building, is applicable when retrieving topics con- taining a number of concepts; it connects different con- cepts with Boolean logic operators.

Our method mainly comprises a snowball strategy and a block building search method, as depicted in Fig. 2.

Detailed explanations for each step are provided below.

The rationale of the two-stage retrieval strategy is to find a small number of patents first, which surely belong to cloud computing. Next, based on the features of these patents, we can obtain more patents via a block-building search. Since the ambit of the retrieved patents after ex- tension may be too broad, we employ the constraint of pat- ent granted year to screen out the unwanted patents.

Through the extension and reduction process, the right

patents of cloud computing can be retrieved.

(6)

We named the first stage of the retrieval process the

“snowball strategy”, given its similarity to snowball sam- pling, which is a non-probability sampling technique for identifying potential subjects in studies where subjects are hard to locate. We applied the snowball strategy to con- trol one or numerous cloud related patents (called the pearl) in advance, and then utilized them to acquire more patents from the United States Patent and Trademark Office (USPTO) database. Similar to the known item search, we used “cloud computing” as a query string and limited the search domain to title, abstract, and claim parts of the pat- ents. In total, 104 patents were retrieved. To make sure that the “pearl” are patents of cloud computing, we pe- rused the 104 patents manually. In the end, 62 patents remained, and we categorized them into three commercial service models of cloud. Among these, 9 patents belong to IaaS, 21 to PaaS, and 45 to SaaS.

This study extracts keywords according to the patent data above and uses the concept of TF-IDF to rank the keywords. The TF-IDF, normally used to improve the weakness of evaluating the importance of lexical items by frequency, is the product of two terms: Term Frequency (TF), and Inverse Document Frequency (IDF). The for- mula below illustrates how to calculate the TF of keyword

“i” in a document set “S ”:

i j S ij

TF ¦

freq

⁽¹⁾

where freq ij is the frequency of keyword i showing up in document s j .

The relative weighting among keywords, IDF, is de- termined by the ratio of the total number of documents N to the number of n i documents containing that term, i.e., keyword i [25]. The formula is as follows:

i log

i

IDF N

n

§ ·¨ ¸

¨ ¸© ¹ (2)

IDF measures the dispersion of a lexical term across the document set; the smaller the proportion of documents containing that term, the higher the magnitude of this met- ric.

To retrieve the cloud computing patents accurately, this study selected the top-50 keywords of each service model according to the value of TF-IDF. The keywords for each service model were determined after excluding those keywords in common; this resulted in 7 keywords remain- ing in IaaS, 5 in PaaS, and 10 in SaaS.

The second stage of the patent retrieval process uses a block building search technique to formulate a search

statement. The block building approach starts with single concept (facet) searches, which usually result in a very large number of hits. Once the single concept searches are completed, they are combined using appropriate Boo- lean operators. This strategy can decompose a complex search task into a simple search task [26]; in addition, it narrows the topic and reduces the number of hits.

This strategy divides the retrieval problem into nu- merous concepts, and then identifies major concepts, along with their logical relationships with one another. More- over, the strategy identifies the strings in each concept and combines the strings into a single set representing that con- cept using Boolean operator OR. Eventually, the strategy combines the facet sets with the Boolean operator AND.

In the first block building retrieval process, the study uses “cloud computing” as the search concept, and the searching scope covers all parts of the patents. During the second block building retrieval process, we use the screened keywords (by TF-IDF) of each cloud computing commer- cial model as the search base. Then, by connecting the first block building and the second block building with Boolean operator “AND”, we can retrieve the patents of each cloud computing commercial model. In the end, we acquired 393 patents for PaaS, 167 patents for IaaS, and 650 patents for SaaS.

Through the two-stage searching process, we were able to retrieve the appropriate cloud computing patents. The ef- fectiveness of the proposed retrieval strategy can reasona- bly be inferred for three reasons. First, after the second re- trieval stage, the proportion of the number of patents in the three models (IaaS:PaaS:SaaS = 167:393:650) is quite close to the Pearl’s in the first stage (IaaS:PaaS:SaaS = 9:21:45).

It may safely be assumed that, after extensive searching, the retrieval strategy can truly obtain the relevant patents on cloud computing.

Second, we prove the effectiveness of our retrieval data from the viewpoint of network analysis. In network analysis, the indicator degree centrality is interpreted as the ratio of the number of tied links to all other patents.

If the degree centrality of a certain patent retrieved from the first stage is high, it will continue to remain high after the second stage of retrieval. Detailed results and descrip- tions of this interesting finding are presented in section 4.5.

Furthermore, we prove the effectiveness from the per-

spective of bibliographic analysis. Regarding the patent

analysis of cloud computing, Chiou [27] undertook a bib-

liometric analysis of the literature and its evolutionary life

(7)

Table 1 The technological content of cloud patent classification

IPC Technological content

G06F015/16 Combinations of two or more digital computers, with each having at least an arithmetic unit, a program unit, and a register, e.g., for the simultaneous processing of several programs.

G06F017/30 Information retrieval; Database structures therefor.

G06F015/173 Using an interconnection network, e.g., matrix, shuffle, pyramid, star, or snowflake.

H04L012/28 Characterized by path configuration, e.g., Local Area Networks (LAN) or Wide Area Networks (WAN).

G06F007/00 Methods or arrangements for processing data by operating upon the order or content of the data handled.

G06F007/00 H04L012/28

G06F017/30 G06F015/173

G06F015/16

2000 2001 2002 2003 2004 2005 2006 Years

Patents Amount

2007 2008 2009 2010 2011 2012 200

0 100

Fig. 3 The trend analysis of cloud patents from 2000 to 2012

cycle and found that it was in its infancy between 1996 and 2007, its growth period was in 2007, and it gradually reached maturity in 2012. The patent analysis of cloud computing is expected to reach the saturation stage in 2016.

So far, the research and development relevant to cloud computing techniques are still in the growth stage. Accord- ing to Gartner’s report [28], the market for cloud comput- ing is still maturing and rapidly evolving.

To distinguish between distributed computing and grid computing, the retrieved patents of cloud computing are restricted to the year 2000 and later. Due to space restric- tions, only the top 5 patent classifications, which account for the highest number of patents granted, are schematized in Fig. 3, including G06F015/16, G06F017/30, G06F015/173, H04L012/28, and G06F007/00. The definition of each cla- ssification is shown in Table 1.

As depicted in Fig. 3, cloud patents began to accu- mulate around 2000 and continuously grew thereafter, with the exception of the period between 2008 and 2009. The transient phenomenon of decline may be regarded as the outcome of the global financial crisis. It is interesting to

note that the growth in cloud patents over the years is con- sistent with literature reports, as revealed in Fig. 3.

IV. NETWORK-BASED PATENT ANALYSIS

The underlying theory of patent analysis includes features (keywords) extraction and patent network devel- opment. Detailed explanations for each step of patent analysis are presented below.

1. Feature extraction

Based on patent retrieval results, this study conducted

keywords extraction, selection, and vectorization of patent

documents. These features are an important clue for users

to determine the meaning of a patent. In order to extract

features from the patent text to represent its corresponding

significance, this study has conducted a text-mining algo-

rithm proposed by Tseng et al. [17]. First, we pre-processed

the data and extracted the preliminary keywords. To keep

the dimension of term-document incidence matrix at a man-

ageable level, the extracted preliminary keywords that

(8)

Table 2 Top 10 keywords extracted from IaaS

Keywords Entropy 1 Cloud computing network 0.17381

2 Routing 0.21579 3 Plurality 0.22160 4 Virtual computer 0.23290 5 Computing nodes 0.27154 6 Substrate network 0.29214 7 Network routers 0.29630 8 Network topology 0.30109 9 Networking devices 0.30323 10 Configuration information 0.31080

Table 3 Top 10 keywords extracted from PaaS

Keywords Entropy

1 Client connection 0.19897 2 Peer discovery 0.19897 3 Virtual machines 0.20742 4 Substrate network 0.21384 5 Network routers 0.21729 6 Configuring 0.21884

7 Peer identity 0.22105

8 Storage nodes 0.2243

9 Client application processes 0.2256

10 Network interface 0.2276

have the same meaning have been merged. In addition, we have eliminated prefixes and suffixes accompanying the extracted word stems (for example, prepositions such as

“of,” “on,” “the,” pronouns, and other functional words) and low-meaning and repetitive strings.

Before calculating the similarity between patents by correlation analysis, one needs to determine keywords corresponding to each cloud computing business model.

Since keywords frequently appearing in most patents are redundant and carry less information for users, and key- words appearing in fewer patents are more informative, this study has applied the index of entropy in patents to select informative keywords. The concept of entropy was originally used to measure the level of disorder in the ther- modynamic system. Shannon [29] was the first researcher who applied this concept in information theory. In 2004, Kao et al. [30] applied the concept of entropy to mine web informative structures and contents. The entropy of key- words T i is expressed as follows:

Table 4 Top 10 keywords extracted from SaaS

Keywords Entropy

1 Storage device 0.08002

2 Search engine 0.15392

3 Hosted 0.15716 4 Peer discovery 0.15914 5 Bandwidth 0.17379

6 Memory medium 0.17679

7 Workload 0.17939 8 infrastructure 0.18059

9 Cloud broker 0.18146

10 Enabling access 0.18208

1

( ) log

m

i ij ij

j

E T

¦ Z Z

⁽³⁾

where m is the number of events, and Z ^ij is the value of normalized keywords frequency in the patent set. While a keyword is spread across all of the patent documents, its entropy can be higher; however, the uncertainty can be lower. Accordingly, this study included the keywords with lower entropy in the keyword thesaurus. If a keyword with the entropy 0 only appears in a specific document, its un- certainty of information might be the lowest; nevertheless, it might be meaningless. Therefore, it ought to be elimi- nated.

By conducting the proposed two-stage patent retrieval process, we extracted 628, 1255, and 1635 keywords, re- spectively, from IaaS, PaaS, and SaaS. Apparently, too many keywords exist for each model. Since most of the keywords only appeared in one document, the study used the entropy method to screen those keywords. Based on the cloud ontology of SaaS proposed by Han and Sim [23], the results show that there are 61, 171, and 233 keywords, respectively, for IaaS, PaaS, and SaaS. Due to space li- mitations, only the top 10 extracted keywords for IaaS, PaaS, and SaaS are listed in Table 2, Table 3, and Table 4, respectively.

Whereas the basic structure of the three models of cloud computing is reasonably distinct, there is a consid- erable overlap of cloud provision between the three mod- els [31]; it is therefore not surprising that some keywords overlap between the three models, as shown in Table 2, Table 3, and Table 4.

In general, manual reading is the most time consum-

ing task during the process of patent retrieval. The infor-

mation provided from the keyword vector offers us a tool

(9)

to screen cloud-computing patents automatically. If a patent document contains only a few keywords belonging to the keyword data of a certain cloud computing business model, its similarity to other patents in this model will be very low.

The results of the network analysis will certainly not show a connection between this patent and others, suggesting that the patent does not belong to this model. That is to say, network analysis can help us to filter out the unrelated patents. Therefore, network analysis can also be considered as an automatic selection tool for patent retrieval to reduce the time spent on manual reading.

2. Adjacency matrix

A keyword vector of a patent can assist us in judging the similarity between patents. After extracting keywords by using the entropy method, we constructed the adjacency matrix directly by inputting the cosine similarity values between patents as its entries. The expression of similar- ity between two patents P m and P n is shown in (4):

( ) ( ) sim( , )

( ) ( )

m n

V P V P

P P

V P V P

JG JG

JG JG (4)

where V P

JG( _m)

V P

JG( )_n

is the inner product between two vectors of the patent document. V P

JG( _m)

and V P

JG( _n)

are the Euclidian length of vectors and , respectively. Cosine similarity may be interpreted as the cosine of the angle between the two vectors. The smaller the angle, the more similar the two vectors will be. Since there are numerous analyzed data, the study establishes the adjacency matrix with 10 documents having the highest correlation values.

3. Network analysis

Network analysis is used to extract patterns of rela- tionships between actors in order to discover the underly- ing structure. Measuring the network location gives us insight into the various roles and groupings in a network.

Among various measures offered by network analysis, degree centrality, which is interpreted as the ratio of the number of tied links to all other patents [6], is an important indicator for evaluating the importance of patents. A central patent occupies a network location that serves as a source for larger volumes of information exchange and other resource transactions with other patents. In general, the higher the centrality index, the greater the impact on other patents.

Relative to the local centrality, which concerns the relative prominence of a focal point in its neighborhood,

global centrality (centralization) is concerned with promi- nence within the whole structure of the network. Accord- ing to Wasserman et al.’s [32] definition, centralization pro- vides a measure of the extent to which a whole network has a centralized structure; in contrast, more decentraliza- tion implies that there is no apparent central tendency.

This suggests that if the level of difference among patents is high, the quality of the network graph is low. On the other hand, the higher the value of network centralization, the lower the difference between patents, and the higher the quality of the network graph.

Yoon and Park [6] proposed the technology centrality index to assess the relationships between patents. This index is equivalent to actor degree centrality in network ana- lysis, and is defined as:

( )

( ) 1

pi i D

C d p

c

g

(5)

where d(p i ) is the number of lines that are incident from patent i, and g is the total number of patents.

In this study, “centralization” is also used to quantify the overall cohesion or integration of the graph, and is considered as a criterion for evaluating the structural qual- ity of network analysis. Before performing the network analysis, all the entries in the adjacency matrixes should be dichotomized. The “1” in the matrix suggests a direct re- lationship among the patents, and the “0” suggests that it has no connection with the patents. To transform the values, a cut-off value of similarity should be determined in ad- vance. Obviously, the structure of the network graph is closely related to the cut-off value and the selection of the keywords. Yoon and Park [6] considered the determina- tion of the cutoff value to be a subjective, trial-and error task. They constructed the adjacency matrixes in a subjec- tive way by simply selecting a reasonable value so that the structure of the network becomes clearly visible. Inspired by the study of Fattori et al. [20], we propose a systematic process for evaluating the impact of these two interacting factors on the structure and visual effect of the network graph.

Fattori et al. [20] developed an experimental text mi-

ning tool prototype named PackMOLE to implement an

application for mining patent information in the packaging

field. The software user has the option of customizing a

number of different parameters that govern the clustering

process, including the maximum number of clusters al-

lowed, the weighting system, the keyword drop threshold,

and the minimum domain homogeneity (alpha). Three

(10)

Table 5 Parameter tests of network quality in IaaS

IaaS

Min. cosine similarity Domain scale Network centralization

0.70 20.86%

0.75 17.06%

0.80

Large

16.28%

0.70 20.71%

0.75 21.27%

0.80

Medium

20.33%

0.70 19.79%

0.75 17.65%

0.80

Specific

18.06%

Fig. 4 Patent networks with better quality in IaaS

different values of the weighting system are assumed:

large domains, specific domains, and medium domains.

By selecting large domains, the algorithm tends to create large clusters based on frequent words; with specific do- mains, small clusters based on rare words are likely to be created, while medium domains is a compromise between the two. The alpha parameter is the minimum degree of similarity that two documents must possess for them to be

grouped within the same cluster. To evaluate a particular combination of clustering parameters, Fattori et al. [20]

created three built-in criteria: within, between, and quality.

The within criterion is a measure of the internal homoge- neity of each cluster; the between criterion measures the degree of similarity between different clusters; and the quality criterion summarizes the two preceding ones.

Drawing on the research of Fattori et al. [20], this

(11)

Fig. 5 Patent networks with worse quality in IaaS

study has divided the keyword data of each business model into three domain scales. For example, with regard to IaaS, we have separated the obtained 61 keywords into three parts according to the rank of the entropy value: the top 20 is set as large domain, the 20 keywords in the middle as medium domain, and the rest as specific domain. With the test of two parameter combinations, domain scale, and minimum value of similarity, we were able to evaluate the patent network structure according to the corresponding value of centralization. The results and summaries for the business models are presented as follows. This study also presented the network with lower quality in order to highlight the difference between network qualities.

Table 5 presents the network quality of different test parameters in IaaS. The minimum cosine similarity is 0.75 and the domain scale with the highest network quality is medium (21.27%).

Fig. 4 presents the patent network with better quality

in IaaS. It suggests that the level of difference among patents is lower when patent documents are distributed into two groups, with the density in the left group being higher.

Fig. 5 depicts the patent networks with worse quality in IaaS. They are more decentralized when there is no apparent central tendency. This suggests that the level of difference among patents is higher. It may be worth pointing out that there are several numbers listed on the left in Fig. 4, which are the patents having no association with others. In general, a patent has more or less certain connections with others due to technological continuity or similarity. It is therefore reasonable to infer that those separate patents may not belong to cloud computing.

This provided us with an auxiliary mechanism to filter out unwanted patents. We have this feedback process shown in Fig. 1.

Table 6 shows the network quality of different test

(12)

Table 6 Parameter tests of network quality in PaaS

IaaS

0.70 13.10%

0.75 12.28%

0.80

Large

8.76%

0.70 12.94%

0.75 12.62%

0.80

Medium

11.87%

0.70 6.57%

0.75 6.59%

0.80

Specific

5.98%

Fig. 6 Patent networks in PaaS

parameters in PaaS. The minimum cosine similarity is 0.7, and the domain scale with the highest network quality is large (13.10%). The patent networks in PaaS are shown in Fig. 6.

Table 7 presents the network quality of different test

parameters in SaaS. The minimum cosine similarity is

0.7, and the domain scale with the highest network quality

is medium (25.93%). The patent networks in SaaS are

(13)

Table 7 Parameter tests of network quality in SaaS

IaaS

0.70 7.47%

0.75 5.98%

0.80

Large

4.44%

0.70 25.93%

0.75 22.43%

0.80

Medium

20.38%

0.70 5.82%

0.75 5.19%

0.80

Specific

4.76%

Fig. 7 Patent networks in SaaS

shown in Fig. 7.

4. Patent analysis

With the aid of visual networks and indexes, the cores of cloud computing technologies can be identified. The

technology centralities of patent networks in the IaaS model

are shown in Table 8. Of all the technology centralities,

patent 124 has the highest percentage. This suggests that

patent 124 has a greater number of connections with patents

in the network and is more influential. The patent number

(14)

Table 8 Technology centrality of patents in IaaS

Patent Technology centrality 124 36.75%

127 36.15%

112 35.54%

105 33.74%

40 33.74%

52 32.53%

100 32.53%

Table 9 Technology centrality of patents in PaaS

143 13.78%

118 12.50%

122 12.25%

127 11.99%

168 11.22%

42 11.22%

74 10.97%

for patent 124 is US08201237; it is titled “Establishing secure remote access to private computer networks.”

The content of this patent mainly includes techniques that provide users with access to computer networks, such as those that enable users to interact with a remote config- urable network service to create and configure computer networks that are provided by the configurable network service. Given that the business model of IaaS primarily includes consumers using “basic computing resources”

such as storage space, network components, or middleware, it is unsurprising that patent 124 has been classified into the IaaS model.

The technology centralities of patent networks in the PaaS model are presented in Table 9. Patent 231 has the highest percentage. The table shows that patent 231 has considerably more connections with patents in the network and is more influential.

The patent number for patent 231 is US08281046; it is titled “System and method for distributing user interface device configurations.” The content of this patent mainly includes a system with a controller to collect a plurality of user interface (UI) device configurations, receive a request from a computing device to download one or more of such UI device configurations, and transmit to the computing

Table 10 Technology centrality of patents in SaaS

42 41.29%

529 41.29%

415 41.29%

461 41.29%

46 41.29%

533 41.29%

device the downloaded UI device configurations requested to configure one or more UI devices of the computing de- vice. While the PaaS model primarily delivers the con- sumers’ operating applications with host computers, which allows consumers to take over the field of application op- eration, they cannot control the operating system, hard- ware, or the basic network structure of operation. There- fore, patent 124 should be classified into the PaaS model.

Table 10 presents part of the technology centralities of patent networks in the SaaS model. The one with the high- est technology centrality is 244 patents. This shows that patent 224 is connected with a greater number of patents in the network and is more influential. The patent number for patent 244 is US08336047; it is titled “Provisioning virtual resources using name resolution.” This patent con- cerns a data string including a resource identifier, and one or more resource attributes is parsed at a name resolution module and provided to a computing resource provisioning system. The SaaS model is created for consumers to use applications and is a service-based concept rather than a concept related to the control of the operation system, hardware, or the basic network structure. In addition, the suppliers of software services provide a rental rather than a purchasing-based customer service; for example, suppliers providing a series of accounts and passwords are com- monly observed. Therefore, we decided that patent 244 ought to be classified into the SaaS model.

5. Verification of the retrieval strategy

Since there is no existing literature containing com-

plete data of cloud patents, this study can only prove the

validity of the two-stage retrieval strategy indirectly. As

mentioned in section 3, the effectiveness of this strategy

can be reasonably inferred for three reasons. The second

reason is demonstrated with the aid of the calculated data

of degree centrality. For example, with regard to the IaaS

model, the 9 patents retrieved in the first retrieval stage

(15)

Table 11 The variation of degree centrality between two-stage retrieval

Degree

centrality

Patent number in the

first retrieval stage Number of tied links Patent number in the

second retrieval stage Number of tied links

1 4 42 52

3 4 157 52

4 4 162 51

8 4 166 52

High

9 4 167 51

6 1 164 31

Medium

7 1 165 28

2 0 93 6

Low 5 0 153 7

were divided into three groups according to their value of degree centrality. Since the indicator of degree centrality is the ratio of the number of tied links to all other patents, patents with more tied links also have higher degree cen- trality. As shown in Table 11, for groups with high degree centrality, patent 1, 3, 4, 8, and 9 have 4 links with other pa- tents in the first retrieval stage. During the second retrieval stage, a total of 167 patents were retrieved for the IaaS model, and the corresponding numbers of these five patents include 42, 157, 162, 166, and 167. All their tied links are higher than 50, i.e., they still have a high value of degree centrality. Similarly, for groups with medium degree centra- lity, the tied links of patent 6 and 7 are 1 in the first stage;

during the second retrieval stage, they have ap- proxi- mately 30 tied links. For groups with low degree central- ity, the number of tied links of patent 2 and 5 is zero in the first stage; during the second retrieval stage, they have less than 10 tied links. It is evident that if the degree centrality of a certain patent retrieved from the first stage is high, it will maintain a high number of tied links during the second retrieval stage. Based on the consistent change of degree centrality between the two-stage retrieval strategy, it is reasonable to support the effectiveness of this strategy.

6. Gray relational-based cluster analysis

This section provides justification for conducting gray relational-based cluster analysis in order to prove the ef- fectiveness of the proposed systematic method for con- structing patent networks. For example, with regard to the IaaS, it is not easy to determine how many clusters can be divided; however, Fig. 4 displays two clusters with in- tensive patents. Consequently, we performed gray rela- tional analysis (GRA) to check the patent relationship

within and between these two clusters, and compare the results with Fig. 4.

Gray theory is an effective mathematical means to deal with systems characterized by incomplete information, such as architecture, parameters, operation mechanism, and system behavior. One of the major advantages of the gray system theory is the ability to identify major correlations among factors of a system with a relatively small amount of data [33]. The main procedure of GRA consists of four steps: gray relational generation, reference sequence defi- nition, gray relational coefficient calculation, and gray re- lational grade calculation. At the gray relational generation step, the performances of alternatives, expressed as Y i = (y i1 , y i2 , }, y in ), are translated into comparability sequences X i = (x i1 , x i2 , }, x in ). The gray relational coefficient, ] ⁱ ^(k) can be expressed as:

min max

max ij

ij

9 9

9

' '

' ' (6)

where, ' ij is the deviation sequence of the reference se- quence (X o ) and the comparability sequence (X i ), i.e., ' ij = x oj x ij . The distinguishing coefficient ] is set at 0.3 in this study. ' max is the largest value of ' ij and the ' min is the smallest value of ' ij . The gray relational grade * ^(x ^oj ^, x ij ) is computed by averaging the gray relational coefficient corresponding to each quality characteristic (Tsao, [33]) and is defined as:

0

( , ) 1

m

j j ij ij

i

x x

* n ¦ 9

, for j = 1, 2, }, n (7)

The gray relational grade indicates the degree of

similarity between the comparability sequences and the

reference sequence. Because Fig. 4 presents two major

(16)

Patent Number 1.9

1.7

1.5

1.3

1.1

0.9

0.7

0.5

88 133 99 34

Gray Relatio nal Grade

Group A Patent 2

Patent 124

Group B Group C

Fig. 8 Gray relation grades of all patents by using different patents as reference

clusters, one patent was selected from each cluster as the reference sequence to calculate the relational grades of all patents. Patent 124 was selected as the reference due to its high value of technology centrality; the gray relation grades of all patents were then calculated using the key- word vectors without dichotomized transformation. The results, indicated as curve “patent 124” in Fig. 8, are plot- ted in descending order. This curve shows the existence of many patents with a close relationship to patent 124;

therefore, they can be considered as a group with similar techniques. Subsequently, patent 2 was selected as the reference, which is located in a different cluster to patent 124. The dotted line, indicated as curve “patent 2” in Fig.

8, shows the existence of another group of patents with a close relation to patent 2. Due to space limitations, only four patents with the highest gray relational grades of

“curve 124” and “curve 2” are presented in Table 12.

When using patent number 124 as the reference, the gray relational values of patent numbers 88, 116, 157, and 113 are high (which appear on the left side of the solid line in Fig. 8); this means they have similar technologies as patent 124. As we can see from Fig. 4, they are actually in

Table 12 Comparison of gray relational grade of patents by using different patents as reference

Reference patent Patent number

124 2

88 1.517 1.039

116 1.478 0.981

157 1.452 1.012

113 1.405 1.011

99 0.996 1.906

18 0.996 1.884

15 0.996 1.917

34 0.996 1.859

the same cluster and are located close to each other. On the other hand, when using patent number 2 as the refer- ence, patent numbers 99, 18, 15, and 34 have a high gray relational grade (they appear on the right side of the dotted line in Fig. 8). Fig. 4 shows that these patents belong to another cluster.

It should be noted that patents with a close relation-

ship with patent 124, as shown in Table 12, may in contrast

(17)

have a low relational grade value when patent number 2 is used as the reference. In other words, patents 88, 116, 157, and 113, illustrated on the left side of the dotted line in Fig. 8, have a low relational grade value compared to patent 2. Similarly, those patents with a strong relationship with patent 2, as shown on the right side of the solid line in Fig. 8 (patents 99, 18, 15 and 34), will have a low relational grade value with patent 124.

We can divide the patents shown in Fig. 8 into three groups: patents having similar technology with patent 124 are categorized into group A; patents having similar tech- nology with patent 2 fall into group C; and patents with technology neither similar to patent 124 nor to patent 2 belong to group B. Since group A and group C are com- posed of patents with apparently different technology, when we use patent 124 as the reference, the relational grades of patents in group A is high; conversely, when we use patent 2 as the reference, their relational grades will significantly decrease. The opposite is true for group C.

As regards to group B, since the patents in this group have nothing to do with patent 124 and patent 2, the relational grades only experience little change regardless of the pat- ent used as reference. By comparing the results illustrated in Fig. 4, we can reasonably conclude that two clusters with intensive patents shown in Fig. 4 are equivalent to group A and group C in Fig. 8; while patents scattered outside these two clusters can be regarded as patents of group B.

It is noteworthy that the establishment of the patent networks shown in Fig. 4 was based on our proposed sys- tematic approach, which contains only part of the keyword vectors during each calculation; however, in this section, we calculated the gray relation grade of all patents by us- ing all the keyword vectors without the process of dichoto- mized transformation. Since the patent networks shown in Fig. 4 and Fig. 8 are established on different grounds, the consistent results of these two figures substantiate the effec- tiveness of our proposed systematic method.

So far, some researchers have proposed a new classi- fication system for cloud computing, which is unlike the conventional SPI model. For example, Youseff et al. [34]

depicted cloud computing as comprising five layers, with three constituents to the cloud infrastructure layer. It is in- teresting that our results corroborate this argument. This may be regarded as a supplementary contribution of this study.

7. Relationship between the core patents and business As regards the IaaS model, we attempted to connect

the content of the core patents attained above with the cur- rent business conditions of the patent owners in the cloud computing industry.

Fig. 9 shows the patent network that has been devel- oped by setting the core patent in the IaaS model, patent 124, at the center; also, most of the patents in the network are connected with this particular patent.

Amazon owns patent 124. Amazon Web Services (AWS), which has been highly promoted by Amazon, is the overwhelming market share leader [28]. In recent years, Amazon has offered numerous remote Web services that have significantly influenced the IaaS field, such as AWS EC2 (which provides virtual server services) and AWS S3 (which provides storage space services).

Other patents like patent 127 (SAP), patent 112 (Mi- crosoft), patent 105 (AT&T), patent 40 (Onemednet), pat- ent 52 (Microsoft), and patent 100 (Google), also performed well on the technology centralities in the IaaS network sys- tem. According to patent related information published in 2012 and 2013, it is evident that AT&T, Onemednet, and SAP have quickly entered the market, offering similar products that can provide virtual server services. Micro- soft and Google, which are known for the PaaS model, also performed fairly well in the IaaS field. These com- panies offer services different from those of Amazon, fo- cusing on virtual machine backup as well as Integrated Development Environment (IDE), which provides network infrastructure. It shows that Microsoft and Google are trying to rush into the IaaS market.

V. CONCLUSION

In essence, cloud computing is a general term for anything that involves delivering hosted services over the Internet. An increasing number of enterprises and deci- sion makers worldwide identify the development of cloud computing as an important strategy to the success of their businesses in the immediate future. To take advantage of the substantial benefits offered by cloud computing, stra- tegic planning of this technique through patent analysis is crucial. The emerging methods of patent network analy- sis, which have become increasingly popular, involve the use of text mining and network techniques. Patent re- trieval is the cornerstone of such patent network analysis.

However, because there are many difficulties in retrieving

cloud patents through conventional methods, we have de-

veloped a two-stage patent retrieval strategy. We have

validated the accurateness of retrieval information from

(18)

Fig. 9 Core patent networks in the IaaS model

the perspectives of bibliometric analysis and network analy- sis. This study is the first to reveal the full scope of cloud computing patents and their related patent classifications.

Patent network uses the frequency of keywords’ occur- rence in patent documents as the input base to illustrate the overall relationship among all patents being studied. Our proposed method, which uses a weighting system and simi- larity threshold as parameters and sets network centraliza- tion as an evaluation criterion, provides an effective system- atic formulation to solve the weakness of previous studies.

The effectiveness of this method is proved through gray relational-based cluster analysis. With the aid of visual networks and indexes, this paper leads to a better under- standing of the overall patent structure and identifies the influential patents of cloud computing. This research is aimed at assisting enterprises intending to enter the cloud computing industry to adopt more effective decision-making strategies in order to reduce technology investment risks and create advantages to increase their market share.

ACKNOWLEDGMENTS

The authors would like to acknowledge the support from the Ministry of Science and Technology in Taiwan under Grant MOST-102-2410-H-167-011.

REFERENCES

1. Basmadjian, R., H. De Meer, R. Lent, and G. Giuliani. 2012.

“Cloud Computing and Its Interest in Saving Energy: The Use Case of a Private Cloud.” Journal of Cloud Computing:

Advances, Systems and Applications 1 (1): 1-25. doi:

10.1186/2192-113X-1-5.

2. Venters, W., and E. A. Whitley. 2012. “A Critical Review of Cloud Computing: Researching Desires and Realities.”

Journal of Information Technology 27: 179-197.

3. Abbas, A., L. Zhang, and S. U. Khan. 2014. “A Literature Review on the State-of-the-Art in Patent Analysis.” World