Unsupervised Learning Methods

Chapter 2 Literature Review

2.3 Machine Learning

2.3.2 Unsupervised Learning Methods

In contrast to supervised learning, the object class labels are not pre-given in unsupervised learning methods. Clustering (or clustering analysis), one common form of unsupervised learning, is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Clustering analysis has a wide range of applications, including information retrieval, image processing, business transaction analysis, and pattern recognition. Two major types of clustering analysis are introduced as follows.

 Hierarchical clustering: Hierarchical methods construct a hierarchical decomposition of the given set of data objects using either an agglomerative (also called “bottoms-up”) or a divisive (also called “top-down”) approach.

Agglomerative strategies start at the bottom and at each level recursively merge a selected pair of clusters into a single cluster. This produces a grouping at the next higher level with one less cluster. The pair chosen for merging consists of the two groups with the smallest intergroup dissimilarity. Divisive methods start at the top and at each level recursively split one of the existing clusters at that level into two new clusters. The split is chosen to produce two new groups with the largest between-group dissimilarity (Hastie, 2011).



Partitional clustering: Partitioning methods typically create an initial partition, which is then refined using iterative relocation techniques to improve the partitioning. Iterative relocation technique improves the partitioning by moving objects from one group to another. K-means clustering is one of most common partitional clustering methods, and aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (Yang et al, 1999). Thus, K-means clustering method is also employed in our experiment as unsupervised learning approach for author disambiguation work.

After the overview of the machine learning approaches above, different characteristics of supervised and unsupervised methods are found. And, the previous studies in Table 1 show that two types of machine learning were all employed.

Therefore, both of supervised and unsupervised approaches are conducted in our experiment. The detail of methods we used is described in next chapter.

Chapter 3 Research Design

In order to investigate different factors, e.g., feature combinations, learning methods, and scalability of datasets, many resources are used and arranged in this study. The research framework is shown in Figure 1. The procedure consists of data collection, data processing, model learning, and performance evaluation. The following subsections explain these stages. In addition, feature encoding, feature combinations, and feature weightings are discussed in detail.

Figure 1: Research Procedure

3.1 Data Collection

The datasets employed in this study was the same DBLP datasets constructed by Han et al. (2005), which contains 8,441 bibliographic records collected from DBLP database. The datasets consists of 14 popular author names shared by 476 individual authors. In order to increase the complexity of ambiguity, the first names of author names were changed into initials in Han’s design. The DBLP datasets of this study is

DBLP

provided by Dr. Giles, but the feature information that we would like to analyze consists of five features (i.e. co-authors, article titles, journal titles, year and number of pages) rather than three features Han et al. (2005) used in their study.

Therefore, we have to supplement the needed features, i.e., year and number of pages. In the process of data supplementing, we unfortunately found some problems of the DBLP datasets as the failure cases pointed by Pereira et al. (2009), such as wrong author names or duplicate names marked in bibliographic record, the lack of article titles or journal titles. We then have to revise and delete some bibliographic records in DBLP datasets accordingly. The statistics of test data used in this study is shown in Table 2.

Table 2: The 14 Ambiguous Author Name Datasets

Name

Number of Different Authors

Number of Bibliographic Records Original Revised Original Revised

A. Gupta (AG) 26 26 577 572

3.2 Feature Combinations

The purpose of this study focuses on performance of complete combinations of various features (i.e. authors, article titles, journal titles, date, and number of pages) in bibliographic data for disambiguation, although previous literature pointed out that the inclusion of all features at the same time might not necessarily achieve the best performance. Accordingly 28 feature combinations are explored in the study to examine how each feature combination takes its effect. The framework is composed of three commonly used features Co-author (C), Article title (T), and Journal title (J) in combination with two previously “never-used” features Year (Y) and Number of pages (P). The possible combinations are shown in Table 3.

Table 3: 28 Feature Combinations

7 Combinations 21 Combinations with Features Y and P

One-feature C; T; J CY; CP; CYP; TY; TP; TYP; JY; JP; JYP

Two-feature CT; TJ; CJ CTY; CTJ; CTP; TJY; TJP; TJYP; CJY; CJP;

CJYP

Three-feature CTJ CTJY; CTJP; CTJYP

3.3 Data Processing

Of course, a few pre-processing tasks are considered in our study. Porter’s stemmer is used for titles (feature T) and journal titles (feature J), and stop words are removed by stop-words corpus from Toolkit in NLTK. In this way, it is believed that the remaining words in those two features are meaningful keywords.

Besides, the word occurrence is also considered for feature weightings, so TFIDF scheme is adopted in the work of data processing. Term Frequency (TF) stands for the frequency of occurrence of keyword term in the bibliographic record, and Inverse Document Frequency (IDF) stands for the inverse of the frequency of occurrence of keyword term in the dataset.

3.4 Machine Learning

After data processing, each bibliographic record is transferred into each vector and ready for classification or clustering. Both supervised learning methods and unsupervised learning methods are employed to examine the performance of author name disambiguation. Two supervised learning methods used are Naïve Bayes (Toolkit in NLTK) and Support Vector Machine (LIBSVM) (Chang & Lin, 2010).

The input format of Naïve Bayes in NLTK is “index = value”. In addition, the format of SVM by LIBSVM is “index: value”, and the attribute with null value in records is deleted. Both tools automatically generate accuracy value for evaluation. The ratio of training set and testing set is 7:3, and cross validation is used in training process.

For unsupervised learning method, K-means clustering is conducted with cluster module using Python. The input format of the K-means cluster module is vector tuple, such as “(5, 3), (10, 3)”. Besides, the number of clusters is based on heuristics of our pretest implementation. Two author name datasets, A. Gupta and C. Chen, are used in pretest. We gradually increase the number of clusters from 5 to 150. Finally, we find while the number of authors of the dataset is fewer than 60, we will run K-means clustering from 5 clusters to 60 clusters. If the number is more than or equal to 60, we will run from 60 to 125. After clustering, the decision of label of each cluster is based on the number of tuple in cluster. The cluster of the maximum is first regarded as one class, and the second cluster is regarded as the other class and so on.

3.5 Performance Evaluation

Like Han et al. (2005) and Yang et al. (2007, 2008), we evaluate the performance in terms of the disambiguation accuracy, calculated by dividing the sum of correctly clustered bibliographic records by the total number of bibliographic records in the dataset. The disambiguation accuracy is then calculated as follows:

where ‘I’ is the set of individuals in the dataset, ‘r’ is the correct cluster of individual

‘i’, and ‘N’ is the total number of bibliographic records in the dataset.

3.6 Settings for Year and Number of Pages

In order to consider features Year (Y) and Number of pages (P) in the study, year and number of pages in bibliographic data have to be transformed into corresponding codes meaningfully.

Table 4: The Length of Regular Paper in Top 15 CS Journals (up to Jan 2011) Rank Abbreviated Journal Title Length of Paper 5-Year

Impact Factor

1 ACM COMPUT SURV 35 7.667

2 HUM-COMPUT INTERACT 8 6.190

3 COMPUT INTELL 12

(More than 5,000 words) 5.378 4 IEEE T EVOLUT COMPUT No proclaimed specially 4.589

5 VLDB J 25 4.517

6 MIS QUART 20 4.485

7 IEEE T PATTERN ANAL 14 4.378

8 J AM MED INFORM ASSN 10

(More than 4,000 words) 3.974 9 J CHEM INF MODEL No proclaimed specially 3.882 10 J COMPUT AID MOL DES No proclaimed specially 3.835

11 IEEE T SOFTWARE ENG 14 3.750

12 ACM T GRAPHIC No proclaimed specially 3.619

13 IEEE T MED IMAGING 8 3.540

14 INT J COMPUT VISION No proclaimed specially 3.508

15 J WEB SEMANT 20 (from 15 to 25) 3.412

Average = 16.6 =>17

For feature Year (Y), it is assumed that each author has his/her period of academic production, so year distribution of the whole dataset is segmented into intervals. According to the dataset, the publication dates of literature in DBLP were

mainly between 1975 and 2005. Based on this observation, a time span of 10 years is used in this study.

As for number of pages (P), under the influence of publication types and authors’

preference, numbers of pages of the bibliographic data are calculated first and intervals are set based on number of pages conventions of different types of publications. For example, the average length of papers of top 15 journals of computer science in Journal Citation Report (Thomason Routers, 2011) is 16.6 (see Table 4). Three segmented points are designed in the study: three pages for poster papers, eight pages for conference papers, and more than 17 pages for journal papers.

Then four intervals are constructed: fewer than 3 pages, 3 to 8 pages, 9 to 17 pages, and more than 17 pagers. In addition to the four intervals, two cases are considered:

no page number and one page. Therefore, totally six cases for number of pages were considered.

Chapter 4 Experimental Results

In this study, 14 author names of DBLP datasets are examined (see Table 2 above). Each feature combination is investigated, and the effects of features Y and P are discussed. In addition, the complexity of datasets is also explored. In the end, the features (or feature combinations) achieving best performance in each dataset are highlighted.

4.1 Common Feature Combinations

To begin with, the performance of author disambiguation without considering features Y and P is described. Because of the following comparisons of various feature combinations are considered three methods in this study, the statistics of rank are based on comparisons of 42 times (combinations of 14 datasets and three methods).

In one-feature (C, T and J) experiment, feature J scored 64.2% of the lead in the comparisons of one-feature (see Figure 2). Feature C obtained 37.5% of the lead, but feature T did not obtain the lead ever. This indicates that the outstanding performance of feature J and feature C in the disambiguation work for authors, and feature J is satisfactory. In two-feature (CT, TJ and CJ) experiment, feature CJ scored 78.5% of the lead in the comparisons of two-feature (see Figure 3). Then, feature TJ obtained 19.0% of the lead, but feature CT only achieved 7.1% of the lead. As the result of comparison in one-feature (J > C > T), the rank comparison of two-feature is not surprising (CJ > TJ > CT).

However, it is found that the rank comparison of each feature combination is to a large extent influenced by different methods. Please take a look at the rank of one-feature in Table 5. Feature J achieves the first rank in K-means clustering (KM for short) and Naïve Bayes (NB for short) steadily, but it is not the case in Support Vector Machine (SVM for short). And, the performance of feature C is generally more desired than feature J in SVM. Then, in the rank of two-feature, although feature CT is always the worst in KM and NB, it is also not the case in SVM.

In three-feature (CTJ) experiment, it is concerned that whether CTJ achieved the best accuracy in the dataset owing to CTJ commonly regarded as “default” feature combination in many previous works. Nevertheless, feature CTJ leads other feature combinations only 7 times in the 42 times of comparisons of the best accuracy, and the 6 times among the 7 times which feature CTJ obtained the lead were conducted by SVM. As a result, when features C, T, and J are used for disambiguation at the same time, the combination cannot necessarily ensure the best performance.

As above, the performance of feature combination CTJ in SVM is different from KM and NB. In fact, the results in SVM match the findings of the study by Han et al.

(2004). For example, feature C outperformed feature J or T, and it is believed “Hybrid scheme” (feature CTJ called in Han’s paper) was outstanding. However, the methods they conducted were only supervised, and the datasets they used were not the same as the experiment used in the study (see Table 1).

Table 5: Statistics of Rank Comparisons in Different Methods K-means (KM)

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

Naïve Bayes (NB)

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

Support Vector Machine (SVM) - Continuing

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

M. Miller 3 2 1 M. Miller 2 3 1 M. Miller no

S. Lee 1 2 3 S. Lee 2 3 1 S. Lee no

Y. Chen 1 2 3 Y. Chen 2 3 1 Y. Chen no

Note: 1 = the lead, 2 = the runner-up, 3 = the third ; yes / no= Whether CTJ achieved the best accuracy in the dataset

Figure 2: Rank Comparisons of Single Feature

Figure 3: Rank Comparisons of Two Features

4.2 Features Year (Y) and Number of Pages (P)

In order to present the influence of features Y and P, the average performance of each feature combination is shown in Figure 4. The average improvement rates of performance with considering features Y, P or YP are investigated and shown in Figure 5. These results indicate that the performance using features Y and P is better than the previous one in general.

However, the performance above mentioned is estimated by the average accuracy rates in three methods. Therefore, separate performance with inclusion of feature Y and P is discussed as follow. The different impacts with inclusion of feature Y and feature P by three methods are shown in Figure 6 and Table 6. The improvement accuracy rate, which is the difference between the performance without and with feature Y or feature P, is examined in this section.

First, with the inclusion of feature Y, the average improvement accuracy rates in KM are 6.08% (sd = 6.76%), 0.73% (sd = 1.00%) in NB model and 0.49% (sd = 1.12%) in SVM, respectively. Then, after adding feature P for author name disambiguation, the average improvement accuracy rates in KM are 3.59% (sd = 4.09%), 0.59% (sd = 0.82%) in NB model and -0.39% (sd = 0.95%) in SVM. Finally,

when features Y and P are included at the same time, the average improvement accuracy rates in KM are 5.21% (sd = 5.28%), 1.38% (sd = 1.67%) in NB model and 0.33% (sd = 0.98%) in SVM (see Table 6).

Table 6: Improvement Accuracy Rate with the Inclusion of Feature Y and P

KM NB SVM

Y P YP Y P YP Y P YP

AG 2.89 3.16 4.99 0.47 0.63 0.60 0.97 -1.43 0.30 AK -1.24 9.53 8.81 0.07 -0.13 0.17 -1.57 -0.77 0.69 CC 0.43 0.41 0.13 0.10 -0.11 1.19 0.89 0.17 0.24 DJ 5.69 5.69 1.19 0.11 0.01 0.41 1.21 0.59 2.27 JL 3.20 3.16 2.07 -0.27 -0.63 -0.09 0.06 -1.03 -1.29 JM 0.86 -3.73 -0.13 2.70 1.91 6.10 2.87 2.21 2.20 JR 2.97 1.53 4.77 0.86 0.66 2.29 0.19 -1.36 0.43 JS 6.44 5.51 1.09 1.50 0.79 1.91 -0.40 -1.03 -0.31 KT 10.14 9.64 6.33 1.41 0.69 0.56 0.93 -0.77 -0.23 MB 13.64 0.23 14.19 2.67 2.54 3.46 1.24 -0.01 0.29 MJ 3.94 -1.56 1.84 0.56 0.89 1.34 -0.57 -0.54 -0.61 MM 24.79 8.59 17.50 -0.53 0.24 0.36 -0.06 0.00 -0.03 SL 2.23 2.37 3.19 0.23 0.20 0.24 -0.46 -0.99 -0.26 YC 9.16 5.70 6.91 0.37 0.50 0.80 1.61 -0.43 0.86 Avg. 6.08 3.59 5.21 0.73 0.59 1.38 0.49 -0.39 0.33

From the findings shown above, it is found that feature Y and feature YP delivered positive performance in our datasets. In addition, the inclusion of feature P also produced positive effects, but the influence is not obvious. However, it is significant that the effect is more positive in K-means clustering (+4.98% in average) than that in Naïve Bayes Model (+0.90% in average) and Support Vector Machine (+0.15% in average). Please refer to Figure 6. It is shown that feature Y and feature P could enhance significant performance in K-means clustering, but not obviously in Naïve Bayes and SVM. In the experiment by K-means clustering, the improvement

rate with feature Y maximally achieve 24.79% in MM Dataset, and feature P achieve 9.53% in AK Dataset and feature YP achieve 17.5% also in MM Dataset. But the maximum of improvement with feature Y or P in the experiment by Naïve Bayes and Support Vector Machine is about 2.5% at most. It seems feasible to explore whether the feature Y and P could efficiently enhance accuracy rate in various unsupervised approaches in future studies.

4.3 Complexity of Datasets

According to the scale of datasets, the datasets are divided into two groups:

Group A and Group B. Group A contains the complicated dataset (more than 20 individuals and more than 400 bibliographic records), such as A. Gupta, C. Chen, J.

Lee, J. Smith, S. Lee and Y. Chen. Group B includes the less complicated dataset (fewer than 20 individuals and fewer than 400 bibliographic records), such as A.

Kumar, D. Johnson, J. Martin, J. Robinson, K. Tanaka, M. Brown, M. Jones and M.

Miller.

As shown in Figure 4, the performance of Group A is not as good as Group B.

The average performance of Group A is 39.14%, but 49.62% in Group B. Moreover, it is obvious that the impact with feature Y and P in Group A is more negative than Group B. The average improvement rate of Group A is 1.28, but 2.56% in Group B.

Please refer to Figure 5. These suggest that the complexity of datasets can influence the performance indeed. In other words, it is easier to increase ambiguity in larger datasets like the complexity in the real world.

Figure 4: The Comparison using with/out Features Y and P (Average in Three Methods)

Figure 5: Average Improvement Rate using Features Y and P (Average in Three Methods)

Group A

Group B

Figure 6: Improvement Accuracy Rate using Features Y and P in Different Methods (Average of Y, P and YP)

4.4 Top One Feature Combinations

Feature combinations achieving the best accuracy are explored in this part. Table 7 shows the “top 1 feature combination” for different methods and different author name datasets. Figure 7 displays top 1 distribution for different feature combinations.

As shown in Table 7 and Figure 7 below, the significance of feature JYP and CTJ is obvious. Note that J, JY and CJY are of the third, fourth and fifth place, respectively.

There are 14 feature combinations in 18 top 1 feature combinations in Table 7 with inclusion of feature Y or feature P. That means features Y and P have their roles in author name disambiguation even though they were not ever considered before. In addition, feature J accounted for 77.7% of top 1 feature combinations, and feature C for 64.4% subsequently. Please refer to Figure 8. As Section 4.4 mentioned, it is found that when feature C and feature combination CTJ achieved outperformance is employed by SVM method.

Table 7: Top 1 Feature Combinations

KM NB SVM

Figure 7: Top 1 Distribution of Feature Combinations

Figure 8: Percentage of Features in Top 1 Feature Combinations

Chapter 5 Conclusions and Suggestions

Finally, research conclusions are organized from the findings of the thesis in this section, and some research prospects are suggested for future studies.

5.1 Conclusions

According to the experimental results, some conclusions are taking shape and described as follows:

 Feature combination CTJ cannot necessarily ensure the best performance: In previous works, this common feature combination was usually regarded as a normal scheme, and the focus of studies often contributed to the designs of algorithm or the impacts of new resource. It is few to pay much attention to conduct a serial of different feature combinations repeatedly on author disambiguation. In this thesis, it is shown that the performance of feature combination JYP is not inferior to CTJ, and the performance of feature combinations CJY, JY and J are also outstanding in general. Therefore, it is known that the best feature combination on author disambiguation is mainly contributed by the combinations of features C and J. Additionally, the inclusion of features Y and P can substantially enhance the performance as well

 The inclusion of features Y and number of pages P exhibits positive influence on disambiguation: The average improvement rates of the inclusion of features Y are 2.44%, 1.29% in feature P, and 2.30% in YP. As Section 4.2 mentioned, the impacts of inclusions by features Y and P are significant in K-means clustering (about 5% accuracy of improvement). However, the influence of them is not obvious in Naïve Bayes and Support Vector Machine. It seems feasible to explore whether the feature Y and P could efficiently enhance accuracy rate in various “unsupervised” approaches in future studies. In addition, the setting for year and number of pages ought to depend on the character of datasets in order to respond to different datasets. For example, the setting for number of pages of journals in the datasets which consists of the citation records in humanity or social science should be more than 17 (used in our experiment).

 Various feature combinations have different effects on author name disambiguation while using different clustering or learning methods: It is found that the performance of feature combination J and JYP in K-means clustering and Naïve Bayes Model is as excellent as that of feature combination C and CTJ in SVM. Moreover, as the previous findings suggested, average improvement

在文檔中書目資料中著者姓名歧異性之解析 (頁 21-0)