Settings for Year and Number of Pages

Chapter 3 Research Design

3.6 Settings for Year and Number of Pages

In order to consider features Year (Y) and Number of pages (P) in the study, year and number of pages in bibliographic data have to be transformed into corresponding codes meaningfully.

Table 4: The Length of Regular Paper in Top 15 CS Journals (up to Jan 2011) Rank Abbreviated Journal Title Length of Paper 5-Year

Impact Factor

1 ACM COMPUT SURV 35 7.667

2 HUM-COMPUT INTERACT 8 6.190

3 COMPUT INTELL 12

(More than 5,000 words) 5.378 4 IEEE T EVOLUT COMPUT No proclaimed specially 4.589

5 VLDB J 25 4.517

6 MIS QUART 20 4.485

7 IEEE T PATTERN ANAL 14 4.378

8 J AM MED INFORM ASSN 10

(More than 4,000 words) 3.974 9 J CHEM INF MODEL No proclaimed specially 3.882 10 J COMPUT AID MOL DES No proclaimed specially 3.835

11 IEEE T SOFTWARE ENG 14 3.750

12 ACM T GRAPHIC No proclaimed specially 3.619

13 IEEE T MED IMAGING 8 3.540

14 INT J COMPUT VISION No proclaimed specially 3.508

15 J WEB SEMANT 20 (from 15 to 25) 3.412

Average = 16.6 =>17

For feature Year (Y), it is assumed that each author has his/her period of academic production, so year distribution of the whole dataset is segmented into intervals. According to the dataset, the publication dates of literature in DBLP were

mainly between 1975 and 2005. Based on this observation, a time span of 10 years is used in this study.

As for number of pages (P), under the influence of publication types and authors’

preference, numbers of pages of the bibliographic data are calculated first and intervals are set based on number of pages conventions of different types of publications. For example, the average length of papers of top 15 journals of computer science in Journal Citation Report (Thomason Routers, 2011) is 16.6 (see Table 4). Three segmented points are designed in the study: three pages for poster papers, eight pages for conference papers, and more than 17 pages for journal papers.

Then four intervals are constructed: fewer than 3 pages, 3 to 8 pages, 9 to 17 pages, and more than 17 pagers. In addition to the four intervals, two cases are considered:

no page number and one page. Therefore, totally six cases for number of pages were considered.

Chapter 4 Experimental Results

In this study, 14 author names of DBLP datasets are examined (see Table 2 above). Each feature combination is investigated, and the effects of features Y and P are discussed. In addition, the complexity of datasets is also explored. In the end, the features (or feature combinations) achieving best performance in each dataset are highlighted.

4.1 Common Feature Combinations

To begin with, the performance of author disambiguation without considering features Y and P is described. Because of the following comparisons of various feature combinations are considered three methods in this study, the statistics of rank are based on comparisons of 42 times (combinations of 14 datasets and three methods).

In one-feature (C, T and J) experiment, feature J scored 64.2% of the lead in the comparisons of one-feature (see Figure 2). Feature C obtained 37.5% of the lead, but feature T did not obtain the lead ever. This indicates that the outstanding performance of feature J and feature C in the disambiguation work for authors, and feature J is satisfactory. In two-feature (CT, TJ and CJ) experiment, feature CJ scored 78.5% of the lead in the comparisons of two-feature (see Figure 3). Then, feature TJ obtained 19.0% of the lead, but feature CT only achieved 7.1% of the lead. As the result of comparison in one-feature (J > C > T), the rank comparison of two-feature is not surprising (CJ > TJ > CT).

However, it is found that the rank comparison of each feature combination is to a large extent influenced by different methods. Please take a look at the rank of one-feature in Table 5. Feature J achieves the first rank in K-means clustering (KM for short) and Naïve Bayes (NB for short) steadily, but it is not the case in Support Vector Machine (SVM for short). And, the performance of feature C is generally more desired than feature J in SVM. Then, in the rank of two-feature, although feature CT is always the worst in KM and NB, it is also not the case in SVM.

In three-feature (CTJ) experiment, it is concerned that whether CTJ achieved the best accuracy in the dataset owing to CTJ commonly regarded as “default” feature combination in many previous works. Nevertheless, feature CTJ leads other feature combinations only 7 times in the 42 times of comparisons of the best accuracy, and the 6 times among the 7 times which feature CTJ obtained the lead were conducted by SVM. As a result, when features C, T, and J are used for disambiguation at the same time, the combination cannot necessarily ensure the best performance.

As above, the performance of feature combination CTJ in SVM is different from KM and NB. In fact, the results in SVM match the findings of the study by Han et al.

(2004). For example, feature C outperformed feature J or T, and it is believed “Hybrid scheme” (feature CTJ called in Han’s paper) was outstanding. However, the methods they conducted were only supervised, and the datasets they used were not the same as the experiment used in the study (see Table 1).

Table 5: Statistics of Rank Comparisons in Different Methods K-means (KM)

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

Naïve Bayes (NB)

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

Support Vector Machine (SVM) - Continuing

Rank of Single-Feature Rank of Two-Feature Best Accuracy

C T J CT TJ CJ CTJ

M. Miller 3 2 1 M. Miller 2 3 1 M. Miller no

S. Lee 1 2 3 S. Lee 2 3 1 S. Lee no

Y. Chen 1 2 3 Y. Chen 2 3 1 Y. Chen no

Note: 1 = the lead, 2 = the runner-up, 3 = the third ; yes / no= Whether CTJ achieved the best accuracy in the dataset

Figure 2: Rank Comparisons of Single Feature

Figure 3: Rank Comparisons of Two Features

4.2 Features Year (Y) and Number of Pages (P)

In order to present the influence of features Y and P, the average performance of each feature combination is shown in Figure 4. The average improvement rates of performance with considering features Y, P or YP are investigated and shown in Figure 5. These results indicate that the performance using features Y and P is better than the previous one in general.

However, the performance above mentioned is estimated by the average accuracy rates in three methods. Therefore, separate performance with inclusion of feature Y and P is discussed as follow. The different impacts with inclusion of feature Y and feature P by three methods are shown in Figure 6 and Table 6. The improvement accuracy rate, which is the difference between the performance without and with feature Y or feature P, is examined in this section.

First, with the inclusion of feature Y, the average improvement accuracy rates in KM are 6.08% (sd = 6.76%), 0.73% (sd = 1.00%) in NB model and 0.49% (sd = 1.12%) in SVM, respectively. Then, after adding feature P for author name disambiguation, the average improvement accuracy rates in KM are 3.59% (sd = 4.09%), 0.59% (sd = 0.82%) in NB model and -0.39% (sd = 0.95%) in SVM. Finally,

when features Y and P are included at the same time, the average improvement accuracy rates in KM are 5.21% (sd = 5.28%), 1.38% (sd = 1.67%) in NB model and 0.33% (sd = 0.98%) in SVM (see Table 6).

Table 6: Improvement Accuracy Rate with the Inclusion of Feature Y and P

KM NB SVM

Y P YP Y P YP Y P YP

AG 2.89 3.16 4.99 0.47 0.63 0.60 0.97 -1.43 0.30 AK -1.24 9.53 8.81 0.07 -0.13 0.17 -1.57 -0.77 0.69 CC 0.43 0.41 0.13 0.10 -0.11 1.19 0.89 0.17 0.24 DJ 5.69 5.69 1.19 0.11 0.01 0.41 1.21 0.59 2.27 JL 3.20 3.16 2.07 -0.27 -0.63 -0.09 0.06 -1.03 -1.29 JM 0.86 -3.73 -0.13 2.70 1.91 6.10 2.87 2.21 2.20 JR 2.97 1.53 4.77 0.86 0.66 2.29 0.19 -1.36 0.43 JS 6.44 5.51 1.09 1.50 0.79 1.91 -0.40 -1.03 -0.31 KT 10.14 9.64 6.33 1.41 0.69 0.56 0.93 -0.77 -0.23 MB 13.64 0.23 14.19 2.67 2.54 3.46 1.24 -0.01 0.29 MJ 3.94 -1.56 1.84 0.56 0.89 1.34 -0.57 -0.54 -0.61 MM 24.79 8.59 17.50 -0.53 0.24 0.36 -0.06 0.00 -0.03 SL 2.23 2.37 3.19 0.23 0.20 0.24 -0.46 -0.99 -0.26 YC 9.16 5.70 6.91 0.37 0.50 0.80 1.61 -0.43 0.86 Avg. 6.08 3.59 5.21 0.73 0.59 1.38 0.49 -0.39 0.33

From the findings shown above, it is found that feature Y and feature YP delivered positive performance in our datasets. In addition, the inclusion of feature P also produced positive effects, but the influence is not obvious. However, it is significant that the effect is more positive in K-means clustering (+4.98% in average) than that in Naïve Bayes Model (+0.90% in average) and Support Vector Machine (+0.15% in average). Please refer to Figure 6. It is shown that feature Y and feature P could enhance significant performance in K-means clustering, but not obviously in Naïve Bayes and SVM. In the experiment by K-means clustering, the improvement

rate with feature Y maximally achieve 24.79% in MM Dataset, and feature P achieve 9.53% in AK Dataset and feature YP achieve 17.5% also in MM Dataset. But the maximum of improvement with feature Y or P in the experiment by Naïve Bayes and Support Vector Machine is about 2.5% at most. It seems feasible to explore whether the feature Y and P could efficiently enhance accuracy rate in various unsupervised approaches in future studies.

4.3 Complexity of Datasets

According to the scale of datasets, the datasets are divided into two groups:

Group A and Group B. Group A contains the complicated dataset (more than 20 individuals and more than 400 bibliographic records), such as A. Gupta, C. Chen, J.

Lee, J. Smith, S. Lee and Y. Chen. Group B includes the less complicated dataset (fewer than 20 individuals and fewer than 400 bibliographic records), such as A.

Kumar, D. Johnson, J. Martin, J. Robinson, K. Tanaka, M. Brown, M. Jones and M.

Miller.

As shown in Figure 4, the performance of Group A is not as good as Group B.

The average performance of Group A is 39.14%, but 49.62% in Group B. Moreover, it is obvious that the impact with feature Y and P in Group A is more negative than Group B. The average improvement rate of Group A is 1.28, but 2.56% in Group B.

Please refer to Figure 5. These suggest that the complexity of datasets can influence the performance indeed. In other words, it is easier to increase ambiguity in larger datasets like the complexity in the real world.

Figure 4: The Comparison using with/out Features Y and P (Average in Three Methods)

Figure 5: Average Improvement Rate using Features Y and P (Average in Three Methods)

Group A

Group B

Figure 6: Improvement Accuracy Rate using Features Y and P in Different Methods (Average of Y, P and YP)

4.4 Top One Feature Combinations

Feature combinations achieving the best accuracy are explored in this part. Table 7 shows the “top 1 feature combination” for different methods and different author name datasets. Figure 7 displays top 1 distribution for different feature combinations.

As shown in Table 7 and Figure 7 below, the significance of feature JYP and CTJ is obvious. Note that J, JY and CJY are of the third, fourth and fifth place, respectively.

There are 14 feature combinations in 18 top 1 feature combinations in Table 7 with inclusion of feature Y or feature P. That means features Y and P have their roles in author name disambiguation even though they were not ever considered before. In addition, feature J accounted for 77.7% of top 1 feature combinations, and feature C for 64.4% subsequently. Please refer to Figure 8. As Section 4.4 mentioned, it is found that when feature C and feature combination CTJ achieved outperformance is employed by SVM method.

Table 7: Top 1 Feature Combinations

KM NB SVM

Figure 7: Top 1 Distribution of Feature Combinations

Figure 8: Percentage of Features in Top 1 Feature Combinations

Chapter 5 Conclusions and Suggestions

Finally, research conclusions are organized from the findings of the thesis in this section, and some research prospects are suggested for future studies.

5.1 Conclusions

According to the experimental results, some conclusions are taking shape and described as follows:

 Feature combination CTJ cannot necessarily ensure the best performance: In previous works, this common feature combination was usually regarded as a normal scheme, and the focus of studies often contributed to the designs of algorithm or the impacts of new resource. It is few to pay much attention to conduct a serial of different feature combinations repeatedly on author disambiguation. In this thesis, it is shown that the performance of feature combination JYP is not inferior to CTJ, and the performance of feature combinations CJY, JY and J are also outstanding in general. Therefore, it is known that the best feature combination on author disambiguation is mainly contributed by the combinations of features C and J. Additionally, the inclusion of features Y and P can substantially enhance the performance as well

 The inclusion of features Y and number of pages P exhibits positive influence on disambiguation: The average improvement rates of the inclusion of features Y are 2.44%, 1.29% in feature P, and 2.30% in YP. As Section 4.2 mentioned, the impacts of inclusions by features Y and P are significant in K-means clustering (about 5% accuracy of improvement). However, the influence of them is not obvious in Naïve Bayes and Support Vector Machine. It seems feasible to explore whether the feature Y and P could efficiently enhance accuracy rate in various “unsupervised” approaches in future studies. In addition, the setting for year and number of pages ought to depend on the character of datasets in order to respond to different datasets. For example, the setting for number of pages of journals in the datasets which consists of the citation records in humanity or social science should be more than 17 (used in our experiment).

 Various feature combinations have different effects on author name disambiguation while using different clustering or learning methods: It is found that the performance of feature combination J and JYP in K-means clustering and Naïve Bayes Model is as excellent as that of feature combination C and CTJ in SVM. Moreover, as the previous findings suggested, average improvement rate of using features Y and P in K-means (4.98%) is markedly better than Naïve Bayes (0.90%), but the growth rate in SVM is not effective at all (0.15%). In other words, it is shown that the selection of bibliographic feature information for author disambiguation work in the future could be applied according to the approaches of classification or clustering.

 The scale of datasets probably takes effects on the disambiguation work owing to the different complexity of datasets: The accuracy of disambiguation on larger datasets usually is lower than that of the smaller ones, and the effectiveness is not obvious while adding features Y and P. Although this causality is inferable, it clearly pointed out the limitation of the performance achieved by bibliographic data only. As a consequence, it can be expected that how to effectually recommend outer resource (ex: web information) is a critical issue in the future studies of name or author disambiguation in order to supplement additional accuracy rates from feature information.

5.2 Suggestions for Future Studies

The objectives of this study are to investigate effects of complete combinations of features contained in bibliographic data without resort to outside information. The current conclusion casts light on the usage of publication date and number of pages.

There are some suggestions for further studies in author disambiguation, even though several feature combinations and different tools for classification or clustering had been implemented in this study.

 Exploration of performance of feature combinations from different dataset (rather than DBLP datasets only): 14 datasets in this study were composed of DBLP database by Han (2005). However, subject area of citations in DBLP database is only “Computer Science”. Therefore, it is worthy to explore whether the performance of feature will be influenced by authors/people from different disciplines.

 More complicated approaches to classification or clustering: Three existing tools (ex: K-means clustering model by Python, Naive Bayes by NLTK, SVM by LibSVM) were used in this study, but they are not very “tailor-made” in disambiguation work when comparing with Latent Dirichlet allocation (LDA) by Song et al. (2007) or 3-way and high-order simultaneous comparisons by McCallum et al. (2007). So, more sophisticated algorithms can be implemented in future studies.

 Enhancement of performance by various outside resources: It is challenging to completely solve author ambiguity by bibliographic information “only”, because bibliographic information in disambiguation work still generates a certain degree of “noise”. In this way, the performance cannot achieve acceptable standard (more than 90%) in general. Thus, it is a promising trend in the future to build an intellectual mechanism to map outside information onto bibliographic information accurately in order to get sufficient information for disambiguation.

References

Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data.

ACM Transactions on Knowledge Discovery from Data, 1, 1-36.

Can, F., & Patton, J. M. (2004). Change of writing style with time. Computers and the Humanities, 38, 61-82.

Chang, C. C. & Lin, C. J. (2010). LIBSVM - A Library for support Vector Machines (Version 3.0). Retrieved Oct. 4, 2010, from http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Churches T., Christen, P., Lim, K., & Zhu, J. (2002). Preparation of name and address data for record linkage using hidden Markov models. BMC Medical Informatics and Decision Making, 2, 9.

CiteSeer (n.d.). About CiteSeer^X. Retrieved Jan. 31, 2011 from http://citeseer.ist.psu.edu/about/site

Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function.

In: Proceedings of the AAAI 6 th International Workshop on Information Integration on the Web, 32-37.

Digital Author Identifier (DAI). (2009). DAI-Standard wiki. Retrieved Oct. 4, 2010, from http://www.surffoundation.nl/wiki/display/standards/DAI

DiLauro, T., Choudhury, G. S., Patton, M., Warner, J. W. & Brown, E. W. (2001).

Automated name authority control and enhanced searching in the levy collection.

D-Lib Magazine, 7(4).

Elmagarmid, A. K., Ipeirotis, P. G. & Verykios, V. S. (2007). Duplicate record detection: A survey. TKDE, 19(1), p1–16.

Ferris, M. & Munson, T. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization 13 (3): 783–804.

French, J. C., Powell, A., & Schulman, E. (2000). Using clustering strategies for creating authority files. Journal of the American Society for Information Science, 51, 774-786.

Gale, W. A., Church, K. W. & Yarowsky, W. (1992). A method for disambiguation word senses in a large corpus. Computers and the Humanities 26: 415-439.

Han, H., Giles, L., Zha, H., (2005a). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries. Retrieved Oct. 4, 2010, Retrieved Nov. 27,

2009, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.9354&rep=rep1&t

ype=pdf

Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/JCDL-2004-author-disambiguation.pdf

Han, H., Giles, L., Zha, H., Xu, W. (2005b). A hierarchical Naïve Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium. Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://clgiles.ist.psu.edu/papers/SAC-2005-Naïve-Bayes-Mixture.pdf

Hastie, T., Tibshirani, R., Friedman, J. (2011). Hierarchical clustering. The Elements of Statistical Learning (2nd ed.). New York: Springer, 520–528.

Hernandez, M. A., Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), p9–37.

Hill, S., & Provost, F. (2003). The myth of the double-blind review? Author identification using only citations. ACM SIGKDD Explorations, 5, 179-184.

Huang, J., Ertekin., S., & Giles, C. L. (2006). Efficient name disambiguation for large scale databases. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, 536-544.

International Standard Name Identifier (ISNI). (2009). ISNI Draft ISO 27729.

Retrieved Oct. 4, 2010, Retrieved Nov. 27, 2009, from http://www.isni.org/

Jang, J. S. (2011). Data Clustering and Pattern Recognition. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://mirlab.org/jang

Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14, 491-498.

Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource bounded information gathering from the web. In M. M. Veloso (Ed.),

Proceedings of the 20th International Joint Conference on Artificial Intelligence, 429-434.

Koppel, M., Argamon, S., & Shimoni, A. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17, 401-412.

Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational and Mathematical Organization Theory, 11, 119-139.

Mitchell, T. M. (1997). Machine Learning. New York: McGraw Hill.

Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery.

Naveman. (2011). Naveman Glossary. Retrieved Jan. 4, 2011, Retrieved Dec. 25, 2010, from http://www.navmanmarine.net/

OCLC. (2009). WorldCat Identity Service. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://orlabs.oclc.org/identities

People Australia. (2010). People Australia Overview. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.nla.gov.au/initiatives/peopleaustralia/index.html

Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., Ferreira, A. A. (2009). Using web information for author name disambiguation.

In Proc. of JCDL, pp 49–58.

ProQuest. (2009). Scholar Universe. Retrieved Oct. 4, 2010, Retrieved Dec. 25, 2009, from http://www.scholaruniverse.com

Research Name Resolver. (2010). NII Research Name Resolver. Retrieved Oct. 4,

2010, Retrieved Dec. 25, 2009, from http://rns.nii.ac.jp/;jsessionid=372CE9C69AF0745A1597C34DD3ACC420

Safavian, S. R., Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Trans. Systems Man Cybernet. 21, 660-674.

Smalheiser, N. R., Torvik, V. I. (2009). Author Name Disambiguation. Chapter in Annual Review of Information Science and Technology, v.43.

Song, Y., Huang, J., Councill, I. G., Li, J. & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In E. M. Rasmussen, R. R. Larson, E. Toms, S. Sugimoto (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 342-351.

Tan, Y. F., Kan, M. Y. & Lee, D. (2006). Search engine driven author disambiguation.

In G. Marchionini, M. L. Nelson, & C. C. Marshall (Eds.), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 314-315.

Thomson Reuter. (2009). Distinct Author Identification System. Retrieved Oct. 4,

在文檔中書目資料中著者姓名歧異性之解析 (頁 27-0)