• 沒有找到結果。

Comparing the author’s impact power with previous work

Chapter 6 Determination of Impact Research Topics via the Bayesian Estimation of

6.1 Experiment to Validate an Author’s Impact Power

6.1.1 Comparing the author’s impact power with previous work

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 6 Determination of Impact Research Topics via the Bayesian Estimation of Author-Publication Correlations

In this section we model the research approach and validate whether the authors and the publications found by our model really do have impact. We suggest the high impact research topics will come from author-publications that possess greater impact power than others. We should confirm that the authors and the publications proposed using our approach are really high impact authors and publications. Section 6.1 describes the experiment of validating the author’s impact power. Section 6.2 illustrates the experiment of validating the publication’s impact power. Section 6.3 shows how to find the high impact topics using the proposed model.

6.1 Experiment to Validate an Author’s Impact Power

In this section we develop an experiment using this research model to validate the author’s impact power. We survey previously published related works to find information about which are already validated using other methods. On the other hand, we also survey some experts on the topic selected to validate the model. The validation procedure for author impact power is illustrated in Fig. 6-1.

Fig. 6-1 Validation Procedure for Author Impact Power.

6.1.1 Comparing the author’s impact power with previous work

Rosen-Zvi, Chemudugunta, Griffiths, Smyth and Steyvers (2010) developed a model to learn about 300 topics. They used the combination of the top 10 words used by the authors on the same topic and the top 10 most frequently published authors.

They tried to use Gibbs sampling to discover the relationship between the texts and Authors listed as

recommended by previous related work

Validate the results of the impact power of the author as proposed by our model.

Interview the experts who investigate this same topic.

Mapping Mapping

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

the authors. The experimental results can help the reader to predict the authors’

domain based on the words they use in a paper. Although their goal was not to find the authors with most impact in a topic area, but list the top ten authors who use the words related to the same topic. The results imply that they are the authors with higher impact because they publish more papers and use more words related to a topic than others.

We collect a dataset and then try to make the comparison between previous work and our work in order to examine the research model. As we know, it is difficult to compare two different models entirely, especially when we do not use exactly the same dataset. Consequently, we try to collect the most similar dataset possible. Their dataset comes from the well-known database, the CiteSeer digital library, also referred by other researchers in this area (such as Bolelli, Ertekin, Zhou & Giles, 2009; Liu, Niculescu-Mizil & Gryc, 2009; and Rosen-Zvi, Chemudugunta, Griffiths, Smyth &

Steyvers, 2010), and is often used in papers collecting works on computer science. In the previous work they used the dataset published between 1990 and 2002. We select one of the 300 topics mentioned in their study to map out our work, specifically the topic of “data mining”. The top two words used in the topic “data mining” are “data”

and “mining”. We use their abstract as our descriptor of a paper to compare to their work. This means that if the term “data mining” is found in the abstract and we consider that paper as one involving data mining. All authors and publications involved in a paper involving the topic, data mining, are viewed as belonging to the same society. In this study we consider data mining for the years of 1990-2002.

Different research models and goals lead to a focus on different model view-points during computing. In our model, the number of co-authors will influence the impact power. Different from the work we want to compare, they emphasis on the words an author used and the co-author will be the opportunity to find out the combination of crossing domains, our research focus on the number of co-authors will duplicate calculating the power of a paper. Although we look at a paper as a research unit, it might have more than one author. If we compute the impact power of all a paper’s authors, it might give a paper with several co-authors a greater impact power.

As we know, this does not make sense in the real world. We need some assessment criteria. During assessment, we ignore the number of co-authors per paper, viewing all of them as making almost the same contribution, even this may not exactly fit in with the real situation. In order to solve this problem, in this study we use the weighting concept, as discussed in the next paragraph, to model the contribution of a paper and to ensure that the impact power of a paper is not be affected by the number of co-authors.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

The contribution weight of co-authors can be based on the sequence in which the authors are listed. We suggest that the first author of a paper has made the greatest contribution to a paper, and that the contribution will decrease in relation to the sequence listed by these authors of a paper. We use an arithmetic sequence to model the weight ratio. For example, if there are 3 authors in a paper, each weight contribution to this paper is divided by the weight ratio 3:2:1, meaning that the 1st author gets 3, the 2nd author gets 2, and the 3rd author gets 1. In order to normalize the value of the total weight so that is equal to 1, the contributions are weighted 3/6, 2/6 and 1/6. Although the 1st author’s contribution may not actually be 3 times the 3rd author’s so that the weight ratio does not exactly reflect the relative strength between these authors, this strategy can reflect the fact that the first author has a greater contribution. This method controls the total weight to be equal to 1 without being affected by the number of co-authors. Algorithm 6-1 is used to model the circumstances.

Algorithm 6-1: Identifying the contributory weight of each author for a single paper Input: n is the number of co-authors

i is the sequence of authors listed for a paper

sum is the summation of 1 to n

Output: weight is the contributory weight of the ith author

i 1 For i=1 to n

2

sum i weight

i

n + −

= 1

3 Next

The contributory weight of each co-author is not only a measure for computing each author’s impact power for a paper but also help calculate the impact power the author will receive from a cited paper. When we want to determine a paper’s impact, we use all of the authors’ impact powers and each contributory weight to compute it.

If we want to know an author’s impact power, we can summarize all the papers written by that author and calculate the published volume and citation frequency of each paper published by the contribution weight.

We use the dataset published during 1990-2002. The topic is “data mining”. We collect paper data from the CiteSeer digital library and compare the experimental results for our model without the contribution weight, and with contribution weight with those obtained in previous work. There were 2294 authors who published papers on the topic of data mining during the period from 1990-2002. Table 6-1 shows the

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Table 6-1 Top 10 Impact Authors in the Topic of Data Mining.

Rosen-Zvi, Chemudugunta, Griffiths, Smyth &

Steyvers (2010) ●

Our research model (without contribution

weight) ◆

Our research model (with contribution

weight) ▓

1 Jiawei Han Jiawei Han Jiawei Han

2 Mohammed Javeed Zaki Rakesh Agrawal Rakesh Agrawal

3 Bing Liu Hannu T. T. Toivonen Ron Kohavi

4 David W. Cheung Heikki Mannila Mohammed Javeed Zaki 5 Kyuseok Shim Salvatore J. Stolfo Foster J. Provost 6 Heikki Mannila Osmar R. Zaïane George Karypis 7 Rajeev Rastogi Mohammed Javeed Zaki Charu C. Aggarwal

8 Venkatesh Ganti Philip S. Yu Chris Clifton

9 Hannu T. T. Toivonen George Karypis Osmar R. Zaïane

10 Huan Liu Ron Kohavi Ming-syan Chen

We use the different symbols to represent different model lists. Duplicate authors duplicate will get more than one symbol when an author belongs to more than one model. The results with symbols are shown in Table 6-2. The “●” represents the model of Rosen-Zvi, Chemudugunta, Griffiths, Smyth and Steyvers (2010); the “◆”

represents our research model without the contribution weight; and the “ ▓ ” represents our research model with the contribution weight. The value for each model is ●=16/30=53.33%; ◆=20/30=66.67%; and ▓=18/30=60%. Each list of these models has a cover rate of more than 50%, meaning that these lists of models are valuable and worth consideration while mentioning the impact author in the topic data mining during 1990-2002.

The results obtained using our research model without the contribution weights are most similar to the other list. One reason might be that the first model considers words and relationships between the topic-author. They list the top 10 persons using more words in that topic than others. We suggest that is the greater volume published in the same topic will cause the sequence of the first model. However, in our model we consider the impact power, not only of the words used and the topic, but also the citation frequency, which is an endorsement by other authors. The model with the contribution weight is a more extensive approach and reveals the best cover rate between the 3 models.

Table 6-2 Top 10 Impact Authors within the Topic of Data Mining (during 1990-2002).

No.

2 Mohammed Javeed Zaki

6.1.2 Comparing the impact power of authors with the expert survey

Besides comparing the experimental results with those obtained with previous models as noted in the literature review, we also survey experts who have investigated the topic of data mining. We list all the authors in alphabetical order without duplication for the three models. There are 20 different authors. We survey five experts asking them to give opinions, including yes, no and no opinion to reflect whether each author has impact power in the topic “data mining”. These experts included two assistant professors, one associate professor and two full professors, three from the department of Management of Information Systems and two from Computer Science and Engineering. We assigned the three options different grades for

calculation, with yes being one, no being negative one, and no opinion represent by zero. Even when the expected value is 0, we show average grades of greater than 2.25 in bold face. There were only 9 authors who exceeded this threshold. The grades of each author are shown in Table 6-3.

We compare the survey result for the three models as mentioned above, and get the information in Table 6-4. The precision and recall of the first model are 40% and 44.44% which is obviously lower than that obtained with our models, whether with or without the contribution weight. This indicates that our model results conform more closely to the expert’s expectations than the results obtained in previous work for determining with author has the largest impact in the topic area of “data mining”.

Table 6-3 Grades of Authors Surveyed by Academic Experts.

No. Author grades No. Author grades

* The bold values indicate authors receiving higher than average grades.

Table 6-4 Precision and Recall of Author Impact Obtained with Each Model.

No.

Javeed Zaki Rakesh Agrawal Rakesh Agrawal Heikki Mannila

3 Bing Liu Hannu T. T.

Toivonen Ron Kohavi Ming-syan Chen

4 David W.

Cheung Heikki Mannila Mohammed

Javeed Zaki Rakesh Agrawal

Stolfo Foster J. Provost Bing Liu 6 Heikki Mannila Osmar R.

Zaïane George Karypis Charu C.

Aggarwal 7 Rajeev Rastogi Mohammed

Javeed Zaki

Charu C.

Aggarwal George Karypis 8 Venkatesh Ganti Philip S. Yu Chris Clifton Mohammed

Javeed Zaki 9 Hannu T. T.

Toivonen George Karypis Osmar R.

Zaïane Philip S. Yu 10 Huan Liu Ron Kohavi Ming-syan Chen

Precision 40% 60% 60%

Recall 44.44% 66.67% 66.67%

6.2 Experiment to Validate the Impact Power of Publications

In this next section, we will conduct an experiment to validate the impact power of publications using our research model. The most frequently used approach is to compare the impact factor as assigned by an organization such as the Information Science Institution (ISI). The impact factor (IF) can be obtained from the ISI’s Journal Citation Report (JCR) database. The ISI uses the publication volume and citation frequency in SCI/SSCI journals over two years to compute the IF. However, as we know, their list does not include all journals. There are some new journals that may discuss new topic that are not included in the SCI/SSCI list so will not count towards the IF in the JCR database. Conference proceedings are also not included in the JCR database. We use the IF to validate the list proposed by our model. We also compare our results to those obtained with 3 different models: that of Rosen-Zvi, Chemudugunta, Griffiths, Smyth and Steyvers, (2010); the second is our research model without the contribution weight; and the final one is our research model with the contribution weight. The precision and recall rates will be the indices used to compare the experiment results. On the other hand, we also survey experts in “data mining” to validate the model results. The procedure for validating the impact power of publications is illustrated in Fig. 6-2.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Fig. 6-2 Procedures of Validating the Impact Power of a Publication.

6.2.1 Comparing the impact power of publications using the impact factor

We use Bayesian estimation to model the impact power of publications as proposed in our research approach then compare these results with the ISI impact

ISI journal lists

with impact factor

Validate the results of the impact power of publications as proposed by our model

Survey experts investigate this same topic

Mapping Mapping

Lists of authors’

publications

Find

Previous related work Lists of 10 authors

Lists of authors’

publications

Find

Our research model (without contribution

weight)

Lists of authors’

publications

Find

Our research model (with contribution

weight)

Mapping Mapping Mapping

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

factor as looked up in the JCR database. The results are basically similar. We both use the volume of publications and citation frequency in the calculation. However, the difference is that they arrive at the impact factor using only information from the past two years in the computation while with the Bayesian estimation of our approach we use sustained concepts which is the previous data will influence the result to model the impact power. Previous reputation will be considered and the posterior distribution will be the prior distribution in next step when new information comes.

We collect data on papers from CiteSeer on the topic “data mining” as determined from the abstract. We collect 1389 papers including conference and journal papers for which we can find both their authors and publications. There are 281 journal papers and 1108 conference papers. After computation we discover that there are 583 different kinds of publications, 131 journal publications and 452 conference publications. The statistics are shown in Table 6-5.

Table 6-5 Statistics for Papers and Publications.

Unit Papers Publications

Volume 1389 583

Type Journals Conferences Journals Conferences

Volume 281 1108 131 452

We have to decide on the recommendation threshold for this enormous number of papers and publications. We find that about 35 publications have published more than 5 papers on data mining. We try to incorporate the criteria of the top 35 publications in our model and recommend these to researchers. The impact powers of the top 35 publications are described in Table 6-6. We can see that the top impact publication in data mining is the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) which has an impact power of 67.6986%, almost 4-5 times that of the second best. There are 11 journals and 24 conferences in the top 35 publications. The details are given in the following tables: Table 6-7 for the top 10 journals; Table 6-8 for the top 10 conferences for this type of publication; and Table 6-9 shows the top 10 publications regardless of whether they are the journal or conference type.

Table 6-6 Top 35 Impact Publications in Topic of Data Mining (during 1990-2002).

Rank C/J Title of the publication Impact

Power

1 C ACM SIGKDD Conference on Knowledge Discovery and

Data Mining (KDD) 67.6986%

2 C SIGMOD 14.7991%

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Rank C/J Title of the publication Impact

Power

3 C Very Large DataBase (VLDB) 4.6433%

4 C Industrial Conference on Data Mining (ICDM) 2.2011%

5 C Principles of Data Mining and Knowledge Discovery (PKDD) 2.0353%

6 C International Conference on Data Engineering (ICDE) 1.7043%

7 J Data Mining and Knowledge Discovery 1.3425%

8 C Pacific-Asia Conference on Knowledge Discovery and Data

Mining (PAKDD) 0.8536%

9 J IEEE Transactions on Knowledge and Data Engineering

(TKDE) 0.8119%

10 C Lecture Notes in Computer Sciences (LNCS) 0.7115%

11 C International Conference on Information and Knowledge

Management (CIKM) 0.4259%

12 J Knowledge and Information Systems (KIS) 0.1765%

13 C SIAM International Conference Proceedings on Data Mining

(SDM) 0.1593%

14 J SIGKDD Explorations 0.1297%

15 C ACM Symposium on Principles of Database Systems 0.1148%

16 C International Conference on Data Warehousing and Knowledge

Discovery (DaWaK) 0.1110%

17 C Lecture Notes in Artificial Intelligence (LNAI) 0.0983%

18 J IEEE Bulletin of the Technical Committee on Data

Engineering 0.0701%

19 C SIGMOD Workshop on Research Issues in Data Mining and

Knowledge Discovery (DMKD) 0.0649%

20 C ACM Conference on Computers and Security 0.0626%

21 J IEEE Computers 0.0623%

22 C International Conference on Database Theory (ICDT) 0.0618%

23 J SIGMOD Record 0.0555%

24 J Communications of the ACM (CACM) 0.0548%

25 C European Conference on Machine Learning (ECML) 0.0545%

26 J Bioinformatics 0.0444%

27 J Machine Learning 0.0381%

28 J IEEE Transactions on Visualization and Computer Graphics 0.0372%

29 C Genetic and Evolutionary Computation Conference (GECCO) 0.0311%

30 C ACM International Conference on Digital Libraries 0.0303%

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Rank C/J Title of the publication Impact

Power

31 C Advances in Large Margin Classifiers 0.0288%

32 C Advances in Distributed and Parallel Knowledge Discovery 0.0271%

33 C IEEE International Conference on Tools with Artificial

Intelligence (ICTAI) 0.0268%

34 C Advances in Digital Libraries Conference (ADL) 0.0268%

35 C Advances in Neural Information Processing Systems (NIPS) 0.0264%

Table 6-7 Top 10 Impact Journals in Topic of Data Mining (during 1990-2002).

Title of the publication IF

2002

IF 2008

1 Data Mining and Knowledge Discovery 1.192 2.421

2 IEEE Transactions on Knowledge and Data Engineering 1.055 2.236

3 Knowledge and Information Systems N/A 1.733

4 SIGKDD Explorations N/A N/A

5 IEEE Bulletin of the Technical Committee on Data Engineering N/A N/A

6 IEEE Computers 1.484 2.611

7 SIGMOD Record 0.228 1.620

8 Communications of the ACM (CACM) 1.497 2.646

9 Bioinformatics 4.615 4.328

10 Machine Learning 1.944 2.326

Table 6-8 Top 10 Impact Conferences in Topic of Data Mining (during 1990-2002)

Title of the publication

1 ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2 SIGMOD

3 Very Large DataBase (VLDB)

4 Industrial Conference on Data Mining (ICDM)

5 Principles of Data Mining and Knowledge Discovery (PKDD) 6 International Conference on Data Engineering (ICDE)

7 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 8 Lecture Notes in Computer Sciences (LNCS)

9 International Conference on Information and Knowledge Management (CIKM) 10 SIAM International Conference Proceedings on Data Mining(SDM)

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Table 6-9 Top 10 Impact Publications in Topic of Data Mining (during 1990-2002).

Title of the publication Impact

Power

1 C ACM SIGKDD Conference on Knowledge Discovery and Data

Mining (KDD) 67.6986%

2 C SIGMOD 14.7991%

3 C Very Large DataBase (VLDB) 4.6433%

4 C Industrial Conference on Data Mining (ICDM) 2.2011%

5 C Principles of Data Mining and Knowledge Discovery (PKDD) 2.0353%

6 C International Conference on Data Engineering (ICDE) 1.7043%

7 J Data Mining and Knowledge Discovery 1.3425%

8 C Pacific-Asia Conference on Knowledge Discovery and Data

Mining (PAKDD) 0.8536%

9 J IEEE Transactions on Knowledge and Data Engineering 0.8119%

10 C Lecture Notes in Computer Sciences (LNCS) 0.7115%

We use the ISI impact factor and get the value from the JCR database as in Table 6-7. The JCR database can trace the value from 2002 to 2008. Since the dataset in our research model is from 1990-2002, the impact factor from 2002 may be suitable for reference. The latest impact factor available is for 2008. We consider both values. The journal Bioinformatics has the highest impact factor in 2002 or 2008, but it is possible that the medical or biology discipline may not really have most impact in the data mining topic. This is also a limitation of the ISI approach, it can only rank by unit of category and not by topic. The experts we survey claim that the journal Bioinformatics may contain some publications about data mining and thus have a high impact according to citation frequency, but it may not the main journal in this area of data mining. There was no impact factor given for the journal, Knowledge and Information Systems, in 2002 because it was not in the SCI/SSCI list that year. This is also the case with the journal SIGKDD Explorations and the IEEE Bulletin of the Technical Committee on Data Engineering, which both appear in the collection of conference papers. Neither are given an impact factor in 2002 or 2008. Conference publications are also not computed in ISI which is another reason why we can not

We use the ISI impact factor and get the value from the JCR database as in Table 6-7. The JCR database can trace the value from 2002 to 2008. Since the dataset in our research model is from 1990-2002, the impact factor from 2002 may be suitable for reference. The latest impact factor available is for 2008. We consider both values. The journal Bioinformatics has the highest impact factor in 2002 or 2008, but it is possible that the medical or biology discipline may not really have most impact in the data mining topic. This is also a limitation of the ISI approach, it can only rank by unit of category and not by topic. The experts we survey claim that the journal Bioinformatics may contain some publications about data mining and thus have a high impact according to citation frequency, but it may not the main journal in this area of data mining. There was no impact factor given for the journal, Knowledge and Information Systems, in 2002 because it was not in the SCI/SSCI list that year. This is also the case with the journal SIGKDD Explorations and the IEEE Bulletin of the Technical Committee on Data Engineering, which both appear in the collection of conference papers. Neither are given an impact factor in 2002 or 2008. Conference publications are also not computed in ISI which is another reason why we can not