A Reviewer Recommendation System based on Collaborative Intelligence

(1)

A Reviewer Recommendation System Based on Collaborative Intelligence

Kai-Hsiang Yang

²

, Tai-Liang Kuo

¹

, Hahn-Ming Lee

^1,2

, Jan-Ming Ho

²

1

Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan

Email: {M9615006, hmlee}@mail.ntust.edu.tw

2

Institute of Information Science, Academia Sinica,Taipei, Taiwan Email: {khyang, hmlee, hoho}@iis.sinica.edu.tw

Abstract

In this paper, expert-finding problem is transformed to a classification issue. We build a knowledge database to represent the expertise characteristic of domain from web information constructed by collaborative intelligence, and an incremental learning method is proposed to update the database. Furthermore, results are ranked by measuring the correlation in the concept network from online encyclo- pedia. In our experiments, we use the real world dataset which comprise 2,701 experts who are categorized into 8 expertise domains. Our experimental results show that the expertise knowledge extracted from collaborative intelli- gence can improve efficiency and effect of classification and increase the precision of ranking expert at least 20 %.

1 Introduction

The reviewer recommendation is an important but com- plex system [7]. The key problem of reviewer recommenda- tion is to identify experts for specific topics [5, 10]. It con- siders an expert who had enough expertise for that specific topic. The problem of expert finding has been mentioned in previous work [1, 2, 5, 7, 10] and the experts are identi- fied by expertise modeling from online communities [11] or their publication [9]. Statistics of keywords co-occurrence in documents or publications [1, 5] are approached to find the similar documents to queries, and the authors as the ex- perts for queries. Ontology-based approach for expertise matching is more efficiently and effectively [3, 9]. But the main drawback of ontology-based approach is that it needs lots of effort to construct and maintain the ontology in many domains. However, the increased new terms followed the increase of the development of research field. Approach differed from keyword co-occurrence is consideration for degree of activity and the category and type of documents

in online community [2, 11]. We overcome these draw- backs by using the online encyclopedia as the semantic ker- nel [4, 6] to construct our Expertise Knowledge Database (EKD) by an incremental learning method. The online en- cyclopedia is named Wikipedia which is built by collabo- rative intelligence from all over the world. The EKD can help us modeling the characters of domains and classify- ing the proposal into related domains. The Wikipedia cat- egory network is used as the Wikipedia Concept Network ( W CN ) to compute the word-semantic relatedness.

In this paper, we propose a approach to solve these issue in a real world task which is a peer review process for re- view proposals. Peer review is an essential but tough task for research councils, journal editors, and conference pro- gram chairs [7]. Besides, many research proposals are mul- tidisciplinary in the computer science domain. (e.g., some proposals address predictions of stock quote by rule based machine learning technologies.) It is a challenge to find suitable experts efficiently and it needs many information to maintain the experts profile [2]. The expertise knowledge management usually takes lots of effort and it becomes a hot topic to improve this task by using the outer source, such as web information [6, 8, 9]. However, we only have very short time to assign the reviewer for a proposal in our scenario.

We focus on the problem of expert ﬁnding and expertise knowledge management in proposed reviewer recommen- dation system.

Our approach divides the problem of expert-finding into three parts. First, it reduces this problem into a multi- domain classification issue since we want to improve com- plexity of finding the experts and efficiency of the recom- mendation system. Second, it uses the W CN as a knowl- edge inference database and computes the correlated relat- edness between experts and proposal. Finally, it takes ac- count of the contribution in academic of each expert who belongs to the domain of proposal. It considers that includes user experience, research related, and authority of academic for reviewer recommendation system in real world.

2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology

564

2009 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops

564

(2)

1.1 Problem Deﬁnition

In this paper, we are required by the Division of Computer Science of National Science Council in Taiwan ( N SC) to help the reviewer recommendation committee and ﬁnd out suitable reviewers for research proposals. The dataset of N SC contains 8 domains, 668 proposals, 2, 701 experts

¹

who have one/more expertise domains, 38468 publications, 71, 899 keywords of publication, and lots of submitted proposals for every year (i.e., 668 in this year for testing). For deﬁnition, each proposal is denoted as P ro

i

and i is the index of proposals. Each expert which has a set of publications to represent the concept of his expertise is denoted as Expert

k

, k is the index of expert. Each publi- cation is denoted as P ub, and each Expert

k

publishes the P ub

kj

, where j is the index of publication. Furthermore, Each Expert

k

has one/more domains and each P ro

i

only belongs to a domain.

2 System Architecture

The system architecture of our approach comprises three parts: 1) Domain Modeling, 2) Expert Matching, and 3) Ranking, as shown in Figure 1. In this system, Domain Modeling improves the cost of computation and handles the problem of Expertise Knowledge Management. In the phase of Expert Finding, it solves the problem of correla- tion ranking by Expert Matching and estimates academic contributions of experts by Ranking.

Expertise Knowledge

Database ˣ̅̂̃̂̆˴˿

Experts of Domain

Domain Modeling Domain

Classifier Wiki-Page-Title

Relation Parser

Expert Matching Semantic Relatedness

Calculator

Score of Publication Academic Authority Estimator Domain

Characteristic Modeller

˘̋̃˸̅̇ʳ˟˼̆̇

Score Calculator Ranking Expertise Knowledge

Management Expert Finding

Collaborative Intelligence

Figure 1. Reviewer recommendation system architecture.

1Experts’ data is retrieved from http://cs.nsc.ncku.edu.tw/introduce/

2.1 Domain Modeling

The goal of Domain Modeling is to find relevant experts quick and generate domain knowledge efficiently. Building the EKD is necessary and helpful for find the suitable do- main efficiently. Domain Characteristic Modeller is an in- cremental EKD learner for modeling the specific domain.

Each expert has expertise domains and a set of Wikipedia Page Title ( W P T ) mapping from Wikipedia [9] to repre- sent his research topic. Traditionally, ﬁnding the relevant experts usually costs a lot of time to compute the correlation with irrelevant experts. Hence, the proposed system classify the queried proposal ﬁrst instead of comparing with all can- didates for expert. The domain which we want to identify is notated as DP ro

n

, where n is the index of the domain. The set of W P T represents the concept of proposal and each W P T is denoted as P roP

iu

, where i is the index of pro- posal and u is the index of W P T . This modulo classiﬁes the proposals according to the probabilities of W P T for do- mains. After the classiﬁcation, the unseen terms would be labeled as the concept. A function sums up the probability of W P T in the domains and calculates the probabilities for each domain. Another function calculates the probabilities by invoking Bayes’ Theorem. It models the knowledge of a domain from associated documents. The probability func- tion is as follows:

p(DP ro

n

|P ro

i

) = p(P ro

i

|DP ro

n

)p(DP ro

n

)

p(P ro

i

) (1)

For the purpose of ﬁnding the related domain, we as- sume the probability p(P ro

i

) to be uniform. We focus on p(P ro

i

|DP ro

n

) and p(DP ro

n

). We estimate the probabil- ity of a proposal given a domain by representing the domain as a multinomial probability distribution over the keywords of proposal:

p(P ro

i

|DP ro

n

)

=

P roPiu∈P roi

p(P roP

iu

|ΘDP ro

n

)

^{n(P roP}^iu^{,P ro}ⁱ⁾

(2) Then, we smooth the probability of a proposal’s keyword given a domain with the background probabilities:

p(P roP

iu

|ΘDP ro

n

)

= (1 − λ) ∗ p(P roP

iu

|DP ro

n

) + λ ∗ p(P roP

iu

) (3) where λ = (τ )/(α + β), α is the average number of key- words in publication, and β is the average length of the pub- lication title. And, α = 1.864, β = 60.112, and the proba- bility of p(DP ro

n

) is computed by the count of proposals in domain n divides by the count of all the queried propos- als. The both functions identify the domain which has the maximum probability as the answer of P ro

i

.

565 565

(3)

2.2 Expert Matching

The goal of Expert Matching is to measure the seman- tic relatedness between proposal of expert and publications.

Wiki-Page-Title Relation Parser parses the Wikipedia cate- gories of page as a concept. Since we want to measure the concept of relation between publication and proposal, the relation of terms should be measured ﬁrst. The distances between each pair of categories are the degree of relation in W CN , and each pair have a maximum depth from root.

According to the previous research, we take account of dis- tance which limits to 5 because there is no relation and this pair of keyword would not be consider [6].

The relations of keywords are found, hence the score of semantic relatedness can be measured. The estimating cri- terion is that using the concept structure based on collab- orative intelligence and ﬁnding the semantic relatedness in the Wikipedia. The W CN is a collaborative tagging sys- tem allowing users to categorize the content of page. The meaning of categories are from top to down, so the more speciﬁc concept it is, the more deeper depth of categories are. The distance between categories means the correla- tion, so the more correlated it is, the more closer distance of categories are. There are many pairs of keyword between proposal and publications and every pair has many paths to connect each other. The scores of each pair are computed by considering distance and depth, and the maximum one of these scores represents degree of semantic relatedness of this pair. Finally, sum of the maximum score of each pair is the score for measuring the semantic relatedness between proposal and publication.

2.3 Ranking

The goal of Ranking is to combine the scores of publi- cations for each expert and rank the experts in the output list. The academic contribution can be estimated by num- ber of publications. It computes F inalScore

P roi,Expertk

of Expert

k

for P ro

i

and the function is as follows:

F inalScoreP roi,Expertk =

P ubkj∈Expertk

ScoreP roi,P ubkj

(4)

3 Experiments

In this paper, we examine two experiments on Domain Modeling and Expert Matching. The all combinations of features and methodologies are examined to ﬁnd the best result for each domain. The domains include the domains of “Image and Pattern Recognition” ( IP R), “Natural Lan- guage and Speech Processing” ( N LSP ), “Artiﬁcial Intel- ligence” ( AI), “Computer Graphics” (CG), “Information

System Management” ( ISM ), “Database” (DB), “Bioin- formatics” ( Bio), and “Web Technologies” (W T ).

3.1 Performance Analysis of Classiﬁer

The performance of domain classifier depended on the methodology of classification and how to model the charac- teristics of domains, but there were lots of expert’s data to model the characteristic of domains. Hence, we wanted to use the fewer instance modeling domains, the features could be selected to examine, such as the keywords that were key- terms of proposal ( KT ), the title of Wikipedia page (P T ), and the title of Wikipedia category ( CT ). The factors of multiple domains for modeling were adding weighting ( W ) or not ( N W ), and two methods of classification are max probability ( M ax P rob) and Na¨ıve Bayesian (N a¨ıve). In

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

The F-measure of Proposal Classification

Max_Prob.-NW-KT Max_Prob.-NW-PT Max_Prob.-NW-CT Max_Prob.-W-KT Max_Prob.-W-PT Max_Prob.-W-CT Naive-NW-KT Naive-NW-PT Naive-NW-CT Naive-W-KT Naive-W-PT Naive-W-CT

Figure 2. The F-measure of proposal classiﬁ- cation.

this experiment, the performance of domain classiﬁer is rep- resented by the f-measure illustrated in Figure 2 against each domain. According to the result , N a¨ıve-W -P T is better than other combinations of methodologies about 5%

to 25% especially in the domain of IP R, DB, and Bio.

3.2 Correct Rate of Reviewer Recommendation The criterion of evaluation is one of domains that expert had match the domain of proposal, and a labeled proposal corresponds to one domain. In our approach, experts of the domain for the proposal are ranked by Wikipedia related- ness score. The result shown in Figure 3 is compared with previous work. The expert data which is extracted from NSC website and these data are different from the data used by previous work. However, it is up-to-date and more fair than previous one. We wanted to make the top of expert list correct, but the answer of ranked expert list was not exis- tent. Therefore, we evaluate the result by precision rate at

566 566

(4)

0.954

0.169 0.706

0.174 0.977

0.747 0.894

0.032 0.748

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

P@50 of Previous Work (PW) P@50 of PW using New Persona (PWN)

Random P@50 of Our Approach

Figure 3. Precision@50 of related expert list.

N (P @N ), where N is the number of top result. The func- tion is as following:

P recision(N, d) =

i∈Domaind

N j=1

F (Result

ij

) P C

d

(5) where F (Result

ij

) is

¹_j

, if expert’s domains match pro- posal’s, else it is zero. d is the index of domain, P C

d

is the proposal’s count of domain d, i is the index of proposal, and j is the index of expert in the result list. The P @50 of P W N is almost equal to the P W , except for the domain of W T , hence we can compare with the result of P W N in- stead of the result of P W . The precision of previous work is equal to the result of random test in many domains, such as the domains of CG, ISM , Bio, and W T . The average P @50 of our approach is better than random about 40% and is better than P W N about 20%.

4 Conclusion

In this paper, we propose a reviewer recommendation system which assists the commissioner of national science organization to find the experts who are suitable to review the proposal. Our proposed system uses the publications of experts as the training data for the expertise knowledge database. The keywords of publications are transformed into domain concept from collaborative intelligence and the correlations between experts and proposal are considered by parsing W CN . The correlation which is between experts and proposal and the efficiency of the computation times are improved in our proposed approach, and the performances of convinced evaluation are shown in our experiments. The F-Measure of domain classifier is about 78.2% and P @50 of recommendation expert list is better than our previous work at least 20%. According to the result, many proposals have been classified into nothing domain which means we need more effort to complete the labeled terms in EKD. In

addition, the academic contribution are not considered very well, and it causes that the quantity of publications is more important than quality of publications.

Acknowledgements

This work was supported in part by the National Science Council of Taiwan under grants NSC 95-2221-E-001-021- MY3 and NSC 96-2628-E-011-084-MY3.

References

[1] K. Balog, M. de Rijke, and W. Weerkamp. Bloggers as ex- perts: feed distillation using expert retrieval models. In SI- GIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in informa- tion retrieval, pages 753–754, New York, NY, USA, 2008.

ACM.

[2] G. Demartini. Finding experts using wikipedia. FEWS 2007:

Finding Experts on the Web with Semantics Workshop at ISWC 2007 + ASWC 2007, November 2007.

[3] P. Liu, K. Liu, and J. Liu. Ontology-based expertise matching system within academia. pages 5431–5434, Sept. 2007.

[4] M. Mika, M. Ciaramita, H. Zaragoza, and J. Atserias. Learn- ing to tag and tagging to learn: A case study on wikipedia.

Intelligent Systems, IEEE, 23(5):26–33, Sept.-Oct. 2008.

[5] D. Mimno and A. McCallum. Expertise modeling for match- ing papers with reviewers. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 500–509, New York, NY, USA, 2007. ACM.

[6] S. P. Ponzetto and M. Strube. Knowledge derived from wikipedia for computing semantic relatedness. Journal of Artiﬁcial Intelligence Research, 30:181–212, 2007.

[7] F. Wang, B. Chen, and Z. Miao. A survey on reviewer assignment problem. In IEA/AIE ’08: Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pages 718–727, Berlin, Heidelberg, 2008. Springer-Verlag.

[8] P. Wang and C. Domeniconi. Building semantic kernels for text classiﬁcation using wikipedia. In KDD ’08: Proceed- ing of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 713–721, New York, NY, USA, 2008. ACM.

[9] K.-H. Yang, C.-Y. Chen, H.-M. Lee, and J.-M. Ho.

Efs:expert ﬁnging system based on wikipedia link pattern analysis. In the 2008 IEEE International Conference on Sys- tems, Man and Cybernetics (SMC 2008), October 2008.

[10] D. Yimam. Expert ﬁnding systems for organizations: Do- main analysis and the demoir approach. In ECSCW 99 Beyond Knowledge Management: Management Expertise Workshop, pages 276–283. MIT Press, 2000.

[11] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise net- works in online communities: structure and algorithms. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 221–230, New York, NY, USA, 2007. ACM.

567 567