• 沒有找到結果。

How to use the emerging topic detection table to predict whether a topic

4. The experimental procedure

4.4 How to use the emerging topic detection table to predict whether a topic

The CDP generated from the CNI and CPVI and the JDP from the JNI and JPVI are used to generate the VCDP from CDP and the VJDP from JDP. This work computes the VDP ofor each topic based on its PDY. If a topic’s PDY is 5 years, one can calculate the VDP for each year for years 3–5. The 3-year VDP focuses on the first, second and third year, and ignores data for the fourth and fifth years. The 4-year VDP is then calculated and year 5 data is ignored. This process continues until the full PDY is obtained, and the development of each topic in its lifecycle can be determined for each year. (Each topic has its own start and emerging time. Consequently, the median VDP can be used to generate the baseline for a research topic in each year. If an unknown topic exists then one cannot determine whether it has high worthiness for continued investigation or not. This study calculates the VDP for a topic for each year in its development. If the VDP is higher than the baseline in any year, the topic should be investigated further as it may be a novel topic. Comparatively, if a topic is old and

mature, it will have a low VDP. Furthermore, if a topic, from the beginning to the end of its history, never has a VDP higher than the baseline, then that topic is worthless.

Based on a topic’s development, one can determine whether the topic is valuable.) Take Virtual Environments (VEs) as an example. The VDP for each year is compared to the baseline. The emerging topic detection table is used to determine whether VEs should be investigated further. The JVDP in the third year is 0.583. The JVDP of VEs is 0.7 higher than the baseline, meaning that VEs is valuable in its third year. The JVPD for the fourth year is 0.465, and 0.592 higher than the baseline, meaning it is a valuable topic. In year 6, the JVDP of VEs is 0.33, lower than the baseline value of 0.355, indicating that the topic of VEs is mature. Similarly, the CVDP in the table in the third year is 0.572 and the VCDP of VEs is 0.675, which is higher than that in the table. In its eighth year, the CVDP of VEs is lower than that in the table (0.271), indicating that the topic of VEs in conferences in the eighth year is mature.

5. Validating the accuracy and effectiveness of the emerging topic detection indices

In this section, we describe an experiment using this research model to validate the proposed indices. We survey previously published related works to find information about topics which have already been validated using other methods. We also survey some experts on the topic selected to validate the indices.

5.1 Comparing the experimental results with previous work

In order to validate the accuracy and effectiveness of the emerging topic detection indices, we look for related work where emerging topics have also been detected but within a different research time range and field. The most similar work that we found is that of Jo et al. (2007) in SIGKDD. Their approach is based on the intuition that documents related to a topic should be more cohesively connected in the citation graph than a random selection of documents. They used the Citeseer data which contains 716,771 papers, with 1,740,326 citations. This amounts to 2.43 citations per paper. For each paper, we use the combined title and abstract for documentation. The number of bigrams in the corpus after pruning out the low-frequency bigrams and 35 stop words is 631,839. The majority of papers are from the years 1994 to 2004.

Besides the research approach, the database and the time range of the dataset are different from those used this study, but if we use the same time range to examine the research results, this approach can predict with precision the correct proposed

emerging time of each topic. Their work contains both conference papers and journal papers. In order to map their work we selected topics which they claimed to have emerged from their study. These topics are image retrieval, sensor network, semantic web, support vector. The indices are then applied to find out the potential developed year and the published volume in each year for these topics in the ACM digital library.

The NI and PVI of each topic proposed by their work are computed afterwards. A comparison of the results is discussed below.

1. Sensor networks

The topic sensor network is one of the most common emergent topics in their work. The emerging time falls in the year 2004. The topic evolution of sensor networks over time. The original published conference and journal volumes. The year of the conference detection point is 2004. This is the same as in Jo et al. (2007) where the year of journal detection point is 2006.

2. Semantic web

The topic of “semantic web”, which emerged in the year 2004, is another common emergent topic that was discussed in their work. The evolution of the topic semantic web is over time. The original published volume of conferences and journal.

The year for the conference detection point is 2004 which is the same as in Jo et al.

(2007) and the year for the journal detection point is 2007.

3. Support vector

The topic “support vector” is another topic that emerged in the year 2004.

However, they claimed that the topic support vector is not as obvious as for the previous two topics. The curve obtained is not as inclined as for the previous two topics, sensor network and semantic web. The evolution of the topic of support vector is over time. In this current research, the conference detection point is in 2004 which is the same as in Jo et al. (2007) and the journal detection point is in 2005. All of these four topics which are considered by Jo et al. (2007) as emerging topics in 2004, are detected correctly using the emerging topic detection indices.

5.2 Comparing the experimental results with the expert survey results

Besides comparing the experimental results with those obtained from previous works as noted in the literature review, we also survey five experts who have investigated the topic of data mining. We ask them to give opinions, including yes, no and no opinion to reflect whether each topic is emerging during a given period of time.

These experts included two assistant professors, one associate professor and two full professors, three from the department of Management of Information Systems and two from Computer Science and Engineering. E1 means Expert 1, and Grade represents the votes which a topic receives.

We give each topic a grade when one of the experts considers it to really be an emerging topic during the given period of period of time. We find that all topics all received more than 3 votes, which is more than half the number of experts’ opinions.

The results for the topic “sensor network”, obviously emerging in 2004, are consistent.

The topic “support vector” did not get more support because some of the experts were not sure whether support vector referred to the support vector machine or not. This is a limitation because of the meaning of the term. We only can extract terms from previous work and can not exactly make sure of their meanings.

6. Applications and implications

The NI helps researchers to exam research topics from the view point of novelty and aging theory. The novelty and aging concepts should be considered while we discuss the emerging topic detection not only the hot topic detection. Emergency implies new and urgent.

The PVI differs from past simple frequency lines such as in the work of Jo et al.

(2007) which only can tell how much the frequency is in each year. The PVI however, adopts the concepts of accumulated relative frequency for each year based on the volume of different PDYs. One can tell which topic is an emerging topic by the rising curve and which is a mature topic by the falling curve.

The combination of the NI and PVI can draw the detection point and the VDP since the DP means the DP present the NI and PVI are both at the highest value. The DP is possessed with the characteristics of being both novel and hot iso matches the expectation of emerging topics. The most important finding is the development of a method and indices which helps researchers to construct their own field topic detection tables and examine new topics in their own field.

The database used in this study includes various journals and conferences related to computer science from the ACM database of publications of journals and transactions, and conference proceedings. There were 689 journal papers and 5154 conference papers published in 2007. Conferences account for 1791 research topics, and journals for 311 topics. The intersection is 89 research topics. From the research topics intersection ratio, the ratio of conferences is almost 5% and the journal ratio is

nearly 29%. Thus, there is more convergence in the topics discussed in journals and more divergent in conferences.

We demonstrate the development of emerging topic detection indices. The YCDP and YJDP can determine the lead relationship. The first DP in the curve, regardless of for the CDP or JDP, indicates the maximal value of the NI and PVI. Consequently, we suggest that the first DP is a leading position. The YDP can help to determine which type of paper is in the lead position. This study also proves that the first publication time is not the critical factor when determining the YDP. The NI and PVI affect the DP directly. In total, 87.64% of research topics represent that conference topics lead the development of journals. We use different research methods and different databases than in previous works which only focus on the leading relationship to generate a similar lead relationship results. The experimental results verify the accuracy and effectiveness of the proposed indices.

This study uses the NI and PVI to develop the emerging topic detection indices.

The concepts of terms, topics, and candidate research topics are used to investigate topics, and we also discuss the CNI and CPVI for conferences and the JNI and JPVI for journals. The DP is produced even though the CDP or JDP creates the YDP and VDP to represent the year and the value when the topic emerges. The YCDP and YJDP can be used to determine which type of papers stand in the leading position.

The VDP can be used to determine whether to investigate a topic further. Based on the NI and PVI, one can show that when the published volume for the present year is large, the curve will rise, indicating the DP in an early stage of its lifecycle, and decline to make the DP delay. Based on the NI and PVI and the properties of the DP, the emerging topic detection indices can help to determine whether a research topic is worth further investigation. A high VDP indicates that topic novelty is sufficiently high for further research.

Finally, the emerging topic detection indexes detect the DP and obtain the YDP and VDP. By comparing conferences and journals, one can determine which reaches the threshold first. However, if a topic has never been important, the topic DP is useless. The emerging topic detection table can be used to examine whether a topic warrants further research. Each topic has its own value; therefore, the value of researching a topic indicates that the topic has not reached the highest point in its lifecycle. When a topic is hot and mature, its potential worthy for further research will decrease in the future. The VDP is the basis of the detection table.

Even when the published volume is low, the PVI will be high since it compares to itself the total number that can decide the PVI. A high VDP represents a large

development space for additional effort. Hence, a high VDP indicates that a topic warrants further investigation. Consequently, when one does not know whether a topic is important or worthy, one can compute its VDP and then compare this value with that in the emerging topic detection table. If the VDP is lower than the baseline, the topic is mature or worthless. Comparatively, if a topic’s lifecycle is never higher than the VDP for the same year in the detection table, it has never garnered popular attention; thus, the topic is worthless.

The indices for research topics in this work can also be applied in bibliometics and patent analysis as in previously related works on tech mining. This can be used by business researchers, organizations or governments. For example, in publishing, novelty is a very important. The indices can help the publisher to know whether a series of reports related to some issue will increase or decrease. Are the customers getting tired of the same issue? The indices can help for stock price prediction by the financial analyst, to know which area of the market is over hot and where more investment is possible. Organizations and governments can use the indices to determine and realize novelty and the number of the inbound competitors in their enterprise. Governments also can use the indices to observe the development of social phenomenon such as economics and to make sure that they balance supply and requirements in their policies.

7. Concluding remarks

This study addresses the inadequacy of topic detection and tracking to develop a set of novel indices for emerging topic detection. The novelty concept is used in combination with aging theory to develop the novelty index (NI). The published volume index (PVI) is an improvement over traditional frequency methods to reflect the growth of a discussion topic. The DP and YDP help determine the relationship between conference topics and journal topics and how long they lead ahead. The VDP is created to construct the detection table to determine whether a topic warrants further research. The NI and PVI can be applied to other fields to determine new trends, for example, the news or stock price predictions.

The indices also have some limitations based on the units of the data sets which we can collect. For example, the dates of research papers are based on the year instead of months, and the DPs at year 2022.1 and year 2002.9 will be the same given these indices.

The major objective of this study is to detect what are emerging topics in order to provide research intelligence for academic papers. The value of candidate emerging

topics can thus be checked to validate whether these topics are prospective ones or have already been adequately researched. In addition, in order to downsize the huge database, the results of the relationship between conferences and journals will be applied in our investigation of whether the emerging topics proposed by influential authors at conferences and the conference itself will appear in journals in the future.

Future work will extend the NI and PVI with more diversified experiments. The set of novelty indices can be improved using other areas of training and testing models. A more complicated detection table can be generated. The limitations of this study are as follows:

 This study focuses on information and systems of computer science; thus, the leading relationship between conferences and journals would differ from those of other disciplines.

 TextAnalyst was utilized to analyze words used in study titles. Therefore, selected features are restricted to the word dictionary in the TextAnalyst software.

Namely, words that appear repeatedly in a corpus, but are not included in the dictionary are not located due to software limitations.

 The year of the potential development of topics is based on the ACM datasets;

thus, this study only observed published volume and development years in the ACM digital library.

8. References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, T. (1998). Topic detection and tracking pilot study: Final report. In: Proceedings of the DARPA

Broadcast News Transcription an Understanding Workshop.

Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking.

In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 37-45).

Aurora, P. P., Rafael, B. L., & Jose, R. S. (2007). Topic discovery based on text mining techniques. Information Processing & Management, 43, 742-768.

Berry, M.W. (2004) Survey of text mining-clustering, classification, and retrieval.

Springer, 185-224.

Blei, D. M., Ng, A. Y., & Jordan, M. I., (2003). Latent Dirichlet allocation. Journal of

Machine Learning Research, 3, 993-1022.

Bolelli, L., Ertekin, S., Zhou, D., & Giles, C. L. (2009). Finding topic trends in digital libraries. In: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital

libraries, 69-72.

Braun, T., Schubert, A. P., & Kostoff, R. N. (2000). Grwth and trends of fullerence research as reflected in its journal literature. Chemical Reviews, 100(1), 23-27.

Chen, K.Y., Luesukprasert, L., & Chou, S. C. (2007). Hot topic extraction based on timeline analysis and multidimensional sentence modeling. IEEE Transactions

on knowlede and data enginerting, 19(8), 1016-1025.

Chou, T. C., & Chen, M. C. (2008). Using incremental PLSI for threshold-resilient online event analysis. IEEE Transactions on knowlede and data enginerting,

20(3), 289-299.

Clifton, C., Cooley, R., & Rennie, J. (2004). Topcat: data mining for topic indentification in a text corpus. IEEE Transactions on knowlede and data

enginerting, 16(8), 949-964.

Cui, C., & Kitagawa, H. (2005). Topic activation analysis for document streams based on document arrival rate and relevance. In: Proceedings of the 2005 ACM

symposium on applied computing, (pp. 1089-1095).

Cunningham, S. W., Porter, A. L., Newman, N. C. (2006). Introduction – Special issue on tech mining. Technological Forecasting & Social Change. 73, 915-922.

Daim, T. U., Rueda, G., Martin, H., & Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics and patent analysis. Technological

Forecasting & Social Change. 73, 981-1012.

Erten, C., Harding, P. J., Kobourov, S. G., Wampler, K., & Yee, G. (2003). Exploring the computing literature using temporal graph visualization. Technical Report,

Department of Computer Science, University of Arizona.

Franz, M., & McCarley, J. C. (2001). Unsupervised and supervised clustering for topic tracking. In: Proceedings of the 24th annual international ACM SIGIR

conference on Research and development in information retrieval, (pp.

310-317).

Garfield, E. (1955). Citation indexes for science. Science, 122, 108-111.

Garfield, E. (2006). The history and meaning of the journal impact factor. Journal of

the American Medical Association, 293, 90-93.

Giles, C. L., Bollacker, K. D., & Lawrence, S. (1998). Citeseer: An automatic citation indexing system. In: Proceedings of the 3rd

ACM Conference on Digital Libraries, 89-98.

Grffths, T. L., & Steyvers, M. (2004). Finding scientific topics. In: Proceedings of the

National Academy of Sciences, 5228-5235.

Hatzivassiloglou, V., Gravano, L., & Maganti, A. (2000). An investigation of linguistic features and clustering algorithms. In: Proceedings of the 23rd annual

international ACM SIGIR conference on Research and development in information retrieval, (pp. 224-231).

Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. In:

Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In: Proceedings of the

22

nd

annual international ACM SIGIR conference on Research and development in information retrieval, 50-57.

Jin, Y., Myaeng, S. H., & Jung, Y. (2007). Use of place information for improved event tracking. Information Processing & Management, 43, 365-378.

Jo, Y., Lagoze, C., & Giles, C. L. (2007). Detecting research topics via the correlation between graphs and texts. In: Proceedings of the 13th ACM SIGKDD

international conference on Knowledge discovery and data mining,

(pp.370-379).

Joachims, T. (1998). Text categorization with Support Vector Machines: learning with many relevant features. In: Proceedings of the EMNLP Conference.

Kautz, H., Selman, B., & Shah, M. (1997). Referral Web: Combining social networks and collaborative filtering. Communications of the ACM, 3, 63-65.

Kleinberg, J. (2002). Bursty and hierarchical structure in streams. In: Proceedings of

the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 91-101).

Kollios, G., Gunopulos, D., Koudas, N., & Berchtold, S. (2003). Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE

Transactionson knowlede and data enginerting, 15(5). 1170-1187.

Kostoff, R. N. (2008). Literature-related discovery (LRD): Introduction and background. Technological Forecasting and Social Change, 75. 165-185.

Kostoff, R. N., Briggs, M. B., Solka, J. L., & Rushenberg, R. L. (2008).

Literature-related discovery (LRD): Methodology. Technological Forecasting

and Social Change, 75. 186-202.

Kostoff, R. N., Briggs, M. B., Rushenberg, R. L., Bowles, C. A., Icenhour, A. S., Nikodym, K. F., Barth, R. B., & Pecht, M. (2007). Chinese science and technology- structure and infrastructure. Technological Forecasting and Social

Change, 74. 1539-1573.

Kostoff, R. N., Briggs, M. B., Rushenberg, R. L., Bowles, C. A., Pecht, M., Johnson, D., Bhattacharya, S., Icenhour, A. S., Nikodym, K., Barth, R. B., & Dodbele, S.

(2007). Comparisons of the structure and infrastructure of Chinese and Indian

(2007). Comparisons of the structure and infrastructure of Chinese and Indian

相關文件