• 沒有找到結果。

Investigate the Extracted Topics

Chapter 8 The Research Experiment of the Development of Emerging Topic

8.1.3 Investigate the Extracted Topics

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

8.1.2 Select the Descriptor

A paper needs descriptors that describe its contents. This study extracts the research topics of papers from these descriptors. Four descriptors refer to paper content.

Researchers use keywords to characterize their papers. Consequently, when a term is a keyword, it becomes a backward term. Although the full text contains the most information, it is a low knowledge-density descriptor. Many words and terms can be used to represent a research topic. Hence, typical information retrieval techniques without human judgment will extract many terms that do not exactly represent a topic. Therefore, two descriptors, the title and abstract, are used in a pilot study to compare the representative. The descriptor is utilized in data mining and information retrieval as terms in searching the ACM in its digital library to identify the journals and conferences that have the most papers with a term. The terms extracted from titles and abstracts in 2007 are used for comparison. Although this study discovers both in journals and conferences, the average ratio which the previous work argued that knowledge density in title is higher than that in the abstract.

Additionally, this study finds that just because the words in that abstract are more, indeed, the information embedded in an abstract is more complete than that in a title.

The terms without stop-words in the abstract are 12-25 times than title. Therefore, this work uses the abstract as the study descriptor.

8.1.3 Investigate the Extracted Topics

To avoid unnecessarily large term vectors, a word is treated as a term only if it appears in the training data at least three times, and is not a “stop-word” (e.g. “and”,

“or”), (Joachims, 1998). This study extracts candidate research topics from terms but not frequently used words. A single word is not used, but a composite word or abbreviation is used as the candidate research topic. Although this approach will overlook topics represented by a single word such as ontology, a single word topic must identify respectively by human. The candidate research topics comprising composite words or abbreviations possessed better properties than single word topic which without human judgments. Conferences and journals have their own candidate research topics. To determine which one is a leading trend, this study uses the intersection of candidate research topics for conferences and journals. This intersection represents the research topics in this study.

We assume that a hot or important topic can be found in 2007. When a topic is hot or important, discussion of the topic will increase regardless of when the topic was introduced. The year 2008 is not chosen because it has net yet ended; thus, the

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

collections for conferences and journals will not be complete. For each journal and conference, we assume that the research position is equal without considering its priority and importance. Furthermore, the database of the study only records the volume of papers not the frequency of the terms occurred in the document. Thus, regardless of how many times a research topic is mentioned in a paper, the research topic is counted as one paper. In other words, this study does not consider the weight of a topic. This study uses the approaches mentioned above identify research topics, which are then input into the ACM search engine and their YDP recorded based on the type of conference and journal. Finally, a table is obtained that has the same format as Table 7-1.

After determining the volume of published papers in each year for Table 7-1, Algorithm 7-1 is applied to determine which year is theC , and Formula (7-1) and F (7-2) are applied to compute the

CNI and

k

JNI . The values are listed in the NI

k table, which is formatted the same as Table 7-2. Algorithm 7-2 and Formula (7-3) and (7-4) are used to compute

CPVI and

k

JPVI , which are listed in the PVI table,

k which is formatted the same as Table 7-4. The values can be utilized to generate the emerging topic detection index. Finally, Algorithm 7-3 is used to calculate the JDP using JNI and JPVI and the CDP using CNI and CPVI. Formula (7-5) is then used to compute the VDP of conferences and journals.

The emerging topic detection index helps in detecting the DP of a research topic in the YDP. This study continues using the YDP of conferences and journals to develop the emerging topic detection table for detecting whether a topic warrants further research. Each research topic that has a PDY exceeding 3 years is used to compute the VDP. The median of VDPs for the same year is used to construct the table. The reason this study does not begin in the second year is if it has developed for only 2 years, its DP must between the first year and second year. After the third year, the VDP and DP will vary and the

NI and

k

PVI fall into different blocks. The

k emerging topic detection table uses the median VDP for each year. However, if a topic only develops for 3 years, then the VDP of fourth year and the later will not use the value of the research topic.

8.2 Experimental Results

This study examined 35 journal issues and 137 conferences, representing 689 journal papers and 5154 conference papers from 2007. TextAnalyst is used to extract the terms mentioned more than 3 times in the same publication in a year from conferences and journals. Single-word terms are deleted and composite words and abbreviations retained as candidate research topics. The number of candidate research

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

topics from conferences is 1791 and that from journals is 311. The intersection set has 89 topics, which are the research topics. The intersection ratio which the conference is almost 5% and in journals is near 29%, indicate that the journal topics are more convergence. Although the range of conferences is broader than that of journals, it is easier to discover new topics in conference papers. It is still more divergence than journals.

The study suggests that some topics of journals may concentrate on some well defined topics and conferences at the same year may look forward new topics so that two subsets cannot get a very high intersection ratio. The other reason probably the conferences are more divergence than journals so the topics which discussed are scattered into different fields that cannot cover the topics of journals since the volume of journals is few in each year.

Only topic “cutaway illustrations” exists regardless of whether conferences or journals started in 2007. We assume that if a research topic is valuable, it will survive into 2007. The proportion of the research topics that matched the assumption is 98.88%; “cutaway illustrations” are an exception since it started in 2007 and no data exists before 2007. Thus, cutaway illustrations cannot be viewed as an important and valuable research topic.

Using the CDP produced by CNI and CPVI, and the JDP produced by JNI and

JPVI, this study obtains the YCDP, which is the year of the CDP; YJDP is the year of

the JDP. For the same research topic, the CDP and the JDP have a sequential relationship, which represents which type of curve generates the DP first. The YCDP and YJDP can be the determining point of which type of papers is the leading tread.

Generally, we will assume that the first paper published by a conference or in a journal will have the first DP. However, it is not exactly correct if we refer to the NI and PVI in the research.

Of the 89 research topics, only 5 are published in journals before conferences;

however, 11 are published first at conferences. Moreover, 1 topic has a DP later than that for conferences, indicating that the first publication year is not the only factor to consider when determining the lead position. When the NI and PVI are also considered, the outcome changes In total, 87.64% of research topics is published by conferences first. On average, conference papers are published 4.26 years ahead of journal papers while if the journal is lead the topic and the journal papers are published 3.5 years ahead of conference papers.

This investigation confirms that researchers can discover new trends for research topics from conference papers. The research findings strongly support the hypothesis.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

85.42% of the data nodes collected from 1990 to 2007 show that the topic of the conference paper that year has influenced the topic of the journal papers the same year and the following two years. In other words, researchers can mine new issues from conference papers.

This study and previous studies can be used to validate the lead relationship. This study investigates 89 research topics, while previous work focused on similarity over 3 years during 1991–2007. Although the data unit is different, the data of conferences lead the journals in this study 87.64% compared to previous work (85.42%). It verifies the effectiveness of this study and indicates that the indices are useful and accurate.

8.3 How to Use the Emerging Topic Detection Table to Predict Whether a Topic