Experimental design - The experimental procedure

4. The experimental procedure

4.2 Experimental design

To verify the accuracy and effectiveness of this method, an experiment is designed to utilize the proposed indices. The experimental results obtained in this study are compared with those obtained in previous work from which one can determine whether the results are consistent.

4.2.1. Choosing the field and data resources

Before determining whether a topic is important we first choose the data field and database, in this case the ACM Digital Library. The ACM is the largest and oldest academic community in the field of education and computer science. It has had a platform for exchanging information, innovation and discoveries since 1947. ACM members belong to the information systems and computer science community, and include professors, technicians, and students in industry, academia and public services in over 100 countries.

The range of the data is defined by using this method to browse journals and transactions not included in the magazine published by the ACM that appear in its digital library. A total of 35 exist. The (IEEE)/ACM Transactions on Computational Biology and Bioinformatics (TCBB) and IEEE/ACM Transactions on Networking (TON) are not published by the IEEE, and the range of discussion is far from that in conferences held by the ACM. Conference data published by the ACM are used. In total, the ACM held 137 conferences. Some conference papers in the database are not formal papers but rather are student papers, short papers, poster papers, keynote speeches, tutorials, and demo abstracts. These papers are not included in this study because they do not present new issues.

4.2.2. Selection of the descriptors

A paper needs descriptors that describe its contents. The research topic of a paper is extracted based on these descriptors. We use the following four descriptors referring to paper content:

1. Title: the title of a paper is treated as a condensed description of the entire text.

Thus the essence of the paper is captured concisely within a limited number of words. Words are sometimes coined by the authors themselves, meaning new trends are concealed within titles.

2. Abstract: when papers are reviewed the abstract can be used to give a rough grasp of the content within a short period of time. Thus abstracts can illustrate the content of a research papers far more explicitly than the titles, but contain many times the words. Consequently, the impact of each word unit as indicative of the meaning of the paper is thus diluted.

3. Keywords: keywords have the highest density in knowledge, but cannot describe a new trend. Authors must identify the keywords from the abstract. The paper can then be searched according to its keywords. In other words, keywords are

expressive words that are most widely adopted by researchers for a particular concept within the same domain. Therefore, researchers in a particular research domain take a long time and much effort to reach a consensus that enables concepts to be translated into keywords. This process is usually time-consuming.

Therefore, keywords in research papers are understood to achieve a high density of knowledge, filtered and crystallized through various researchers to form a single accumulated consensus on a concept, enabling them to express the paper far more precisely than titles. However, keywords can rarely identify new trends.

This is because keywords relate to well-known concepts. A long period of time is required for domain experts to reach a consensus about a concept. Therefore, this study concludes that keywords do not describe the content of a paper as well as its title does.

4. Full Text: The full text includes every concept the researcher uses concerning the subject, yet individual words embody very little substance. The full text obviously includes the integrity of the content, but it is a compilation of an immense load of information that far exceeds that of titles, abstracts and keywords. Therefore the degree to which each phrase can express the concepts of the paper is small. Using the full text to describe the content of a paper would waste resources and time.

Authors use keywords to characterize their papers. Consequently, when a term is a keyword, it becomes a backward term. Although the full text contains the most information, it is a low knowledge-density descriptor. Many words and terms can be used to represent a research topic. Hence, typical information retrieval techniques without human judgment tend to extract many terms that do not exactly represent a topic. Therefore, two descriptors, the title and abstract, are used in this pilot study.

The descriptors are utilized for data mining and information retrieval. We search the ACM’s digital library to identify journal and conference papers containing a term.

The terms extracted from titles and abstracts in 2007 are used for comparison.

Based on previous work we argue that knowledge density in the title is higher than that in the abstract. Additionally, this study finds that just because there are more words in the abstract, the information embedded therein is more complete than in a title. There are 12-25 times more terms without stop-words in the abstract than in the title. Therefore, this work uses the abstract as the study descriptor.

4.2.3. Investigating extracted topics

To avoid unnecessarily large term vectors, a word is treated as a term only if it appears in the training data at least three times, and is not a “stop-word” (e.g. “and”,

“or”), (Joachims, 1998). Candidate research topics are extracted from terms rather than frequently used words. Instead of a single word, a composite word or abbreviation is used as the candidate research topic. Although this approach will overlook topics represented by a single word such as ontology, a single word topic must be identified by a person. Candidate research topics comprised of composite words or abbreviations possess better properties than single word topics which require human judgment. Conferences and journals have their own candidate research topics.

To determine which one is a leading trend, we examine the intersection of candidate research topics between conferences and journals. This intersection represents the research topics in this study.

We assume that a hot or important topic can be found in 2007. When a topic is hot or important, discussion will increase regardless of when the topic was introduced.

The year 2008 was not chosen because it had not yet ended when this study was carried out meaning that collections for conferences and journals would not be complete. For each journal and conference, we assume that the research position is equal without considering priority and importance. Furthermore, the database only records the volume of papers, not the frequency that the terms occur in the document.

Thus, regardless of how many times it is mentioned in a paper, a research topic is counted as one paper. In other words, we do not consider the weight of a topic. The approach mentioned above is used to identify research topics, which are then input into the ACM search engine where their YDP is recorded based on the type of conference or journal. (After determining the volume of published papers in each year, Algorithm 3-1 is applied to determine which year is the C , and Formulas (3-1) _F and (3-2) are applied to compute the

CNI and

JNI . The values are listed in the

_k NI table, which is formatted the same. Algorithm 3-2 and Formulas (3-3) and (3-4) are used to compute

CPVI and

JPVI , which are listed in the PVI table, which is

_k formatted the same. The values can be utilized to generate the emerging topic detection index. Finally, Algorithm 3-3 is used to calculate the JDP using JNI and JPVI, and the CDP using CNI and CPVI. Formula (3-5) is then used to compute the VDP for conferences and journals.)

The emerging topic detection index helps in detecting the DP of a research topic in the YDP. We continue by using the YDP for conferences and journals to develop am emerging topic detection table for detecting whether a topic warrants further research. Each research topic that has a PDY exceeding 3 years is used to compute the VDP; the median of VDPs for the same year is used to construct the table. The reason we do not begin in the second year is if a topic has developed for only 2 years, its DP must be between the first and second year. After the third year, the VDP and DP

will vary and

NI

_k and

PVI

_k fall into different blocks. The emerging topic detection table uses the median VDP for each year. However, if a topic only develops for 3 years, then the VDP for the fourth year and later will not use the value of the research topic.

在文檔中開發學術智慧、指標、影響力之模型與技術 - 以資訊檢索為例 (頁 30-34)