• 沒有找到結果。

Chapter 7 The Indices for Emerging Topic Detection

7.1 Novelty of Emerging Topics

7.1.2 Novelty Index

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

7.1.2 Novelty Index

Before defining the NI, we should realize what definition of the potential developed year is. The potential development year (PDY) is defined as the period of a topic from its first year to the current year when it becomes a research topic and does not have any year which contains zero paper in the following years. Research topic

“XML” is taken as an example. Because of the “XML” is a well known and established topic. Researchers in the field have reached a consensus in terms of the lifecycle of emergency of XML. Hence, the study introduces the example of XML to illustrate the proposed method throughout the paper.

Table 7-1 presents the datasets collected from the ACM database. The column entitled Type separates conference and journal papers, while “J” represents journal papers and “C” represents conference papers. This study selects and records the published volume of each paper type and each year separately. The value of type J in 2008 is 12, indicating that 12 journal papers focused on XML. Comparatively, 114 conference papers in 2008 focused on XML. Consequently, one can identify the first paper that referred to XML in conferences using Table 7-1; the first paper was published in 1989. However, no paper discussed XML during 1990–1993. A paper in 1994 addressed XML; however, no paper during 1996–1998 addressed XML, indicating that if the first paper caught the attention of researchers, it cannot be considered the start of the emerging topic.

Table 7-1 The Volume of Published Papers on XML in Each Year: An Example.

Year

Type 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999

J

j XML 12 11 20 15 11 7 11 3 0 1

C

i

XML 114 152 155 191 179 147 156 78 30 11 Year

Type 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989

J

j

XML N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

C

i

XML 0 0 0 0 1 0 0 0 0 1

Hence, if the PDY is 1989, it does not have the substance meaning in the research. Conversely, after 1999, XML was the main topic in a considerable number of papers. Furthermore, we assert that if a topic is not discussed in any papers during one year, the topic does not have the potential to be an emerging topic at that time.

Therefore, this study defines the first year of the PDY as 1999 for conferences and 2001 for journals. The NI is defined as the inverse of the PDY. The NI indicates whether a topic is novel. For example, if the PDY of a topic is 5, the NI of the topic in

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

the 5th year is 1/5=0.2. That is, in its nth development year, the NI is 1/n. This study uses the proposed Algorithm 7-1 to identify first PDY.

Algorithm 7-1: Identifying which year is the

C of the research topic F

Input: C , the published volume of conference papers in the i

i th year

Output:

C , the first PDY in conference papers for a research topic F 1 For

i

=2008 to 1989

2 If

C >0 then

i

3 C =i F

4 Else

5 Return C =i+1 F

6 Break

7 End If

8 Next where

C is defined as the published volume of conference papers in the i

i th year.

For XML,

i = 1989 ,..., 2008

.

J

j is defined as the published volume journal papers in the jth year. For XML,

j = 1999 ,..., 2008

.

 C is defined as the first PDY in conference papers for a research topic. F

 J is defined as the first PDY in journal papers for a research topic. F

CNI

k(Topic) is defined as the NI for the kth year for a research topic in conference papers.

JNI

k(Topic) is defined as the NI for the kth year for a research topic in journal papers.

We assume a research topic is new when it is first published; thus, NI should be normalized to 1=100%. In its second year, the NI should be 1/2=50%. The value of the NI should be normalized to 0–1. By Algorithm 7-1, the formula for

) (Topic

CNI

k is Formula (7-1).

1 ) 1

( = − +

F k

Topic k C

CNI

(7-1)

Furthermore, the formula for

JNI

k(Topic) is Formula (7-2).

2008 temporarily. The next keeps searching and using i=2007 until i=1998 while

1998 =0

C

, and the loop will break and return to C =F

i + 1 = 1999

, indicating that no papers in the ith year focused on XML and it was not the first PDY, the reallyC is next F year of i as i+1.

After determining that 1999 is the C year for XML, then F

CNI

1999(

XML

)=1. Comparatively, to compute the conference Novelty Index (CNI) of 2008, this study takes

k = 2008

into the Formula (7-1) and obtains CNI of 2008.

CNI

. Table 7-2 calculates the NI for

each year using Algorithm 7-1, and Formulas (7-1) and (7-2).

Table 7-2 The NI of XML Example in Each Year.

Year

7.1.3 Published Volume Index

As the NI is a measurement of novelty, this study can determine whether a research topic is emerging or hot based on volume of papers published in the same period. Conversely, if it is discussed over a long period in a vast number of publications, it is likely mature, indicating that the topic is well developed and may cross or enter another domain the focus would turn to the combination and application with another domain, not only the ontology itself. Consequently, if only focuses solely on the volume of published papers for a topic, one cannot determine whether the topic has potential research value. Notably, as the volume of papers increases, topic impact decreases. The conventional method using the frequency curve for volume of published papers is lacked for determining whether a topic is hot or emerging. Since the PVI declines, we conclude that it has been a hot topic. The traditional frequency curve is a backward index.

To not only consider the volume of published papers but also still can reflect

topic hotness over time. The curve of accumulative relative frequency reflects the published volume variation. The ability of accumulative relative frequency is better than the frequency of a topic as it can detect the topic status based on comparing to itself growth situation without waiting it to mature. The PVI is defined as the accumulative relative frequency of the kth development year normalized to 0–1.

Algorithm 7-2 is used to compute the PVI of the kth year, and Formula (7-3) comprises the equations for CPVI.

Algorithm 7-2: Computes the PVI, takes the JPVI

k(Topic)as an example

Input: Sum ,

J

Sum ,

i

J

i the same paper type.

CPVI

k(Topic)is the PVI of a topic in the kth year published in conferences and formulated as Formula (7-3)

C formulated as Formula (7-4).

J i

i

Sum

Topic Sum

JPVI ( ) =

(7-4)

This study takes XML as an example to illustrate the application of the PVI in Table 7-2. As discussed in Section 7.1.1., the question of concern to researchers is when a topic becomes an emerging topic with no break in subsequent years. The first time a research topic discussed is not necessarily an important point. Although this study identified 1999 as the first year in which XML appeared in journals using Table 7-1, this study uses Algorithm 7-1 to find J =2001, the real PDY. The PDY n in F 2001–2008 is 2008−2001+1=8. Table 7-3 is used with Algorithm 7-2 to compute the PVI for each year. For 2003,

J

2003 =7,

Sum

2003 =21, and

Sum

2003=

Sum

2002+

J

2003

and the PVI of other years is computed in the same manner and recorded in Table 7-3.

Table 7-3 The PVI of XML Example in Journals for Each Year

Year 2008 2007 2006 2005 2004 2003 2002 2001

According to the two proposed detection indices, NI and PVI, when the PDY is early compared to its lifecycle, the NI is high (Table 7-2). For example, J =2001 F and

JNI

2001(

XML

)=1. Compared to

JNI

2002(

XML

)=1/2=0.5, When the PDY is late relative to its lifecycle, the NI decreases. Conversely, Table 7-3 indicates that the situation for the PVI is opposite. As the published volume reveals the amount of discussion of a research topic, the PVI reflects the relative degree of growth in volume. The two indices can use the values in Table 7-4 to determine the development of XML.

Using data in Table 7-4, it can draw the curves of JPVI (Journal PVI), JNI (Journal NI), CPVI (Conference PVI) and CNI (Conference NI). This study discovers that the NI and PVI is the trade off curve; that is, a new topic lack the volume needed to be a hot topic and when a hot topic exists for a period, it loses its novelty.

Consequently, the maximal NI and PVI must be obtained from the intersection of curves, which is called the DP. Fig. 7-1 shows thedetection point of emerging topic detection index of XML.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Table 7-4 The Values of Indices for XML Example in Emerging topic Detection.

Year 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999

J

j 12 11 20 15 11 7 11 3

Sum

i 90 78 67 47 32 21 14 3

) ( XML JPVIj

1.000 0.867 0.744 0.522 0.356 0.233 0.156 0.033

)

( XML

JNI

j 0.125 0.143 0.167 0.200 0.250 0.333 0.500 1.000

Year 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999

C

i 114 152 155 191 179 147 156 78 30 11

Sum

i 1213 1099 947 792 601 422 275 119 41 11

) ( XML

CPVIi 1.000 0.906 0.781 0.653 0.495 0.348 0.227 0.098 0.034 0.009 )

( XML

CNIi 0.100 0.111 0.125 0.143 0.167 0.200 0.250 0.333 0.500 1.000

* The bolded value is where the DP decreases.

The DP is defined as point at which the NI and PVI intersect. We suggest that the DP can be used to determine whether a topic is hot and emerging, as it is the maximal value of two indices and look after both sides of novelty and hotness. The DP also separates the conference detection point (CDP) and journal detection point (JDP). The Algorithm 7-3 shows how to compute the DP of JPVI and JNI.

Algorithm 7-3: How to compute the DP of JPVI and JNI Input: present year, JPVI ,

i

JNI ,

i

JPVI

i+1,

JNI

i1

Output: J-detection point

1 For i = present year To C F 2 If

JPVI =

i

JNI Then

i

3 Return the J-detection point=i

4 ElseIf

JPVI >

i

JNI And

i

JPVI

i+1 <

JNI

i1 Then 5 Return the J-detection point=

2 +1

i

6 End If

7 Next

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Fig. 7-1 The Detection Point of Emerging Topic Detection Index of XML.

7.2 Information Produced by the Emerging Topic Detection Indices

According to the DP indices, if a topic has been published in both conference and journal papers, then it can be based on the published data to draw four curves to obtain the JDP and CDP using Algorithm 7-3. Additionally, the indices also generates the year for the DP (YDP) and value of the DP.

7.2.1 Year of the Detection Point

The YDP is the X-axis value of the DP. The YDP indicates that a topic reaches the emergence threshold in its development. The YDP can also be separated into year for the CDP (YCDP) and year for the JDP (YJDP). The YCDP is the X-axis value of the CDP. The YJDP is the X-axis value of JDP. Regardless of the JDP or CDP, they will also have a value on the X-axis, which represents the year. Although the YCDP is near 2002, the graph shows that this is not a DP. The topic does not become an emerging topic until 2003. Consequently, this study takes 2003 as the YCDP and 2004 as the YJDP.

7.2.2 The Detection Point Value

The value of the DP (VDP) is the value at which the DP intersects the Y-axis.

This value both indicates the NI and PVI of the DP at the same time. Since this study uses the NI and PVI normalized to 0–1 and the Y-axis their representation, the DP value can be expressed by the NI and PVI. Additionally, the VDP also means that the NI and PVI are equal at the DP. However, the VDP is divided into the value of the

0.000 0.200 0.400 0.600 0.800 1.000 1.200

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Year Index Value

JPVI JNI CPVI CNI

C-detection point J-detection point

CDP (VCDP) and value of the JDP (VJDP). The VCDP is the value at which the DP intersects the Y-axis for conferences and is the same value as the CPVI and CNI. The VJDP is the value at which the DP intersects the Y-axis for journal papers and is the same value as the JPVI and JNI. Since this study uses year as a unit, and if the DP is not exactly at one year, it must be between two years. The VCDP is between 2002 and 2003, and the value is affected by

CPVI

2002 =0.227 ,

CPVI

2003 =0.348 ,

250 .

2002 =0

CNI

and

CNI

2003 =0.200 (Fig. 7-1). The exact value of the VCDP is the center of those 4 points and is calculated as follows:

4 Formula (7-5) can be used to compute the VDP.

4

7.3 The Properties of Emerging Topic Detection Indices

By creating the NI and PVI to construct the emerging topic detection indices and detection table, this study can analyze the academic publications and forecast the trend.

7.3.1 Novelty Index Properties

This study defines NI=1/n, where PDY is n. We suggest that it is a curve that can be used even this is not verified. Since the research supposes that no matter in the conferences or in the journals if the relationships exists between conferences and journals. Furthermore, the leading and following relationship (Tu & Seng, 2009), the NI will produce the same result for the relationship of conferences and journals with any validated index. Nevertheless, we assert that the NI is a reasonable and convenient index. To determine the entire lifecycle of a topic, one must obtain the termination date at which volume is 0 and determine the novelty for each year based on the termination date. For instance, if one knows that a topic has been developed for 10 years, and the NI=1 in the first year and NI=0.9 in the second year; this process continues until the last year. However, one cannot determine when a topic terminates until it is terminated. Therefore, using NI=1/n can avoid the lack, and we suggest that novelty decreases as the PDY increases. Hence, regardless of the topic, the impact of the NI is 1/n at the nth PDY, and the NI is 1 in the first year, and that in the second year is 1/2=0.5.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

7.3.2 Published Volume Index Properties

As mentioned, comparing the PVI and the traditional frequency measure can improve the forward effect. This study uses XML as an example to describe the properties of the PVI. This study uses the XML data as an example in Table 7-5 to illustrate how the PVI reflects the emergence of XML.

Table 7-5 The PVI Table of Different Situations in XML Example.

Year 2001 2002 2003 2004 2005 2006 2007 2008

Original-2006 3 11 7 11 15 20 N/A N/A

Decrease-2008 3 11 7 11 15 20 10 5

Increase-2008 3 11 7 11 15 20 40 80

PVI-2006 0.04 0.21 0.31 0.48 0.70 1.00 N/A N/A PVI-2008-decrease 0.03 0.16 0.23 0.36 0.52 0.74 0.94 1.00 PVI-2008-increase 0.02 0.08 0.13 0.19 0.28 0.40 0.57 1.00 The curve of Original-2006 in Fig.7-2 is the data in journal of XML during 2001–2006. The curve of Decrease-2008 indicates that the amount of data is decreased after 2006. The amount of data in 2007 is 1/2 of that in 2006 (10) and 1/2 of that in 2007 in 2008 (5). The other situation is Increase-2008, which indicates that the amount of data increases after 2006; thus, the amount of data in 2007 is 2 times that in 2006 (40) and that in 2008 is 2 times that in 2007 (80). Thus, PVI-2006, PVI-2008-decrease and PVI-2008-increase are the indices of Original-2006, Decrease-2008 and Increase-2008, respectively.

While the volume of PVI increases relative to that in the past, like PVI-2008-increase forms a concave curve that opens upward. Conversely, while the volume of PVI decreases relative to that in the past like PVI-2008-decrease forms a convex curve that opens downward. Consequently, as the volume of PVI is comparatively larger compared to the value in 2006 between PVI-2008-decrease and PVI-2008-increase, the curve will rise from year 2006, indicating that the topic is becoming a hot topic. Conversely, as the proportion of PVI-2008-increase in 2006 is getting lower than past - the largest volume of topic exists after 2006 - the curve is relatively flat in 2006, indicating that topic in 2006 has not yet become a hot topic.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Fig. 7-2 The PVI Curves of Different Situations in XML Example.

7.3.3 Detection Point Properties

The DP is the intersection of the NI and PVI, and produces the YDP and VDP.

Based on the discussions in Sections 7.2.1 and 7.3.2, which refer to the properties upon which the YDP is based. The accumulated relative frequency are used to determine the DP properties and validate the effectiveness.

1. As the YDP increases, the DP is delayed

This study compares the curves of PVI-2006 to those of PVI-2008-decrease and PVI-2008-increase. Regardless of whether the amount of data increases or decreases, as long as a topic keeps developing (published volume is not 0), the curve will delay the intersect point. This makes sense because a later YDP means the topic has not yet reached the highest point in its lifecycle and growth stage. Conversely, for PVI-2006, the DP must intersect before 2006 as the YDP is 2004.

2. The increase in frequency of the year the entire curve will rise

Consider PVI-2008-decrease, the highest value is produced in 2006 and the curve intersects in front of the DP of PVI-2008-increase. For the case in which the topic is in its mature stage, then the curve is getting fall down. Conversely, PVI-2008-increase indicates that 2006 was not the highest year in terms of its lifecycle, it until the year 2008 reaches the highest volume in it lifecycle. The delayed DP indicates that the topic is not hot.

3. The DP time

0.00 0.20 0.40 0.60 0.80 1.00 1.20

2001 2002 2003 2004 2005 2006 2007 2008

Year Index value

PVI-2006 PVI-2008-decrease PVI-2008-increase Novelty

DP of PVI-2006 DP of PVI-2008-decrease

DP of PVI-2008-increase

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

When a topic has a high PVI in its early stage, the curve will increase and the DP will form in the former of the curve. It indicates that the topic is becoming hot at that time. Comparatively, if the PVI is high in a late stage, the curve will delay the DP.

However, when the PVI curve starts to increase, the topic is being discussed and is an emerging topic. The highest stage is not a point of concern as the topic is already mature. The DP represents the emerging topic produced as the DP is always in front of the present and is a trade-off between the NI and the PVI.

Based on the relationship between conferences and journals, this study can use the proposed emerging topic detection indices to examine the relationship between conferences and journals. If the reasoning is correct, regardless of how the NI is defined, the pattern of topics in conferences and journals will be the same. This study detects the DP of XML in our database as 2004, which is before the highest amount of data in 2006. Although this study cannot determine whether XML has reached the highest volume in its history and will have higher volume never more than it later, but the DP is in 2004, which matches the expected date. Hence, the PVI has a better ability to predict the emerging topic happed time than traditional frequency method.

The value of the DP, regardless of the NI or PVI, is maximal in the trade off and can be used to detect when a topic is emerging. We assert that the DP must exist before the topic becomes hot. Consequently, the DP must exist during period from the first PDY to the present. Whether a topic becomes a hot one or not, the DP can still be calculated (as long as the PDY is more than 2 years) using the proposed indices.

Hence, this study uses the YDP and VDP indices to identify the situation in which a topic is hot. The emerging topic detection table is used to detect the value of retaining a research topic.

7.4 The Emerging Topic Detection Table

The emerging topic detection table helps in identifying the DP, which includes the YDP and VDP. By comparing conferences and journals, one can determine which reaches the threshold first. If a topic has never been an important topic, the DP year can still be calculated; however, it is worthless. Hence, this study uses the VDPs of each year for conferences and journals, respectively, to develop the emerging topic detection table. Take the XML at conferences as an example. From C =1999 to the F present date of 2008, the PDY is 10 years. Thus, YCDP=2003 and VCDP=0.256. But while we are in the year 2003 or 2004, we still can compute another YDP and VDP at that time. This study uses the properties of VDP to construct the emerging topic detection table. When a topic develops over 10 years, the VDPs for each year can be derived. The study identifies all research topics in the ACM database and computes

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

their VDPs for each year. The median VDP is used to avoid confounding by extreme data and generate the emerging topic detection table.

The VDP represents a new topic. Although the volume is small, the PVI is high.

For instance, if the PDY is 2 years and the first and second year volumes are 1, then

PVI

1=0.5and

PVI

2 =1. The VDP must be 0.5 - 1, but the topic still has room for discussion. A high VDP indicates that the topic retains the potential to keep investigating in the topic. By increasing the published volume of topic regardless of the NI or PVI, the curve will easily make the DP later in the curve. For instance, in the third year, two papers were published and

PVI

1=0.25,

PVI

2 =0.5and

PVI

3 =1, while the

NI

1=1,

NI

2 =0.5, and

NI

3 =0.333; thus, VDP=0.5. The VDP decreases and the DP is delayed; therefore, as the VDP increases, whether a topic warrants

For instance, if the PDY is 2 years and the first and second year volumes are 1, then

PVI

1=0.5and

PVI

2 =1. The VDP must be 0.5 - 1, but the topic still has room for discussion. A high VDP indicates that the topic retains the potential to keep investigating in the topic. By increasing the published volume of topic regardless of the NI or PVI, the curve will easily make the DP later in the curve. For instance, in the third year, two papers were published and

PVI

1=0.25,

PVI

2 =0.5and

PVI

3 =1, while the

NI

1=1,

NI

2 =0.5, and

NI

3 =0.333; thus, VDP=0.5. The VDP decreases and the DP is delayed; therefore, as the VDP increases, whether a topic warrants