• 沒有找到結果。

網際網路上資訊涵意探究與資訊變化追蹤之研究The Study of Information Conceptualization and Information Change Tracking over the Internet

N/A
N/A
Protected

Academic year: 2021

Share "網際網路上資訊涵意探究與資訊變化追蹤之研究The Study of Information Conceptualization and Information Change Tracking over the Internet"

Copied!
10
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 成果報告

網際網路上資訊涵意探究與資訊變化追蹤之研究

計畫類別: 個別型計畫 計畫編號: NSC91-2416-H-110-012- 執行期間: 91 年 08 月 01 日至 92 年 07 月 31 日 執行單位: 國立中山大學資訊管理學系(所) 計畫主持人: 張德民 計畫參與人員: 賴志明、葉飛 報告類型: 精簡報告 處理方式: 本計畫可公開查詢

中 華 民 國 92 年 10 月 7 日

(2)

2

行政院國家科學委員會補助專題研究計畫

5 成 果 報 告 … 期中進度報告

網際網路上資訊涵意探究資訊變化追蹤之研究

計畫類別:

5

個別型計畫

…

整合型計畫

計畫編號: NSC 91-2416-H-110-012

執行期間: 91 年 8 月 1 日至 92 年 7 月 31 日

計畫主持人:張德民博士

計畫參與人員: 賴志明、葉飛

成果報告類型(依經費核定清單規定繳交):

5

精簡報告

…

完整報告

本成果報告包括以下應繳交之附件:

…

赴國外出差或研習心得報告一份

…

赴大陸地區出差或研習心得報告一份

…

出席國際學術會議心得報告及發表之論文各一份

…

國際合作研究計畫國外研究報告書一份

處理方式:除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫 及下列情形者外,得立即公開查詢

…

涉及專利或其他智慧財產權,

…

一年

…

二年後可公開查詢

執行單位:國立中山大學資訊管理系

中 華 民 國 92 年 10 月 9 日

(3)

中文摘要 網際網路的的蓬勃發展使得它成為重要的資訊來源之一。目前當使用者想從 網際網路上得到某一特定主題的相關資訊時,最常使用的工具是搜尋引擎。然而 使用者透過搜尋引擎所得的結果往往是龐大且雜亂的,使用者無法容易地了解這 些資訊中所包含的涵意。因此本研究提出了一個方法用以協助使用者分析搜尋資 訊結果的涵意。本研究首先利用一般搜尋引擎搜尋某一特定主題的資訊,再以所 提之關鍵字擷取方法 RCBKE 找出代表這些資訊的關鍵字詞及其關係,使用者可 以藉由此關鍵字詞關係瞭解所搜尋資訊的涵意。最後本研究以一範例顯示所提方 法之實用性。 關鍵字:網際網路、資訊涵意擷取、關鍵字擷取 Abstract

Information acquisition over the Internet has become quite popular recently. Users, however, have difficulty in understanding the overall concept resulting from the searched information about a specific topic of their interests in the Internet. Therefore, in this research, an approach is proposed to help users realize the searched results of their interested topic. To be more specific, we first gather information of a user-specified topic from any search engine. We then analyze the overall meaning represented by those pieces of information using the proposed keyword extraction method, RCBKE. This method identifies keywords and their relationship of the given information. In this manner, users can gain the general concept of what the search results indicate from the keyword relationships. Finally, an example is illustrated to indicate how our proposed approach works in practice.

Keywords: Internet, Information Concepts Extraction, Keyword Extracting. 1. Introduction

Information is an aggregation of processed data that render meaning useful in decision-making and problem-solving processes. The introduction of the Internet has shaped significantly the way information is disseminated because of its openness, dynamics, and convenience. The Internet has become a popular information source where people can easily acquire, transfer, and exchange information. As a result, the demand of finding useful information on the Internet increases dramatically by general users. Particularly, one can search and gather information about a specific topic that he is interested in via the Internet.

This task, however, can be tedious and difficult because the information amount on the Internet is massively proliferated. Obtaining desired information in the huge information pool turns into a great challenge to users. Search engines that employ information retrieval (IR) techniques are thus developed to assist users to retrieve desired information items.

(4)

2

However, search engine can only assist users to organize the desired information in a preliminary manner. After the search, users may still face hundreds of ranked results and have to dig them up one by one until they are satisfied with what they found. This task again troubles users. A feasible approach to tackling this problem is to further analyze the search results by exploring the overall meaning and implicit relations among them and then rendering users an overview, with which users can easily project the whole picture resulting from the searched information about the topics they are interested in. The issue to help users further analyze the search results thus becomes essential.

The purpose of this research is thus to propose an approach that depicts an overall picture of the searched information over the Internet based on users’ interests. This approach is to search and gather information based on specific topics users are interested in, and analyze the overall meaning and relations represented by the search results. In this manner, users can gain the general concept of what the search results indicate without browsing through the whole searched information.

2. Literature Review

In our approach, document keyword extraction and cluster analysis are used to depict the meaning of the searched information. In this section, related works on keyword extraction based on lexical cohesion, and clustering on relational data are reviewed.

Lexical cohesion refers to the semantic connections between words (Halliday and Hansan, 1976; Morris and Hirst, 1991). In a text document, a sequence of sentences with related words tends to convey information around a certain subtopic. This is known as cohesion. Skorochod'ko (1972) suggested to view the cohesion in a document as a sequence of densely interrelated subtopics. A text document could be divided into sentences and certain words would duplicate themselves among those sentences. The duplications formed some kinds of intra-structure of subtopics in the document.

Several researchers (Dumais, 1994; Ohsawa et al., 1998) have further applied this idea to segment a document into semantically coherent portions and find out the keywords accordingly. Ohsawa et al. (1998) proposed the KeyGraph algorithm for extracting keywords representing the asserted core idea of a document. Its main claim was that terms that rarely occurred in a document could be significantly representative of the document. To find such terms, KeyGraph algorithm clustered terms in the document, based on co-occurrences between any two terms. Each cluster represented a concept on which the document is based. Terms that connected several clusters tightly were identified as keywords.

Relational data refer to the numerical values representing the relevance degrees of any pairs of objects in the data set. Algorithms that generate partitions of relational data

(5)

3

are usually referred to as relational (pairwise) clustering algorithms. Relational clustering algorithms can be applied to cluster terms in a document if appropriate similarity measures are defined to quantify the degree of resemblance between pairs of terms.

There are several well-known relational clustering algorithms in the literature. One of the most popular is the sequential agglomerative hierarchical nonoverlapping (SAHN) model (Sneath and Sokal, 1993), which is a bottom-up approach that generates crisp clusters by sequentially merging pairs of clusters that are closest to each other in each step. Depending on how “closeness” between clusters is defined, the SAHN model gives rise to single, complete, and average linkage algorithms.

In general, hierarchical clustering methods are not as good as partitional (non-hierarchical) ones in the sense that they depend on previously found clusters and are sensitive to outliers and varying distance measures. Partitional clustering algorithms divide a data set into a number of clusters based on the minimization of some criterion or error functions. The number of clusters is usually predefined. Among them, K-means method (MacQueen, 1967; Anderberg, 1973) is probably the most popular algorithm in use.

Partitional clustering algorithms, however, cannot be applied to relational data directly because data values are in the relative sense instead of absolute measures. Hathway and Bezdek (1994) proposed a non-Euclidean relational hard c-means (NERHCM) clustering algorithm to solve the problem. Traditional K-means method is transformed into a relational version that can be applied on relational data. In this manner, clustering techniques on relational data can be of partitional type.

3. The RCBKE approach

In this section, the proposed RCBKE (relational clustering-based keyword extraction) approach to extracting keyword relationship from a document is presented. RCBKE consists of five steps as shown in Figure 1. Its basic idea is to cluster primary terms in a document into subtopics. Association strengths between each term and subtopics are then calculated. Terms that have strong connections with many clusters are extracted as keywords. RCBKE is described in more details as follows.

Step 1

Preprocessing Documents

Step 2

Calculating Co-occurrence Frequency

(6)

4

Figure 1 Steps of the RCBKE approach

Step 1. Preprocessing Documents

In the first step, a document is pre-processed to generate individual terms. A document is made up of sentences. Each sentence can be divided into several segments by removing certain words in the sentence. Words to be removed include non-significant words, prepositions, connectives, punctuation marks, and so on. A stop list of such words is prepared to process the sentences. Individual terms are then further extracted from those sentence segments. Terms to be extracted are noun phrases exclusively. Link Grammar Parser (Sleator and Temperley, 1993) and the WordNet database (Fellbaum, 1998) are employed for such a purpose.

Step 2. Calculating Co-occurrence Frequency

Step 2 is to calculate co-occurrence frequency between two terms in the same sentences. This is based on the idea of lexical cohesion: related words are used to reinforce the concept of a subtopic in a document. Calculated co-occurrence frequencies between pairs of terms are then fed into the next step to form clusters (subtopics).

Step 3. Clustering Primary Terms into Subtopics

(7)

5

represent subtopics in the document. Terms with high co-occurrence frequencies are selected as primary terms to be the bases for clustering. We employ the non-Euclidean relational hard c-means (NERHCM) algorithm proposed by Hathway and Bezdek (1994) in our approach. The reason to adopt this algorithm is that the co-occurrence frequencies between two primary terms constitute relational data of the pairs, and a partitional clustering technique is desired to perform the clustering on the relational data. After NERHCM is applied, clusters among primary terms are generated.

Step 4. Calculating distances between terms and subtopics

In this step, association distances between terms (except primary terms) in the document and subtopics derived from the above step are calculated. The lower the distance value, the closer the association strength between the term and the subtopic.

Step 5. Extracting keyword Relationship

As mentioned, keywords are terms that strongly connect to several subtopics (clusters) in a document. A term is deemed to have strong connection with a cluster if it links at least one primary term in that cluster, and if this distance value is shorter than a predetermined threshold value. Finally, terms are sorted based on the criteria of the number of connected clusters and the average distance value. Terms with larger number of connected clusters with smaller average distance value are extracted as keywords of the document. Furthermore, keyword relationship can be easily constituted from those extracted keywords.

4. An Illustrated Example

In this section, we apply our proposed approach to the ProQuest electronic database, and analyze the overall concept of the searched information on the specific topic users define. Suppose that users specify the topic on “e-commerce” and “wireless communications” from the database. Within the time interval from January 1st, 2001 to January 31st, 2001, ten relevant discussions that contain 2363 number of terms in total are found. Using RCBKE, we extracted the keyword relationship as shown in Figure 2. wap platform handheld device mobile phone sim website wireless mobile unit

(8)

6

Figure 2 Result of the keyword relationship extracted

Black nodes in Figure 2 are keywords extracted. Keywords within a dotted circle are primary terms that represent a cluster underlying a subtopic. For example, there are three subtopics implied under the discussion of “e-commerce” and “wireless communications” during the time period, i.e., web applications, wireless device, and

mobile-commerce. Keywords outside the circles are words of low frequencies (thus

are not primary terms) but essential in the sense that they connect several clusters. For example, “agent” is an associated term for when people discuss web applications and mobile-commerce. Generally speaking, this graph reflects the overall meaning and implicit relations on the issue of “e-commerce” and “wireless communications” discussed in the papers contained in ProQuest electronic database. In this manner, users can easily project the overall concept resulting from the searched information about the topics they specify.

5. Conclusions

In this research, we propose an approach to providing users with an overview of the searched results on the topics users specify over the Internet. This approach makes use of the keyword extraction method, RCBKE, to obtain the keyword relationship of the searched information. With the keyword relationship, users can easily understand the overall concept of the searched information on the topic they are interested in. An example is shown to illustrate how our proposed approach works in practice.

(9)

It is noted that, however, information is not constant but time-variant over the Internet. With information changes over time, what seems relevant to users’ need may become of little value at the next moment, and what seems trivial to users’ need currently may become significant as it develops into important concepts. Therefore, the next issue of this research is to assist users to trace and investigate the change of the information in a certain time period. In this manner, Users can realize the change patterns of the information and further recognize the trend of the topic they specify.

6. Self-evaluation

This research work is consistent with the original idea of the proposed project, i.e., to assist users to understand the overall concept of the searched information over the Internet on the topics they are interested in. Due to the time limit, however, the work on tracking and investigating information change over time is left as the future work. Nonetheless, the research work so far makes sufficient contributions to both researchers and practitioners. Further evaluations on RCBKE are about to perform, and this work is ready for major journal publications.

7. References

Anderberg, M., “Cluster Analysis for Application,” New York, Academic Press, 1973. Dumais, S. T., “Latent semantic indexing (LSI) and TREC-2,” Proceedings of Text

Retrieval Conference, 1994.

Fellbaum C., WordNet: An Electronic Lexical Database, MIT Press, 1998. Halliday, M. A. K., and Hansan, R., Cohesion in English, Longman, 1976.

Hathaway, R. J., and Bezdek, J. C., “An Iterative procedure for minimizing a generalized sum-of-squared-errors clustering criterion,” Neural, Parallel &

Scientific Computations, Vol. 2, 1994, pp. 1-16.

MacQueen, J. B., “Some Methods for Classification and Analysis of Multivariate Observation”, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.

Morris, J., and Hirst, G., “Lexical cohesion computed by thesaural relations as indicator of the structure of text,” Computational Linguistics, Vol. 17, No. 1, 1991.

Ohsawa, Y., Benson, N. E. and Yachida, M., “KeyGraph: Auromatic indexing by co-occurrence graph based on building construction metaphor,” Proceedings of

Advanced Digital Library Conference, 1998.

Skorochod'ko, E. F., “Adaptive method of automatic abstracting and indexing,”

Information Processing 71: Proceedings of the IFIP Congress 71, Amsterdam,

(10)

8

Sleator, D., and Temperley, D., “Parsing English with a Link Grammar,” Third

International Workshop on Parsing Technologies, 1993.

Sneath, P. H. A., and Sokal, R. R., Numerical Taxonomy-The Principles and Practice

參考文獻

相關文件

Then they work in groups of four to design a questionnaire on diets and eating habits based on the information they have collected from the internet and in Part A, and with

The aim of this theme is to study the factors affecting industrial location using iron and steel industry and information technology industry as examples. Iron and steel industry

Teacher / HR Data Payroll School email system Exam papers Exam Grades /.

Classifying sensitive data (personal data, mailbox, exam papers etc.) Managing file storage, backup and cloud services, IT Assets (keys) Security in IT Procurement and

Access - ICT skills: the technical skills needed to use digital technologies and social media. - Information

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in

Through the enforcement of information security management, policies, and regulations, this study uses RBAC (Role-Based Access Control) as the model to focus on different

The objective of this study is to analyze the population and employment of Taichung metropolitan area by economic-based analysis to provide for government