Introduction - 應用自然語言處理於自動化資訊擷取--以資訊產品規格之擷取為例

1. Introduction

1.1 Background

The economy age evolve from farm economics to industrial economics. Today, most people agree we are entering the knowledge economy age. Michio Kaku claims that

“knowledge and technology” will become the only determining factor in a nation’s competitiveness [1]

The concept of knowledge economic changes the management of enterprise and makes knowledge management become one of the main competitiveness of business.

For business view, information technology plays a crucial role that enhances the performance of knowledge management, such as knowledge retrieval, knowledge store, especially knowledge sharing.

With the dynamic environment and the knowledge economy coming, knowledge has been treated as one of the most important assets that can enhance competitive advantages. For a company to lead among competitors, it is important to ensure that the best corporate knowledge must be available and applied to the needs of the clients in the right places at the right times [2]. Thus, how to creating and sharing knowledge to keep high competition of enterprise is a critical mission of IT managers.

The knowledge/information extraction and presentation are important process in knowledge management. Information Extraction (IE) is an important approach to automated information management. IE is the task of converting documents containing fragments of structured information embedded in other extraneous material into a structured template or database-like representation [3]. The major concern of IE is how

2

to address specific pieces of data in natural language document and extracting structured information from unstructured text.

World Wide Web (WWW) is a large repository of information. Include many resources, like text document, image, multimedia and so on. All of resources can be retrieved by anyone who connects to Internet. Most of information is presented by unstructured text document in natural language and disperse on different site. Since most of web documents are presented by natural language, it is unreadable for computer to extract knowledge from web page, we need an efficient approach to convert natural language document to be computer readable format.

The traditional way to extract information form web page is through search engine to select related document and annotate these documents by manual, much time and effort is needed for information extraction. How to extract information from Internet efficiently is the major concern of our research.

However, recent advances in natural language processing (NLP) open the new choose to perform document annotation and information extraction task. Through the customized and pre-defined process flow, NLP tools could extract information accurately form web page in specific domain [4].

In other and, the concept of Semantic Web and Ontology provide new thinking model about information presentation. Semantic Web is the representation of data on the World Wide Web. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners.

Traditional Web language focus on web page presentation, most of information content still descript in natural language. Although WWW provide a user friendly and

3

platform independent client for information exchange, it can’t meet the requirement for automatic process of software agent. The key point is software agent can’t understand the information content of web page.

Building computer readable Web pages is one of the terminal goals of the W3C Semantic Web. The idea is first mentioned by Tim-Berner’s Lee at the original proposal of WWW at CERN when 1989. The proposal includes a figure showing how information about a web of relationships amongst named objects could unify a number of information management tasks.[5]

An ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. An ontology defines the common words and concepts used to describe and represent an area of knowledge. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general. And it is certainly a different sense of the word than its use in philosophy [6].

After more than 10 years developing, WWW accumulate millions web page that contain very large information. How to extract the information and translate to be computer readable data format is a key process for Semantic Web promotion, it is also major concern of our research.

1.2 Motivation and Objectives

For most of company and individual, to collect product specification for comparison and evaluation is necessary before IT product purchasing. Generally, the life cycle of IT product is very short, the product specification always changed with new product

4

release.

Although it is easy to get IT product specification form WWW, this kind of information usually disperse in many different web sites. Traditionally, to collect the IT product specification such as desktop computer, one needs to search the related web pages by using search engines, either yahoo or google, then browse these web pages manually to collect the information we need. It takes large effort for information collection.

Most of product specifications are described with specific format and embedded in web page. Due to browser or software agent can’t identify the production specification form natural language document, it cause the simple job like “product specification collection” can’t be automated. The advance in natural language process technology makes it is possible to extract specific information piece from web page that describe in natural language.

In other hand, the concept of Semantic Web and ontology provide a new model for information presentation. Through the ontology, information could be produce, exchange or analysis by automatic process to enhance the efficiency of knowledge management.

So that, our research try to build up a automatic process that link the natural language process technology and ontology concept, and apply this process to extract IT product specification form Web page. The objectives of research are show as following:

1. Build up a prototype system for automatic IT product specification extraction that can extract information formation form web page efficiently and accurately

2. Save the extracted information in ontology language to provide widely

5

application and push the developing of Semantic Web.

6 1.3 Research Methodology and progress

We build up a prototype system to research the automatic information extraction process that combined with natural language process and ontology concept. Our research steps are:

1. Define the research topic, objective and scope:

At first, we define the research topic, objective and scope to guide the whole research progress.

在文檔中應用自然語言處理於自動化資訊擷取--以資訊產品規格之擷取為例 (頁 10-15)