Conclusion and Contribution - 應用自然語言處理於自動化資訊擷取--以資訊產品規格之擷取為例

Base on the extraction result and performance evaluation, we summarize the advantage and disadvantage of rule base extraction methodology.

Figure 1 show the research flow and progress.

7

Figure 1: Research flow and progress

Clarify Research Motivation

Define Research Topic

Define Research Objectives

Define Research Methodology and

Literature Review

Prototype NLP Tools Survey

JAPE Developing Ontology Developing

Analysis and evaluation

Conclusion

8 1.4 Research Scope and Limitation

To extract specification information from Web page, we have to developing information extraction pattern for specific domain knowledge. The IT product is a wide set that difficult to develop a general pattern for all IT product, so we select part of IT product for prototype developing. The target IT product as following:

● Personal Computer

● Unix Server

● Monitor

● Printer

Although product specification has common format, the little difference are exist in different web site. To enhance the precision of information extraction, we aim HP and IBM web site as target template to develop optimize information extraction rule for IT product of HP and IBM. The information extraction rule also can apply to other Web site, but the precision of information extraction maybe down. We will compare and discuss this issue in chapter 5.

9 2. Literature review

2.1 Natural Language Processing

Natural language processing (NLP) is the area of study that focuses on techniques that enable machines to work with human language. This involves not only the

“understanding” or analysis of language, but also the generation or production of language [7].

A “natural language” (NL) is any of the languages naturally used by humans, i.e. not an artificial or man-made language such as a programming language. The “Natural language processing” (NLP) is a convenient description for all attempts to use computers to process natural language [8]. NLP includes:

•

Speech synthesis:

Although this may not at first sight appear very 'intelligent', the synthesis of natural-sounding speech is technically complex and almost certainly requires some 'understanding' of what is being spoken to ensure, for example, correct intonation.

•

Speech recognition:

Recognize of continuous sound waves to discrete words.

•

Natural language understanding:

Here treated as moving from isolated words (either written or determined via speech recognition) to 'meaning'. This may involve complete model systems or 'front-ends', driving other programs by NL commands.

•

Natural language generation:

10

Generating appropriate NL responses to unpredictable inputs.

•

Machine translation (MT):

Translating one NL into another.

NLP for information extraction has been started for a long time. In 1992, Tomek Strzalkowski and Barbara Vautheyl tried to build up a prototype of information retrieval system which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval [9]. The information retrieval system consists of a traditional statistical backbone (Harman and Candela, 1989) augmented with various natural language processing components that assist the system in database processing and translate a user's information request into an effective query.

To enhance the information extraction in natural language processing, Cynthia A.

Thompson’s research team try to apply the active learning to reduce annotation effort in 1999.They developed a system that learns rules for information extraction. The goal of an IE system is to find specific pieces of information in a natural-language document [10].

2.2 Information Extraction

Information extraction is the task of converting documents containing fragments of structured information embedded in other extraneous material into a structured template or database-like representation [3]. Since WWW is a large information repository, our major concern is the approach to extract information for web document.

Claire Cardie had provided a architecture for information extraction system in 1997.

The architecture defines 5 steps to extract information from natural language document

11

[11].

1. Tokenization and Tagging:

Each input text is first divided into sentences and words in a tokenization and tagging step.

2. Sentence Analysis:

It comprises one or more stages of syntactic analysis, or parsing, that together identify noun groups, verb groups, prepositional phrases, and other simple constructs.

3. Extraction:

The extraction phase is the first entirely domain specific component of the system. During extraction, the extraction phase identifies domain specific relations among relevant entities in the text

.

4. Merging:

The main job of the merging phase is co-reference resolution, or anaphora resolution: The system examines each entity encountered in the text and determines whether it refers to an existing entity or whether it is new and must be added to the system's discourse-level representation of the text.

5. Template generation:

The template generation phase determines the number of distinct events in the text, maps the individually extracted pieces of information onto each event and produces output templates.

The process flow is show as figure 2:

12

Figure 2: Architecture for an Information-Extraction System

Generally, there are tow major approaches to extract information form WWW, one is text mining and another is Web mining.

2.2.1 Text Mining

Text mining should not be confused with the better known Internet search engine tools or database management capabilities. Analogous to data mining, which extracts useful information from any type of data with large quantities, text mining is a procedure applied to large volumes of free unstructured text. After a traditional search for documents is completed, such as in format of full text, abstracts, or indexed terms, text mining explores the complex relationship among documents [12].

Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. There are three major concerns about text mining [13,14,15]:

(1) Information Retrieval, the foundational step of text mining. It is the extraction

13

of relevant records from the source technical literatures or text databases for further processing.

(2) Information Processing, the extraction of patterns from the retrieved data obtained in the previous step. According to Kostoff, it has three components:

bibliometrics, computational linguistics and clustering techniques. This step typically provides ordering, classification and quantification to the formerly unstructured material.

(3) Information Integration. It is the combination of the information processing computer output with the human cognitive processes.

Raymond J. Mooney and Un Yong Nahm present a framework for text mining based on the integration of Information Extraction (IE) and Knowledge Discovery in 2002 [16].

Figure 3: The overview of IE base text mining framework

They use the application of data mining techniques to automated discovery of useful or interesting information from unstructured text. Several techniques have been proposed for text mining, including conceptual structure, association rule mining, episode rule mining, decision trees, and rule induction methods.

14

Claire Grover’s research team provides another methodology for test mining in 2004. They propose a framework for text mining services that apply NLP tools to annotate XML document and extract information from natural language document. The Workflow involves four major steps:

1. Tokenization: Identifying and marking up words and sentences in the input text.

2. Location Tagging with a classifier: Using a trained maximum entropy classifier to

mark up location names.

3. Location Tagging by Lexicon: Using a lexicon of location names to mark up

additional locations not identifier by the tagger.

4. Gazetteer Query: Sending location names extracted from the text to a gazetteer

resource, and presenting the query results in an application-appropriate form.

Weiguo Fan’s research team describes a generic process model for a text mining application in 2005. Their process starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating techniques until information is extracted. Three text analysis techniques are shown in the example, but many other combinations of techniques could be used depending on the goals of the organization. The resulting information can be placed in a management information system, yielding an abundant amount of knowledge for the user of that system [17].

15

Figure 4: The process flow for text mining

2.2.2 Web mining

Two important and active areas of current research are data mining and the World Wide Web. The web mining is the combination of these two areas, has been the focus of several recent research projects and papers.

Web mining is the use of data mining techniques to automatically discovery and extract information from Web document and services. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web.

This broad definition on the one hand describes the automatic search and retrieval of information and resources available from millions of sites and on-line databases, i.e., Web content mining, and on the other hand, the discovery and analysis of user access patterns from one or more Web servers or on-line services, i.e., Web usage mining.

The taxonomy of Web mining along its two primary dimensions, namely Web content mining and Web usage mining. We also describe and categorize some of the recent work and the related tools or techniques in each area. This taxonomy is depicted in Figure 5 [18]

16

Figure 5: Taxonomy of Web mining

To mining web content, agent-base tools and database mining techniques are two major approaches. Agent base approach to Web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information. The database approaches to Web mining have generally focused on techniques for integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources, such as in relational databases, and using standard database querying mechanisms and data mining techniques to access and analyze this information.

2.3 Semantic Web and Ontology

2.3.1 Semantic Web

The Semantic Web is the representation of data on the World Wide Web. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF),

Web Mining

Web Content Mining Web Usage Mining

Agent Based Approach Database Approach

• Intelligent search agents

• Information Filtering/Categorization

• Personalized Web Agents

•Multilevel Databases

•Web Query Systems

• Preprocess

• Transaction identification

• Pattern Discovery Tools

• Pattern Analysis Tools

17

which integrates a variety of applications using XML for syntax and URIs for naming.

Semantic Web tries to build a universal schema to unify the different knowledge schemas on the web. Semantic Web proposes an architecture that consists of multiple layers to construct a universal schema framework. The figure 6 shows the layer structure of Semantic Web.

Figure 6: The layer of Semantic Web

The basis of the architecture is RDF. It provides a special format for every semantic statement on the web: (Subject, Predicate, Object). It is the basic syntax for the whole web; that is, all programs can recognize this format.

RDF is the first step (and hence the basis) of the Semantic Web architecture. But, constructing a universal schema for the web is not easy. RDF itself can’t be the universal schema because the semantics of data encoded in it are not specified yet.

Hence there are RDF Schema, Ontology, and Rules Layers that help to specify the

18

meanings of the subjects, predicates, and objects used in RDF statements. Together they can precisely define the semantics of RDF statements. Finally the Logic Framework Layer is needed to define the working mechanism of machines and these portions. Search on Semantic Web should follow this mechanism. Using Semantic Web’s approach, all machines can understand data on the web by first recognizing RDF statements, and finding the semantic definitions in the specified URIs (they may be linked to RDF Schema documents, ontology documents, or documents about rules), then the machines can perform search or provide services more smoothly since they can really recognize all the data on the web. There will be only one schema on the web then.

2.3.2 Ontology

Ontology is an explicit formal specification of how to represent the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them. An ontology define the common words and concepts that used to describe and represent an area of knowledge. Ontology models the vocabulary and meaning of domains of interests in a computer-usable form that computer can understand and share domain knowledge for each other.

We can now clarify the role of an ontology that considered as a set of logical axioms designed to account for the intended meaning of a vocabulary. Given a language L with ontological commitment K, an ontology for L is a set of axioms designed in a way such that the set of its models approximates as best as possible the set of intended models of L according to K (see figure 7). In general, it is not easy to find the right set of axioms, so that an ontology will admit other models besides the intended ones. Therefore, an ontology can “specify” a conceptualization only in a very indirect way, since

19

(i) It can only approximate a set of intended models;

(ii) Such a set of intended models is only a weak characterization of a conceptualization.

We shall say that an ontology O for a language L approximates a conceptualization C if there exists an ontological commitment K = <C, ℑ> , where C = <D, W, ℜ> is a conceptualization and ℑ: V→D∪ℜ is a function assigning elements of D to vocabulary V, and elements of ℜ to predicate vocabulary V. The symbol D is a domain and W is a set of relevant states of affairs of such domain and Â is a set of conceptual relations on <D, W>. Such that the intended models of L according to K are included in the models of O .

An ontology commits to C if:

(iii) It has been designed with the purpose of characterizing C, and (iv) It approximates C.

A language L commits to an ontology O if it commits to some conceptualization C such that O agrees on C. With these clarifications, we come up to the following definition, which refines Gruber’s definition by making clear the difference between an ontology and a conceptualization [19]:

20

Figure 7: The intended models of a logical language reflect its commitment to a conceptualization

An ontology is a logical theory accounting for the intended meaning of a formal vocabulary, i.e. its ontological commitment to a particular conceptualization of the world. The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualization) by approximating these intended models.

The relationships between vocabulary, conceptualization, ontological commitment and ontology are illustrated in Figure 2. It is important to stress that an ontology is language-dependent, while a conceptualization is language-independent.

2.3.3 Ontology Language

Several ontology languages have been developed during the last few years, and they

21

will surely become ontology languages in the context of the Semantic Web. Some of them are based on XML syntax, such as Ontology Exchange Language (XOL), SHOE (which was previously based on HTML), and Ontology Markup Language (OML), whereas Resource Description Framework (RDF) and RDF Schema are languages created by Word Wide Web Consortium (W3C) working groups. Finally, two additional languages are being built on top of RDF(S), the union of RDF and RDF Schema to improve its features: Ontology Inference Layer (OIL) and DAML+OIL.

Other languages have also been used, traditionally, for building ontologies, but that analysis is out of the scope of this article.

● XML-based Ontology Exchange Language (XOL)

The US bioinformatics community designed XOL for the exchange of ontology definitions among a heterogeneous set of software systems in their domain. Researchers created it after studying the representational needs of experts in bioinformatics. They selected Ontolingua and OML as the basis for creating XOL, merging the high expressiveness of OKBC-Lite, a subset of the Open Knowledge Based Connectivity protocol, and the syntax of OML, based on XML. There are no tools that allow the development of ontologies using XOL. However, since XOL files use XML syntax, we can use an XML editor to author XOL files

● Simple HTML Ontology Extension (SHOE)

SHOE, developed at the University on Maryland and used to develop OML, was created as an extension of HTML, incorporating machine-readable semantic knowledge in HTML documents or other Web documents [20]. Recently, the University of Maryland has adapted the SHOE syntax to XML. SHOE makes it possible for agents to

22

gather meaningful information about Web pages and documents, improving search mechanisms, and knowledge gathering. This process consists of three phases: Define an ontology, annotate HTML pages with ontological information to describe themselves and other pages, and have an agent semantically retrieve information by searching all the existing pages and keeping information updated. The Knowledge Annotator annotates ontological information in HTML pages.

● Ontology Markup Language (OML)

OML, developed at the University of Washington, is partially based on SHOE. In fact, it was first considered an XML serialization of SHOE. Hence, OML and SHOE share many features.

Four different levels of OML exist: OML Core is related to logical aspects of the language and is included by the rest of the layers; Simple OML maps directly to RDF(S); Abbreviated OML includes conceptual graphs features; and Standard OML is the most expressive version of OML. We selected Simple OML, because the higher layers don’t provide more components than the ones identified in our framework. These higher layers are tightly related to the representation of conceptual graphs.

There are no other tools for authoring OML ontologies other than existing general-purpose XML edition tools

● Resource Description Framework (RDF) and RDF Schema (RDFS)

RDF, developed by the W3C for describing Web resources, allows the specification of the semantics of data based on XML in a standardized, interoperable manner. It also provides mechanisms to explicitly represent services, processes, and business models, while allowing recognition of non-explicit information.

23

The RDF data model is equivalent to the semantic networks formalism. It consists of three object types: resources are described by RDF expressions and are always named by URIs plus optional anchor IDs; properties define specific aspects, characteristics, attributes, or relations used to describe a resource; and statements assign a value for a property in a specific resource.

The RDF data model does not provide mechanisms for defining the relationships between properties (attributes) and resources. This is the role of RDFS. RDFS offers primitives for defining knowledge models that are closer to frame-based approaches.

RDF(S) is widely used as a representation format in many tools and projects, such as Amaya, Protégé, Mozilla, SilRI, and so on.

● Ontology Interchange Language (OIL)

OIL, developed in the OntoKnowledge project (www.ontoknowledge.org/OIL), permits semantic interoperability between Web resources. Its syntax and semantics are based on existing proposals (OKBC, XOL, and RDF(S)), providing modeling primitives commonly used in frame-based approaches to ontological engineering (concepts, taxonomies of concepts, relations, and so on), and formal semantics and reasoning support found in description logic approaches (a subset of first order logic that maintains a high expressive power, together with decidability and an efficient inference mechanism).

OIL, built on top of RDF(S), has the following layers: Core OIL groups the OIL primitives that have a direct mapping to RDF(S) primitives; Standard OIL is the complete OIL model, using more primitives than the ones defined in RDF(S); Instance OIL adds instances of concepts and roles to the previous model; and Heavy OIL is the

24

layer for future extensions of OIL.

OILEd, Protégé2000, and WebODE can be used to author OIL ontologies. OIL’s syntax is not only expressed in XML but can also be presented in ASCII.

● DARPA Agent Markup Language (DAML) + OIL

DAML+OIL has been developed by a joint committee from the US and the European Union (IST) in the context of DAML, a DARPA project for allowing

在文檔中應用自然語言處理於自動化資訊擷取--以資訊產品規格之擷取為例 (頁 15-71)