Automatic Generation of Semantically Enriched Web Pages by a Text Mining Approach

(1)

Automatic Generation of Semantically

Enriched Web Pages by a Text Mining

Approach

Hsin-Chang Yang ∗

Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan

Abstract

Nowadays most of the Web pages contain little amount of structure and support-ing information that can reveal their semantics or meansupport-ings. To enable automated processing of the Web pages, semantic information such as metadata and tags re-garding to each page should be added to it. Several authoring tools have been developed to help users tackling this task. However, manual or semi-automatic au-thoring is implausible when we intend to annotate large amount of Web pages. In this work, we proposed a method to automatically generate some descriptive meta-data and tags for a Web page. The idea is to apply the self-organizing map algorithm to cluster the Web pages and discover the relationships between these clusters. In the mean time, the themes of each cluster are also identified. We then use such relationships and themes to tag the Web pages and generate metadata for the Web pages. The result of experiments shows that our method may generate semantically relevant metadata and tags for the Web pages.

Key words: Metadata Generation, Semantic Tagging, Text Mining,

(2)

1 Introduction

The vision of the Semantic Web is to provide machine processable metadata that describes the semantics of resources to facilitate the searching, filtering, condensing, or negotiation of knowledge for human users. A core technology for making the Semantic Web happen, but also to leverage application areas like knowledge management and E-commerce, is the field of semantic annotation, which turns human-understandable content into a machine understandable form. For newly created Semantic Web resources, the annotation can be done manually or by help of some authoring tools. However, it is not practical to annotate existing Web pages due to the gigantic amount of them. To promote the Web to the Semantic Web, we need to develop an automated process to discover the semantics of a Web page and explicitly add them to the page, generally in the form of XML-based metadata, for future use. Automatic se-mantic metadata generation thus plays an important role to ensure the success of the Semantic Web.

We like to emphasize that the metadata mentioned here should carry more semantic meaning than common descriptive metadata such as the Dublin Core. Here, the semantics of a Web page refers to the implicit meanings or topics ∗ Corresponding author. Address: Department of Information Management,

Na-tional University of Kaohsiung, Kaohsiung, 811, Taiwan, ROC. Tel. +886 7 5919512 Fax. +886 7 5919328

Email address: [email protected] (Hsin-Chang Yang).

(3)

that are difficult to extract using simple syntactic features and rules. That is, a more sophisticated process should be applied to analyze the content of the Web page and discover the implicit semantics of it. With the fact that most of the Web pages contain unstructured or semi-structured textual data, the process of extracting implicit knowledge from texts, or namely the text mining process, is naturally introduced to generate semantic metadata.

The intent to annotate existing Web pages with semantic metadata will be impractical if the annotation process requires human intervention or is un-scalable. In fact, the entire annotation process is technically trivial and can be easily automated if we omit the semantic extraction part of it. Generally, the semantics of a Web page is difficult to decide without the judgment of hu-man beings. The automation of sehu-mantics extraction process is necessary since even minor requirement of human effort will result in tremendous amount of time cost due to the gigantic number of several billions Web pages existed in the Web. However, such automation is rather difficult because the recognition of semantics is a high-level cognitive process. A suitable text mining process may provide a solution to this automatic semantics extraction problem if three basic requirements are satisfied. The first is that the process should be fully automatic without, or with a negligible amount of, human intervention. The second is that it should be generalizable and scalable so we need not to use all existing Web pages for training. The last is that it should be able to extract the real semantics of the Web pages and present it in a human-comprehensible way. The second requirement can be solved by a suitable machine learning al-gorithm. However, the rest two requirements are considerably difficult since they involve high-level cognitive processes and sophisticated design of the ex-traction procedure.

(4)

In this work, we present a semantics extraction method using a text mining approach that fits the requirements mentioned above. Our goal is to generate some semantic metadata and tags for a Web page. We should emphasize that a Web page mentioned here is generally an ordinary page in HTML or XML format without interesting metadata within it. To obtain the semantic meta-data and tags of the Web pages, we should first cluster a set of training Web pages using the self-organizing map (SOM) (Kohonen, 1997) algorithm and generate two feature maps. A semantics extraction process is then applied on these maps to identify various types of semantic descriptions of a page, includ-ing thematic terms that describe the theme of the page, related pages of this page, and important keywords of this page, etc. The extraction process may also construct an ontology of these Web pages. These semantic descriptions, as well as the ontology, are then used to form the metadata and tag some key-words within the page. Specifically, two kinds of semantic annotations will be generated and added to a Web page. The first kind is a piece of metadata that accompanies the Web page. Such metadata could exist within the page or be stored outside the page. The second kind is some tags that tagged important terms within the page. These tags are generally existed within the page as ordinary HTML or XML tags. We believe that a Web page will be decorated with much semantic information through such metadata and tags. The main contribution of our work is that we incorporate semantic annotation and on-tology creation in an unified framework. Furthermore, our method requires no predefined ontology as well as human intervention, and can be applied to dynamic Web pages. Although the generated metadata is rather primitive, we believe that the method will be beneficial to the success of the Semantic Web when the output of the described method migrates to the existing standards.

(5)

The structure of this paper is as follows. We will discuss some related work in Sec. 2. In Sec. 3 we will introduce the clustering process by SOM and the generation of two feature maps. The text mining process for metadata generation that consists of the identification methods for thematic keywords and their relationships is described in Sec. 4. We then show the experimental result for the proposed method in Sec. 5. Finally, we give some conclusions and discussions in Sec. 6.

2 Related Work

Automatic creation of metadata for Web pages resembles the task of seman-tic annotation in general. The schemes on semanseman-tic annotation of Web pages may be divides into two major categories, namely manual annotation and au-tomatic annotation. Most of manual annotation schemes concern about tools for annotating the Web pages and sharing the annotations. The major con-cerns of these schemes include the representation of annotations, the easiness of use, the incorporation of ontologies, the design of efficient sharing method-ologies, and the evaluation of annotations, etc. Some works on this area can be found in Bechhofer and Gobel (2001); Erdmann et al. (2000); Handschuh and Stabb (2002); Koivunen and Swick (2001); Martin and Eklund (1999); Vargas-Vera et al. (2001b). Meanwhile, a number of annotation tools for pro-ducing semantic markups have been developed, such as SHOE (Heflin and Hendler, 2000), Protege-2000 (Noy et al., 2001), OntoAnnotate (Handschuh et al., 2002), MnM (Vargas-Vera et al., 2001a), and Annotea (Kahan et al., 2001). On the other hand, automatic annotation intends to automatically or semi-automatically create annotations for existing Web pages by means of

(6)

ma-chine learning or syntactic analysis. Our work deals with automatic creation of semantic metadata and falls into this category. In the following we will discuss some related works.

In general, most of the works about semantic annotation require some prede-fined ontologies to extract, define, and relate the annotations, as we should discuss in the following. Here we further differentiate the automatic semantic annotation schemes into two models. The first model is called ontology-driven semantic tagging which task is to generate a set of tags that can seman-tically describe and tag the original content of a Web page in sentence or sub-structure levels according to some ontologies. Graubitz et al. (2001) de-veloped the DIAsDEM framework to incorporate legacy data and collections of semi-structured documents into an integrated information system that can be queried to support decision processes. They applied a KDD process to de-rive a preliminary flat XML DTD serving as a quasi-schema for the document archive and enable the provision of database-like querying services on textual data. The framework is further improved by Winkler and Spiliopoulou (2001) that derived structured XML DTDs in order to extend previously derived flat DTDs. Dill et al. (2003) introduced the SemTag system that performs automated semantic tagging of large corpora and focuses on detecting the occurrence of particular entities in Web pages. The application is central-ized, which is the same as this work, such that knowledge from the entire corpus is used. However, the tagged terms are predefined in a standard on-tology rather than automatically identified. Bonino et al. (2003) proposed an architecture for creating and managing annotations using previously defined ontologies, and it allows the tagging of documents at different granularity levels, ranging from the whole document to the single paragraph. Li et al.

(7)

(2001) proposed a machine-learning based automatic annotation approach, which is implemented in ALPHA system, to annotate Web pages in sentence level and in RDF format. Kiryakov et al. (2005) describes the semantic anno-tation scheme of KIM platform. The named entities are extracted and tagged according to a upper-level ontology (KIMO) and a knowledge base. All the above-mentioned methods rely on predefined ontologies. Semantic tagging may also be achieved without the use of ontologies. Dingli et al. (2003) proposed the Armadillo architecture that uses an iterative approach which combines in-formation extraction and machine learning to extract annotations for tagging semantically-consistent portions of the Web. The knowledge for extracting an-notations comes from some Web sites that provide structured knowledge, such as Citeseer1 _{and Google}2_{. No ontological knowledge has been used in their}

method.

The second model for automatic semantic annotation is called semantic meta-data generation, which task is to generate a section of metameta-data that can semantically describe the content of the annotating page. The generated meta-data may incorporate ontologies implicitly or explicitly, where implicit ontol-ogy means that a system defines its own semantic categories as well as their relationships, and explicit ontology means a system relies on some predefined ontologies. Most of the semantic metadata generation schemes use ontolo-gies explicitly since it is difficult to generate ontoloontolo-gies. Volz et al. (2004) describes an annotation scheme called deep annotation that manually cre-ates relational metadata for the Web presentation or annotcre-ates directly using the logical database schema. The CREAM framework (Handschuh and Staab,

1 _{www.citeseer.com} 2 _{www.google.com}

(8)

2003) applies a metadata re-recognition process to compare existing metadata literals with newly typed or existing text. It also learns a wrapper from given markup in order to automatically annotate similarly structured pages. Finally, message extraction like systems may be used to recognize named entities, pro-pose co-reference, and extract some relationship from texts. The PANKOW method (Cimiano et al., 2004), a module of the CREAM framework, employs an unsupervised, pattern-based approach to categorize instances with regard to a given ontology. However, the patterns are very limited.

3 SOM Clustering

To obtain the metadata and tags for a Web page, we first perform a clustering process on a training set of Web pages. We then generate feature maps to re-veal the relationships among Web pages as well as keywords. In the following subsections, we will start from the preprocessing steps, and follow by the clus-tering process using SOM learning algorithm. Two labeling processes are then applied to the trained result to construct feature maps which characterized the relationship among keywords and Web pages.

3.1 Preprocessing of Documents

Our approach begins with a standard practice in information retrieval (Salton and McGill, 1983), i.e. the vector space model, to encode documents with vectors, in which each element of a document vector corresponds to a differ-ent index terms. In this work the training corpus contains a set of Chinese news Web pages posted on CNA (Central News Agency) newswire. We

(9)

se-lect these pages due to the reasons as follows. The first is they are publicly available for research purpose and are used as test set for many works. The other reason is that these pages contain less tags and metadata that should meet our need. Although these Web pages are mostly written in Chinese, our method can be applied to Web pages in any language, as long as they can be segmented into lists of keywords. To encode a Web page into a vector, we first extract index terms (or keywords) from this page. Traditionally there are two schemes for extracting terms from Chinese texts. One is character-based scheme and the other is word-based scheme (Huang and Robertson, 1997). We adopt the second scheme because individual Chinese characters generally carry no context-specific meaning. A word in Chinese is composed by two or more Chinese characters. After extracting words from all training Web pages, we collect all extracted keywords and obtain a vocabulary V for this corpus. It is then used to encode a Web page into a binary vector. In this vector, an element with value 1 indicates the presence of its corresponding word in this Web page; otherwise, a value of 0 indicates the absence of the word. We use binary vector scheme to encode the Web pages because we intend to cluster Web pages according to the co-occurrence of the words, which is irrelevant to the weights of the individual words. Note that we only keep those words that belong to the content of the pages, i.e. we discard those words within HTML or XML tags in the Web pages.

A problem with this encoding method is that if the vocabulary is very large the dimensionality of the vector is also high. In general, dimensions on the order of 1000 to 10000 are very common for even reasonably small collections of Web pages. As a result, techniques for controlling the dimensionality of the vector space are required. In information retrieval several techniques are widely used

(10)

to reduce the number of index terms. Unfortunately, these techniques are not fully applicable to Chinese documents. For example, stemming is generally not necessary for Chinese texts. On the other hand, we can use stop words and noun group selection to reduce the number of index terms. In this work, we adopt a simple approach by allowing only nouns being index terms. In our experiments, this approach is able to reduce the vocabulary to a reasonable size.

3.2 Clustering and Feature Maps Generation

In this subsection we will describe how to organized Web pages and keywords into clusters by the co-occurrence similarities of keywords. The Web pages in the corpus are first encoded into a set of document vectors as described in Sec. 3.1. We intend to organize these Web pages into a set of clusters such that similar Web pages will fall into the same cluster. Moreover, similar clusters should be ’close’ in some manner. That is, we should be able to organize the clusters such that clusters that contain similar Web pages should be close in some measurement space. The unsupervised learning algorithm of SOM networks (Kohonen, 1997) meets our needs. The SOM algorithm maps a set of high-dimensional vectors to a two-dimensional map of neurons according to the similarities among the vectors. Similar vectors, i.e. vectors with small distance, will map to the same or nearby neurons after the training (or learning) process. That is, the similarity between vectors in the original space is preserved in the mapped space. Applying the SOM algorithm to the document vectors, we actually perform a clustering process on the corpus. A neuron in the trained map can be considered as a cluster. Similar Web pages will fall into the same

(11)

or neighboring neurons (clusters). Moreover, the similarity of two clusters can be measured by the geometrical distance between their corresponding neurons. To decide the cluster to which a Web page or a keyword belongs, we apply two labeling processes on the Web pages and keywords, respectively. After the Web page labeling process, each Web page is associated with a neuron in the map. We record such associations and obtain the document cluster map (DCM). Similarly, each neuron will be labeled by a set of keywords after the keyword labeling process and we have the keyword cluster map (KCM). We then use these maps to generate the semantic metadata and tags.

We define some denotations and describe the clustering process here. Let xi = {xin|1 ≤ n ≤ N}, 1 ≤ i ≤ M, be the document vector of the ith Web page in the corpus, where N is the number of keywords and M is the number of the Web pages in the corpus. We use these vectors as the training inputs to the SOM network. The network consists of a regular grid of neurons. Each neuron in the network has N synapses. Let wj = {wjn|1 ≤ n ≤ N}, 1 ≤ j ≤ J, be the synaptic weight vector of the jth neuron in the network, where J is the number of neurons in the network. We trained the network by the following SOM algorithm:

Step 1 Randomly select a training vector xi.

Step 2 Find the neuron j with synaptic weights wj which is closest to xi, i.e. ||xi− wj|| = min

1≤k≤J||xi− wk||. (1)

Step 3 For every neuron l in the neighborhood of neuron j, update its synap-tic weights by

wnew

l = woldl + α(t)(xi− woldl ), (2) where α(t) is the training gain at epoch number t.

(12)

Step 4 Repeat Step 1 through 4 until all training vectors are selected. Goto Step 5.

Step 5 Increase epoch number t. If t reaches the preset maximum training epoch number T , halt the training process; otherwise decrease α(t) and the neighborhood size, goto Step 1.

To obtain the DCM, each document vector is compared with every neuron in the trained map. A Web page is labeled to a neuron if the document vector of this page is the closest to the synaptic weight vector of this neuron. Formally, the ith Web page is labeled to the jth neuron if Eq. 1 holds. When all Web pages are labeled to the map, we should record the Web pages labeled on each neuron and obtain the DCM. In the DCM, each neuron is labeled by a set of Web pages which have similar document vectors, i.e. vectors with many overlapped elements. Such similarity makes a neuron being a cluster of similar Web pages in terms of their keyword co-occurrences. Thus Web pages in the same cluster could be considered semantically related since Web pages that contain many overlapped words should be similar in context. Note that some neurons may have no document labeled on them. We call these neurons the unlabeled neurons.

We may obtain the KCM by labeling each neuron with certain keywords, which can be achieved by examining the neurons’ synaptic weight vectors. We design the keyword labeling process based on the following observations. Since we use binary representation for the document vectors, the elements in the weight vectors will move toward either 0 or 1. Ideally, all elements should be either 0 or 1 after the training. In practice this situation is not likely to happen since the training vectors are different. However, an element will have a value near 1 if it is repeatedly moved toward 1 during the training. Since an element

(13)

with value 1 in a document vector represents the presence of its corresponding keyword in that Web page, an element with value near 1 in a synaptic weight vector also shows that such neuron has been repeatedly told to ’learn’ the corresponding keyword during the training process. This keyword should be important and often occur in those Web pages that labeled to this neuron. Thus we should label this neuron by those keywords that their corresponding elements have values near 1. According to such interpretation, we design the following keyword labeling process. First we calculate the agglomerative weight vector of a neuron by aggregating the weight from neighboring neurons as follows. wj = 1 |Nc(j)| X k∈NC(j) wk, (3)

where Nc(j) is the set of neighboring neurons of neuron j. Unlabeled neurons are not used in Eq. 3. If the nth element of wj exceeds a predetermined threshold τK, the corresponding keyword of that element is labeled to neuron j. To achieve better result, the threshold is a real value near 1. We aggregate neighboring neurons to prevent noise weight value that may be caused by imperfect convergence. After the labeling process, a neuron may be labeled by several keywords which often co-occur in a set of Web pages. Thus a neuron forms a keyword cluster. The KCM autonomously clusters keywords according to their similarity of co-occurrence. Keywords tend to occur simultaneously in the same Web page will be mapped to neighboring neurons in the map. For example, the translated Chinese words for ”neural” and ”network” often occur simultaneously in a Web page. They will map to the same neuron, or neighboring neurons, in the map because their corresponding elements in the encoded document vector are both set to 1. Thus a neuron will try to learn these two keywords simultaneously. As a result, their corresponding elements

(14)

will have good chance to have large values and been labeled on this neuron. Thus we can reveal the relationship between two keywords according to their corresponding neurons in the KCM.

The determination of the threshold is an important issue in labeling keywords. Large threshold values should allow less keywords being labeled. These key-words, however, may be more specific in describing the theme of the underlying Web pages. Small threshold values label more keywords to a neuron but some of them may be erroneous. It is difficult to determine optimal threshold value. Here we devise a scheme to determine the threshold value for each neuron. Since the values of elements of the synaptic weight vector reflect the impor-tance of the keywords that the neuron recognizes, we may set the threshold value according to the norm of the weight vector. When a neuron recognizes many keywords as important, many elements of its weight vector have large values. This will result in a large norm. The threshold should also be large to prevent too many keywords being labeled on this neuron. On the other hand, a small norm indicates that the neuron just recognizes few keywords as important. We should set the threshold to a smaller value to allow adequate number of keywords being labeled. According to above realization, we use the following equation to determine the threshold value.

τK = β||wj||, (4)

where β is a scale factor and ||wj|| is the norm of wj. Note that the threshold τK could be different for different neurons.

(15)

4 Metadata and Tags Generation

In this work, two kinds of annotations, namely the metadata and the tags, will be added to a ordinary Web page. In this section we will describe the methods to generate such annotations. We will first develop a method to generate an ontology according to the generated feature maps as described in Sec. 3.

4.1 Ontology Generation

To enrich a Web page with semantic information, an understanding of the meaning of the page itself and its relationships to other Web pages will help a lot. In many cases the meaning (or semantics) of a Web page could be repre-sented by its themes, which are generally reprerepre-sented by one or more keywords. Meanwhile, the relationships between Web pages could be approximated by the relationships between their themes. Therefore, identification of the themes of a Web page and the relationships between these themes is the foundation of annotated Web pages. When these kinds of knowledge have been identified, we actually obtain an ontology. In this subsection, we will develop a method to construct an ontology that will be used to generate metadata and tags for Web pages.

An ontology consists of a set of concepts along with the relationships between them. Here we let the concepts be those keywords labeled on KCM since these keywords are considered important. The relationships between them could be revealed by the locations of their labeled neurons. Since SOM preserves the topology of the training feature space, nearby neurons in the trained map should be similar. Therefore, keywords labeled on nearby neurons should be

(16)

related. We decide to build relationships among neurons that are close enough, i.e. within a certain distance. The distance between two neurons here is com-posed by their geometrical distance and their semantic distance. The geo-metrical distance measures the Euclidean distance between the coordinates of these neurons in the map. For example, for a square formation of J neurons, the coordinate of neuron j should be (jx, jy) = (j/

√

J, j%√J), where jx and jy denote the x and y coordinates of this neuron in the map. Similar defini-tions of coordinates should be applied to other types of map formation. The geometrical distance between neuron j and k is defined as follows:

DG(j, k) = q

(jx− kx)2+ (jy− ky)2. (5)

On the other hand, the semantic distance between neuron j and k is the dif-ference between their corresponding synaptic weight vectors, which is defined as follows:

DS(j, k) = ||wj − wk||. (6)

The distance between neuron j and k is then calculated by a weighted sum of DG and DS:

D(j, k) = DG(j, k) + γDS(j, k), (7)

where γ is a weight factor to scale the contribution from each component. A common setting is to allow these two components giving equivalent con-tributions. Note that the value of DG(j, k) ranges from 0 to

√

2J for square formation with J neurons but the value of DS(j, k) ranges from 0 to |w|, i.e. N. In such case we should let γ = √2J

N to balance the contributions of DG and DS. The weight factor for other formations should be determined in similar way. According to the above distance definitions, we relate neurons j and k if their distance D(j, k) is smaller than some threshold τD. Note that 0 ≤ τD ≤

√

(17)

When two neurons are related, all the keywords labeled on these neurons are also related. The relationships among keywords can then be established and an ontology is constructed.

4.2 Generating Metadata

The metadata of a Web page in this work consists of four parts. The first part is the important keywords section which contains a set of high-frequency keywords appeared in this Web page. The second part is the important themes section which is a set of identified themes that can reflect the main interest of the Web page. The third part is the related themes section which depicts some themes related to those in the second part and also the relationships among the themes. Finally, we include a related pages section which contains a set of Web pages that are similar to the annotated one. The important keywords are selected according to the standard tf·idf weighting scheme. We select those keywords whose tf·idf weight value exceeds a threshold. These keywords, together with their weights, are recorded in the important keywords section as follows:

<KEYWORDS>keyword1 : weight1, · · · , keywordn : weightn</KEYWORDS>.

This kind of information can be used for simple search and content highlight-ing.

The second part contains a set of themes, which have been discovered in the KCM. In the KCM, each neuron is labeled by a set of keywords that are important to those Web pages that labeled on the same neuron in the DCM. Thus it is straightforward to obtain the themes of a Web page through

(18)

investigating KCM and DCM. We first find the neuron which the Web page labeled in the DCM. Let it be neuron j. The themes of this Web page will then be those keywords that label to neuron j in the KCM. The corresponding element values of these keywords in the agglomerative weight vector wj are used as weights of these keywords. We record these themes and their weights in the important themes section as follows:

<THEMES>keyword1 : weight1, · · · , keywordn: weightn</THEMES>.

Note that the identified themes may not appear in this Web page.

To obtain the third part which depicts the relationships among the themes, we may use the ontology which is constructed in Sec. 4.1. When the themes of a Web page have been identified as described above, their related themes could then be derived from the ontology. Let {K1, K2, . . . , Kt} be the set of themes of this Web page. The related themes consist of all keywords that are related to these themes. That is,

R = [

1≤k≤t

RKk, (8)

where R is the set of related themes and RKk is the set of keywords related to Kk defined in the ontology. We use the distance define in Eq. 7 as weight of a related keyword. These related keywords and their weights are recorded in the related themes section as follows:

<RELATED THEMES>keyword1 : weight1, · · · , keywordn: weightn </RELATED THEMES>.

(19)

Finally, we may obtain the related pages of a Web page according to the DCM. We define the relate pages of a Web page as those pages that labeling to the same neuron. We may also include those pages that labeled to the nearby neurons as related pages since these neurons are considered similar by virtue of the SOM. When Web page j is related to Web page i, the weight of Web page j is the normalized difference between their document vectors, i.e.

||wi−wj||

N multiplies by the geometrical distance between their corresponding neurons. Note that a value of 1 is added to the geometrical distance defined in Eq. 5 to avoid zero distances. We then record these related pages as well as their weights as follows:

<RELATED PAGES>page1 : weight1, · · · , pagen: weightn</RELATED PAGES>.

4.3 Generating Tags

We define tags as pieces of information that accompany pieces of text in a Web page. In a plain Web page, several types of tags have been defined. These tags are generally syntactic and used for typesetting or inclusion of multimedia objects. In this subsection we will develop several types of semantic tags that will provide related information of some tagged texts when a user browses the page. Different from the syntactic tags, semantic tags are generally some kinds of information that provide semantic information about the tagged text. A common type of semantic tag is the part-of-speech tag, but it is often con-sidered as syntactic tag. It is difficult to draw a strict line between semantic tagging and syntactic tagging since they often overlap. We will adopt a broader view of semantics here. A semantic tag, in this work, is a piece of informa-tion that describes the semantics of the tagged text. Such informainforma-tion could

(20)

have various levels of abstraction. For example, it could be some metaphor behind the text, the conclusion give in the text, or the causality of the text. However, these high level information are rather difficult to derive. Thus, the tags assigned to some texts are often simple categorical labels in practice. For example, semantic tagging Web sites such as Flickr3_{, Del.icio.us}4_{, and}

Tech-norati5 _{allow the users to tag a Web page with user-defined categories. In}

these sites, users could add tags to photos, posts, and Web sites, etc. to help searching. Similarly, our goal is to generate tags that can be used to provide guidance for a user to find related Web pages. We will tag some keyword in the Web page with its theme. It is then simple to find the related pages and related themes through the ontology developed in Sec. 4.1.

In a Web page, we will tag all keywords that appear in the KCM. The tagged information of a keyword will be the index of its labeled neuron in the KCM. For example, let keyword ’knowledge’ be labeled to neuron 17 in the KCM. The keyword should then be tagged as follows:

...<MAP index="17">knowledge</MAP>...

When there are multiple choices in the KCM for the keyword, we simply select the neuron which is closest to the one to which this page is labeled. We can then find related themes and Web pages according to this neuron index. To facilitate such representation, it is necessary to refer to both KCM and DCM. A plausible approach is to include additional metadata that point to the KCM and DCM as follows:

3 _{http://flickr.com} 4 _{http://del.icio.us} 5 _{http://technorati.com}

(21)

An agent or a Web browser that use the tags of keywords should refer to these maps to retrieve necessary information. In fact, the metadata generated in Sec. 4.2, except the KEYWORDS part, could be abbreviated by this way. How-ever, we make the references during the generation of metadata to increase its comprehensibility and alleviate the burden of the agent or browser. On the other hand, there are many keywords that needed to be tagged and not all of them will be used by users. Accompanying all related themes and pages with the keyword will dramatically increase the volume of the page. The index approach should conserve lots of space and, in the mean time, provide a way to find helpful information regarding some keywords.

4.4 Application Scenario

We give a scenario to demonstrate the usage and feasibility of the generated metadata and tags. A search engine sent some crawlers to gather pages and make indices of them using keywords within these pages. With the generated metadata, the search engine not only can retrieve pages using themes rather than keywords, but also can provide users related themes and pages accom-panied with each page. For example, when a user sends a query word ’SOM’, the search engine may show a list of pages containing ’SOM’, ’self-organizing maps’, or ’self-organizing feature maps’ since they all have the same theme. Note that these pages may not contain the query word. Together with each page, the search engine also shows links for its related themes and related pages. A fictitious result is depicted in Fig. 1 to demonstrate possible

(22)

out-comes of the search engine. The user can click on the Related Pages button to view other related pages. He can also click on the Related Themes button to browse a list of themes that related to this page.

When a user browses a page, he will find some keywords being highlighted by double underlines to distinguish it from normal hyperlinks. These keywords are tagged as described in Sec. 4.3. When he moves the cursor over a tagged keyword, a dialog window will appear and show the semantic annotations of this keyword. This is depicted in Fig. 2. The user can then move the cursor to the dialog box and click on a keyword for a list of pages that are related to it or the ’Find related pages’ link to obtain a list of pages that is related to the double underlined keyword, which is ’SOM’ in the figure.

5 Experimental Result

We applied our method on the Chinese news articles posted daily in the Web by CNA(Central News Agency, Taiwan). Two corpora were constructed in our experiments. The first corpus (CORPUS-1) contains 100 news articles posted in Aug. 1, 2, and 3, 1996. The second corpus (CORPUS-2) contains 3268 Web pages (or documents interchangeably) posted during Oct. 1 to Oct. 9, 1996. CORPUS-1 is rather small and is used for explanatory purpose only. A word extraction process was applied to the corpora to extract Chinese words. There are 1475 and 10937 words been extracted from CORPUS-1 and CORPUS-2, respectively. To reduce the dimensionality of the feature vectors we discarded those words which occur only once in a page. We also discarded the words appeared in a manually constructed stoplist. Finally, we discarded all words other than those of nouns. These processes reduce the number of words to 414

(23)

and 1567 for CORPUS-1 and CORPUS-2, respectively. A reduction rate of 72% and 86% are achieved for the two corpora, respectively.

To train CORPUS-1, we constructed a self-organizing map which contains 64 neurons in 8 × 8 grid format. The number of neurons is determined experi-mentally such that a better clustering can be achieved. Each neuron in the map contains 563 synapses. The initial training gain is set to 0.4 and the maximal training epoch number is set to 100. These settings are also deter-mined experimentally. We tried different gain values ranged from 0.1 to 1.0 and various training epoch setting ranged from 50 to 200. We simply adopted the setting which achieves the most satisfying result. After training we labeled the map with keywords and Web pages by the methods described in Sec. 3.2, and obtained the KCM and DCM, respectively, for CORPUS-1. The above process was also applied to CORPUS-2 and obtained the KCM and DCM for CORPUS-2. The thresholds for labeling keywords in constructing the KCMs had been set to 0.7 for both corpora. Fig. 3 shows the KCM of CORPUS-1. To verify the effectiveness of the generated maps, we performed two evaluation processes on these maps. The first process evaluates the goodness of DCM, which clusters similar documents together. We first categorized training Web pages in CORPUS-1 and CORPUS-2 manually and created 27 and 229 classes, respectively. The categorization is based on semantic similarity that is judged by 3 examiners. We do not assign labels to these categories since we just need to know the relationships among these documents. A strict distinction between categories is adopted. That is, we should consider document in different cate-gories being semantically different, although such distinction is always fuzzy in practice. However, such strict distinction will simplify the evaluation process. When two documents in the same category are labeled to different neurons

(24)

in the DCM, we give a score of 1 to this pair of documents. Otherwise, they have score 0. We denote the score between documents di and dj as si,j. The average score for every pair of documents is calculated as follows:

S = ³1 M 2 ´ X 1≤i<j≤M si,j. (9)

When documents are clustered well in the DCM, S should approach 0. On the other hand, large S value shows that similar documents are always assigned in different categories. In our experiments, the values of S are 0.13 and 0.19 for CORPUS-1 and CORPUS-2, respectively. We also conducted experiments using traditional k-means algorithm for comparison. Here the values of k are set to the number of clusters, which are 27 and 229, respectively, for CORPUS-1 and CORPUS-2, respectively. The values of S using k-means are 0.2CORPUS-1 and 0.35 for CORPUS-1 and CORPUS-2, respectively. It is obvious that SOM outperforms k-means in our experiments.

We performed the other evaluation process to prove the effectiveness of the KCM. Since the KCM reveals the important thematic keywords occurred in the underlying set of Web pages, we should verify that these keywords are truly important in these Web pages. Generally the importance of a keyword can be measured by the number of its occurrences, i.e. the term frequency (tf), within a document. Therefore, we may measure the average term frequency of a keyword ki in the KCM as follows:

tf_i = 1 |Ki|

X dj∈Ki

tfi,j, (10)

where tfi,j ∈ (0, 1) is the normalized term frequency of ki in document dj and Ki is the set of documents that labeled to the same neuron as ki. We then calculate the average value of tf_i over all keywords in the KCM. When

(25)

this average value approaches 1, we should consider these identified keywords important in their underlying documents. In our experiments, the average values are 0.59 and 0.67 for CORPUS-1 and CORPUS-2, respectively. It is difficult to judge the goodness of the result. Thus we computed a baseline value for comparison by replacing each keyword in the KCM with a randomly selected keywords. The baseline values for CORPUS-1 and CORPUS-2 are then 0.21 and 0.23, respectively. Our scheme produces values that increase threefold compared with the baseline values.

We demonstrate the result of our method with an example to show the effectiveness of the theme identification process. The themes of the 57-th neuron were identified as ”the Olympic Games,” ”Kenya,” ”silver medal,” ”steeplechase,” ”tennis,” ”Spain,” ”table tennis,” and ”final.” The Web pages that label to the same neuron in the DCM are enlarged in Fig. 4. We also translate these subjects into English for ease of comprehension. We may ob-serve that these Web pages are all closely related to their themes.

We applied the metadata generation process on each of these corpora to ob-tain the metadata-enriched version of the training Web pages. In the following we will only take CORPUS-1 as example due to space limitation. First, we calculate the tf·idf value for each keyword in a page and select those keywords that have values exceed 0.5 as important keywords. The important themes sec-tion can be generated by the KCM and DCM as described in Sec. 4.2. To create the other parts of metadata, we first generate an ontology according to the method described in Sec. 4.1. Fig. 5 depicts the generated ontology. A circle in the figure depicts a neuron’s location in the map. The gray circles depict unlabeled neurons. When two neurons are found to be related, a link is created between them. Note that a self-link existed with each neuron since we

(26)

consider each neuron is self-related. We omit the self-links here for clarity. In this figure, two types of links existed within each pair of neurons. The dashed links, together with solid links, depict the relationships revealed when we set the value of γ to 128

563 which balances the contributions of geometric distance

and semantic distance. The threshold τD is set to 3 in this case. However, we find that this setting will create many inadequate links. This is because the geometric distance contributes too much in the distance calculation. As a result, close neurons are always related, no matter what their semantic dis-tances are. Therefore, we decide to increase the weight of semantic distance to avoid creating such inadequate links. The solid links in Fig. 5 are the links created by setting the value of γ to 512

563, i.e. we quadruplicate the contribution

of semantic distance. We also let τD = 3 here. We found that this setting will produce more rational result.

The created ontology is then used to identify the related themes of this page. Finally, we find the related pages of a page according to the DCM. The ob-tained metadata as well as the original content of a Web page are both trans-formed and integrated into XML format. We also tagged each pages according to the method described in Sec. 4.3. An example page with generated meta-data and tags is shown in Fig. 6. As we can see in the figure, various kinds of useful information has been added to a normal page. These kinds of infor-mation can be used by autonomous agents and applied to different scenarios such as those described in Sec. 4.4.

(27)

6 Conclusions and Discussions

One major strength of the Semantic Web is its ability to represent and use the semantics of the Web pages. However, such strength relies on the pre-cise understanding of the semantics of Web pages. Unfortunately, most of the existing Web pages do not provide such semantic information in an explicit, comprehensive, and easy-to-access way. Therefore, we must omit these pages or extract semantics online to meet the requirements of the Semantic Web, which are both implausible. Semantically annotating the Web pages will help improving such lack-of-semantics situation. However, most existing annota-tion techniques require human interference and make the annotaannota-tion process time-consuming and inconsistent. To remedy these difficulties, we proposed an automated technique to generate metadata that semantically annotating a Web page. The proposed method applies a text mining process that uses the self-organizing maps on a set of training Web pages to create an ontol-ogy and discover four kinds of semantic information, namely the important keywords, the important themes, the related themes, and the related pages. These kinds of information are aggregated into a piece of metadata and used to tag keywords of a Web page to help other applications understanding the meaning of this page. We believe such metadata and tags will give contribu-tions on providing semantic information of a page for possible applicacontribu-tions on the Semantic Web.

We discuss some limitations of the proposed method here. First, there are several parameters that need to be determined in our method, such as τK in labeling keywords, τD in relating neurons, and T in training the SOM, etc. The value of these parameters were mostly determined experimentally and

(28)

hardly to be optimal. Several tries should be performed before deciding their appropriate values. However, we can derive some guidelines from experience or theoretical deductions. For example, τK should be as close as 1 to allow precise keyword labeling. This may, however, label too few keywords to a neuron. Thus we often decrease τK to have adequate number of keywords on a neuron. The value of τK should not be, say, a value smaller than 0.5, to prevent irrelevant keywords being labeled. Other guidelines could be derived likewise for other parameters. We will investigate further on this topic and try to establish rules for determining the values of these parameters.

In this work we construct four types of metadata, namely important keywords, important themes, related themes, and related pages. These types of metadata, however, are determined barely by the author’s discernment. They are neither complete nor sufficient to represent the semantics of a Web page. What we add is just a set of derived knowledge that may catch the semantics of the Web pages. Although semantics extraction is the ultimate goal of our method, we should emphasize that we did not intend to develop a method that can extract real or full semantics from Web pages. Instead of extract full semantics of Web pages, we concern with the development of a fully automatic method that may extract information which is useful for understanding of Web pages. It is impractical and unnecessary to know the real semantics of a Web page when we already have enough knowledge about this page. It is not been said that the four types of metadata are enough to describe or understand the Web pages. What we did is merely a start of extracting semantics from Web pages. Other types of metadata should be derived for further applications.

This work focuses on describing a method for generating metadata and tags. We did not discuss the ways to represent such derived information. Although

(29)

we depicted a sample Web page using XML format in Fig. 6, the sample Web page does not comply with any standard in practice such as RDF and OWL. The metadata and tags, however, can be easily represented in these standards to fit for real-world applications.

Acknowledgements

This work was partially supported by National Science Council under grant NSC94-2213-E-309-012.

References

Bechhofer, S., Gobel, C., 2001. Towards annotation using daml+oil. In: Proc. First International Conference on Knowledge Capture (K-CAP 2001) Work-shop on Knowledge Markup and Semantic Annotation. Victoria, B. C., Canada.

Bonino, D., Corno, F., Farinetti, L., 2003. Semantic annotation and search at the document substructure level. In: Proceedings of the 2nd International Semantic Web Conference. Florida, USA.

Cimiano, P., Handschuh, S., Staab, S., 2004. Towards the self-annotating web. In: Proceedings of The 13th International Conference on World Wide Web. New York, NY, USA, pp. 462–471.

Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K. S., Rajagopalan, S., Tomkins, A., Tomlin, J. A., Zien, J. Y., 2003. A case for automated large-scale semantic annotation. Web

(30)

Semantics: Science, Services and Agents on the World Wide Web 1 (1), 115–132.

Dingli, A., Ciravegna, F., Wilks, Y., 2003. Automatic semantic annotation using unsupervised information extraction and integration. In: Proceedings of the 2nd International Conference on Knowledge Capture (K-CAP 2003). Florida, USA.

Erdmann, M., Maedche, A., Schnurr, H., Staab, S., 2000. From manual to semi-automatic semantic annotation: About ontology-based text annota-tion tools. In: Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content. Luxembourg.

Graubitz, H., Winkler, K., Spiliopoulou, M., 2001. Semantic tagging of domain-specific text documents with diasdem. In: Proceedings of the 1st International Workshop on Databases, Documents, and Information Fusion (DBFusion 2001). Gommern, Germany, pp. 61–72.

Handschuh, S., Staab, S., 2003. Cream: Creating metadata for the semantic web. Computer Networks 42 (5), 579–598.

Handschuh, S., Stabb, S., 2002. Authoring and annotation of web pages in cream. In: Proceedings of the 11th International World Wide Web Confer-ence (WWW 2002). Honolulu, Hawaii, USA, pp. 462–473.

Handschuh, S., Stabb, S., Ciravegna, F., 2002. S-cream–semi-automatic cre-ation of metadata. In: Proceedings of the 13th Interncre-ational Conference on Knowledge Engineering and Management (EKAW 2002). Siguenza, Spain, pp. 358–372.

Heflin, J., Hendler, J., 2000. Searching the web with shoe. In: Artificial In-telligence for Web Search. Papers from the AAAI Workshop. AAAI Press, Menlo Park, CA, pp. 35–40.

(31)

with probabilistic approaches to chinese text retrieval. In: Proc. the 2nd Int’l Workshop on Information Retrieval With Asian Languages. Tsukuba, Japan, pp. 129–140.

Kahan, J., Koivunen, M.-R., Prud´Hommeaux, E., Swick, R. R., 2001. An-notea: an open rdf infrastructure for shared web annotations. In: Proceed-ings of the 10th International World Wide Web Conference. Siguenza, Spain, pp. 623–632.

Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D., 2005. Semantic annotation, indexing, and retrieval. Web Semantics: Science, Services and Agents on the World Wide Web 2 (1).

Kohonen, T., 1997. Self-Organizing Maps. Springer-Verlag, Berlin.

Koivunen, M.-R., Swick, R., 2001. Metadata based annotation infrastructure offer flexibility and extensibility for collaborative applications and beyond. In: Proc. First International Conference on Knowledge Capture (K-CAP 2001) Workshop on Knowledge Markup and Semantic Annotation. Victoria, B. C., Canada, pp. 1–4.

Li, J., Zhang, L., Yu, Y., 2001. Learning to generate semantic annotation for domain specific sentences. In: Proc. First International Conference on Knowledge Capture (K-CAP 2001) Workshop on Knowledge Markup and Semantic Annotation. Victoria, B. C., Canada.

Martin, P., Eklund, P., 1999. Embedding knowledge in web documents. Com-puter Networks 31, 1403–1419.

Noy, N., Sintek, M., Decker, S., Crubezy, M., Fergerson, R., Musen, M., 2001. Creating semantic web contents with protege-2000. IEEE Intelligent Sys-tems 2 (16), 60–71.

Salton, G., McGill, M. J., 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.

(32)

Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F., 2001a. Mnm: ontology driven semi-automatic and automatic support for semantic markup. In: Proceedings of The 13th International Conference on Knowledge Engineering and Management (EKAW 2002). Victoria, B. C., Canada, pp. 5–12.

Vargas-Vera, M., Motta, E., Domingue, J., Shum, S. B., Lanzoni, M., 2001b. Knowledge extraction by using an ontology-based annotation tool. In: Pro-ceedings of The First International Conference on Knowledge Capture (K-CAP 2001) Workshop on Knowledge Markup and Semantic Annotation. Victoria, B. C., Canada, pp. 5–12.

Volz, R., Handschuh, S., Staab, S., Stojanovic, L., Stojanovic, N., 2004. Un-veiling the hidden bride: deep annotation for mapping and migrating legacy data to the semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 2 (1), 187–206.

Winkler, K., Spiliopoulou, M., 2001. Extraction of semantic xml dtds from texts using data mining techniques. In: Proceedings of The First Inter-national Conference on Knowledge Capture (K-CAP 2001) Workshop on Knowledge Markup and Semantic Annotation. Victoria, B. C., Canada.

(33)

List of Figures

1 A fictitious result of searching with ’SOM’ 34

2 A fictitious result of using tag of ’SOM’ 34

3 The keyword cluster map of CORPUS-1. We had translated

each keyword into English. 35

4 The titles of the Web pages that have been labeled to the 57-th neuron of the KCM for CORPUS-1. We had translated

them into English. 35

5 The created ontology. We omit self-related links in this figure. 36 6 A sample Web page with automatically generated metadata

and tags. We translated necessary keywords into English in

(34)

. . .

Introduction to the self-organizing maps model

…introduce the self-organizing map (SOM) model developed by… THEMES: SOM, self-organizing map, self-organizing feature map

www.som.xyz . . . Related Themes Related Pages

Fig. 1. A fictitious result of searching with ’SOM’

In this article we will introduce the self-organizing map (SOM) model developed by Teuvo Kohonen…

Related keywords: self-organizing map, self-organizing feature map Find related pages

(35)

police, traffic, Taipei, administration,

examination, Department of

Examination

PRC, Hong Kong, news,

mainland China, war Hong Kong, democracy, member, stock market, index job, overseas, China, student, normal university committee member, committee, KMT, Miaoli,

victim, Miaoli county

typhoon, government, Taipei, Taiwan, central, local, NT dollar, Executive

Yuan, provincial government, city government, general

budget PRC, Beijing, relationship Taichung county,

Taichung county, casualty, riverbank

Taipei, budget, general budget, council, city

council poison, food, Yunlin,

Yunlin county, river, Department of Economy, economy, enterprise, Asia-Pacific conference, National Assembly, teaching, investigation

US, bill, ism, House of Representatives, terrorism, tourism feminine, musical group, symphony, Vienna, symphony orchestra, police station, hoax

mushroom, post office,

game, gangster, police government, Taipei city, Taipei, principal, city mansion, golden wedding,

joint examination Pingtung, Taiwan Power

Company, power loss nation, Chinese, group, criminal, Australia Typhoon, Taichung, robbery, passenger, driver, Taichung city, taxi

Taipei, registration, high school Chinese Taipei, team,

golf, golf ball, ambassador, president, Panama, schedule Taitung, man, consumption, vegetable, consumer, North Korea, aviation, airlines medallion, vacation, West Europe, Somalia,

pain, parents, telecommunication, objection, exchequer, exchequer bill Okinawa, children, Outer Mongolia, transportation, transporter, Taoyuan, election, Taoyuan county

disaster, army, Chiayi, Chiayi county, Chiayi city,

millimeter, Hualien, rainfall, graph, Changhua

county

Kaohsiung, Kaohsiung harbor, transshipment, center, pier, service, facility

world, Korea, Go,

Paris Bus, exploration, dynamite, India Jordan, Jordan river, police, west coast, Palestine

Czech Republic, gold medal, youth, hero, patient, county government, Hsinchu,

Hsinchu county

typhoon, Fukien, Fukien

Province, central bank typhoon, disaster, Kaohsiung, Kaohsiung city

Kaohsiung, container, Kaohsiung county,

unemployment, hydrochloric acid, gas

station champion, Atlanta, kilogram, wrestling, lawyer, organizing committee, Olympic Games Olympic Games, meter, Brazil, record elder people, population, Singapore,

Tainan county, Tainan county, reservoir Belorussia, leader, politics, opposition party, Russia loss, typhoon, agriculture, Taiwan province

unit, Kaohsiung city

Olympic Games, Kenya, silver medal, steeplechase, tennis, Spain, table tennis,

final

player, gymnastics, sign,

manager

headspring, aborigines, Tainan, public housing, Anping district, Tainan city, environment protection, environment,

director

death, center, missing, casualty typhoon, death,

airport, wound

power loss, peninsula,

chaos, Malaya, Malaysia statistics, farmland, typhoon, area, agriculture, brine, Changhua, Council of Argriculture

typhoon, president, Taipei, motor, Taiwan, dike,

homeland, County Magistrate, car, governor, flood, mayor, Taipei county

Fig. 3. The keyword cluster map of CORPUS-1. We had translated each keyword into English.

Kenya won both the Gold and Silver medals in the 3000 meters steeplechase in the Olympics.

Spanish tennis player Arantxa Sanchez-Vicario competes for the Gold medal in the final of women’s singles in the Olympic Tennis.

Chen Jung won the Silver medal of women’s single in the Olympic Table Tennis and asked for the recognition of Taiwan’s people.

Den Yaping and Chen Jung won the Gold and Silver medals, respectively, of women’s single in the Olympic Table Tennis.

Fig. 4. The titles of the Web pages that have been labeled to the 57-th neuron of the KCM for CORPUS-1. We had translated them into English.

(36)

(37)

<?xml version="1.0" encoding="Big5"?> <article>

<keywords>公尺(meter):1.73,肯亞(Kenya):1.98,奧運(Olympic Games):0.64,

銀牌(silver medal):1.24,障礙賽(Steeplechase):1.87</keywords>

<themes>奧運(Olympic Games):0.93,肯亞(Kenya):0.81,銀牌(silver medal):0.82,

障礙賽(steeplechase):0.81,網球(tennis):0.71,西班牙(Spain):0.72,

桌球(table tennis) :0.84,決賽(final):0.76</themes>

<related_themes>冠軍(champion):2.78,亞特蘭大(Atlanta):2.23,

公斤(kilogram):2.98,角力(wrestling):2.54,律師(lawyer):2.23,

籌委會(organizing committee):2.12,奧運會(Olympic Games):1.39,公尺(meter):1.78,

巴西(Brazil):2.45,紀錄(record):1.98,選手(player):1.56,體操 (gymnastics):2.65,

簽約(sign):2.76,經紀人(manager):2.47</related_themes>

<related_pages>19960801073.htm:0.85,19960801012.htm:0.79,

19960801177.htm:0.89</related_pages>

<KCM>http://.../kcm1.htm</KCM> <DCM>http://.../dcm1.htm</DCM>

<publish_date>03 Aug 1996 10:00:34 +800(CST)</publish_date>

<subject>肯亞在奧運三千公尺障礙賽中獲金銀牌</subject>

 <text>

（中央社<MAP index=49>亞特蘭大(Atlanta)</MAP>二日美聯電）<MAP

index=57>肯亞(Kenya)</MAP><MAP index=58>選手(player)</MAP>基特爾二日晚間在本屆<MAP index=57>奧運(Olympic Games)</MAP>三千<MAP index=50> 公尺(meter)</MAP><MAP index=57>障礙賽(steeplechase)</MAP>中以八分零七秒一二的成績獲得<MAP index=44>金牌(gold medal)</MAP>，另一位<MAP index=57>肯亞(Kenya)</MAP><MAP index=58>選手(player)</MAP>基普坦優以

八分零八秒三三獲<MAP index=57>銀牌(silver medal)</MAP>，義大利的藍魯斯奇

尼以八分十一秒二八獲銅牌。 </text>

</article>

Fig. 6. A sample Web page with automatically generated metadata and tags. We translated necessary keywords into English in the parentheses succeeding the key-words.