1
Clustering for Web Information Hierarchy Mining
Hung-Yu Kao, Jan-Ming Ho
*, and Ming-Syan Chen
Electrical Engineering Department National Taiwan University
Taipei, Taiwan, ROC E-Mail: {[email protected],
Institute of Information Science* Academia Sinica Taipei, Taiwan, ROC E-Mail: [email protected]
Abstract
Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics and form a larger cluster with the more generalized information. The hierarchical structure generated by information clusters in a bottom-up manner is called the information hierarchy of a page. In this paper, we study the problem of mining the information hierarchies of pages in Web sites to recognize the information distribution of pages within the multi-level, multi-granularity configurations. Explicitly, we propose an information clustering system that applies a top-down information centroid searching algorithm and a multi-granularity centroid converging process on the document object model (DOM) trees of pages to build the information hierarchies of pages. Experiments on several real news Web sites show the high precision and recall rates of the proposed method on determining information clusters of pages and also validate its practical applicability to real Web sites.
1. Introduction
Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. Many Web pages are online generated for the purposes of maintenance, flexibility, and scalability of Web sites. These Web sites are referred to as systematic Web sites in [3]. The structures of most pages in systematic Web sites are dynamically generated by the same templates. These structures are therefore similar to one another and are usually assembled by a set of fundamental information clusters. An information cluster is defined as a sub-structure of a page which provides a unique semantic representation to users among pages in a Web site and is composed of information elements or smaller information clusters, where an information element is one context or an anchor with a non-zero length. In this paper, one information element providing good information for users is called an information authority. In contrast, an information element is called an information hub if it contains information linking to information authorities. The
definitions of the information authority and the information hub are similar to those of the hub and the authority in [4], but different from them in that information authorities / hubs here are not specific to any topic.
We define the information scale of an information element as the amount of information provided to users. Note that in a Web site, the information scales and characteristics of the neighboring information clusters are usually similar to one another. In view of this, we can merge them into a larger block to represent more generalized information. Such a merged structure is considered as the high-level information cluster corresponding to these merged clusters. Such merging is referred to as information clustering. After the clustering, a page composes of several disjoining information clusters. We define the configuration formed by the set of information clusters after the k-th level information clustering as the k-th level clustering, denoted by Lk. The information hierarchy is then built by
Clusterings L0, L1, …, Ln, where Clustering L0 corresponds
to the configuration of sets of all information elements, whereas Clustering Ln is the configuration for the
converged clustering, i.e., Ln = Ln+1= Ln+2 and so forth.
In an HTML document, tags are inserted for purposes of the page layout, content presentation and for providing interactive functions. In this paper, we extract and utilize the knowledge in the tagging tree structure, or referred to the Document Object Model [8], i.e., DOM, of a Web page and apply the information theory to mine the information hierarchy. Specifically, we propose in the paper an information clustering system which builds the information hierarchy of each page in a Web site according to both the knowledge of the information contained and the structure of pages automatically.
The main mining flow in the proposed system is to use the information theory to evaluate the information amount of content and sub-structures, and then to construct the information hierarchy by applying the specific clustering methods. We first scatter the original DOM tree into several small and non-overlapped sub-trees. The system then applies a top-down searching algorithm to select a set of top-n sub-trees referred to as the information centroids in the paper. Information clusters with different levels are built by converging the centroids using the proposed bottom-up multi-granularity centroid converging (MGCC)
Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI’03) 0-7695-1932-6/03 $17.00 © 2003 IEEE
2 process. The information hierarchy is then constructed by the set of all configurations which contain information clusters with different levels. Experiments on several real news Web sites show the high precision and recall rates of the proposed method MGCC and also validate its practical applicability to real Web sites.
The remainder of this paper is organized as follows. In Section 2, we describe the related works. The proposed system is described in Section 3. In Section 4, we empirically evaluate the performance of the proposed system by several real news Web sites. The paper concludes with Section 5.
2. Related works
The research in [5] provides a mechanism to construct the multi-granularity and topic-focused Web site maps. The constructed site map can be considered as a site-level information hierarchy, different from the proposed page-level information hierarchy in the paper.
Works in [1][2] also provide auxiliary systems to help the information extraction of semistructure documents. However, they need either the pre-marked training set or a considerable amount of human labor involved to process the information extraction semi-automatically. It is noted that there are also works on mining informative structure [3][6], which are, however, different from our work in that these prior works mainly dealt with the mining blocks delimited by <TABLE> tags. In contrast, we would like to mine the fine-grained blocks using the DOM tree.
3. The information clustering on the DOM
We develop a DOM-based information clustering system to build the information hierarchy of each page in a Web site according to the knowledge in the tree structures of pages automatically. The mining flow of the system, shown in Figure 1, consists of two main phases, i.e., (1) the information coverage tree building phase, and (2) the multi-granularity information clustering phase. We will describe these two phases with more details in Section 3.1 and Section 3.2 respectively.Figure 1: The system flow of information clustering
3.1 Phase 1: Information Coverage Tree Building
According to the innerText, i.e., the context delimited by the tag, and structure of a page, we utilize some features of nodes to indicate the scales of the information authority and the information hub, namely (1) the content length (CLEN), (2) the content information index (CII), (3) the anchor precision index (API), and (4) the structure information index (SII). We then define a tree with the bottom-up aggregated features as an information coverage tree (abbreviated as ICT). In the proposed system, we aggregate node information CLEN and API to get the corresponding aggregated features, denoted by CLENA and APIA. We use
these two normalized aggregated values to indicate the aggregated scales of information authority and information hub of a node.
For the calculations of these features, we parse the innerText of the root node to extract meaningful terms. A term corresponds to a meaningful keyword or phrase. After extracting terms in all crawled pages, we calculate the entropy value of each term according to its term frequency. From Shannon's information entropy [7], the entropy of term termi can be formulated as:
, ,| | , log ) ( 1 pages of set the is D D n where w w term EN n j ij n ij i
¦
in which wij is the value of normalized term frequency in
the page set. When entropy values of terms are calculated, we average the entropy values of terms in an innerText of node N to get CII(N), the content information index of N, i.e., . , ) ( ) ( 1~ 1 N of innerText in term where k term EN N CII j k j k j j
¦
The CII value of node N represents the amount of information carried in a sub-tree rooted by N. We also define the value of the anchor precision index to indicate the correlation of the anchor and its linking page. We use the anchor text to evaluate the value of API. The correlation index API is defined as,
¦
m j ENtermj N API 1 , ) ( 1 ) ( wheretermj is the term concurrently appearing in both the anchor
text of N and the linked page and m is the number of matched terms.
Finally, the index SII of a node is calculated according to the distribution of the feature values of the node’s children. We define children(N) as the set of all non-dummy children of the node A. We define the SII value of node N with children n0, n1, …, nm-1 for feature fi as:
). ( , ) ( ) ( , log ) , ( 1 ~ 0 1 0 N children n n f n f w where w w f N SII k m k k i j i m j ij ij m ij i
¦
¦
We apply entropy calculation here to represent the distribution of children’s feature values of any node with
Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI’03) 0-7695-1932-6/03 $17.00 © 2003 IEEE
3 more than one child. The value of SII indicates the degree that the feature values of the node are dispersed among its children. When the value of SII(N, fi) is higher, the values
of all children’s fitend to be equal.
3.2 Phase 2: Multi-Granularity Information
Clustering
In this phase, we first apply the proposed information centroid searching algorithm to select the top-k information clusters of a page, which are called information centroids. The bottom-up centroid converging method is then applied to these top-k centroids to build the multi-granularity information clusterings in different levels. After scattering a DOM tree into a set of sub-trees by the given SII constraint, i.e., SII Threshold (ST), we can find the set of scattered centroid candidates. According to the covering characteristic of the aggregated features, a top-down, greedy searching algorithm can extract information centroids with the top-k information scales among these candidates.
After the top-k information centroids are extracted, we then apply a node merging process, called centroid converging, to merge the neighboring and similar information centroids and clusters into a more generalized cluster. One iteration of the basic centroid converging process in the i-th level converging contains two steps, including (1) verifying the incremental cluster constraint, i.e., CInfobase + i * CInfoinc
and (2) finding the new converged centroid. The values in the tuple (CInfobase, CInfoinc) are pre-assigned thresholds
and are used for the judgments of clustering a set of information centroids and continuing to converge in a level of converging. Note that a level of converging process may contain more than one iteration of the basic process as shown in Figure 2.
We define a converging scope (or abbreviatedly scope) as a set of sibling nodes in the DOM tree, which contains at least one centroid. Each node N in the converging scopes of the MGCC process can be expressed as a node in the 2-dimensional space of the scales of information authority and information hub with the tuple value
)) ( , ) ( ) ( ( CII N N AC N
APIA where AC(N) is the count of anchors
contained in T(N). For a set of centroids in the scope k, we calculate the value CInfok, which is equal to the geometric
average of the maximum difference of information authorities, i.e., CInfoAuthk, and the maximum differences
of information hubs, i.e., CInfoHubk, among the set of
centroids and the converged centroid in the scope as shown in Figure 3. We use the value CInfok to measure the
information diversity between centroids in the same scope. CInfok is equal to the distance between the centroid and the
converged centroid in the 2-dimensional space when there is only one centroid in the scope.
Figure 2: the centroid converging in the DOM tree
Figure 3: The different converging cases of the scopes in Figure 2
4. Clustering results and evaluations
News Web sites are typical systematic Web sites. The structures of TOCs and articles pages are distinct and are very appropriate to evaluate the proposed method of mining the information hierarchy. We therefore conduct our experiments on pages in the datasets used in [3]. The datasets contain several commercial news Web sites as described in Table 1.
Table 1: Datasets for experiments and evaluations of information clusters
Site Abbr. URL Total
pages TOC pages TOC answer article answer CDN www.cdn.com.tw 261 25 22* /38 60# /63 TIMES news.chinatimes.com 3747 79 69/313 66/68 CNA www.cna.com.tw 1400 33 29/106 50/50 CNET taiwan.cnet.com 4331 78 38/84 37/86 CTS www.cts.com.tw 1316 31 19/21 53/80 TVBS www.tvbs.com.tw 740 13 12/25 50/50 TTV www.ttv.com.tw 861 22 18/20 42/75 UDN udnnews.com 4676 252 243/674 52/106 TOTAL 12035 530 450/1281 411/579
#: The domain experts selected the article pages with different and distinctive tagging styles to be the article answer set.
4.1 The results of information clustering
We apply the proposed information clustering method on the pages with the marked types to build their information hierarchies. After building the ICTs, we apply the information centroid searching algorithm under different
Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI’03) 0-7695-1932-6/03 $17.00 © 2003 IEEE
4 SII thresholds. We use the different ST values to control the number and granularity of the information centroids as shown in Figure 4. 0 1 0 2 0 3 0 4 0 5 0 0 . 7 0 . 7 5 0 . 8 0 . 8 5 0 . 9 S T #node in a c ent ro id A r t i c l e p a g e s T O C p a g e s 0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 0 . 7 0 . 7 5 0 . 8 0 . 8 5 0 . 9 S T #I nf or m a ti on C ent ro id A r t i c l e p a g e s T O C p a g e s
Figure 4: the distribution of numbers of information centroid and corresponding sizes under different STs.
4.2 Evaluations on informative information
clusters
To assess the proposed process, we use two answer sets, i.e., TOC blocks and article blocks, to evaluate the precision and recall rates of Clustering L1. Figure 5 shows that the
average improvement of the 1-st level converging. We use two evaluation methods, i.e., significant node coverage (SNC) and information coverage (IC) to evaluate the precision and recall rates of TOC and article pages respectively. Explicitly, SNC evaluates the precision (P) and recall (R) rates by matching anchor nodes and IC matches innerText. We also use the F-measure which is the harmonic mean of values of precision and recall and is formulated as P R P R ) * ( *
2 to evaluate results in a single
efficiency measure. It can be observed from this figure that the enhanced performance of the information hierarchy is prominent when k=1.
5. Conclusion
In this paper, we propose an information hierarchy mining system that applies the multi-granularity centroid converging information clustering on the DOM trees of pages. With a DOM tree scattered into many small pieces of sub-trees by a dynamic threshold of the structure information after the aggregated features are computed, the system applies a top-down information centroid searching algorithm to select a set of sub-structures. The information hierarchy is then built by the multi-level configurations which are generated from expanding and merging the centroids using the proposed multi-granularity centroid converging method (MGCC). The attained information hierarchy is not only useful for search engines, inter-media information agents, and crawlers to index, extract and navigate significant information from a Web site, but also
for providing the hierarchical configurations of a page according to the amount of information contained. The clustering results show that the proposed process can effectively extract the information hierarchies of pages and experiments on several real news Web sites show the high precision and recall rates of the proposed system on finding information clusters of pages and also validate its practical applicability to real Web sites.
A r t ic le [ C L E NA, S II( C L E NA) ] , T C = 0 . 8
0 . 6 0 0 0 . 6 5 0 0 . 7 0 0 0 . 7 5 0 0 . 8 0 0 0 . 8 5 0 0 . 9 0 0 0 . 9 5 0 1 . 0 0 0 0 . 7 0 0 . 7 5 0 . 8 0 0 . 8 5 0 . 9 0 S T F-m e a s u re k = 1 k = 3 k = 1 , L 1 k = 3 , L 1 k = 5 k = 5 , L 1 T O C [ A P IA, S II( A P IA) ] , T C = 1 . 2 5 0 . 5 0 0 0 . 6 0 0 0 . 7 0 0 0 . 8 0 0 0 . 9 0 0 0 . 7 0 0 . 7 5 0 . 8 0 0 . 8 5 0 . 9 0 S T F-me s u re
Figure 5: The evaluation of the informative clusters in
Clustering L1
Acknowledgement
The authors are supported in part by the Ministry of Education Project No.89-E-FA06-2-4, and the National Science Council Project No. NSC 91-2213-E-002-034 and NSC 91-2213-E-002-045, Taiwan, Republic of China.
References
[1] B. Adelberg. NoDoSE—a tool for semi-automatically
extracting structured and semistructured data from text documents. Proc. of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD'98), 1998.
[2] C. N. Hsu and M. T. Dung. Generating Finite-state
Transducers for Semi-structured Data Extraction from the Web. Information Systems, 23(8):521-538, 1998.
[3] H.-Y. Kao, S.-H. Lin, J.-M. Ho and M.-S. Chen.
Entropy-Based Link Analysis for Mining Web Informative Structures. Proc. of the ACM 11th International Conf. on Information and Knowledge Management (CIKM-02), Nov. 4-9, 2002.
[4] J. M. Kleinberg, Authoritative sources in a hyperlinked
environment. ACM-SIAM Symposium on Discrete Algorithms. 1998.
[5] W. S. Li, N. F. Ayan, O. Kolak and Q. Vu, Constructing
Multi-Granular and Topic-Focused Web Site Maps, Proc. of the 10th World Wide Web Conference, 2001.
[6] S.-H. Lin and J.-M. Ho. Discovering Informative Content
Blocks from Web Documents. The 8th ACM SIGKDD, 2002.
[7] C. E. Shannon, A mathematical theory of communication.
Bell System Technical Journal, 27:398-403, 1948.
[8] W3C DOM. Document Object Model (DOM).
http://www.w3.org/DOM/.
Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI’03) 0-7695-1932-6/03 $17.00 © 2003 IEEE