Chapter 4 Webpage Information Extraction
Li fang
Dept.of Computer Science Shanghai Jiao Tong university
Contents
• Overview of Information Extraction tools from Web pages
• Wrapper Induction
• Wrapper Maintenance
Two kinds of webpages
Multiple-record page extraction (left)
One-record page extraction (right)
SD:semi-structured
Language for wrapper development—for manually constructed IE systems
Minerva: combines a declarative grammar-based approach with features typical of procedural programming languages.
Tsimmis: includes wrappers that can be configured through specification files written by the user.
Web-OQL: originally aimed at performing SQL-
like queries over the Web.
Web-OQL
• Hypertrees are arc-lableled ordered
trees.
•Tag name•The piece of HTML code
•Text excluding markup
Web-OQL (cont.)
• Query: extracts the reviewer names
“Jeff” and “Jane”
from page pe2 。
Overview of Web data extraction tools (cont.)
• HTML-aware Tools
• W4F(world wide web wrapper factory): a toolkit for building wrappers.
• XWRAP: a component library that provides basic building blocks for wrapper development.
• NLP-based tools
PAPIER (job posting), SRV, WHISK: suitable for Web pages consisting of grammatical text, such as job listings,
apartment rental advertisements, seminar announcements.
NLP based tools: PAPIER
• Extraction rule for the book title:
• Preceded by words “Book”, “Name”, and “</b>”
• Followed by the word “<b>”.
• The “Filler pattern” specifies that the title consists of at most two
words that were labeled as “nn” or “nns” by the POS tagger (i.e., one or two singular or plural common nouns).
Overview of Web data extraction tools (cont.)
• Modeling-based Tools
• NoDoSE: an interactive tool for semi-automatically determining the structure of Web page.
• DEByE: an interactive tool to extract page contents based on a set of example objects.
• Ontology-based Tools
Ontologies are previously constructed to describe the data of interest, including relationships, lexical appearance and context keywords.
Overview of Web data extraction tools
WRAPPER induction tools
WIEN, SoftMealy, Stalker
Wrapper Technologies
• What is wrapper
• Wrapper induction
• Wrapper maintenance
What is Wrapper?
• For information integration
A procedure that is designed for extracting content of a particular information source and delivering the
content of interesting in a self-describing representation (eg.XML)
• For Web application
– An extracting program to extract desired information from Web pages.
Semi-Structure Doc.– wrapper→ Structure Info
.
For Web Applications:
• Given a Web page S containing a set of implicit objects, determine a mapping W that populates a data repository R with the objects in S.
Mapping W Page S
Similar S
Repository R
Wrapper
An example for a wrapper
Wrapper Induction
• Web wrappers wrap...
– “Query-able’’ or “Search-able’’ Web sites – Web pages with large itemized lists
• The primary issues are:
– How to build the extractor quickly?
– Wrapper induction algorithms search a
hypothesis space of possible wrapper programs
for a wrapper that has high extraction accuracy
on a set of training pages.
Wrapper Induction: Methods
• Manually writing wrappers
• Tedious, time consuming task, eg. TSIMMIS, Minerva, …
• Wrapper programming languages
•
Florid (a logic-programming formalism), pillow (an HTML/XML programming library for logic programming systems) …• Machine learning methods
•
Stalker, Softmealy, WIEN …•
Supervised interactive wrapper
•
W4F (uses an SQL-like query called HEL), Xwrap (uses a procedural rule system), …Wrapper Induction Tools
• WIEN:
• Input: a set of pages where data of interest is labeled to serve as examples
• Output:a wrapper that is consistent with each labeled page.
• SoftMealy
• Using finite-state transducers (FST) which takes a sequence of tokens as input and matches the context separators with contextual rules to determine state transitions
• Stalker
• The wrapper induction techniques used in WIEN and SoftMealy are further developed in Stalker
Wrapper Induction: machine learning methods (Stalker)
The lifecycle of a wrapper
Our focus here
Learning Extraction Rules ---from pages
• Aim:
Defining a set of extraction rules that precisely define how to locate the information on the page.
How to describe the content of a page?
Describing the content of a page:
Embedded Catalog Tree
• Embedded catalog (EC): a tree-like structure to
represent a Web page.
• Leaves: items of interest for the user
• Internal nodes: lists of k-tuples where each
item in the k-tuple can be either a leaf or
another list L.
Embedded Catalog Tree (for example)
A list of five
tuples
Extracting Rule based on EC
• A rule: for each node x in the EC Tree, the wrapper needs a rule r that extracts that particular node from its root, p is a path from the root to the leaf.
• A list iteration rule: decomposes p into individual
tuples, and then apply r to each extracted tuple for
each list node.
Example for extraction rules
To extract the address:
R1:start from the beginning of the document and skip every token until you find a landmark consisting of the word Address and then ignore everything until you find the landmark<I>
R1=SkipTo(Address) SkipTo(I) R2=SkipTo(Address : <I>)
R3=SkipTo(Cuisine : <I> )
SkipTo(Address : <I>)
Example for extraction rules (cont.)
R4=SkipTo(Cuisine : <I> _Capitalized_</I> <p>
Address : <I>)
R4 is defined based on a 9-token landmark that uses the wildcard_Capitalized_, which is a placeholder for any capitalized alphabetic string.
Disjunctive rules: either R1 or R2
To deal with variations in the format of the documents,
disjunctions are allowed to use.
Extraction Rules as Finite Automata
• Landmarks: each argument of a skipTo()
• A sequence of tokens and wildcards
• Landmark automata
• A non-deterministic finite automata l
i,jS
i S
jThe transition takes place if the
automaton is in the state Si and the landmark l matches the sequence of
Landmark Automata (linear LA)
• A linear LA has one accepting state.
• from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state;
• each non-looping transition is labeled by a landmarks;
• all looping transitions have the meaning “consume all tokens until you encounter the landmark that leads to the next state”.
R1::=skipTo((),
R2::=skipTo(phone) skipTo(<b>).
Disjunctive rule either R1 or R2
Rules and its automaton
•The initial state S0 has a branching-factor of k.
•It has exactly k accepting states. (one per branch)
Learning Extraction Rules
STALKER
Marked samples
Extraction R1::=skipTo((),
User
marking
STALKER algorithm -it accepts the
positive examples in E2 and E4
-it rejects both E1 and E3 because R1 can not be matched on them. R2 can do.
Process of the example (STALKER)
1. Generating a linear LA that covers as many as possible of the four positive examples.
2. Create another linear LA for the remaining examples, and so on.
3. Once STALKER covers all examples. It returns
the disjunction of all the induced LAs.
STALKER Algorithm
See example
STALKER Algorithm (cont.)
STALKER Algorithm (cont.)
• Refine() function: obtain better disjuncts
either by making its landmarks more specific (landmark refinements), or by adding new states in the automaton (topology
refinements).
• Landmark refinements
• Topology refinements
Landmark Refinement
• R4 = SkipTo<b>
Refine as :
Topology Refinements
• R4 = skipTo<b>
Refine as :
Each initial candidate is a 2-state landmark
automaton that is either a token t that ends one prefix(p) or a wildcard
that matches such a t Example of rule induction
During the second iteration with E1 and E3 example, the initial candidate rules R5 and R6
Refinement:tries to obtain better disjuncts either by marking its
landmarks more specific (Landmark refinement) or by adding new states in the automaton
A perfect rule
which matches examples
Seed examples
Identifying highly informative examples
• The most informative examples illustrate exceptional cases
• Active learning :analyzes the set of unlabeled
example to automatically select examples for the user to label
• forward and backward rules:
Fwd R1=SkipTo(Address)SkipTo(<I>)
Bwd R1=BackTo(Phone) BackTo(_Number_)
If two rules disagree on the sample, which is
selected for user to label –highly informative
Results reported from STALKER
• From 28 sources, 206 extraction rules: 182
rules (100% correct),18 rules (>90%),3% rules are<90%.
• Active learning:
Average accuracy from 85.7% 94.2%
STALKER features
• the ability to wrap a larger variety of sources.
• capable of learning most of the extraction rules based on just a couple of examples.
• Using single-slot rules, keep high accuracy.
• improving the efficiency based on active
learning for hardest items.
Other Wrappers
WIEN: learns the landmarks by searching common prefixes at the character level , needs more training data.
SoftMealy: its extraction rules are less
expressive than STALKER, complex to deal
with missing items and various orderings
of items
Test page
Quote Server: Tabular style document
Test Pages
Internet Address Finder:
Tagged-list style
document
Result Comparison
Quote Server
Stalker: 10 example tuples, 79%, 500 test
WIEN: the collection beyond learn’s capability
SoftMealy: multi-pass 85%, single-pass 97%
Internet Address Finder
Stalker: 85% ~ 100%, 500 test
WIEN: the collection beyond learn’s capablity
SoftMealy: multi-pass 68%, single-pass 41%,
Result Comparison (cont.)
Okra
(tabular pages) Stalker: 97%, 1 example tuple
WIEN: 100% , 13 example tuples, 30 test
SoftMealy: single-pass 100%, 1 example tuple, 30 test
Big-book
(tagged-list pages) Stalker: 97%, 8 example tuples
WIEN: perfect, 18 example tuples, 30 test
SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test
A General View of Wrapper (as Summarization)
• Machine learning method for Wrapper Induction
DeLa, RoadRunner,…
Iepad, Olera, …
Overall Comparison
Conclusion: Template-based pages have high automation degree.
IE cross-site pages and free texts,
semantic features are required.
Manual IE systems can be applied to all kinds of inputs
Semi-supervised and unsupervised IE
systems can be applied only to template-based pages
Unsupervised systems usually apply superficial Three dimensions: the difficulty of IE task,
the techniques used, the effort made by the user for the training process and necessity to port IE across different domains.
Problem?
The Web are very dynamic: contents, page structures Original wrappers can stop working: rely on Web page structures
Re-generating wrappers is not easy: heavy workload to system developers
Changed
Documents Original Wrapper Original Wrapper
……… ………
Extract nothing … Incomplete results
Example
The original wrapper fails due to the structure change.
How to solve it? (discussion)
•Monitoring a set of generic features
•Machine learning techniques to learn a set of patterns that
describe the information that is being extracted from each of the relevant fields.
•…
Wrapper Maintenance in STALKER
• DataProg algorithm, which learns structural
information (patterns) about a data field from a set of examples of the field
• wrapper verification: Is a wrapper operating correctly?
• Wrapper maintenance: how to automatically modify a wrapper when the pages have changed?
Street address: 12 Pico
St.,512 Oak Blvd, 416 Main st. and 97 Adams Blvd.
(_Number_ _ capitalized_) (Blvd.) or (St.)
detecting when a wrapper stops extracting data
correctly from a Web source.
identify new examples of the
Example for Wrapper Maintenance in STALKER
an example of the original site,the extracting rule for a book title and the extracted results from the example page.
The source and
incorrectly extracted result after the titles’s font and color were changed.
Wrapper Maintenance Methods (Kushmerick’s method)
• Each data field was described by a collection of
global features, such as word count, average word length, and density of types.
• Calculated the mean and variance of each
feature’s distribution over the training examples.
• Individual feature probabilities are then combined to produce a value.
• If the value exceeds a threshold, the wrapper is
correct, otherwise, it is failed.
A prototype for tracking changes to webpages – Microsoft Research
Diff-IE is a prototype Internet Explorer add-on that:
• Highlights the changes to a webpage since the last time you visited it.
• Enables you to view and compare previously
cached version of a page.
Download DIFF-IE
From: Microsoft research
http://research.microsoft.com/en-us/projects/diffie/default.aspx
• How it was implemented?
• Cache: stores the previous versions of the page, in order to highlight how a page has changed.
• Comparison component: is responsible for detecting and highlighting the changes.
• Toolbar component: is the portion of the application with
Comparison Component (1)
Web page representation
• DiffIE identifies changes to text-based Web content at the Document Object Model (DOM) level. Pages are
represented internally as a tree of hash values to support this DOM-level comparison of text across pages.
• The text nodes of a Web page: the leaves of the DOM tree.
• The content of these nodes are hashed using the MD5
algorithm. MD5:
A message of arbitrary length 128-bit fingerprint
Comparison Component (2)
Detecting Differences:
• Starting at the root node, DiffIE compares the pre- computed subtree hash of the live version and
the cached version.
• If same, DiffIE terminates comparison of the
corresponding subtree, since identical hashes
implies the content must not have changed.
Comparison Component (3)
• 4 Types of Differences: only addition and changes
are highlighted.
Application 1
• Monitoring a page for change, to keep track
of the latest stock prices, or latest updates
on the page.
Application 2
• See new or different search results.
Application 3
Application 4
References
• Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems).
• Chun-Nan Hsu and Ming-Tzung Dung. Generating finite- state transducers for semistructured data extraction from the web. Information Systems, 23(8):521-538, Special
Issue on Semistructured Data, 1998.
• Ion Muslea, Steve Minton, Craig Knoblock.
Hierarchical Wrapper Induction for Semistructured
Information Sources, Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001 .