Information Extraction

(1)

Chapter 4 Webpage Information Extraction

Li fang

Dept.of Computer Science Shanghai Jiao Tong university

(2)

Web-OQL (cont.)

• Query: extracts the reviewer names

“Jeff” and “Jane”

from page pe2 。

(8)

Overview of Web data extraction tools (cont.)

• HTML-aware Tools

• W4F(world wide web wrapper factory): a toolkit for building wrappers.

• XWRAP: a component library that provides basic building blocks for wrapper development.

• NLP-based tools

PAPIER (job posting), SRV, WHISK: suitable for Web pages consisting of grammatical text, such as job listings,

apartment rental advertisements, seminar announcements.

(9)

NLP based tools: PAPIER

• Extraction rule for the book title:

• Preceded by words “Book”, “Name”, and “”

• Followed by the word “”.

• The “Filler pattern” specifies that the title consists of at most two

words that were labeled as “nn” or “nns” by the POS tagger (i.e., one or two singular or plural common nouns).

(10)

Overview of Web data extraction tools (cont.)

• Modeling-based Tools

• NoDoSE: an interactive tool for semi-automatically determining the structure of Web page.

• DEByE: an interactive tool to extract page contents based on a set of example objects.

• Ontology-based Tools

Ontologies are previously constructed to describe the data of interest, including relationships, lexical appearance and context keywords.

(11)

Overview of Web data extraction tools

WRAPPER induction tools

WIEN, SoftMealy, Stalker

(12)

Wrapper Technologies

• What is wrapper

• Wrapper induction

• Wrapper maintenance

(13)

What is Wrapper?

• For information integration

A procedure that is designed for extracting content of a particular information source and delivering the

content of interesting in a self-describing representation (eg.XML)

• For Web application

– An extracting program to extract desired information from Web pages.

Semi-Structure Doc.– wrapper→ Structure Info

.

(14)

For Web Applications:

• Given a Web page S containing a set of implicit objects, determine a mapping W that populates a data repository R with the objects in S.

Mapping W Page S

Similar S

Repository R

Wrapper

(15)

An example for a wrapper

(16)

Wrapper Induction

• Web wrappers wrap...

– “Query-able’’ or “Search-able’’ Web sites – Web pages with large itemized lists

• The primary issues are:

– How to build the extractor quickly?

– Wrapper induction algorithms search a

hypothesis space of possible wrapper programs

for a wrapper that has high extraction accuracy

on a set of training pages.

(17)

Wrapper Induction: Methods

• Manually writing wrappers

• Tedious, time consuming task, eg. TSIMMIS, Minerva, …

• Wrapper programming languages

•

Florid (a logic-programming formalism), pillow (an HTML/XML programming library for logic programming systems) …

• Machine learning methods

•

Stalker, Softmealy, WIEN …

•

Supervised interactive wrapper

•

W4F (uses an SQL-like query called HEL), Xwrap (uses a procedural rule system), …

(18)

Wrapper Induction Tools

• WIEN:

• Input: a set of pages where data of interest is labeled to serve as examples

• Output:a wrapper that is consistent with each labeled page.

• SoftMealy

• Using finite-state transducers (FST) which takes a sequence of tokens as input and matches the context separators with contextual rules to determine state transitions

• Stalker

• The wrapper induction techniques used in WIEN and SoftMealy are further developed in Stalker

(19)

Wrapper Induction: machine learning methods (Stalker)

The lifecycle of a wrapper

Our focus here

(20)

Learning Extraction Rules ---from pages

• Aim:

Defining a set of extraction rules that precisely define how to locate the information on the page.

 How to describe the content of a page?

(21)

Describing the content of a page:

Embedded Catalog Tree

• Embedded catalog (EC): a tree-like structure to

represent a Web page.

• Leaves: items of interest for the user

• Internal nodes: lists of k-tuples where each

item in the k-tuple can be either a leaf or

another list L.

(22)

Embedded Catalog Tree (for example)

A list of five

tuples

(23)

Extracting Rule based on EC

• A rule: for each node x in the EC Tree, the wrapper needs a rule r that extracts that particular node from its root, p is a path from the root to the leaf.

• A list iteration rule: decomposes p into individual

tuples, and then apply r to each extracted tuple for

each list node.

(24)

Example for extraction rules

To extract the address:

R1:start from the beginning of the document and skip every token until you find a landmark consisting of the word Address and then ignore everything until you find the landmark

R1=SkipTo(Address) SkipTo(I) R2=SkipTo(Address : )

R3=SkipTo(Cuisine : )

SkipTo(Address : )

(25)

Example for extraction rules (cont.)

R4=SkipTo(Cuisine : _Capitalized_

Address : )

R4 is defined based on a 9-token landmark that uses the wildcard_Capitalized_, which is a placeholder for any capitalized alphabetic string.

Disjunctive rules: either R1 or R2

To deal with variations in the format of the documents,

disjunctions are allowed to use.

(26)

Extraction Rules as Finite Automata

• Landmarks: each argument of a skipTo()

• A sequence of tokens and wildcards

• Landmark automata

• A non-deterministic finite automata l

^i,j

S

ⁱ

 S

^j

The transition takes place if the

automaton is in the state Sⁱ and the landmark l matches the sequence of

(27)

Landmark Automata (linear LA)

• A linear LA has one accepting state.

• from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state;

• each non-looping transition is labeled by a landmarks;

• all looping transitions have the meaning “consume all tokens until you encounter the landmark that leads to the next state”.

(28)

R1::=skipTo((),

R2::=skipTo(phone) skipTo().

Disjunctive rule either R1 or R2

Rules and its automaton

•The initial state S⁰ has a branching-factor of k.

•It has exactly k accepting states. (one per branch)

(29)

Learning Extraction Rules

STALKER

Marked samples

Extraction R1::=skipTo((),

User

marking

STALKER algorithm -it accepts the

positive examples in E2 and E4

-it rejects both E1 and E3 because R1 can not be matched on them. R2 can do.

(30)

Process of the example (STALKER)

1. Generating a linear LA that covers as many as possible of the four positive examples.

2. Create another linear LA for the remaining examples, and so on.

3. Once STALKER covers all examples. It returns

the disjunction of all the induced LAs.

(31)

STALKER Algorithm

See example

(32)

STALKER Algorithm (cont.)

(33)

STALKER Algorithm (cont.)

• Refine() function: obtain better disjuncts

either by making its landmarks more specific (landmark refinements), or by adding new states in the automaton (topology

refinements).

• Landmark refinements

• Topology refinements

(34)

Landmark Refinement

• R4 = SkipTo

Refine as :

(35)

Topology Refinements

• R4 = skipTo

Refine as :

(36)

Each initial candidate is a 2-state landmark

automaton that is either a token t that ends one prefix(p) or a wildcard

that matches such a t Example of rule induction

(37)

During the second iteration with E1 and E3 example, the initial candidate rules R5 and R6

Refinement:tries to obtain better disjuncts either by marking its

landmarks more specific (Landmark refinement) or by adding new states in the automaton

A perfect rule

which matches examples

(38)

Seed examples 

Identifying highly informative examples

• The most informative examples illustrate exceptional cases

• Active learning :analyzes the set of unlabeled

example to automatically select examples for the user to label

• forward and backward rules:

Fwd R1=SkipTo(Address)SkipTo()

Bwd R1=BackTo(Phone) BackTo(_Number_)

If two rules disagree on the sample, which is

selected for user to label –highly informative

(39)

Results reported from STALKER

• From 28 sources, 206 extraction rules: 182

rules (100% correct),18 rules (>90%),3% rules are<90%.

• Active learning:

Average accuracy from 85.7%  94.2%

(40)

STALKER features

• the ability to wrap a larger variety of sources.

• capable of learning most of the extraction rules based on just a couple of examples.

• Using single-slot rules, keep high accuracy.

• improving the efficiency based on active

learning for hardest items.

(41)

Other Wrappers

WIEN: learns the landmarks by searching common prefixes at the character level , needs more training data.

SoftMealy: its extraction rules are less

expressive than STALKER, complex to deal

with missing items and various orderings

of items

(42)

Test page

Quote Server: Tabular style document

(43)

Test Pages

Internet Address Finder:

Tagged-list style

document

(44)

Result Comparison

Quote Server

 Stalker: 10 example tuples, 79%, 500 test

 WIEN: the collection beyond learn’s capability

 SoftMealy: multi-pass 85%, single-pass 97%

Internet Address Finder

 Stalker: 85% ~ 100%, 500 test

 WIEN: the collection beyond learn’s capablity

 SoftMealy: multi-pass 68%, single-pass 41%,

(45)

Result Comparison (cont.)

Okra

(tabular pages)

 Stalker: 97%, 1 example tuple

 WIEN: 100% , 13 example tuples, 30 test

 SoftMealy: single-pass 100%, 1 example tuple, 30 test

Big-book

(tagged-list pages)

 Stalker: 97%, 8 example tuples

 WIEN: perfect, 18 example tuples, 30 test

 SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test

(46)

A General View of Wrapper (as Summarization)

• Machine learning method for Wrapper Induction

DeLa, RoadRunner,…

Iepad, Olera, …

(47)

Overall Comparison

Conclusion:

 Template-based pages have high automation degree.

 IE cross-site pages and free texts,

semantic features are required.

 Manual IE systems can be applied to all kinds of inputs

 Semi-supervised and unsupervised IE

systems can be applied only to template-based pages

 Unsupervised systems usually apply superficial Three dimensions: the difficulty of IE task,

the techniques used, the effort made by the user for the training process and necessity to port IE across different domains.

(48)

Problem?

The Web are very dynamic: contents, page structures Original wrappers can stop working: rely on Web page structures

Re-generating wrappers is not easy: heavy workload to system developers

Changed

Documents Original Wrapper Original Wrapper

……… ………

Extract nothing … Incomplete results

(49)

Example

The original wrapper fails due to the structure change.

How to solve it? (discussion)

•Monitoring a set of generic features

•Machine learning techniques to learn a set of patterns that

describe the information that is being extracted from each of the relevant fields.

•…

(50)

Wrapper Maintenance in STALKER

• DataProg algorithm, which learns structural

information (patterns) about a data field from a set of examples of the field 

• wrapper verification: Is a wrapper operating correctly?

• Wrapper maintenance: how to automatically modify a wrapper when the pages have changed?

Street address: 12 Pico

St.,512 Oak Blvd, 416 Main st. and 97 Adams Blvd.

(_Number_ _ capitalized_) (Blvd.) or (St.)

detecting when a wrapper stops extracting data

correctly from a Web source.

identify new examples of the

(51)

Example for Wrapper Maintenance in STALKER

an example of the original site,

the extracting rule for a book title and the extracted results from the example page.

The source and

incorrectly extracted result after the titles’s font and color were changed.

(52)

Wrapper Maintenance Methods (Kushmerick’s method)

• Each data field was described by a collection of

global features, such as word count, average word length, and density of types.

• Calculated the mean and variance of each

feature’s distribution over the training examples.

• Individual feature probabilities are then combined to produce a value.

• If the value exceeds a threshold, the wrapper is

correct, otherwise, it is failed.

(53)

A prototype for tracking changes to webpages – Microsoft Research

Diff-IE is a prototype Internet Explorer add-on that:

• Highlights the changes to a webpage since the last time you visited it.

• Enables you to view and compare previously

cached version of a page.

(54)

Download DIFF-IE

From: Microsoft research

http://research.microsoft.com/en-us/projects/diffie/default.aspx

• How it was implemented?

(55)

• Cache: stores the previous versions of the page, in order to highlight how a page has changed.

• Comparison component: is responsible for detecting and highlighting the changes.

• Toolbar component: is the portion of the application with

(56)

Comparison Component (1)

Web page representation

• DiffIE identifies changes to text-based Web content at the Document Object Model (DOM) level. Pages are

represented internally as a tree of hash values to support this DOM-level comparison of text across pages.

• The text nodes of a Web page: the leaves of the DOM tree.

• The content of these nodes are hashed using the MD5

algorithm. MD5:

A message of arbitrary length  128-bit fingerprint

(57)

Comparison Component (2)

Detecting Differences:

• Starting at the root node, DiffIE compares the pre- computed subtree hash of the live version and

the cached version.

• If same, DiffIE terminates comparison of the

corresponding subtree, since identical hashes

implies the content must not have changed.

(58)

Comparison Component (3)

• 4 Types of Differences: only addition and changes

are highlighted.

(59)

Application 1

• Monitoring a page for change, to keep track

of the latest stock prices, or latest updates

on the page.

(60)

Application 2

• See new or different search results.

(61)

Application 3

(62)

Application 4

(63)

References

• Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems).

• Chun-Nan Hsu and Ming-Tzung Dung. Generating finite- state transducers for semistructured data extraction from the web. Information Systems, 23(8):521-538, Special

Issue on Semistructured Data, 1998.

• Ion Muslea, Steve Minton, Craig Knoblock.

Hierarchical Wrapper Induction for Semistructured

Information Sources, Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001 .

(64)

Information Extraction

Chapter 4 Webpage Information Extraction

Contents

• Overview of Information Extraction tools from Web pages

• Wrapper Induction

• Wrapper Maintenance

Two kinds of webpages

Multiple-record page extraction (left)

One-record page extraction (right)

Language for wrapper development—for manually constructed IE systems

Minerva: combines a declarative grammar-based approach with features typical of procedural programming languages.

Tsimmis: includes wrappers that can be configured through specification files written by the user.

Web-OQL: originally aimed at performing SQL-

like queries over the Web.

Web-OQL

• Hypertrees are arc-lableled ordered

trees.

Web-OQL (cont.)

• Query: extracts the reviewer names

“Jeff” and “Jane”

from page pe2 。

Overview of Web data extraction tools (cont.)

• HTML-aware Tools

• NLP-based tools

NLP based tools: PAPIER

Overview of Web data extraction tools (cont.)

• Modeling-based Tools

• Ontology-based Tools

Overview of Web data extraction tools

Wrapper Technologies

• What is wrapper

• Wrapper induction

• Wrapper maintenance

What is Wrapper?

• For information integration

A procedure that is designed for extracting content of a particular information source and delivering the

content of interesting in a self-describing representation (eg.XML)

• For Web application

– An extracting program to extract desired information from Web pages.

.

For Web Applications:

• Given a Web page S containing a set of implicit objects, determine a mapping W that populates a data repository R with the objects in S.

An example for a wrapper

Wrapper Induction

• Web wrappers wrap...

– “Query-able’’ or “Search-able’’ Web sites – Web pages with large itemized lists

• The primary issues are:

– How to build the extractor quickly?

– Wrapper induction algorithms search a

hypothesis space of possible wrapper programs

for a wrapper that has high extraction accuracy

on a set of training pages.

Wrapper Induction: Methods

• Manually writing wrappers

• Wrapper programming languages

•

• Machine learning methods

•

Supervised interactive wrapper

•

Wrapper Induction Tools

• WIEN:

• SoftMealy

• Stalker

Wrapper Induction: machine learning methods (Stalker)

The lifecycle of a wrapper

Learning Extraction Rules ---from pages

• Aim:

Defining a set of extraction rules that precisely define how to locate the information on the page.

 How to describe the content of a page?

Describing the content of a page:

Embedded Catalog Tree

• Embedded catalog (EC): a tree-like structure to

represent a Web page.

• Leaves: items of interest for the user

• Internal nodes: lists of k-tuples where each

item in the k-tuple can be either a leaf or

another list L.

Embedded Catalog Tree (for example)

Extracting Rule based on EC