IE Systems

(1)

IE Systems

Fang Li

(2)

What is IE system?

Some limitations:

Information defined in advance Domain –specific

(3)

Types of IE systems introduced

Aim: Extract information from the Internet.

Wrapper Systems

NLP based extraction system

Open-domain extraction system

(4)

Wrapper-based IE systems:

• lixTo system (semi-automatically)

• Roadrunner system (automatically)

• Never-Ending Learning (NELL)

(5)

Discussion

Given a page with many data in it How to extract it automatically?

(6)

Lixto System

A system and method for the visual and interactive generation of wrappers for web pages under the supervision of a human developer, for automatically extracting information from Web pages using such

wrappers, and for translating the extracted content into XML

www.lixto.com (the company founded in 2001 as a spin-0ff of the Vienna Technical University.)

(7)

(8)

Features of Lixto

Very high expressive power:

• of defining sophisticated extraction patterns

Excellent visual support

• for marking extraction patterns

Good learnability

• No extraction language needs to be learned

Sample parsimony

• Very few sample pages are needed in order to define robust wrappers

Simple and smooth XML translation mechanism

(9)

Architecture of Lixto System

(10)

Architecture of Lixto System (cont.)

Interactive pattern builder: provides the visual UI that allows a user to specify the desired extraction patterns and the basic algorithm for creating a corresponding Elog

wrapper as output.

Extractor: Elog program interpreter that performs the actual extraction based on a given Elog program.

The controller of XML Generator: the user

chooses how to map extracted information to XML.

(11)

About extraction language: Elog

Elog: system-internal datalog-like rule based

language specially designed for hierarchical and modular data extraction.

Datalog Rule:

Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p)

Head predicate:

Happy(d)

Rule body If and only if all atoms of

the body are true, the head is true

(12)

Extraction language: Elog

The head of a rule r is of the form p(S,X):

p is a pattern name,

S is a variable which is bound in the body of the rule to the parent-pattern instances of the filter corresponding to r,

X is the target variable which, at

extraction time, is bound to some target pattern instance (a tree region or string) to be extracted.

(13)

Extraction language: Elog (cont.)

A standard extraction rule:

New(S,X)  Par(_,S), Ex(S,X),Co( S,X,…)[a,b]

Par(_,S): parent pattern predicate

Ex(S,X): extraction definition predicate Co(S,X,..): further imposed conditions [a,b] are optional, range parameter.

(14)

Rule example

record(S, X)  tableseq( _, S), subelem(S, :table, X)

If S is an instance in tableseq, and X is a tree

region contained in S and the root of X matches table then X is a table contained in S.

The first atom： the parent pattern is an instance of <tableseq>.

The second atom: looks for subelements that qualify as tables

inside the unique tableseq instance and instantiates X with them.

(15)

Extraction language: Elog (body of the rule)

Attribute conditions: impose restrictions on matched elements. E.g the value is italics

Element characterizations: the value is a concept like “isCity”.

Tree Extraction Definition Predicates: a variable should be instantiated with a node in the HTML tree which matches an element path definition.

(16)

Extraction language: Elog (body of the rule)

String extraction definition predicates: every node n of the parse tree by concatenating all strings corresponding to leaves of the

subtree rooted in n.

Contextual conditions: some other elements must or must not appear either before or

after some instances.

Internal conditions: some characteristic feature must or must not appear with an instance.

(17)

Extraction language: Elog (body of the rule)

Concept conditions: predicates like isEmail(X), isCurrency(X)

Comparison conditions: compare two dates,

Pattern References: parent pattern defines the context of a rule.

Range conditions: any rule a range

condition such as “[3,7]” can be added.

(18)

(19)

Elog Extraction Program for a single eBay page

Element

characterization

String extraction

Definition predicates

Context condition

(20)

How to build the extraction rules?

Pattern：A set of rules defining the same head.

Rule：A rule defines many extraction conditions, such as attribute condition, element characterization,…

Filter：like a rule.

(21)

How to Build Wrapper

I: interactive A: automatic

Interactively generate a new pattern

(22)

Recursive Wrapping

“$1” is interpreted as a constant whose value is the URL of the

start document.

(23)

Recursive extract pages which are connected to each other via a “next page” link.

(24)

Results reported from Lixto

how many example pages are necessary to get 100 percent of correctly matched pattern instances

the time needed for constructing the initial

wrapper based on one example page

(25)

Question

How does the company profit from the data extraction program?

http://www.lixto.com

(26)

RoadRunner System

Aim:

• Extract data-intensive web sites.

• Data is stored in a back-end DBMS, HTML pages are dynamically generated using scripts

Methods:

• Unsupervised wrapper generation

• Do not assume that sample pages are manually selected the system is able to automatically cluster pages in a site into homogeneous classes

• Does not rely on user-specified labeled examples  wrappers are generated and data are extracted in a completely automatic way.

• Do not assume any a priori knowledge about the target

schema deal with flat records and also nested structures.

(27)

Overview of

RoadRunner System

Given a set of HTML pages, find a

schema for the content of these pages.

A set of extraction rules parse the HTML code and retrieve the data according to the discovered schema.

Pattern discovery can be based on the study of similarities and dissimilarities between the pages

(28)

A running example

Fig. a: a nested dataset by querying a database.

Fig. b: each author’s book information with the same style.

Method: compares the HTML codes of the two pages, infers a common structure and a wrapper, and use that to extract the source dataset.

(29)

Result of the extraction

in the example

(30)

The Architecture of the System

•Classifier: analyzes pages from the target site and collect them into clusters with a

homogeneous structure.

•Aligner: compares the HTML sources of some samples pages to infer a a grammar to be used as a wrapper.

(31)

How to identify different pages classes in the target sites?

(

Classifier: mapping a sample to the feature space)

Tag Probability: it is reasonable to assume that pages complying the same grammar have a similar

“distribution” of tags, i.e., tags appear in the pages with similar probability

Tag Periodicity: there are cases in which tag

probabilities may be misleading, since they do not give information about the relative positions of tags. Tag frequency is used to complement tag probability.

Distance from the Home Page :if navigation paths in the site are well organized, it is reasonable to assume that pages containing homogeneous information are

approximately at the same distance from the home page in the site graph.

(32)

Architecture of the System (cont.)

Expander: infer a wrapper for those singleton pages. Most singleton pages are indices or links to other pages.

Labeler: associates a semantic meaning to the data fields that can be extracted by running the wrappers generated by the above modules.

Discussion:

How to label the data item extracted from the page?

(33)

The Labeler (methods)

To be done manually.

Adoption of knowledge representation techniques, by some domain ontology.

Based on a generalized notion of

closeness between wrapper’s tokens and non-terminal symbols.

(34)

The Labeler

Check whether the pattern sub-tree is adjacent with some isomorphic sub-tree.

The leaves of the

discovered tree can be selected as names for the non-terminals of the patterns tree.

namely, the strings

”name” and ”phone” – are candidate to be used as names for the non-terminals $A and

$B respectively.

(35)

The Labeler (cont.)

Richness of the Web itself :

it is possible that in some page a given data item is associated with some

information describing its meaning. It is reasonable that in some of the

pages retrieved by the search engine, the input value is explicitly associated with some descriptive text.

(36)

Further Research:

(37)

Summarization

• lixTo system (interactive wrapper generation, semi-supervised)

• Roadrunner system (data-intensive page extraction, unsupervised )

(38)

References

Robert Baumgartner, et al.”Visual web information extraction with Lixto”

Proceedings of the 27th VLDB Conference,2001.

Valter Crescenzi, et al. ” Automatic Web Information Extraction in the

ROADRUNNER system” LNCS 2465, pp.

264–277, 2002.

Jun Zhu,et al, Simultaneous record

detection and attribute labeling in Web data Extraction KDD 2006

(39)

Never-Ending Learning (NELL)

Read the web 24 hours/day since Jan.2010.

Acquired a knowledge base with 80 million confidence-weighted beliefs.

http://rtw.ml.cmu.edu

(40)

Problem Statement

A set L={Lⁱ} of learning tasks.

where L_i=(T_i,P_i,E_i) performance metric Pi,on a

given performance task Ti, through a given type of experience Ei;

A set of coupling constraints C={φ_K,V_ki} φ_Kis a real-valued function over two or more

learning tasks, specifying the degree of satisfaction of the constraint.

V_ki a vector of indices over learning tasks.

(41)

Problem Statement (cont.)

(42)

Input of the System

Input

 Ontology and binary relations ( ~800 categories and relations)

 10-20 Labeled training examples for each category and relation

 The web and access to 100,000 Google API search queries.

 Occasional interaction with humans

System Doing

 read (extract) more beliefs from the web

 remove old incorrect beliefs

 populate a growing knowledge base containing a conﬁdence and provenance for each belief

 learn to read better than the previous day.

Result: KB with +90.000,000 extracted beliefs (different levels of confidence)

(43)

Output

of the

system

(44)

NELL architecture

(45)

Techniques used for learning tasks

Category classification: NELL learns different boolean functions for each of the 280 categories in its ontology, allowing noun phrases to refer to

entities in multiple semantic categories.

Relation classification: NELL learns distinct

boolean-valued classiﬁcation functions for each of the 327 relations in its ontology.

Entity Resolution: Functions that classify noun phrase pairs by whether they are synonyms.

Inference rules among belief triples: Functions that map from NELL’s current KB, to new beliefs it should add to its KB.

(46)

Techniques used for coupling constraints

Multi-view co-training coupling.

Subset/superset coupling.

Multi-label mutual exclusion coupling.

Coupling relations to their argument types.

Horn clause coupling.

(47)

Coupled semi-supervised

training of many functions

(48)

Co-training, Multiview

(49)

Learn functions with the same input,

different outputs, where some

constraint are known.

(50)

(51)

(52)

Advantages of NELL

To achieve successful semi-supervised learning, couple the training of many different learning tasks.

Allow the agent to learn additional coupling constraints.

Learn new representations that cover relevant phenomena beyond the initial representation.

Organize the set of learning tasks into an easy-to increasingly-difﬁcult curriculum

IE Systems

IE Systems

What is IE system?

Types of IE systems introduced

Contents

Wrapper-based IE systems:

Discussion

Lixto System

Features of Lixto

Architecture of Lixto System

Architecture of Lixto System (cont.)

About extraction language: Elog

Extraction language: Elog

Extraction language: Elog (cont.)

Rule example

Extraction language: Elog (body of the rule)

Extraction language: Elog (body of the rule)

Extraction language: Elog (body of the rule)

How to build the extraction rules?

How to Build Wrapper

Recursive Wrapping

Results reported from Lixto

Question

RoadRunner System

Overview of

RoadRunner System

A running example

Result of the extraction

in the example

The Architecture of the System

How to identify different pages classes in the target sites?

(

Architecture of the System (cont.)

The Labeler (methods)

The Labeler

The Labeler (cont.)

Summarization

References

Never-Ending Learning (NELL)

Problem Statement

Problem Statement (cont.)

Input of the System

Output

of the

system

NELL architecture

Techniques used for learning tasks

Techniques used for coupling constraints

Coupled semi-supervised

training of many functions

Co-training, Multiview

Advantages of NELL