IE Systems
Fang Li
What is IE system?
Some limitations:
Information defined in advance Domain –specific
Types of IE systems introduced
Aim: Extract information from the Internet.
Wrapper Systems
NLP based extraction system
Open-domain extraction system
Contents
Wrapper-based IE systems:
• lixTo system (semi-automatically)
• Roadrunner system (automatically)
• Never-Ending Learning (NELL)
Discussion
Given a page with many data in it How to extract it automatically?
Lixto System
A system and method for the visual and interactive generation of wrappers for web pages under the supervision of a human developer, for automatically extracting information from Web pages using such
wrappers, and for translating the extracted content into XML
www.lixto.com (the company founded in 2001 as a spin-0ff of the Vienna Technical University.)
Features of Lixto
Very high expressive power:
• of defining sophisticated extraction patterns
Excellent visual support
• for marking extraction patterns
Good learnability
• No extraction language needs to be learned
Sample parsimony
• Very few sample pages are needed in order to define robust wrappers
Simple and smooth XML translation mechanism
Architecture of Lixto System
Architecture of Lixto System (cont.)
Interactive pattern builder: provides the visual UI that allows a user to specify the desired extraction patterns and the basic algorithm for creating a corresponding Elog
wrapper as output.
Extractor: Elog program interpreter that performs the actual extraction based on a given Elog program.
The controller of XML Generator: the user
chooses how to map extracted information to XML.
About extraction language: Elog
Elog: system-internal datalog-like rule based
language specially designed for hierarchical and modular data extraction.
Datalog Rule:
Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p)
Head predicate:
Happy(d)
Rule body If and only if all atoms of
the body are true, the head is true
Extraction language: Elog
The head of a rule r is of the form p(S,X):
p is a pattern name,
S is a variable which is bound in the body of the rule to the parent-pattern instances of the filter corresponding to r,
X is the target variable which, at
extraction time, is bound to some target pattern instance (a tree region or string) to be extracted.
Extraction language: Elog (cont.)
A standard extraction rule:
New(S,X) Par(_,S), Ex(S,X),Co( S,X,…)[a,b]
Par(_,S): parent pattern predicate
Ex(S,X): extraction definition predicate Co(S,X,..): further imposed conditions [a,b] are optional, range parameter.
Rule example
record(S, X) tableseq( _, S), subelem(S, :table, X)
If S is an instance in tableseq, and X is a tree
region contained in S and the root of X matches table then X is a table contained in S.
The first atom: the parent pattern is an instance of <tableseq>.
The second atom: looks for subelements that qualify as tables
inside the unique tableseq instance and instantiates X with them.
Extraction language: Elog (body of the rule)
Attribute conditions: impose restrictions on matched elements. E.g the value is italics
Element characterizations: the value is a concept like “isCity”.
Tree Extraction Definition Predicates: a variable should be instantiated with a node in the HTML tree which matches an element path definition.
Extraction language: Elog (body of the rule)
String extraction definition predicates: every node n of the parse tree by concatenating all strings corresponding to leaves of the
subtree rooted in n.
Contextual conditions: some other elements must or must not appear either before or
after some instances.
Internal conditions: some characteristic feature must or must not appear with an instance.
Extraction language: Elog (body of the rule)
Concept conditions: predicates like isEmail(X), isCurrency(X)
Comparison conditions: compare two dates,
Pattern References: parent pattern defines the context of a rule.
Range conditions: any rule a range
condition such as “[3,7]” can be added.
Elog Extraction Program for a single eBay page
Element
characterization
String extraction
Definition predicates
Context condition
How to build the extraction rules?
Pattern:A set of rules defining the same head.
Rule:A rule defines many extraction conditions, such as attribute condition, element characterization,…
Filter:like a rule.
How to Build Wrapper
I: interactive A: automatic
Interactively generate a new pattern
Recursive Wrapping
“$1” is interpreted as a constant whose value is the URL of the
start document.
Recursive extract pages which are connected to each other via a “next page” link.
Results reported from Lixto
how many example pages are necessary to get 100 percent of correctly matched pattern instances
the time needed for constructing the initial
wrapper based on one example page
Question
How does the company profit from the data extraction program?
http://www.lixto.com
RoadRunner System
Aim:
• Extract data-intensive web sites.
• Data is stored in a back-end DBMS, HTML pages are dynamically generated using scripts
Methods:
• Unsupervised wrapper generation
• Do not assume that sample pages are manually selected the system is able to automatically cluster pages in a site into homogeneous classes
• Does not rely on user-specified labeled examples wrappers are generated and data are extracted in a completely automatic way.
• Do not assume any a priori knowledge about the target
schema deal with flat records and also nested structures.
Overview of
RoadRunner System
Given a set of HTML pages, find a
schema for the content of these pages.
A set of extraction rules parse the HTML code and retrieve the data according to the discovered schema.
Pattern discovery can be based on the study of similarities and dissimilarities between the pages
A running example
Fig. a: a nested dataset by querying a database.
Fig. b: each author’s book information with the same style.
Method: compares the HTML codes of the two pages, infers a common structure and a wrapper, and use that to extract the source dataset.
Result of the extraction
in the example
The Architecture of the System
•Classifier: analyzes pages from the target site and collect them into clusters with a
homogeneous structure.
•Aligner: compares the HTML sources of some samples pages to infer a a grammar to be used as a wrapper.
How to identify different pages classes in the target sites?
(
Classifier: mapping a sample to the feature space)Tag Probability: it is reasonable to assume that pages complying the same grammar have a similar
“distribution” of tags, i.e., tags appear in the pages with similar probability
Tag Periodicity: there are cases in which tag
probabilities may be misleading, since they do not give information about the relative positions of tags. Tag frequency is used to complement tag probability.
Distance from the Home Page :if navigation paths in the site are well organized, it is reasonable to assume that pages containing homogeneous information are
approximately at the same distance from the home page in the site graph.
Architecture of the System (cont.)
Expander: infer a wrapper for those singleton pages. Most singleton pages are indices or links to other pages.
Labeler: associates a semantic meaning to the data fields that can be extracted by running the wrappers generated by the above modules.
Discussion:
How to label the data item extracted from the page?
The Labeler (methods)
To be done manually.
Adoption of knowledge representation techniques, by some domain ontology.
Based on a generalized notion of
closeness between wrapper’s tokens and non-terminal symbols.
The Labeler
Check whether the pattern sub-tree is adjacent with some isomorphic sub-tree.
The leaves of the
discovered tree can be selected as names for the non-terminals of the patterns tree.
namely, the strings
”name” and ”phone” – are candidate to be used as names for the non-terminals $A and
$B respectively.
The Labeler (cont.)
Richness of the Web itself :
it is possible that in some page a given data item is associated with some
information describing its meaning. It is reasonable that in some of the
pages retrieved by the search engine, the input value is explicitly associated with some descriptive text.
Further Research:
Summarization
• lixTo system (interactive wrapper generation, semi-supervised)
• Roadrunner system (data-intensive page extraction, unsupervised )
References
Robert Baumgartner, et al.”Visual web information extraction with Lixto”
Proceedings of the 27th VLDB Conference,2001.
Valter Crescenzi, et al. ” Automatic Web Information Extraction in the
ROADRUNNER system” LNCS 2465, pp.
264–277, 2002.
Jun Zhu,et al, Simultaneous record
detection and attribute labeling in Web data Extraction KDD 2006
Never-Ending Learning (NELL)
Read the web 24 hours/day since Jan.2010.
Acquired a knowledge base with 80 million confidence-weighted beliefs.
http://rtw.ml.cmu.edu
Problem Statement
A set L={Li} of learning tasks.
where Li=(Ti,Pi,Ei) performance metric Pi,on a
given performance task Ti, through a given type of experience Ei;
A set of coupling constraints C={φK,Vki} φK is a real-valued function over two or more
learning tasks, specifying the degree of satisfaction of the constraint.
Vki a vector of indices over learning tasks.
Problem Statement (cont.)
Input of the System
Input
Ontology and binary relations ( ~800 categories and relations)
10-20 Labeled training examples for each category and relation
The web and access to 100,000 Google API search queries.
Occasional interaction with humans
System Doing
read (extract) more beliefs from the web
remove old incorrect beliefs
populate a growing knowledge base containing a confidence and provenance for each belief
learn to read better than the previous day.
Result: KB with +90.000,000 extracted beliefs (different levels of confidence)
Output
of the
system
NELL architecture
Techniques used for learning tasks
Category classification: NELL learns different boolean functions for each of the 280 categories in its ontology, allowing noun phrases to refer to
entities in multiple semantic categories.
Relation classification: NELL learns distinct
boolean-valued classification functions for each of the 327 relations in its ontology.
Entity Resolution: Functions that classify noun phrase pairs by whether they are synonyms.
Inference rules among belief triples: Functions that map from NELL’s current KB, to new beliefs it should add to its KB.
Techniques used for coupling constraints
Multi-view co-training coupling.
Subset/superset coupling.
Multi-label mutual exclusion coupling.
Coupling relations to their argument types.
Horn clause coupling.
Coupled semi-supervised
training of many functions
Co-training, Multiview
Learn functions with the same input,
different outputs, where some
constraint are known.
Advantages of NELL
To achieve successful semi-supervised learning, couple the training of many different learning tasks.
Allow the agent to learn additional coupling constraints.
Learn new representations that cover relevant phenomena beyond the initial representation.
Organize the set of learning tasks into an easy-to increasingly-difficult curriculum