• 沒有找到結果。

IE Systems

N/A
N/A
Protected

Academic year: 2022

Share "IE Systems"

Copied!
52
0
0

加載中.... (立即查看全文)

全文

(1)

IE Systems

Fang Li

(2)

What is IE system?

Some limitations:

Information defined in advance Domain –specific

(3)

Types of IE systems introduced

Aim: Extract information from the Internet.

Wrapper Systems

NLP based extraction system

Open-domain extraction system

(4)

Contents

Wrapper-based IE systems:

lixTo system (semi-automatically)

Roadrunner system (automatically)

Never-Ending Learning (NELL)

(5)

Discussion

Given a page with many data in it How to extract it automatically?

(6)

Lixto System

A system and method for the visual and interactive generation of wrappers for web pages under the supervision of a human developer, for automatically extracting information from Web pages using such

wrappers, and for translating the extracted content into XML

www.lixto.com (the company founded in 2001 as a spin-0ff of the Vienna Technical University.)

(7)
(8)

Features of Lixto

Very high expressive power:

of defining sophisticated extraction patterns

Excellent visual support

for marking extraction patterns

Good learnability

No extraction language needs to be learned

Sample parsimony

Very few sample pages are needed in order to define robust wrappers

Simple and smooth XML translation mechanism

(9)

Architecture of Lixto System

(10)

Architecture of Lixto System (cont.)

Interactive pattern builder: provides the visual UI that allows a user to specify the desired extraction patterns and the basic algorithm for creating a corresponding Elog

wrapper as output.

Extractor: Elog program interpreter that performs the actual extraction based on a given Elog program.

The controller of XML Generator: the user

chooses how to map extracted information to XML.

(11)

About extraction language: Elog

Elog: system-internal datalog-like rule based

language specially designed for hierarchical and modular data extraction.

Datalog Rule:

Happy(d) <- Frequents(d,bar) AND Likes(d,beer) AND Sells(bar,beer,p)

Head predicate:

Happy(d)

Rule body If and only if all atoms of

the body are true, the head is true

(12)

Extraction language: Elog

The head of a rule r is of the form p(S,X):

p is a pattern name,

S is a variable which is bound in the body of the rule to the parent-pattern instances of the filter corresponding to r,

X is the target variable which, at

extraction time, is bound to some target pattern instance (a tree region or string) to be extracted.

(13)

Extraction language: Elog (cont.)

A standard extraction rule:

New(S,X)  Par(_,S), Ex(S,X),Co( S,X,…)[a,b]

Par(_,S): parent pattern predicate

Ex(S,X): extraction definition predicate Co(S,X,..): further imposed conditions [a,b] are optional, range parameter.

(14)

Rule example

record(S, X)  tableseq( _, S), subelem(S, :table, X)

If S is an instance in tableseq, and X is a tree

region contained in S and the root of X matches table then X is a table contained in S.

The first atom: the parent pattern is an instance of <tableseq>.

The second atom: looks for subelements that qualify as tables

inside the unique tableseq instance and instantiates X with them.

(15)

Extraction language: Elog (body of the rule)

Attribute conditions: impose restrictions on matched elements. E.g the value is italics

Element characterizations: the value is a concept like “isCity”.

Tree Extraction Definition Predicates: a variable should be instantiated with a node in the HTML tree which matches an element path definition.

(16)

Extraction language: Elog (body of the rule)

String extraction definition predicates: every node n of the parse tree by concatenating all strings corresponding to leaves of the

subtree rooted in n.

Contextual conditions: some other elements must or must not appear either before or

after some instances.

Internal conditions: some characteristic feature must or must not appear with an instance.

(17)

Extraction language: Elog (body of the rule)

Concept conditions: predicates like isEmail(X), isCurrency(X)

Comparison conditions: compare two dates,

Pattern References: parent pattern defines the context of a rule.

Range conditions: any rule a range

condition such as “[3,7]” can be added.

(18)
(19)

Elog Extraction Program for a single eBay page

Element

characterization

String extraction

Definition predicates

Context condition

(20)

How to build the extraction rules?

Pattern:A set of rules defining the same head.

Rule:A rule defines many extraction conditions, such as attribute condition, element characterization,…

Filter:like a rule.

(21)

How to Build Wrapper

I: interactive A: automatic

Interactively generate a new pattern

(22)

Recursive Wrapping

“$1” is interpreted as a constant whose value is the URL of the

start document.

(23)

Recursive extract pages which are connected to each other via a “next page” link.

(24)

Results reported from Lixto

how many example pages are necessary to get 100 percent of correctly matched pattern instances

the time needed for constructing the initial

wrapper based on one example page

(25)

Question

How does the company profit from the data extraction program?

http://www.lixto.com

(26)

RoadRunner System

Aim:

Extract data-intensive web sites.

Data is stored in a back-end DBMS, HTML pages are dynamically generated using scripts

Methods:

Unsupervised wrapper generation

Do not assume that sample pages are manually selected the system is able to automatically cluster pages in a site into homogeneous classes

Does not rely on user-specified labeled examples  wrappers are generated and data are extracted in a completely automatic way.

Do not assume any a priori knowledge about the target

schema deal with flat records and also nested structures.

(27)

Overview of

RoadRunner System

Given a set of HTML pages, find a

schema for the content of these pages.

A set of extraction rules parse the HTML code and retrieve the data according to the discovered schema.

Pattern discovery can be based on the study of similarities and dissimilarities between the pages

(28)

A running example

Fig. a: a nested dataset by querying a database.

Fig. b: each author’s book information with the same style.

Method: compares the HTML codes of the two pages, infers a common structure and a wrapper, and use that to extract the source dataset.

(29)

Result of the extraction

in the example

(30)

The Architecture of the System

•Classifier: analyzes pages from the target site and collect them into clusters with a

homogeneous structure.

•Aligner: compares the HTML sources of some samples pages to infer a a grammar to be used as a wrapper.

(31)

How to identify different pages classes in the target sites?

(

Classifier: mapping a sample to the feature space)

Tag Probability: it is reasonable to assume that pages complying the same grammar have a similar

“distribution” of tags, i.e., tags appear in the pages with similar probability

Tag Periodicity: there are cases in which tag

probabilities may be misleading, since they do not give information about the relative positions of tags. Tag frequency is used to complement tag probability.

Distance from the Home Page :if navigation paths in the site are well organized, it is reasonable to assume that pages containing homogeneous information are

approximately at the same distance from the home page in the site graph.

(32)

Architecture of the System (cont.)

Expander: infer a wrapper for those singleton pages. Most singleton pages are indices or links to other pages.

Labeler: associates a semantic meaning to the data fields that can be extracted by running the wrappers generated by the above modules.

Discussion:

How to label the data item extracted from the page?

(33)

The Labeler (methods)

To be done manually.

Adoption of knowledge representation techniques, by some domain ontology.

Based on a generalized notion of

closeness between wrapper’s tokens and non-terminal symbols.

(34)

The Labeler

Check whether the pattern sub-tree is adjacent with some isomorphic sub-tree.

The leaves of the

discovered tree can be selected as names for the non-terminals of the patterns tree.

namely, the strings

”name” and ”phone” – are candidate to be used as names for the non-terminals $A and

$B respectively.

(35)

The Labeler (cont.)

Richness of the Web itself :

it is possible that in some page a given data item is associated with some

information describing its meaning. It is reasonable that in some of the

pages retrieved by the search engine, the input value is explicitly associated with some descriptive text.

(36)

Further Research:

(37)

Summarization

lixTo system (interactive wrapper generation, semi-supervised)

Roadrunner system (data-intensive page extraction, unsupervised )

(38)

References

Robert Baumgartner, et al.”Visual web information extraction with Lixto”

Proceedings of the 27th VLDB Conference,2001.

Valter Crescenzi, et al. ” Automatic Web Information Extraction in the

ROADRUNNER system” LNCS 2465, pp.

264–277, 2002.

Jun Zhu,et al, Simultaneous record

detection and attribute labeling in Web data Extraction KDD 2006

(39)

Never-Ending Learning (NELL)

Read the web 24 hours/day since Jan.2010.

Acquired a knowledge base with 80 million confidence-weighted beliefs.

http://rtw.ml.cmu.edu

(40)

Problem Statement

A set L={Li} of learning tasks.

where Li=(Ti,Pi,Ei) performance metric Pi,on a

given performance task Ti, through a given type of experience Ei;

A set of coupling constraints C={φK,Vki} φK is a real-valued function over two or more

learning tasks, specifying the degree of satisfaction of the constraint.

Vki a vector of indices over learning tasks.

(41)

Problem Statement (cont.)

(42)

Input of the System

Input

Ontology and binary relations ( ~800 categories and relations)

10-20 Labeled training examples for each category and relation

The web and access to 100,000 Google API search queries.

Occasional interaction with humans

System Doing

read (extract) more beliefs from the web

remove old incorrect beliefs

populate a growing knowledge base containing a confidence and provenance for each belief

learn to read better than the previous day.

Result: KB with +90.000,000 extracted beliefs (different levels of confidence)

(43)

Output

of the

system

(44)

NELL architecture

(45)

Techniques used for learning tasks

Category classification: NELL learns different boolean functions for each of the 280 categories in its ontology, allowing noun phrases to refer to

entities in multiple semantic categories.

Relation classification: NELL learns distinct

boolean-valued classification functions for each of the 327 relations in its ontology.

Entity Resolution: Functions that classify noun phrase pairs by whether they are synonyms.

Inference rules among belief triples: Functions that map from NELL’s current KB, to new beliefs it should add to its KB.

(46)

Techniques used for coupling constraints

Multi-view co-training coupling.

Subset/superset coupling.

Multi-label mutual exclusion coupling.

Coupling relations to their argument types.

Horn clause coupling.

(47)

Coupled semi-supervised

training of many functions

(48)

Co-training, Multiview

(49)

Learn functions with the same input,

different outputs, where some

constraint are known.

(50)
(51)
(52)

Advantages of NELL

To achieve successful semi-supervised learning, couple the training of many different learning tasks.

Allow the agent to learn additional coupling constraints.

Learn new representations that cover relevant phenomena beyond the initial representation.

Organize the set of learning tasks into an easy-to increasingly-difficult curriculum

參考文獻

相關文件

We can therefore hope that the exact solution of a lower-dimensional string will provide ideas which could be used to make an exact definition of critical string theory and give

Using this formalism we derive an exact differential equation for the partition function of two-dimensional gravity as a function of the string coupling constant that governs the

• Contact with both parents is generally said to be the right of the child, as opposed to the right of the parent. • In other words the child has the right to see and to have a

(It is also acceptable to have either just an image region or just a text region.) The layout and ordering of the slides is specified in a language called SMIL.. SMIL is covered in

final instance variable: accessed through instance, and assigned once (in declaration or every instance constructor) final instance method: cannot be overriden (≈ assigned once)

/** Class invariant: A Person always has a date of birth, and if the Person has a date of death, then the date of death is equal to or later than the date of birth. To be

The remaining positions contain //the rest of the original array elements //the rest of the original array elements.

3. Works better for some tasks to use grammatical tree structure Language recursion is still up to debate.. Recursive Neural Network Architecture. A network is to predict the