• 沒有找到結果。

An Overview of Web Retrieval and Mining

N/A
N/A
Protected

Academic year: 2022

Share "An Overview of Web Retrieval and Mining"

Copied!
38
0
0

加載中.... (立即查看全文)

全文

(1)

An Overview of

Web Retrieval and Mining

Instructor: Pu-Jen Cheng (鄭卜壬)

Department of Computer Science & Information Engineering National Taiwan University

(2)

Instructor

Postdoc. in Academia Sinica

Dr. Lee-Feng Chien

(Former Director of Google TW)

Univ. of Illinois at Urbana–Champaign Visiting Scholar

(DAIS Lab)

Google Research Award

Microsoft Research Awards

PI of Web Mining and Information Retrieval Lab

(3)

World-Wide Web

Initiated at CERN

(European Organization for Nuclear Research)

– By Tim Berners-Lee (1989) 1945: Vannevar Bush

As We May Think 1965: Ted Nelson

'hypertext' : non-sequential writing

Mosaic (1993)

– A hypertext GUI for the X-window system

– CERN HTTPD: server of hypertext documents

Tim Berners-Lee

(4)

1994: the Landmark Year

Foundation of the “Mosaic Communications Corporation"

First World-wide Web conference http://www.iw3c2.org/conferences/

MIT and CERN agreed to set up the World-wide

Web Consortium (W3C).

(5)

Crawling and Indexing

Crawler/Spiders/Web robots/Bots

Purpose of crawling and indexing

– Quick fetching of large number of Web pages into a local repository

– Indexing based on keywords

– Ordering responses to maximize user’s chances of the first few responses satisfying her information need.

Earliest search engine: Lycos (Jan 1994)

Followed by….

– Alta Vista (1995), HotBot and Inktomi, Excite

(6)

Topic Directories

Yahoo! directory

– To locate useful Web sites

– Jerry Yang & David Filo (Ph.D. students at Stanford University), 1994

Efforts for organizing knowledge into ontologies

– Centralized: Yahoo!

– Decentralized: About.COM and the Open Directory

(7)

Hyperlink Analysis

Take advantage of the structure of the Web graph

– Indicators of prestige of a page (e.g. citations)

– HITS (Kleinberg 1998) & PageRank (Page 1998)

Bibliometry

– Bibliographic citation graph of academic papers

Topic distillation

– The process of finding quality documents on a query topic – Adapting to idioms of Web authorship and linking styles

(8)

Paid Placement Ranking

Goto.com ( ® Overture.com ® Yahoo!)

– Search ranking depended on how much you paid – Auction for keywords: casino was expensive!

Result: Google added paid-placement “ads” to the side, independent of search results

– Yahoo follows suit, acquiring Overture (for paid placement) and Inktomi (for search)

(9)

Web 2.0

– Human-centered, social value, participation & co-creation

Explosive growth of multimedia data

Huge incentives in search business

Applications

Web 2.0

(10)

What’s going on?

More searches from mobile devices

More data from users

More integrated services for users

From passive to active services

More semantics from learning

(11)

From World-Wide Web to Now

1989 Initiated at CERN

1990 Machine Learning (Data-driven)

1993 Mosaic (First GUI-based Brower) Data Mining

1994 Mosaic Communications Corporation (Commercialized) W3C, WWW (Standard)

Lycos Search Engine Yahoo! Directory

1997 Deep Blue (Computer > Human: Chess (10123))

1998 Web Structure Analysis (PageRank) Pay-for-placement (Goto.com)

Tim Berners-Lee

(12)

2001 Google (Getting noticed)

2004 Web 2.0 (Human-centered) Web as Platform (Services) Facebook

2008 Mobile Apps (Apple app store, Android market) Fintech (Bitcon)

2011 Deep Learning (Big Data+GPU)

2016 AlphaGo (Computer > Human: Go (10360))

2021 ?

From World-Wide Web to Now (Cont.)

(13)

The Problem of

Information Overload

(14)

Web

logs, texts, images, … Search Engine Information Seeking

Billions of Users

Web Retrieval/Search

(15)

Without Search Engines the Web Wouldn’t Scale

No incentive in creating content unless it can be easily found

Web is both a technology artifact and a social environment

– The Web has become the “new normal” in the way of life

– Those who don’t go online constitute an ever-shrinking minority

Search engines make aggregation of interest possible

– Create incentives for very specialized niche players

• Economical – specialized stores, providers, etc

• Social – narrow interests, specialized communities, etc

The acceptance of search interaction makes “unlimited selection” stores possible

Search turned out to be the best mechanism for advertising on the web

(16)

Information Retrieval

Information retrieval (IR) deals with the

representation, storage, organization of, and access to information items.

Converting information need to information items

– Information need

full description, keyword-based query – Information item

text documents (often unstructured), Web pages (semi- structured), images, audios, videos, ….

(17)

The Niche of

Information Overload

(18)

Web

logs, texts, images, … Search Engine Knowledge Discovery

Millions of Users

Web Mining

(19)

Taxonomy of Web Mining [R. Cooley]

Web Mining

Web Content Mining

Web Structure Mining

Web Usage Mining

Social Network Mining

n Web Content Mining (web page/text, search-result page, multimedia, tags, …)

n Web Usage Mining (query log analysis, user gap, community, …)

n Web Structure Mining (hyperlink, anchor text, web site, …)

n Social Network Mining (blog, wikipedia, email, instant messaging, …)

(20)

Document Query Logs

sex 1551477 0.27%

applet 1169031 0.20%

porno 712790 0.12%

mp3 613902 0.11%

chat 406014 0.07%

warez 398953 0.07%

yahoo 377025 0.07%

playboy 356556 0.06%

xxx 324923 0.06%

hotmail 321267 0.06%

[non-ASCII query] 263760 0.05%

pamela anderson 256559 0.04%

p**** 234037 0.04%

sexo 226705 0.04%

porn 212161 0.04%

nude 190641 0.03%

lolita 179629 0.03%

games 166781 0.03%

spice girls 162272 0.03%

beastiality 152143 0.03%

575,244,993

MP3 42561 1.95%

24970 1.14%

24363 1.12%

sex 20182 0.92%

15071 0.69%

icq 13899 0.64%

13622 0.62%

12210 0.56%

12092 0.55%

11680 0.53%

11640 0.53%

10000 0.46%

9817 0.45%

9530 0.44%

9328 0.43%

bbs 8613 0.39%

kimo 8166 0.37%

104 7943 0.36%

7456 0.34%

7217 0.33%

2,183,506

AltaVista前20大查詢語彙及比例

(21)

Image Query Log

(22)

Common Interests in Web Pages

(23)

Common Interests in Web Images

34.4%

23.4%

8.0%

(24)

Course Summary

(25)

Web Mining & Information Retrieval

Query Space

Doc Space User

Space

Author Space

Information Needs Document Authority

Search Engine Searching

Mining

(26)

Two Topics of this Course

Access Mining

Select

information Create Knowledge

Web

(27)

Related Areas

Information

Retrieval Databases

Library & Info Science

Machine Learning Pattern Recognition

Data Mining

Natural Language Processing

Applications

Web, Bioinformatics…

Statistics Optimization

Software engineering Computer systems

Models

Algorithms

Applications

Systems

(28)

What to Read?

ACM SIGIR

VLDB, PODS, ICDE

ASIS Learning/Mining

NLP

Applications

Statistics

??

Software/systems

??

COLING, EMNLP, ANLP

HLT

ICML, NIPS, UAI

JCDL

Info. Science Info Retrieval

CIKM, TREC, ICTIR, ECIR, NTCIR, AIRS

Databases

ACM SIGMOD

ACL ICML

AAAI

KDD, ICDM, SDM

WSDM WWW

WI

(29)

922 U3640

Web Retrieval and Mining

(Spring 2021)

(30)

Goal & Design

Introduce “Web Search” and “Web Mining”

Prepare students for doing research/development in related fields

Targeted at (senior) undergraduate students and

graduate students with computer science background

(31)

Schedule

Part I: Information Retrieval

– Indexing & Query Optimization

– Retrieval Model & User Interaction – Evaluation

– Link Analysis

– Machine Learning for IR – Deep Learning for IR

Part II: SIG Study

– Recommendation

– Opinion Mining, Sentiment Analysis (tentative) – Information Extraction & Filtering (tentative)

(32)

Some Relevant NTU CISE Courses

Information Retrieval

Natural Language Processing

Machine / Deep Learning

Data Mining

Social Network

Statistical Artificial Intelligence

(33)

Format

Handwritten Assignments (

individual work)

2 Programming Assignments

(individual work) – Programming + Report

Midterm Exam

Final Project

– Team work (3~4 people, which depends on # of students) – Programming

– Presentation & 2-pages Report

(including idea, literature review, method & experiment)

(34)

Grading

Assignments: 50% (hand-written, programming)

Midterm Exam: 20%

Term Project: 30%

(35)

Readings

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, (Selected Chapters) Available online!

Information retrieval : Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L.A. Clarke, Gordon V. Cormack. (Selected Chapters)

Modern Information Retrieval, by Ricardo Baeza-Yates, Berthier Ribeiro-Neto.

(Selected Chapters)

Search Engines: Information Retrieval in Practice, by W. Bruce Croft, Donald Metzler, Trevor Strohman. (Selected Chapters)

Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti, Morgan Kaufmann. (Selected Chapters)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, by Bing Liu, Springer, 2006. (Selected Chapters) Available online!

Additional readings will be available online.

(36)

In response to COVID-19

Friday Morning Class + Asynchronous Online Learning

– Video (Screencast): NTU COOL – Slides and reading materials

Download from http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021

– Hand-written and Programming Assignments

Download from http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021

& NTU COOL

All materials are for teaching only (copyright issue)

You cannot distribute anything (video, slide, reading,…)

NTU COOL: https://cool.ntu.edu.tw/courses/5426

(37)

Questions?

Related Websites:

http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021 NTU COOL (https://cool.ntu.edu.tw/courses/5426)

Office hours:

Tuesday 9:00am-11:00am, R323

(38)

Good Luck!

參考文獻

相關文件

• to assist in the executive functions of financial resource management (such as procurement of goods and services, handling school trading operations, acceptance of donations,

Keywords: Requesting Song, Information Retrieval, Knowledge Base, Fuzzy Inference, Adaptation Recommendation System... 致

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

If the source is very highly coherent and the detector is placed very far behind the sample, one will observe a fringe pattern as different components of the beam,

Overview of a variety of business software, graphics and multimedia software, and home/personal/educational software Web applications and application software for

Cost-and-Error-Sensitive Classification with Bioinformatics Application Cost-Sensitive Ordinal Ranking with Information Retrieval Application Summary.. Non-Bayesian Perspective

 Retrieval performance of different texture features according to the number of relevant images retrieved at various scopes using Corel Photo galleries. # of top

The roles of school management and technical support staff on implementing information and network security measures... Security