An Overview of Web Retrieval and Mining

(1)

An Overview of

Web Retrieval and Mining

Instructor: Pu-Jen Cheng (^鄭卜壬)

Department of Computer Science & Information Engineering National Taiwan University

(2)

Instructor

• Postdoc. in Academia Sinica

Dr. Lee-Feng Chien

(Former Director of Google TW)

• Univ. of Illinois at Urbana–Champaign Visiting Scholar

^{(DAIS Lab)}

• Google Research Award

Microsoft Research Awards

• PI of Web Mining and Information Retrieval Lab

(3)

World-Wide Web

• Initiated at CERN

(European Organization for Nuclear Research)

– By Tim Berners-Lee (1989) 1945: Vannevar Bush

As We May Think 1965: Ted Nelson

'hypertext' : non-sequential writing

• Mosaic (1993)

– A hypertext GUI for the X-window system

– CERN HTTPD: server of hypertext documents

Tim Berners-Lee

(4)

1994: the Landmark Year

• Foundation of the “Mosaic Communications Corporation"

• First World-wide Web conference http://www.iw3c2.org/conferences/

• MIT and CERN agreed to set up the World-wide

Web Consortium (W3C).

(5)

Crawling and Indexing

• Crawler/Spiders/Web robots/Bots

• Purpose of crawling and indexing

– Quick fetching of large number of Web pages into a local repository

– Indexing based on keywords

– Ordering responses to maximize user’s chances of the first few responses satisfying her information need.

• Earliest search engine: Lycos (Jan 1994)

• Followed by….

– Alta Vista (1995), HotBot and Inktomi, Excite

(6)

Topic Directories

• Yahoo! directory

– To locate useful Web sites

– Jerry Yang & David Filo (Ph.D. students at Stanford University), 1994

• Efforts for organizing knowledge into ontologies

– Centralized: Yahoo!

– Decentralized: About.COM and the Open Directory

(7)

Hyperlink Analysis

• Take advantage of the structure of the Web graph

– Indicators of prestige of a page (e.g. citations)

– HITS (Kleinberg 1998) & PageRank (Page 1998)

• Bibliometry

– Bibliographic citation graph of academic papers

• Topic distillation

– The process of finding quality documents on a query topic – Adapting to idioms of Web authorship and linking styles

(8)

Paid Placement Ranking

• ^{Goto.com (} ^® Overture.com ® Yahoo!)

– Search ranking depended on how much you paid – Auction for keywords: casino was expensive!

• Result: Google added paid-placement “ads” to the side, independent of search results

– Yahoo follows suit, acquiring Overture (for paid placement) and Inktomi (for search)

(9)

• ^{Web 2.0}

– Human-centered, social value, participation & co-creation

• Explosive growth of multimedia data

• Huge incentives in search business

• Applications

Web 2.0

(10)

What’s going on?

• More searches from mobile devices

• More data from users

• More integrated services for users

• From passive to active services

• More semantics from learning

(11)

From World-Wide Web to Now

•

1989 Initiated at CERN

•

1990 Machine Learning (Data-driven)

•

1993 Mosaic (First GUI-based Brower) Data Mining

•

1994 Mosaic Communications Corporation (Commercialized) W3C, WWW (Standard)

Lycos Search Engine Yahoo! Directory

•

1997 Deep Blue (Computer > Human: Chess (10¹²³))

•

1998 Web Structure Analysis (PageRank) Pay-for-placement (Goto.com)

Tim Berners-Lee

(12)

•

2001 Google (Getting noticed)

•

2004 Web 2.0 (Human-centered) Web as Platform (Services) Facebook

•

2008 Mobile Apps (Apple app store, Android market) Fintech (Bitcon)

•

2011 Deep Learning (Big Data+GPU)

•

2016 AlphaGo (Computer > Human: Go (10³⁶⁰))

•

^{2021 ?}

From World-Wide Web to Now (Cont.)

(13)

The Problem of

Information Overload

(14)

Web

logs, texts, images, … Search Engine Information Seeking

Billions of Users

Web Retrieval/Search

(15)

Without Search Engines the Web Wouldn’t Scale

•

No incentive in creating content unless it can be easily found

•

Web is both a technology artifact and a social environment

– The Web has become the “new normal” in the way of life

– Those who don’t go online constitute an ever-shrinking minority

•

Search engines make aggregation of interest possible

– Create incentives for very specialized niche players

• Economical – specialized stores, providers, etc

• Social – narrow interests, specialized communities, etc

•

The acceptance of search interaction makes “unlimited selection” stores possible

•

Search turned out to be the best mechanism for advertising on the web

(16)

Information Retrieval

• Information retrieval (IR) deals with the

representation, storage, organization of, and access to information items.

• Converting information need to information items

– Information need

full description, keyword-based query – Information item

text documents (often unstructured), Web pages (semi- structured), images, audios, videos, ….

(17)

The Niche of

Information Overload

(18)

Web

logs, texts, images, … Search Engine Knowledge Discovery

Millions of Users

Web Mining

(19)

Taxonomy of Web Mining [R. Cooley]

Web Mining

Web Content Mining

Web Structure Mining

Web Usage Mining

Social Network Mining

n Web Content Mining (web page/text, search-result page, multimedia, tags, …)

n Web Usage Mining (query log analysis, user gap, community, …)

n Web Structure Mining (hyperlink, anchor text, web site, …)

n Social Network Mining (blog, wikipedia, email, instant messaging, …)

(20)

Document Query Logs

sex 1551477 0.27%

applet 1169031 0.20%

porno 712790 0.12%

mp3 613902 0.11%

chat 406014 0.07%

warez 398953 0.07%

yahoo 377025 0.07%

playboy 356556 0.06%

xxx 324923 0.06%

hotmail 321267 0.06%

[non-ASCII query] 263760 0.05%

pamela anderson 256559 0.04%

p**** 234037 0.04%

sexo 226705 0.04%

porn 212161 0.04%

nude 190641 0.03%

lolita 179629 0.03%

games 166781 0.03%

spice girls 162272 0.03%

beastiality 152143 0.03%

575,244,993

MP3 42561 1.95%

24970 1.14%

24363 1.12%

sex 20182 0.92%

15071 0.69%

icq 13899 0.64%

13622 0.62%

12210 0.56%

12092 0.55%

11680 0.53%

11640 0.53%

10000 0.46%

9817 0.45%

9530 0.44%

9328 0.43%

bbs 8613 0.39%

kimo 8166 0.37%

104 7943 0.36%

7456 0.34%

7217 0.33%

2,183,506

AltaVista前20大查詢語彙及比例

(21)

Image Query Log

(22)

Common Interests in Web Pages

(23)

Common Interests in Web Images

34.4%

23.4%

8.0%

(24)

Course Summary

(25)

Web Mining & Information Retrieval

Query Space

Doc Space User

Space

Author Space

Information Needs Document Authority

Search Engine Searching

Mining

(26)

Two Topics of this Course

Access Mining

Select

information Create Knowledge

Web

(27)

Related Areas

Information

Retrieval _Databases

Library & Info Science

Machine Learning Pattern Recognition

Data Mining

Natural Language Processing

Applications

Web, Bioinformatics…

Statistics Optimization

Software engineering Computer systems

Models

Algorithms

Applications

Systems

(28)

What to Read?

ACM SIGIR

VLDB, PODS, ICDE

ASIS Learning/Mining

NLP

Applications

Statistics

??

Software/systems

??

COLING, EMNLP, ANLP

HLT

ICML, NIPS, UAI

JCDL

Info. Science Info Retrieval

CIKM, TREC, ICTIR, ECIR, NTCIR, AIRS

Databases

ACM SIGMOD

ACL ICML

AAAI

KDD, ICDM, SDM

WSDM WWW

WI

(29)

922 U3640

Web Retrieval and Mining

(Spring 2021)

(30)

Goal & Design

• Introduce “Web Search” and “Web Mining”

• Prepare students for doing research/development in related fields

• Targeted at (senior) undergraduate students and

graduate students with computer science background

(31)

Schedule

• Part I: Information Retrieval

– Indexing & Query Optimization

– Retrieval Model & User Interaction – Evaluation

– Link Analysis

– Machine Learning for IR – Deep Learning for IR

• Part II: SIG Study

– Recommendation

– Opinion Mining, Sentiment Analysis (tentative) – Information Extraction & Filtering (tentative)

(32)

Some Relevant NTU CISE Courses

• Information Retrieval

• Natural Language Processing

• Machine / Deep Learning

• Data Mining

• Social Network

• Statistical Artificial Intelligence

(33)

Format

• Handwritten Assignments (

individual work)

• 2 Programming Assignments

(individual work) – Programming + Report

• Midterm Exam

• Final Project

– Team work (3~4 people, which depends on # of students) – Programming

– Presentation & 2-pages Report

(including idea, literature review, method & experiment)

(34)

Grading

• Assignments: 50% (hand-written, programming)

• Midterm Exam: 20%

• Term Project: 30%

(35)

Readings

•

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, (Selected Chapters) Available online!

•

Information retrieval : Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L.A. Clarke, Gordon V. Cormack. (Selected Chapters)

•

Modern Information Retrieval, by Ricardo Baeza-Yates, Berthier Ribeiro-Neto.

(Selected Chapters)

•

Search Engines: Information Retrieval in Practice, by W. Bruce Croft, Donald Metzler, Trevor Strohman. (Selected Chapters)

•

Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti, Morgan Kaufmann. (Selected Chapters)

•

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, by Bing Liu, Springer, 2006. (Selected Chapters) Available online!

•

Additional readings will be available online.

(36)

In response to COVID-19

• Friday Morning Class + Asynchronous Online Learning

– Video (Screencast): NTU COOL – Slides and reading materials

Download from http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021

– Hand-written and Programming Assignments

Download from http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021

& NTU COOL

•

All materials are for teaching only (copyright issue)

•

You cannot distribute anything (video, slide, reading,…)

NTU COOL: https://cool.ntu.edu.tw/courses/5426

(37)

Questions?

Related Websites:

http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021 NTU COOL (https://cool.ntu.edu.tw/courses/5426)

Office hours:

Tuesday 9:00am-11:00am, R323

(38)

An Overview of Web Retrieval and Mining

An Overview of

Web Retrieval and Mining

Instructor

• Postdoc. in Academia Sinica

Dr. Lee-Feng Chien

• Univ. of Illinois at Urbana–Champaign Visiting Scholar

• Google Research Award

Microsoft Research Awards

• PI of Web Mining and Information Retrieval Lab

World-Wide Web

• Initiated at CERN

• Mosaic (1993)

1994: the Landmark Year

• Foundation of the “Mosaic Communications Corporation"

• First World-wide Web conference http://www.iw3c2.org/conferences/

• MIT and CERN agreed to set up the World-wide

Web Consortium (W3C).

Crawling and Indexing

• Crawler/Spiders/Web robots/Bots

• Purpose of crawling and indexing

• Earliest search engine: Lycos (Jan 1994)

• Followed by….

Topic Directories

• Yahoo! directory

• Efforts for organizing knowledge into ontologies

Hyperlink Analysis

• Take advantage of the structure of the Web graph

• Bibliometry

• Topic distillation

Paid Placement Ranking

• Goto.com ( ® Overture.com ® Yahoo!)

• Result: Google added paid-placement “ads” to the side, independent of search results

• Web 2.0

• Explosive growth of multimedia data

• Huge incentives in search business

• Applications

Web 2.0

What’s going on?

• More searches from mobile devices

• More data from users

• More integrated services for users

• From passive to active services

• More semantics from learning

From World-Wide Web to Now

•

•

•

•

•

•

•

•

•

•

•

•

From World-Wide Web to Now (Cont.)

The Problem of

Information Overload

Web Retrieval/Search

Without Search Engines the Web Wouldn’t Scale

•

•

•

•

•

Information Retrieval

• Information retrieval (IR) deals with the

representation, storage, organization of, and access to information items.

• Converting information need to information items

The Niche of

Information Overload

Web Mining

Taxonomy of Web Mining [R. Cooley]

Document Query Logs

Image Query Log

Common Interests in Web Pages

Common Interests in Web Images

Course Summary

• ^{Goto.com (} ^® Overture.com ® Yahoo!)

• ^{Web 2.0}