An Overview of
Web Retrieval and Mining
Instructor: Pu-Jen Cheng (鄭卜壬)
Department of Computer Science & Information Engineering National Taiwan University
Instructor
• Postdoc. in Academia Sinica
Dr. Lee-Feng Chien
(Former Director of Google TW)• Univ. of Illinois at Urbana–Champaign Visiting Scholar
(DAIS Lab)• Google Research Award
Microsoft Research Awards
• PI of Web Mining and Information Retrieval Lab
World-Wide Web
• Initiated at CERN
(European Organization for Nuclear Research)– By Tim Berners-Lee (1989) 1945: Vannevar Bush
As We May Think 1965: Ted Nelson
'hypertext' : non-sequential writing
• Mosaic (1993)
– A hypertext GUI for the X-window system
– CERN HTTPD: server of hypertext documents
Tim Berners-Lee
1994: the Landmark Year
• Foundation of the “Mosaic Communications Corporation"
• First World-wide Web conference http://www.iw3c2.org/conferences/
• MIT and CERN agreed to set up the World-wide
Web Consortium (W3C).
Crawling and Indexing
• Crawler/Spiders/Web robots/Bots
• Purpose of crawling and indexing
– Quick fetching of large number of Web pages into a local repository
– Indexing based on keywords
– Ordering responses to maximize user’s chances of the first few responses satisfying her information need.
• Earliest search engine: Lycos (Jan 1994)
• Followed by….
– Alta Vista (1995), HotBot and Inktomi, Excite
Topic Directories
• Yahoo! directory
– To locate useful Web sites
– Jerry Yang & David Filo (Ph.D. students at Stanford University), 1994
• Efforts for organizing knowledge into ontologies
– Centralized: Yahoo!
– Decentralized: About.COM and the Open Directory
Hyperlink Analysis
• Take advantage of the structure of the Web graph
– Indicators of prestige of a page (e.g. citations)
– HITS (Kleinberg 1998) & PageRank (Page 1998)
• Bibliometry
– Bibliographic citation graph of academic papers
• Topic distillation
– The process of finding quality documents on a query topic – Adapting to idioms of Web authorship and linking styles
Paid Placement Ranking
• Goto.com ( ® Overture.com ® Yahoo!)
– Search ranking depended on how much you paid – Auction for keywords: casino was expensive!
• Result: Google added paid-placement “ads” to the side, independent of search results
– Yahoo follows suit, acquiring Overture (for paid placement) and Inktomi (for search)
• Web 2.0
– Human-centered, social value, participation & co-creation
• Explosive growth of multimedia data
• Huge incentives in search business
• Applications
Web 2.0
What’s going on?
• More searches from mobile devices
• More data from users
• More integrated services for users
• From passive to active services
• More semantics from learning
From World-Wide Web to Now
•
1989 Initiated at CERN•
1990 Machine Learning (Data-driven)•
1993 Mosaic (First GUI-based Brower) Data Mining•
1994 Mosaic Communications Corporation (Commercialized) W3C, WWW (Standard)Lycos Search Engine Yahoo! Directory
•
1997 Deep Blue (Computer > Human: Chess (10123))•
1998 Web Structure Analysis (PageRank) Pay-for-placement (Goto.com)Tim Berners-Lee
•
2001 Google (Getting noticed)•
2004 Web 2.0 (Human-centered) Web as Platform (Services) Facebook•
2008 Mobile Apps (Apple app store, Android market) Fintech (Bitcon)•
2011 Deep Learning (Big Data+GPU)•
2016 AlphaGo (Computer > Human: Go (10360))•
2021 ?From World-Wide Web to Now (Cont.)
The Problem of
Information Overload
Web
logs, texts, images, … Search Engine Information Seeking
Billions of Users
Web Retrieval/Search
Without Search Engines the Web Wouldn’t Scale
•
No incentive in creating content unless it can be easily found•
Web is both a technology artifact and a social environment– The Web has become the “new normal” in the way of life
– Those who don’t go online constitute an ever-shrinking minority
•
Search engines make aggregation of interest possible– Create incentives for very specialized niche players
• Economical – specialized stores, providers, etc
• Social – narrow interests, specialized communities, etc
•
The acceptance of search interaction makes “unlimited selection” stores possible•
Search turned out to be the best mechanism for advertising on the webInformation Retrieval
• Information retrieval (IR) deals with the
representation, storage, organization of, and access to information items.
• Converting information need to information items
– Information need
full description, keyword-based query – Information item
text documents (often unstructured), Web pages (semi- structured), images, audios, videos, ….
The Niche of
Information Overload
Web
logs, texts, images, … Search Engine Knowledge Discovery
Millions of Users
Web Mining
Taxonomy of Web Mining [R. Cooley]
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Social Network Mining
n Web Content Mining (web page/text, search-result page, multimedia, tags, …)
n Web Usage Mining (query log analysis, user gap, community, …)
n Web Structure Mining (hyperlink, anchor text, web site, …)
n Social Network Mining (blog, wikipedia, email, instant messaging, …)
Document Query Logs
sex 1551477 0.27%
applet 1169031 0.20%
porno 712790 0.12%
mp3 613902 0.11%
chat 406014 0.07%
warez 398953 0.07%
yahoo 377025 0.07%
playboy 356556 0.06%
xxx 324923 0.06%
hotmail 321267 0.06%
[non-ASCII query] 263760 0.05%
pamela anderson 256559 0.04%
p**** 234037 0.04%
sexo 226705 0.04%
porn 212161 0.04%
nude 190641 0.03%
lolita 179629 0.03%
games 166781 0.03%
spice girls 162272 0.03%
beastiality 152143 0.03%
575,244,993
MP3 42561 1.95%
24970 1.14%
24363 1.12%
sex 20182 0.92%
15071 0.69%
icq 13899 0.64%
13622 0.62%
12210 0.56%
12092 0.55%
11680 0.53%
11640 0.53%
10000 0.46%
9817 0.45%
9530 0.44%
9328 0.43%
bbs 8613 0.39%
kimo 8166 0.37%
104 7943 0.36%
7456 0.34%
7217 0.33%
2,183,506
AltaVista前20大查詢語彙及比例
Image Query Log
Common Interests in Web Pages
Common Interests in Web Images
34.4%
23.4%
8.0%
Course Summary
Web Mining & Information Retrieval
Query Space
Doc Space User
Space
Author Space
Information Needs Document Authority
Search Engine Searching
Mining
Two Topics of this Course
Access Mining
Select
information Create Knowledge
Web
Related Areas
Information
Retrieval Databases
Library & Info Science
Machine Learning Pattern Recognition
Data Mining
Natural Language Processing
Applications
Web, Bioinformatics…
Statistics Optimization
Software engineering Computer systems
Models
Algorithms
Applications
Systems
What to Read?
ACM SIGIR
VLDB, PODS, ICDE
ASIS Learning/Mining
NLP
Applications
Statistics
??
Software/systems
??
COLING, EMNLP, ANLP
HLT
ICML, NIPS, UAI
JCDL
Info. Science Info Retrieval
CIKM, TREC, ICTIR, ECIR, NTCIR, AIRS
Databases
ACM SIGMOD
ACL ICML
AAAI
KDD, ICDM, SDM
WSDM WWW
WI
922 U3640
Web Retrieval and Mining
(Spring 2021)
Goal & Design
• Introduce “Web Search” and “Web Mining”
• Prepare students for doing research/development in related fields
• Targeted at (senior) undergraduate students and
graduate students with computer science background
Schedule
• Part I: Information Retrieval
– Indexing & Query Optimization
– Retrieval Model & User Interaction – Evaluation
– Link Analysis
– Machine Learning for IR – Deep Learning for IR
• Part II: SIG Study
– Recommendation
– Opinion Mining, Sentiment Analysis (tentative) – Information Extraction & Filtering (tentative)
Some Relevant NTU CISE Courses
• Information Retrieval
• Natural Language Processing
• Machine / Deep Learning
• Data Mining
• Social Network
• Statistical Artificial Intelligence
Format
• Handwritten Assignments (
individual work)• 2 Programming Assignments
(individual work) – Programming + Report• Midterm Exam
• Final Project
– Team work (3~4 people, which depends on # of students) – Programming
– Presentation & 2-pages Report
(including idea, literature review, method & experiment)
Grading
• Assignments: 50% (hand-written, programming)
• Midterm Exam: 20%
• Term Project: 30%
Readings
•
Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, (Selected Chapters) Available online!•
Information retrieval : Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L.A. Clarke, Gordon V. Cormack. (Selected Chapters)•
Modern Information Retrieval, by Ricardo Baeza-Yates, Berthier Ribeiro-Neto.(Selected Chapters)
•
Search Engines: Information Retrieval in Practice, by W. Bruce Croft, Donald Metzler, Trevor Strohman. (Selected Chapters)•
Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti, Morgan Kaufmann. (Selected Chapters)•
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, by Bing Liu, Springer, 2006. (Selected Chapters) Available online!•
Additional readings will be available online.In response to COVID-19
• Friday Morning Class + Asynchronous Online Learning
– Video (Screencast): NTU COOL – Slides and reading materials
Download from http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021
– Hand-written and Programming Assignments
Download from http://www.csie.ntu.edu.tw/~pjcheng/course/wm2021
& NTU COOL
•
All materials are for teaching only (copyright issue)•
You cannot distribute anything (video, slide, reading,…)NTU COOL: https://cool.ntu.edu.tw/courses/5426