On the Social Network
Analysis and Mining:
A Brief Introduction
Leon S.L. Wang (王學亮)
Department of Information Management National University of Kaohsiung
Outline
• What is Social Network?
• Social Networks Analysis
• Social Network Extraction &
Construction
• Social Network Applications
• Challenges in Mining Social
Introduction (1)
• What is Social Network?
– A social network is a social structure to
describe social relations (wikipedia)
– The history of Social Network is older
than everybody who is here
• More than 100 years (Cooley 1909, Durkheim 1893)
• Focusing on small groups
– Information Techniques give it a new life
– From Sociology to Computer Science
Introduction (2)
• Topics about Social Networking
– Social Networking: Analyzing and
Constructing Social Network
(Churchill & Halverson 2005)• Social Network Extraction and Construction
• Social Network Analysis
• Online Social Networking
Social Network Analysis (1)
• Social Network Analysis
– A simple social network diagram
(Scott 1991)• Roles
• Relationships
– One way – Two way
– Positive & Negative
– Self-defined relationships
• Visualization
– Why visualization?
» Providing as much information as possible in a social network
» Human can easily and roughly
Social Network Analysis (2)
• Relational Data
Adjacency matrix:Companies-by-companies 1 2 3 4 1 - 3 3 1 2 3 - 2 2 3 3 2 - 1 4 1 2 1 -Adjacency matrix: directors-by-directors A B C D E A - 2 2 1 1 B 2 - 3 2 1 Directors A B C D E Com pan ies 1 1 1 1 1 0 2 1 1 1 0 1 3 0 1 1 1 0 4 0 0 1 0 1 1 2 4 3 1 1 2 2 3 3 A B 1 1 1 2 2 2 3
Social Network Analysis (3)-1
• Measurement
– Size
• Density
• Geodesic distance
• Diameter
• Closeness
– Centrality
• Degree
• Betweenness
• Closeness
7Social Network Analysis (3)-2
• Density Measurement-An example
Connected Points 4 4 4 3 2 0 Inclusiveness 1.0 1.0 1.0 0.7 0.5 0 Sum of degrees 12 8 6 4 2 0 No. of lines 6 4 3 2 1 0
Social Network Analysis (3)-3
• Centrality
– Degree, Betweenness, Closeness, Eigenvalue
Social Network Analysis (3)-4
• Measurement
– Clustering Coefficient
– Path Length, Trail, Walk
– Reachability
(digraph)– Structural Hole
– Reciprocity
– K-Clique
– Position
Social Network Analysis (3)-5
• Measurement
– Clustering Coefficient
• Local
• Global
11Social Network Analysis (3)-6
• Measurement
Social Network Analysis (3)-7
• Measurement
– Reciprocity
• the number of ties that are involved in reciprocal
relations relative to the total number of actual ties
• |(AB, BA)| / |(AB, BA, AC)|
• = 2/3
Social Network Analysis (3)-8
• Measurement
– Clique
• Complete subgraph
• Maximum clique
– {1,2,5}• Maximal cliques
– {1,2,5} – {2,3}, {3,4}, {4,5},{4,6}• K-clique
– Clique of size kSocial Network Analysis (5)
• Sociologists only focus on small social
networks
– 50~100 nodes in a social network
• The advent of Internet communications
has greatly increased SNA‟s popularity
– Computer & Information Technologies
become essential tools for SNA
(Churchill & Halverson 2005)Can You Analyze and Construct this Social Network
Diagram by hand??
Social Networking
• Social Network = Computer Network
– Next Target: Mobile Phone
• Facebook is collaborating with Cingular
Wireless, Sprint Nextel & Verizon Wireless
• Killer Applications are Needed
– E-commerceWallop
– Job Finder
• On-line Social Networking Websites
– Using
people
to find
content
Social Network Extraction & Construction
• Extracting & Constructing Social Networks from
Contents
– Using
content
to find
people
– Contents
• Web
• Event-logs
• On-line Chat
• Papers & Theses
Social Network Extraction & Construction (2)
• Extracting Social Networks from Web
– Extracting from web contents (Personal Homepage)
– Semantic Analysis (Ontology) & NLP (Natural Language Processing) (Jin et al. 2007)
– Contacting information is the focus (Culotta et al. 2005)
• E-mail address • Phone Number • Names – Network Analysis • Appearance • Connectivity • URLs Similarity 19
Social Network Extraction & Construction (3)
• Extracting Social
Networks from E-mail
– A most used on-line
communication application
– E-mail is a semi-structured
document
(Bird et al 2006) • Header for sender identification– Form: „Bill Stoddard‟ <reddrum@attglobal.net>
• Subject • Receiver • Date & Time
Me A B C Me A B C A B
Social Network Extraction & Construction (4)
• Extracting Social Networks from Chat
– Internet Relay Chat (Chat Room)
(Muttons 2006)– Instant Messenger
• MSN Messenger, ICQ, Yahoo! Messenger,……
• MSN messenger provides a XML based and structured communication logs
– Date & Time – Sender
– Receiver – Messages
• Network Analysis
– Communication Frequency & Closeness
– Contact Sharing (Who may also your friends) – Automatic Grouping and Blocking
Social Networking Applications (1)
• Marketing & E-commerce
– Target Marketing
– Collaborative Recommendation
• Terrorist & Crime Detection
– Ipswich‟s Jack the Ripper, England 2006
• Medical Network
– Finding Blood – Organ
• Knowledge Management
Social Networking Applications (2)
• Learning
• Organizational Social Network Analysis
– Optimice
• Politic & Election
• Academic Social Networking
– Family Tree
• Game AI
– On-line Game
– Game with Social Network (Game 2.0)
• Second Life
• And Much More………
Some Link Mining Tasks (1)
• Object-Related Tasks
– Link-based Object Ranking
• PageRank, HITS, Centrality, Tagommender
– Link-based Object Classification
• News items classification, Folksonomy
– Object Clustering (Group Detection)
• Finding positions, a set of people with similar links
– Object identification
• Same name, different people
• Link-Related Tasks
Some Link Mining Tasks (2)
• Graph-Related Tasks
– Subgraph Discovery
• K-Cliques, K-Clans, K-Plexes
– Graph Classification
– Generative Models for Graphs
Challenges in Mining Social
Network Data
Adopted and Modified from the talk of Jon M. Kleinberg
Challenge 1: Splitting Network
Challenge 2: A Matter of Scale
• 436-node network
of e-mail exchange over 3 months at
a corporate research lab (Adamic-Adar 2003)
• 43,553-node network
of e-mail exchange over 2 years at
a large university (Kossinets-Watts 2006)
• 4.4-million-node network
of declared friendships on
blogging community LiveJournal (Liben-Nowell et al.
2005, Backstrom et al. 2006)
• 240-million-node network
of all IM communication over
one month on Microsoft Instant Messenger
(Leskovec-Horvitz‟07)
Challenge 2: A Matter of Scale
• Currently, massive network datasets give
you both more and less:
– More: can observe global phenomena that are
genuine, but literally invisible at smaller
scales.
– Less: Don‟t really know what any one node or
link means. Easy to measure things; hard to
pose nuanced questions.
– Goal: Find the point where the lines of
research converge.
Challenge 3: Geographic Data
• Liben-Nowell, Kumar, Novak, Raghavan, Tomkins (2005) studied • LiveJournal, an on-line blogging community with friendship links • Large-scale social network with geographical embedding:
– 500,000 members with U.S. Zip codes, 4 million links.
Challenge 4: Diffusion in Social Networks
• Diffusion, another fundamental social processs:
Behaviors that cascade from node to node like an
epidemic.
– News, opinions, rumors, fads, urban legends, ... – Viral marketing [Domingos-Richardson 2001]
– Public health (e.g. obesity [Christakis-Fowler 2007]) – Cascading failures in financial markets
Challenge 5: Protecting Privacy in Social Network Data
• Many large datasets based on communication (e-mail, IM, voice) where users have strong privacy expectations.
– Current safeguards based on anonymization: replace node names with random IDs.
• With more detailed data, anonymization has run into
trouble:
– Identifying on-line pseudonyms by textual analysis – De-anonymizing Netflix ratings via time series
– Search engine query logs: identifying users from their queries.
• Does this make things safer?
– E.g. no text, time-stamps, or node attributes
Challenge 6: Attacking an Anonymized Network
• What we learn from this:
• Attacker may have extra power if they are part of the system. In large e-mail/IM network, can easily add yourself to system.
• But “finding yourself” when there are 100 million nodes is going to be more subtle than when there are 34 nodes.
• Template for an active attack on an anonymized network
– Attacker can create (before the data is released) nodes (e.g. by registering an e-mail account) edges incident to these nodes (by sending mail)
– Privacy breach: learning whether there is an edge between two existing nodes in the network.