• 沒有找到結果。

Big Data and Internet Thinking

N/A
N/A
Protected

Academic year: 2022

Share "Big Data and Internet Thinking"

Copied!
110
0
0

加載中.... (立即查看全文)

全文

(1)

Big Data and Internet Thinking

Chentao Wu

Associate Professor

Dept. of Computer Science and Engineering

[email protected]

(2)

Download lectures

• ftp://public.sjtu.edu.cn

• User: wuct

• Password: wuct123456

• http://www.cs.sjtu.edu.cn/~wuct/bdit/

(3)

Schedule

• lec1: Introduction on big data, cloud computing & IoT

• Iec2: Parallel processing framework (e.g., MapReduce)

• lec3: Advanced parallel processing techniques (e.g., YARN, Spark)

• lec4: Cloud & Fog/Edge Computing

• lec5: Data reliability & data consistency

• lec6: Distributed file system & objected-based storage

• lec7: Metadata management & NoSQL Database

• lec8: Big Data Analytics

(4)

Collaborators

(5)

Contents

Big Data Analytics

1

(6)

Big Data Challenges

(7)

It’s not just about the data…

Machine Learning/Deep Learning

IoT (Internet of Things) &

Sensor Analytics

Modeling Willingness- to-Pay

Natural Language Processing

Analyzing Data @ Scale

Creating a Lake Streaming Consumer Behavior Data

Big Data + Big Data Analytics

• Leveraging a computer’s ability to learn without being explicitly programmed to solve business problems

• Understanding value drivers from the ever- growing network of connected physical objects and the communication between them

• Mining product reviews to estimate willingness- to-pay for product features

• Understanding human speech as it is spoken through application of computer science, AI, and computational linguistics

• Using distributed computing and machine learning tools to analyze hundreds of gigabytes of data

• Mining social data in real time to understand when and where consumers are making choices

1 2

3 4 5 6

Methods of using Big Data to generate insight Refers to the DATA only

• It is important to understand the distinction between

Big Data sets (large, unstructured, fast, and uncertain

data) and ‘Big Data Analytics’.

(8)

It’s also about what, how, and why you use it

• Big Data Analytics – the process of harnessing Big Data to yield actionable insights – is a combination of five key elements:

Decisions Analytics Data Technology Mindset &

Skills

The value of Big Data Analytics is driven by the unique

decisions facing leaders, companies, and countries today.

In turn, the type, frequency, speed, and complexity of decisions drive how Big Data Analytics is

deployed.

To leverage the variety and volume of

Big Data while managing its volatility, advanced

analytical approaches are necessary, such as

natural language processing, network

analysis, simulative modeling, artificial

intelligence, etc.

Big Data Analytics is about

operationalizing new and more data, but it is also about data

quality, data interoperability, data disaggregation, and

the ability to modularize data structures to quickly absorb new data and

new types of data.

To store, manage, and use Big Data

often requires investments in new

technologies and data processing methods, such as

distributed processing (e.g., Hadoop), NoSQL storage, and Cloud

computing.

Big Data Analytics requires firm commitment to using analytics in decision-

making; a decisive mentality capable of

employing in-the- moment intelligence;

and investment in analytical technology,

resources, and skills.

(9)

Big Data Analytical Capabilities

• Continuing increases in processing capacity have opened the door to a range of advanced algorithms and modeling

techniques that can produce valuable insights from Big Data.

Tr adition al Emer gin g

Structured Unstructured

A/B/N Testing Experiment to find the

most effective variation of a website,

product, etc

Sentiment Analysis Extract consumer reactions based on social media behavior Complex Event

Processing Combine data sources

to recognize events Predictive

Modeling Use data to forecast or

infer behavior Regression

Discover relationships between variables

Time Series Analysis Discover relationships

over time

Classification Organize data points into known categories

Simulation Modeling Experiment with a

system virtually

Spatial Analysis Extract geographic or

topological information Cluster Analysis Discover meaningful

groupings of data points Signal Analysis

Distinguish between noise and meaningful

information

Visualization Use visual representations of

data to find and communicate info

Network Analysis Discover meaningful

nodes and relationships on

networks

Optimization Improve a process or

function based on criteria

Deep QA Find answers to human questions

using artificial intelligence

Natural Language Processing Extract meaning from

human speech or writing

(10)

Forward-Looking vs. Rear-View Analytics

• Big Data Analytics improves the speed and efficiency with which we

understand the past, and opens up entirely new avenues for preparing for and adapting to the future.

What happened?

Describe, summarize and analyze historical

data

What should be done?

Recommend ‘right’ or optimal actions or

decisions

How do we adapt to change?

Monitor, decide, and act autonomously or semi-autonomously What could

happen?

Predict future outcomes based on the

past

Observed behavior or events

Non-traditional data sources such as social listening and web crawling

Forward-looking view of current and future value

Sentiment Scoring

Graph analysis and Natural Language Processing to identify hidden relationships and themes

Dual objective models

Behavioral economics

Real-time product and service propositions (graph analysis, entity resolution on data lakes to infer present customer need)

Rapid evaluation of multiple ‘what-if’

scenarios

Optimization decisions and actions

Monitor results on a continuous basis

Dynamically adjust strategies based on changing environment and improved

predictions

Agent-based and dynamic simulation models, time-series analysis

Descriptive Analytics

Predictive Analytics

Prescriptive Analytics

Continuous Analytics

Incr easing Business Valu e

Why did it happen?

Identify causes of trends and outcomes

Observed behavior or events

Non-traditional data sources such as social listening and web crawling

Statistical and regression analysis

Dynamic visualization Diagnostic

Analytics

Increasing Sophistication of Data & Analytics

Rear-view Forward-looking

(11)

Examples of Big Data Analytics in Action

• Market Leaders are leveraging Big Data Analytics to generate value by starting with a business need and focusing on

implementing actionable insights quickly and decisively.

Company Business Need Data and Analytics Impact

Greater tailoring of credit card offers to fit customer needs

Statistical model based on public credit and demographic data to target

customized products to customers

Net revenue grew at a CAGR of 32% from 1994 to 2003;

prompted competitors to shift focus to data and analytics Data-enabled engine prognostics,

monitoring, maintenance and repair

Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance

Over 70% annual revenue from the aircraft engine division attributable to this service Search-to-purchase conversion by

anticipating intent of a shopper’s search and delivering relevant results

Semantic search, which enables discovery using algorithms that rank results via social signals from around the web

Increases 10-15% the likelihood that a customer will complete their purchase – translating to millions of dollars in revenue Transformation from subscription

streaming service to original content producer

Analysis of data from 66 million subscribers’ viewing habits and preferences

Revenue and subscriber base increased by 15% and 9%

respectively in 2013 Leverage Internet of Things (IoT) by

connecting machines to facilitate data- enabled prognostics, increase

efficiency and reduce downtime

Launched software to help airlines and railroads move their data to the cloud and predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost

Estimated 1% reduction in fuel costs, projected to save the airline industry $30 billion over 15 years

Impact Big Data Analytics

Business Need

(12)

Big Data Analytics in Development

• Big Data Analytics is making an equally impressive impact on Development interventions – allowing decision-makers to reach and serve previously neglected populations.

Company Business Need Data and Analytics Impact

More transparent, reliable, and low-cost method to track inflation in Argentina

Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies

Government statistical offices shifting to accept Big Data.

Central banks using Big Data to see day-to-day volatility.

Understand how migrants act as

arbitrageurs to bring labor markets into equilibrium

Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.)

Informing labor policy design in low-income countries to incentivize or disincentivize migratory behavior

The city of Rio de Janeiro wanted to improve its emergency response by better predicting heavy rainfall and subsequent severe landslides and flooding

The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center

Rio has improved emergency response time by 30%,

catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half- km basis

Create a better ecosystem for mobile services in the agricultural sectors of Kenya, Tanzania, and Mozambique

Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit

worthiness, and incubate new mobile businesses with greater predictors of success

M-PESA is being used to lower costs for farmers to receive loans and perform transactions with distributers and buyers, as well as to provide geography-specific market information

Impact Big Data Analytics

Business Need

(13)

Big Data Landscape

(14)

Creating a Big Data Organization Step 1: Be Yourself

• Beginning with a clear understanding of the specific questions you intend to use Big Data Analytics to address can help guide where and which data solutions are deployed.

Value

enablement Value

enhancement

Strategic

Tactical

Operational

Day to day operations

• Struggle to move from narrow focus on reactive operations to more proactive, comprehensive management of daily operations

• High value for digitization of operational processes across program units

• Often already proficient in traditional business intelligence

Enabling strategy and improving performance

• Use analytics to reduce political divergence and drive consensus

• Real-time analytics to enable quick responses to events

• Use data to develop personalized services

• Need for more objective and higher quality data Delivering future value

• Data-driven decision-making in real time

• Use analytics to develop new programs/opportunities

• Relies heavily on data supplied by others

• Often struggles to move away from exclusively intuitive decision-making

(15)

Creating a Big Data Organization Step 2: Secure People & Skills

• The competencies required of “data scientists” within an analytics organization or project converge from multiple skill domains.

Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics

Comfort in programming across various languages, a thorough understanding of external and internal data sources, data gathering, storing, and retrieving methods which help combine disparate data sources to generate unique insights Subject Area or Domain

Expertise

Computer Science &

Programming Statistical &

Mathematical

Organization-specific Information Knowledge Expertise in statistical

techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights

Deep understanding of industry, subject area, or research domain to help determine which questions need answering and on what frequency, specificity, or geography

(16)

Creating a Big Data Organization

Step 3: Let objectives dictate structure, not vice versa

• How analytics efforts or organizations are structured – whether reporting is vertically or horizontally aligned, how interconnected or autonomous separate units are, how

resources and successes are shared – can influence efficiency and impact.

CENTRAL Analytics Competency Center

Distributed Analytics Centralized Analytics

LOCAL

CENTRAL Analytics Competency Center

ETL Data Warehouse

BI Applications Metadata Repository

Data Mart

Federated Analytics

LOCAL

CENTRAL Analytics Competency Center

ETL Data Warehouse

BI Applications Metadata Repository

Data Mart

ETL

Data Warehouse

BI Applications Metadata Repository

Data Mart

Objectives

• Adopt previously proven practices

• Highly focused analytics support

• Subject area-specific innovations

• Repeatable models

• Governance

• Aligning analytics to organization- wide strategy

Data Warehouses, Marts, etc.

• Deployed locally • Deployed locally

• Some data and models shared across groups

• Deployed and managed centrally

Analytics Tools • Managed locally • Managed locally, but connected to group framework

• Controlled centrally, with units having access to shared resources Analytics Staff/

Competencies

• Placed within individual units • Placed within individual units

• Skills tailored to specific region or subject matter

• Placed within central analytics team, available as needed to support individual units

(17)

The ‘Hub-Spoke’ operating model often serves as a well- synchronized, connected system

Competency Center

‘Standardization’

2

Local Business Operations

Global Business Strategy Local Adoption

of Practices

Centers of Excellence (Regional)

Competency Center (‘Standards’)

Central Decision Hub

Local

‘Spoke’

Central Decision

Hub

1

Center of Excellence (Regional)

3

Center of Excellence (Regional)

3

Center of Excellence (Regional)

3

1 2

3 4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Local

‘Spoke’

4

Sample Hub-Spoke Interaction Model

(18)

Creating a Big Data Organization

Step 4: Invest in Appropriate Infrastructure

• Big Data introduces challenges related to data volume and variety, processing

constraints, and new data structures that traditional data infrastructure is not equipped

to support Objective Considerations Impact

Identify the type of analysis that will be conducted and define

which analytics capabilities will be

employed

Dictates performance needs along with data structures and processing architecture

Interface could restrict the ability to perform analysis ad hoc and restrict ability to update

Support for analysis specific data structures can improve performance and reduce analysis effort

Define the data set that will be used for the analysis including its

sources, size, and structure

Size of data sets introduce need for scalable infrastructure and performance

Variability of source data models and data set structure require data model flexibility

Diverse sources will require scalability, model flexibility, and flexible interfaces

Define the timeliness and frequency of the analysis results for

reporting and downstream systems

Frequency of analysis will dictate the processing architecture (batch or real time)

The timeliness of the analysis will impact the need for scalability and performance

In and out bound interfaces are defined by the use of data and required flexibility

Analytics Capabilities

Data Variety

Application

Analysis Type

Size Structure

Sources

Frequency Speed Interfaces

Analysis Flexibility

Analysis Structures

(19)

Contents

Architecture Design

2

(20)

Emerging Infrastructure Options

• To harness Big Data, storage solutions must be able to support targeted analytics capabilities, data diversity and performance needs

Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

NoSQL Embedded and persisted storage that

implement data models through document, graph, and dictionary structures

Cloud Computing Cloud computing can improve flexibility,

scalability and cost management and enable a cohesive business strategy across a org

• Scalability Issues

• Big Data set information extraction and queries require large volumes of processing cycles that can quickly scale

• Data storage solutions need to provide flexible data models to better ingest unstructured and semi structured data

• Need to combine and link multiple data sources

Traditional challenges being addressed…

(21)

Building an Analytics Organization: Critical Components

Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

Introduction to Hadoop

• Hadoop is based on work done by Google in early 2000s (combination of Google File System (GFS) and MapReduce)

• Useful for analyzing copious amounts of complex data across multiple data sources

• Distributes data as it is initially stored in the system

• Applications are written in high-level code

• Computation happens where data is stored, whenever possible

• Data is replicated multiple times on the

system for increased availability and reliability

Faster and Lower Cost Analysis

Linear Scalability

Greater flexibility

Emerging Infrastructure – Computing/Storage Options

(22)

Building an Analytics Organization: Critical Components

Emerging Infrastructure – Storage Options

NoSQL Embedded and persisted storage that

implement data models through document, graph, and dictionary structures

NoSQL - Storage Types

Document Store Key – Value

Store Columnar Store Graph Store

Solution Examples

Increasing Data Complexity Pros: Simplicity &

Scalability

Cons: Lack of advanced features/queries

Pros: Scalability &

Flexibility

Cons: Complexity

Pros: Easy to Use Cons: Scalability

Pros: Graph Joins Cons: Flexibility

(23)

Building an Analytics Organization: Critical Components

Emerging Infrastructure – System Options

Cloud Computing The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud

computing can transform your entire organization — people, processes, and systems

Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”

Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs.

The key benefits, beyond consolidation,

include standardized application and

development environments, resulting in

better controlled and more efficient

application lifecycles.

(24)

Relational Reference Architecture

(25)

Extended Relational Reference Architecture

(26)

Non-Relational Reference Architecture

(27)

Data Discovery: Non-Relational Architecture

(28)

Business Reporting: Hybrid Architecture

(29)

Contents

Big Data Algorithms

3

(30)

Key components of Mahout in Hadoop (1)

(31)

Key components of Mahout in Hadoop (2)

(32)

Key Components of Spark MLlib

(33)

Spark ML Basic Statistics

◼ Correlation: Calculating the correlation between two series of data is a common operation in Statistics

➢ Pearson’s Correlation

➢ Spearman’s Correlation

(34)

Example of Popular Similarity Measurements

◆Pearson Correlation Similarity

◆Euclidean Distance Similarity

◆Cosine Measure Similarity

◆Spearman Correlation Similarity

◆Tanimoto Coefficient Similarity (Jaccard coefficient)

◆Log-Likelihood Similarity

(35)

Pearson Correlation Similarity

Data:

Missing

Data

(36)

On Pearson Similarity

Three problems with the Pearson Similarity:

1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==>

1, more items may not be better.)

2. If two users overlap on only one item, no correlation can be computed.

3. The correlation is undefined if either series of preference values are identical.

Adding Weighting. WEIGHTED as 2nd parameter of the constructor can cause the

resulting correlation to be pushed towards 1.0, or -1.0, depending on how many

points are used.

(37)

Spearman Correlation Similarity

Example for ties

Pearson value on the

relative ranks

(38)

Basic Spark Data Format

Data: 1.0, 0.0, 3.0 // straightforward

// number of parameters, location of non-zero indices, and non-zero values

// number of parameters, Sequence of non-value values (index, value)

(39)

Correlation Example in Spark

1.0, 0.0, 0.0, -2.0

4.0, 5.0, 0.0, 3.0

6.0, 7.0, 0.0, 8.0

9.0, 0.0, 0.0, 1.0

(40)

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

(41)

Cosine Similarity

Cosine similarity and Pearson similarity get the

same results if data are normalized (mean == 0).

(42)

Spearman Correlation Similarity is time consuming.

Need to use Caching ==> remember s user-user similarity which was previously computed.

Caching User Similarity

(43)

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

(44)

Log-LikeLihood Similarity

Asses how unlikely it is that the overlap between the two users

is just due to chance.

(45)

Performance Measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.

• Spearnman: 0.8

• Tanimoto: 0.82

• Log-Likelihood: 0.73

• Euclidean: 0.75

• Pearson (weighted): 0.77

• Pearson: 0.89

(46)

Spark ML Basic Statistics

• Hypothesis testing: Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant.

Spark ML currently supports Pearson’s Chi-squared (χ2) tests for independence.

• ChiSquareTest conducts Pearson’s

independence test for every feature

against the label.

(47)

Chi-Square Tests (1)

(48)

Chi-Square Tests (2)

(49)

Chi-Square Tests (3)

We would reject the null hypothesis that there is no relationship

between location and type of malaria. Our data tell us there is a

relationship between type of malaria and location.

(50)

Chi-Square Tests in Spark

(51)

Spark ML Clustering

(52)

Example: Clustering

Feature

Space

(53)

Clustering

(54)

Clustering – on feature plane

(55)

Clustering example

(56)

Steps on Clustering

(57)

Making Initial Cluster Centers

(58)

K-means Clustering

(59)

HelloWorld Clustering Scenario Result

(60)

Testing difference distance measures

(61)

Manhattan and Cosine distances

(62)

Tanimoto distance and weighted distance

(63)

Results Comparison

(64)

Sample Code of K-Means Clustering in Spark

(65)

Vectorization Example

0: Weight

1: Color

2: Size

(66)

Canopy Clustering (estimate the number of clusters)

Tell what size clusters to look for. The algorithm will find the number of clusters that

have approximately that size. The algorithm uses two distance thresholds. This method

prevents all points close to an already existing canopy from being the center of a new

canopy.

(67)

Other Clustering Algorithms

Hierarchical clustering

(68)

Different Clustering Algorithms

https://github.com/HewlettPackard/cacti

(69)

Spark ML Classification

(70)

Spark ML Classification

(71)

Classification - definition

(72)

Classification example: using SVM to recognize a

Toyota Camry

(73)

Classification example: using SVM to recognize a

Toyota Camry

(74)

When to use Big Data System for Classification?

(75)

The advantage of using Big Data System for

Classification

(76)

How does a classification systems work?

(77)

Key Terminology for Classification

(78)

Input and Output of a classification model

(79)

Four types of values for predictor variables

(80)

Sample data that illustrates all four values

(81)

Supervised vs. Unsupervised Learning

(82)

Work flow in a typical classification project

(83)

Classification Example – Color-Fill

Position looks promising, especially the x-axis ==> predictor variable.

Shape seems to be irrelevant. Target variable is “color-fill” label.

(84)

Classification Example – Color-Fill (another

feature)

(85)

Fundamental classification algorithm

Example of fundamental classification algorithms:

• Naive Bayesian

• Complementary Naive Bayesian

• Stochastic Gradient Descent (SDG)

• Random Forest

• Support Vector Machines

(86)

Choose algorithm

(87)

Support Vector Machine (SVM)

maximize boundary distances; remembering “support vectors”

nonlinear kernels

(88)

Example SVM code in Spark

(89)

Contents

Tools Support

4

(90)

Data Mining, Text Mining, and Natural Language Processing

Extraction of implicit, previously unknown, and potentially useful information from data

Data Mining

Analysis of large

quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information

Text Mining

Natural Language Processing

NLP is a theoretically motivated range of computational

techniques for analyzing and representing

naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving

human-like

language processing for a range of tasks or

applications.

(91)

NLP Tools

Tool Description Analysis Type

OpenNLP

A machine learning based toolkit for the processing of natural language text. Link

• Tokenization

• sentence segmentation

• Part-of-speech tagging

• Named entity extraction

• Chunking, parsing

• Coreference resolution.

GATE A Java suite of tools that can perform natural language processing tasks for multiple languages. Link

• Information extraction

• Part of speech tagging

• Tokenizer

• Sentence splitter NLTK

A suite of libraries and programs for symbolic and statistical natural language processing Python.Link

• Information extraction

• Part of speech tagging,

• Tokenizer

• Word categorization

• Text classification

Stanford NLP

Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs.

Link

• Including tokenization

• Part-of-speech tagging

• Named entity recognition

• Parsing

• Classification

• Segmentation

• Coreference Resolution

LingPipe

A tool kit for processing text using computational linguistics.Link

• Sentiment analysis

• Entity recognition

• Clustering

• Topic classification

• Part of speech tagging

• Sentence detection

• Disambiguation

MontyLingua

A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java.Link

• Information extraction

• Part of speech tagging

• Tokenizer

• Word categorization

• Text generation

• Stemming

• Phrase chunking

Rosetta Linguistic Platform

A suite of linguistic analysis components that integrate into applications for mining unstructured data.Link

• Language Identification

• Name, places, and key concept extraction

• name matching

• name translation

(92)

Text Mining/Analytics Tools

Tool Description Analysis Type

RapidMiner

An open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics.Link

• Document classification

• Sentiment analysis

• Topic tracking

• Data mining

• Traditional analytics

SAS Text Miner A suite of text processing and analysis tools. Link, • Text Parsing

• Filtering

• Feature Extraction

• Topic Clustering

VisualText

Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link

• Information extractions

• Summarization

• Categorization

• Data Mining

• Document Filtering

• Natural Language Search

SAS Sentiment Analysis

Commercial tool that is dedicated to customer sentiment analysis. Link

• Customer sentiment monitoring

• sentiment discovery

Textifier

Tool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT).

Link

• Topic modeling,

• Information retrieval • Document analysis

• Social media analysis

Infinite Insight

System for automatically preparing and

transforming unstructured text attributes into a structured representation.Link

• Term frequency

• Term frequency inverse

• Document frequency

• Root word coding

• synonym identification

• Customization of stop words

• Stemming rules

• Concepts merging

Clustify

Software for grouping related documents into clusters, providing an overview of the document set and aiding with categorization.Link

• Document clustering

(93)

Text Mining/Analytics Tools Cont.

Tool Description Analysis Type

Attensity Analyze

Customer analytics applications that help analyze high volumes of customer

conversations across multiple channels. Link

• Unstructured communication analysis

• sentiment analysis

• consumer profiling

ReVerb

A program that automatically identifies and extracts binary relationships from English sentences. Link

• Information extraction

• Topic Identification

• Topic Linking

Open text summarizer

Open source tool for summarizing texts.

Link

• Document summarization

Open Calais

Web based API that is used to analyze content and extract topics or information.

Link

• Attribute/feature extraction

• Fact identification

Knowledge Search

Family of techniques tools for searching and organizing large data collections. Link

• Semantic Analysis

KH Coder A free software for Quantitative Content Analysis or Text Mining Link

• Text Parsing

• document search

• Network analysis

(94)

Image Analytics Overview

Overview

• The process of pulling relevant

information from an image or sets of images for advanced classification and traditional analysis

• Applies image capture, image processing, and machine learning techniques to extract, quantify, and structure, image information

Advantages

• Provides a method to structure,

organize, and search information that is stored within images

• Offers an additional data set that can be applied to understanding consumer behavior, automating business

processes, and discovering knowledge

enterprise content

(95)

Image Analytics Tools

Tool Overview Image

Processing

Computer Vision

Machine Learning

OpenCV

Open source library of computer vision functions that is

accessible via C, Java, and Python

X X X

PAXit Image Analysis

Integrated image analysis platform that provides basic feature identification functions

X X

ImageJ

Java based image processing platform that can be accessed via an API and expanded with custom plugins

X

PIL Python image processing library X

PyBrain A modular machine learning

library for Python X

(96)

Audio Analytics Overview

Overview

• The process of capturing audio and analyzing its features as to extract content and context of an event

• Applies speech analysis and signal

processing principles to structure audio information for analysis via NLP or

traditional analytics techniques

Advantages

• Provides a method for identifying events or common patterns within sound bytes

• Offers a way of capturing not only the content and topics within a conversation, but also the

emotions and context

(97)

Audio Analytics Tools

Tool Overview Audio

Processing

Information Retrieval Clam

A C++ library that provides varying level of audio processing and information retrieval capabilities

X X

CallMiner

A tool that is capable of translating calls to a more structured text data set and combining with other communication forms

X

Nuance Logs calls and structures audio for text

based search and retrieval X

yaafe Aduio feature extraction toolkit with

wrappers for several languages X

PRAAT Multiple platform audio analysis toolkit X

(98)

Social Network → Applications (1)

Analysis Objectives

Collaboration Analysis

Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures

• Identify team structures that are not effective

• Identify informal organizational structures

• Identify individuals/roles or groups that are influential to collaborative work environments Content/

Knowledge Management

Evaluate how knowledge or content is diffused and accessed within an organization

• Improve content and knowledge distribution

• Identify content bottlenecks, open

communication flows, and establish channels

• Explore impact of new communication methods

Community Mining

Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks

• Improved structures for key organizational functions.

• Improved information flows

• Identify potential bottlenecks for organizational functions

• Identify cultural patterns to build other communities

Organization Development

Explore formal and informal organization structures and how individuals work with one another to improve the design of the organization

• Improve hierarchy and structure of organization to better align with the informal practices

• Identify team members that are effective leaders

and would impact the organization if promoted

(99)

Social Network → Applications (2)

Analysis Objectives

Disaster recovery planning

Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans

• Identify communication improvements to disaster recovery teams

• Identify weak links among functional groups to improve collaboration during recovery plan execution

Data/

Information Dissemination

Assess how data points or information sets originate or are distributed across the enterprise to their intended targets

• Identify overlapping information sets and bottlenecks for information dissemination

• Assess how organization structures or information architecture impact the flow of information to its targets

Fraud Detection / prevention

Assess the organization or external network to identify communication or collaboration patterns that align with known fraudulent activity

• Identify network agents that collaborate with known fraudulent agents

• Identify activities that align with known fraudulent behavior

Process Discovery / Improvement

Analyze the organization structure and communication patterns to uncover process improvements or identify new processes

• Identify process improvements through discovery of hidden process steps, communication flows , and actors

• Discover undocumented or informal processes that are hidden within frequent collaboration and communication paths

Supply Chain Analysis

Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps,

bottlenecks and sourcing strategies

• Identify communication gaps that could impact dependent process or operations

• Identify strategic relationships to optimize the supply network

• Identify supply nodes that create inefficiencies

(100)

Social Network → Applications (3)

Analysis Objectives

Novelty/

Sentiment Diffusion Analysis

Observe how a specific topic, news articles or sentiment diffuses through a consumer network

• Assess how target consumers/market will react to a piece of news or campaign

• Evaluate how long news, data, or sentiment will be retained within a system and how far it will spread

Market Influencer Identification

Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities

• Identify individuals or groups that influence markets and adoption

• Identify untapped markets

• Identify market segments as targets for ad campaigns to improve product/service adoption

Consumer Segmentation

Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics

• Improve product or service offerings based on attributes that connect the consumer market

• Develop strategies to target new or existing consumers based on identified segmentation characteristics

Product or Brand Diffusion Analysis

Analyze the flow of communication or ideas through a market segment to evaluate how a product may diffuse

• Identify segments or individuals that will be likely early adopters

• Identify incentives or campaigns that will improve product/service adoption

Recommendation Systems

Analyze consumer network

connections and common features among consumers to develop recommendations

• Identify new feature sets for products and services

• Assess new markets for selling similar or new products

• Target consumers with specific products or services

(101)

Social Network → Tools (1)

Tool Overview Network

Analysis

Network Visual

Network Manipulation SNAP A general purpose network analysis

and graph mining library for C++ . Link X X

Statnet

A package for R that provides

capabilities for social network statistical

analysis. Link X

libSNA, graphTool,

networkX

Python libraries for network analysis and manipulation. libSNA, networkX,

graphTool X X

JUNG Java package for network analysis and

modeling. Link X X X

NodeXL

Excel plug-in that provides an easy to use and interactive interface to explore

and visualize networks Link X X

(102)

Social Network → Tools (2)

Tool Overview Network

Analysis

Network Visual

Network Manipulation GEPHI

Interactive open source platform for network

analysis and visualization. Gephi

X X X

Ucinet

Commercial social network analysis tool with

separate visualization component. Link

X X

Graphviz

Open source graph visualization package. Link

X NetMiner

Proprietary package that provides the ability to develop and implement custom algorithms

link

X X X

kxen SNA

Network analysis package that provides predictive

analytics and customer MDM integration. Link

X X X

ProM

Open source package for mining business process

networks. Link

X X X

Cytoscape

Open source tool for network modeling, and

analysis. Can connect to external data sources Link

X X X

Network Workbench

Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science

and Physics Research. Link

X X X

(103)

Contents

Deep QA/Mind/Brain Systems

5

(104)

DeepQA/Mind/Brain

What is DeepQA?

• DeepQA forms that core of Watson, the open domain question analysis and answering system

• The DeepQA stack is comprised of set of search, NLP, learning, and scoring algorithms

• DeepQA operates on a distributed computing infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture

What is the target problem set?

• Understanding the meaning and context of human language

• Searching and retrieving information from large library of unstructured information

• Identifying accurate and precise answers to

questions that are complex and must sourced

from a large knowledge set

(105)

DeepQA Infrastructure Technology

Data Management and Search

Technology Links

Unstructured Information Architecture

UIMA Link

SQL Server MySQL Link

Apache Derby Link

Java Natural Language Toolkit

Open NLP Link

Stanford NLP Link

Map/Reduce Apache Hadoop Link

Commonsense Knowledgebase

OpenCYC Link

Open Mind Common Sense Link

Triple Store Apache Jena Link

OpenAnzo Link

Text Search Lucene Link

Open FTS Link

(106)

DeepQA Infrastructure Technology

Platform and Administration

Technology Links

Web Server Apache Link

Virtualization Host

VMWare Link

Zen Link

Distributed File System

Apache Hadoop Link

OpenAFS Link

File Management/

Archival rSync Link

OS Fedora Link

Cloud Management

Extreme Cloud Administration Link

Open Nebula Link

(107)

Business Applications

Overview Objectives

Knowledge Discovery

Search internal and external

unstructured/structured information assets to uncover previously unknown knowledge

• Identify information about a subject through deep analysis of internal and external information sources

• Answer questions about a business problem or trend that may be difficult to analyze within traditional data sources

E-Discovery

Search documents and

communications to uncover relevant information associated with a specific topic

• Identify business topics and trends within communication and documents

• Search for non compliance activities within internal and external data sources

Contract Evaluations

Search through single or multiple contracts to answer specific questions about the nature of the contract

• Identify key facts or issues that comprise a contract or sets of contracts

• Identify contracts or legal documents that contain similar entities or features

Relationship Management

Provide the ability to interact with consumers providing precise responses to technical and open domain questions

• Provide a platform for automatically answering consumer questions about products or services

• Reduce reliance on call centers and improve interaction with consumers

Consumer Discovery

Search consumer communications, social media, and sales information to identify opportunities and

demographics

• Identify background information about consumers

• Identify consumer qualities that create risks or represent opportunities

Technical Troubleshooting

Find answers to technical and process problems through

• Utilize unstructured data and communications to identify solutions or root causes to system and process problems

(108)

Areas for Further Research

I nfrastructure/Tools and Search Technologies/Concepts

Topic Research

Tools

Hadoop Map/Reduce

The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc.

OpenNLP A Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA

OpenCYC An open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies

UIMA An architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities

Lucene A text search platform. Further research is needed to understand the library and how to incorporate it into UIMA

Search

Text Search Scoring

Algorithms are used to score search results based on their alignment with the question.

Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA.

Triple Store Search

Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and

technologies behind these data storage mechanisms Commonsense

Reasoning

Research is required to understand the branch of AI, technologies and role within DeepQA.

Document/

Information Retrieval

Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.

(109)

Areas for Further Research

Machine Learning and Natural Language Processing

Topic Description

Machine Learning

MetaLearners

Research the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results

Question Classification

Identify techniques and models that can be employed to analyze and classify questions

Search Ranking Models

Research models are available for ranking search results based on the various search and recall techniques that are employed for a question

NLP

Logical Form Analysis

Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text

Semantic Structure

Analysis

Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search Relationship

Analysis

Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set

Feature Extraction

Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search

Phrase Analysis Identify algorithms and tools that can be applied to extract key phrases from

text based on a search context

(110)

Thank you!

參考文獻

相關文件

Programming languages can be used to create programs that control the behavior of a. machine and/or to express algorithms precisely.” -

important to not just have intuition (building), but know definition (building block).. More on

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced?. insight and

Good Data Structure Needs Proper Accessing Algorithms: get, insert. rule of thumb for speed: often-get

• raw scores 80, 60 with term scores F, B: impossible from the principle: no individual score

In this section we define a general model that will encompass both register and variable automata and study its query evaluation problem over graphs. The model is essentially a

A quote from Dan Ariely, “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they

"Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in