Machine Intelligence for Large-Scale Image/Video Data Streams –

(1)

Winston H. Hsu (徐宏民)

National Taiwan University, Taipei

Machine Intelligence for Large-Scale Image/Video Data Streams

– Advancing Deep Neural Networks for Emerging Applications

Office: R512, CSIE Building

November, 2016

(2)

Dr. Winston Hsu ( ^徐宏民 ) – Short Bio

§ Professor in NTU CSIE and GINM, since Feb. 1, 2007

– Affiliated with Communication and Multimedia Lab (CMLab)

§ PhD from Columbia University, New York, 2007

§ 4 years in (startup-period) CyberLink Corp. (

訊連科技

)

– Founding Engineer, Project Leader, and RD Manger

§ Recognitions & Awards

– 3500+ Google citations; H-index: 27; i10-index: 51

– Director for NVIDIA AI Lab (NTU), AE for IEEE Trans. on Multimedia; AE for IEEE Multimedia Mag., Organizing Committee for ACM Multimedia

2010/2013/2015/2016, IEEE/ACM Senior Member, MSR Visiting Researcher (2014),

Visiting Researcher IBM Watson (2016)

– Awards:

2011 Ta-You Wu Memorial Award (Young Researcher), FIRST PRIZE in ACM Multimedia Grand Challenge 2011, FIRST PLACE in MSR-Bing Image Retrieval

Challenge 2013, Microsoft Research Award 2009/2012/2014/2015, 2013 National

Outstanding IT Elite Award, 2012 NTU EECS Academic Contribution Award (top 3%), etc.

2

(3)

Globally Competitive for Our Research Team

§ Recent report by Wealth Magazine (財訊雙週刊)

§ Research developments in AI (data learning in large-scale multimodal data streams)

§ How we have strived hard to keep our group competitive in the global research communities.

§ Our PhD alumni had received offers from the US-based

research labs

(4)

Awarded “NVIDIA AI LAB” – The 1st in Asia, the 4th in the World (GTC Taipei, September 21, 2016)

§ Video announcement by NIVIDIA CEO/Co-Founder Jen-Hsun Huang

– https://www.youtube.com/watch?v=yjhj7bAj9hs#t=57m16s

§ For the project, “DeepTutor” – question and answering over large- scale multimodal data streams

§ The 4

^th

NVIDIA AI Lab in the world; right after Stanford, Berkeley, and OpenAI

4

(5)

Motivations – Numerous Cameras in Different Forms;

and Keep Growing …

Ent er pr is es /G ov er nm ent s

(6)

Ongoing Research Projects (Selected) –

More Details and Demos in

6 facial/clothing attribute

detection/search

web-scale indexing &

feature learning

large-scale photo/video recognition

web-scale facial image retrieval

mobile visual recognition

multimodal deep neural network

social media mining big data analytics and visualization

first-person/wearable cameras

consumer photo retrieval

http://www.csie.ntu.edu.tw/~winston/

§ Task: Online system (< 12 seconds) to score on each image-query pair that reflects how relevant the query could be used to describe the given image;

§ Hosted by Microsoft Research (Redmond) and Bing

§ Dataset: 23M click logs (query, image, #click) for training set and 77450 image-query pairs for online test

Image Search by Semantic Understanding –

First Place in MSR-Bing Image Retrieval Challenge 2013

http://web-ngram.research.microsoft.com/GrandChallenge

dollar bill

？

suri and katie cruise

？

drones

？

1

(8)

Product Inquiry/Recommendation by Mobiles (2009)

§ Product price/information inquiry by mobile phones

§ Experienced with indexing high-dimensional & large-scale data

demo

8

Amazon Flow Google Goggle

Pinterest Visual Search Alibaba Pailitao

[Lin et al., ICIP’09, Chen et al., JVCI’10]

2

(9)

Large-Scale Attribute-based People Search – Search by Impression

§ Search by impression

– searching people-related photos by graphically describing the search intentions

§ FIRST PRIZE in ACM Multimedia Grand Challenge 2011

demo

[Lei&Hsu, ACM MM 2011]

[Lei&Hsu, SIGIR 2012]

3

(10)

Ongoing Projects in Image/Video Analytics with Deep Convolutional Neural Networks

§

Goal – Devise effective and efficient learning methods for scalable visual analytic platforms, applicable for

emerging industry applications

10

Playground

scene cat.

photo annotation

person

bottle

dog

object detection

clothing attributes facial attributes vehicle attributes video events

Corgi

fine-grained recognition

Pembroke WelshCorgi

drone AR auto. training

data for CNN

(11)

Supervised

QA Proactive QA Self-taught

QA Deep

Tutoring

(diverse media streams)

Travel Sentiment Shopping Smart City …

Education Healthcare Surveillance Automobile Robotics

DeepTutor for Multimodal Question and Answering

Scalable Deep Learning Framework

• Multimodal and joint

• Semi-/un- supervised

• Video learning

• Transfer learning

• Scalable platform

• ….

• Hashing for memory networks

• Multimodal memory networks

• Captioning

• Zero-shot query

• Auto. training data acquisition

• …

QA Interface (Reinforced + Augmented) Multimedia QA Engine

Efficient, Large-Scale, and Multimodal Memory Representations

• Memory networks

• Reinforcement

• Attention model for AR

• Deep segmentation

• Deep user modeling

• …

4

(a)

(b)

(d)

(e)

(12)

Visiting Scientist – Cognitive Computation for IBM Watson AI (New York, USA)

§ The first movie trailer generated by AI system (Watson)

– One of the researchers in the team of three

§ Demo video: https://www.youtube.com/watch?v=gJEzuYynaiw

§ News

– “Watson helped make a trailer for a horror movie about AI,” Engadget

• https://www.engadget.com/2016/09/01/ibm-watson-movie-trailer-morgan/

– “A computer built this trailer for a horror movie about an evil AI,” Mashable

• http://mashable.com/2016/09/01/morgan-watson-ai-trailer

– ….

12

5

(13)

Image/Video Cognition (Machine Perception)

§ Problem definition: Given a video (image), describe it in natural language

§ Motivations

– Understanding high-level semantics and intention from video collection

– Leveraging multiple modalities such as video, time, text, etc., in the unified deep learning framework

– Enabling technology for video event detection, surveillance, live content filtering, robotics, social media mining, HCI, question and answering, etc.

A man and a woman are

6

(14)

Image/Video Cognition (Machine Perception) – Tentative Results

14

A woman is pouring a bowl of dough and another woman is making something <eos>

(15)

Image-based 3D Model Retrieval

– Retrieving semantically Related 3D Models by Image

§ Novel proposal – End-to-end deep neural networks for cross-domain and cross-view learning and ranking

§ Impacts: the brand-new problem and significantly outperforming prior neural networks

[Lee et al., submitted, 2016]

Image-CNN Adaptation Layer

Cross-View Convolution

Rank by L2 distance View-CNN

…

Query Image

3D Shapes

Image representation

Shape representation

Top Ranked 3D Shapes:

Rendered Views

…

View-CNN

6

(16)

3D Medical Segmentation by Deep Neural Networks

§ Novel proposal – Utilizing cross-modal learning in the sequential and convolutional neural networks

§ Impacts: Significantly outperforming prior works (e.g., U-Net) in open benchmarks

16

[Tseng et al., submitted, 2016]

7

(17)

Social Media Mining – Huge Photos/Videos Shared for

Human Activities

(18)

Discovering the City by Mining Diverse and Multimodal Data Streams – IBM Grand Challenge: New York City 360

§ Exploring and integrating multiple contents and sources for NYC life

§ ACM Multimedia 2014 Grand Challenge Multimodal Award

18

[Kuo, ACM MM’14]

8

(19)

Understand Human Activities from Social Media (e.g., Instagram): Time + Photos + Tags

9

convolutional neural network sequential neural

network (RNN, LSTM)

§ Why: Huge needs in location-based services: advertisement, location understanding, recommendation, city planning, etc.

§ Problem Definition: Location classification, provided a collection of photos and associated metadata

§ Location Categories (10): Arts & Entertainment, College & University, Event, Food,

(20)

Fashion Mining from Social Media by Clothing Attributes – Huge Interest from Fashion Industry

§ Confirmed the influence of fashion shows in daily life

– 60 clothing attributes

§ Widely discussed in social media and news media (NY Post, MIT Tech. Review, Science News, etc.)

20

[Chen et al., ACMMM’15]

08/28/2015

10

(21)

Drone AR – Understanding the Context from Drone Views (Ongoing Project)

11

(22)

@NTU, November 2016 – Winston Hsu

Recent Student Awards (selected)

– Working on Essential and Emerging Problems

§ FIRST PLACE in MSR-Bing Image Retrieval Challenge 2013

§ First Prize for ACM Multimedia Grand Challenge 2011

§ ACM Multimedia 2013 Grand Challenge Multimodal Award

§ 陳殷盈ACM Multimedia 2012 Doctoral Symposium Best Paper Award

§ 郭盈希Microsoft Research Asia Fellowship 2012

§ 朱冠宇榮獲「中國電機工程學會102年青年論文獎」第三名

§ 博士班學生陳冠婷(102)、陳殷盈(101)、林彥良(101)獲得「補助博士生赴國外研究（千里馬）」獎助

§ 陳柏村榮獲101年度中華民國人工智慧學會碩士論文獎

§ 中華電信2011電信創新應用大賽雲端應用校園組亞軍

§ 鄭安容榮獲「中國電機工程學會100年青年論文獎」第二名

§ 李文瑜榮獲頂尖國際會議SIGIR 2011 Google Fellowship for Women

§ 陳殷盈榮獲頂尖國際會議WWW 2011 Google Fellowship for Women

§ 郭盈希同學榮獲「中國電機工程學會99年青年論文獎」第二名

§ 學生榮獲中華電信2010電信奧斯卡—花博應用組冠軍

22

(23)

Hearty Contributions from

Our Research Members

(24)

Acknowledgements for Research Sponsors

24

Machine Intelligence for Large-Scale Image/Video Data Streams –