DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS:

SEOUL | Oct.7, 2016

DEEP LEARNING BASED VISION-TO-LANGUAGE APPLICATIONS:

CAPTIONING OF PHOTO STREAMS, VIDEOS, AND ONLINE POSTS Gunhee Kim Computer Science and Engineering Seoul National University October 7, 2016

AGENDA

Photo stream captioning Video captioning

Cesc C. Park and Gunhee Kim. Expressing an Image Stream with a Sequence of Natural Sentences. NIPS 2015 2

GENERAL USERS’ PHOTO STREAM Suppose that you and your family visit NYC

A photo stream is a thread of the user’s story Users do not organize the photo streams for later use Can we write a travelogue for a given photo stream? 3

3

PREVIOUS WORK – IMAGE CAPTIONING Retrieve or generate a descriptive natural language sentence for a given image

[Socher et al, TACL2013] [Karpathy et al. CVPR2015 ] [Mao et al, ICLR2015]

Many more! [Vinyal et al, CVPR2015]

[Gong et al. ECCV2014 ] 4

4

LIMITATION OF PREVIOUS WORK Much of previous work mainly discuss the relation between a single image and a single sentence

A kid is smiling …

Absence of correlation, coherence and story for a stream of images

Extend both input and output dimension to a sequence of images and a sequence of sentences 5

5

PROBLEM STATEMENT Objective: express an image stream to a coherent sequence of sentences A query of image stream

A coherent sequence of sentences We took a couple days for family vacation in NYC to get away… Empire state building right off the bat. Caeden is checking out the view. Caeden's first MLB game and my first in a while MLB game... He might be a mets fan… Shake Shack... 6

6

IMAGE-TEXT PARALLEL TRAINING DATA Use a set of blog posts to learn the relation between a image stream and a sequence of sentences 19K blog posts with 150K images from Blogs are written in a way of storytelling • Blog pictures are selected as the most canonical ones out of photo albums • Informative sentences associated with pictures about location, sentiments, actors, … 7

7

OUR SOLUTION – CRCN Coherence Recurrent Convolutional Network • (1) Convolutional neural networks for image description • (2) Bidirectional recurrent neural networks for language model • (3) Coherence model for a smooth flow of multi-sentences

8

8

OVERVIEW OF ALGORITHM

9

9

CRCN ARCHITECTURE The compatibility score btw image and sentence sequence

Misaligned Pairs should be few

Aligned Pairs should be many

10

10

RETRIEVAL SENTENCE SEQUENCES Retrieve best sentence sequences for a query image stream Divide-conquer search strategy! • Almost optimal! • The local fluency and coherence is required for the global one

11

11

USER STUDIES VIA AMT Goal: Find general users’ preferences between text sequences by different methods for a given photo stream • Randomly select 100 test streams of 5 images • Our method and one baseline predict text sequences • Pairwise preference test via

Quantitative results • A higher number than 50% validate our approach (CRCN) • The coherence becomes more critical as the passage is longer (4th vs 5th columns)

12

12

RESULTS FOR NYC DATASET

(1)

(2)

(3)

(4)

(5)

(CRCN) (1) One of the hallway arches inside of the library (2) As we walked through the library I noticed an exhibit called lunch hour nyc it captured my attention as I had also taken a tour of nyc food carts during my trip (3) Here is the top of the Chrysler building everyone's favorite skyscraper in new york. (4) After leaving the nypl we walked along 42nd st. (5) We walked down fifth avenue from rockefeller centre checking out the windows in saks the designer stores and eventually making our way to the impressive new york public library. (RCN) (1) As you walk along in some spots it looks like the buildings are sprouting up out of the high line plants (2) Charlie and his aunt donna relax on the high line after a steamy stroll (3) However navigating the new york subway system can be like trying to find your way through the amazon jungle sans guide (4) We loved nyc! (5) Getting ready for the sunny day...putting sunscreen on.

13

13

AGENDA

Photo stream captioning Video captioning

14

WON LSMDC 2016! Large Scale Movie Description and Understanding Challenge (LSMDC 2016) in ECCV 2016 and MM 2016 • https://sites.google.com/site/describingmovies/lsmdc-2016 • Team members: Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim movie description

movie multiple-choice

movie fill-in-the-blank

movie retrieval His vanity license plate reads 732.

15

15

ATTENTION MECHANISMS IN DEEP LEARNING Machine decides where to attend itself --- sequentially focus or attend on the most relevant part of the input over time.

Image captioning [Xu et al. ICML2015]

Machine translation [Hermann et al. NIPS2015]

K. Xu et al. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”. ICML 2015. 16 K. M. Hermann et al. “Teaching Machines to Read and Comprehend”. NIPS 2015.

16

ATTENTION MECHANISMS IN DEEP LEARNING Machine decides where to attend itself --- sequentially focus or attend on the most relevant part of the input over time. Action recognition [Sharma et al. Arxiv2015] Golf swinging

Trampoline jumping

Video captioning [Sharma. Toronto 2016] Generated sentence : A woman is slicing a onion. Groundtruth : A woman is slicing a shrimp.

S. Sharma et al. “Action Recognition Using Visual Attention”. Arxiv 2015. 17 U Toronto, 2016. S. Sharma. “Action recognition and video description using visual attention”. MS thesis,

17

PROBLEM STATEMENT Although attention models simulate human’s attrition, there is no attempt to explicitly use human gaze labels • Attention weights are implicitly learned in an end-to-end manner

Does human attention improve the model performance? Then how can we inject such supervision to the attention model? Target task: Video captioning 18

18

PROBLEM STATEMENT Objective: supervise a caption generation model to attend where human focus on A short movie video stream Movie clip Predicted Human attention

A human gaze assisted caption A little boy flies in the air by riding a bicycle. 19

19

RESULTS (CAPTION GENERATION)

[1] Subhashini Venugopalan et al. Sequence to Sequence – Video to Text. ICCV 2015 20 [2] Li Yao et al. Describing Videos by Exploiting Temporal Structure. ICCV 2015

20

RESULTS (CAPTION GENERATION)

[1] Subhashini Venugopalan et al. Sequence to Sequence – Video to Text. ICCV 2015 21 [2] Li Yao et al. Describing Videos by Exploiting Temporal Structure. ICCV 2015

21

CONCLUSION Joint understanding of multiple data modalities • Visual data (images/videos) + Textual data

Deep learning models are superior to jointly represent multiple data or tasks • In the areas of robotics, VR, security, speech analysis, and more

Many possible applications for online services

22

22

SEOUL | Oct.7, 2016

THANK YOU