Visual Relationship Detection with Language Priors - Stanford CS

1 downloads 106 Views 29MB Size Report
Gupta et al. ECCV 2008. Kumar et al. CVPR 2010. Wang et al. ECCV 2016. Sadeghi et al. CVPR 2011. Related work. 11. Yao e
Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University

* = equal contribution

image #1

llama person

image #2

llama

person

2

next to

chasing 3

4

Problem formulation

Input (image only)

6

Problem formulation

Input (image only)

person

person

riding

riding

horse

in front of

horse

Output 7

Problem formulation person

person

horse

horse

Input (image only)

Output 8

Problem formulation

Input (image only)

person

person

riding

riding

horse

horse

Output 9

Problem formulation

Input (image only)

person

person

riding

riding

horse

in front of

horse

Output 10

Related work Spatial relationships: cup on top of table

Action relationships: person kick ball

Common relationships: person wear shirt

Roger et al. ICCV 2008 Galleguillos, CVPR 2008

Yao et al. CVPR 2012 Maji et al. CVPR 2011 Rohrbach et al. ICCV 2013 Gupta et al. PAMI 2009

Gupta et al. ECCV 2008 Kumar et al. CVPR 2010 Wang et al. ECCV 2016 Sadeghi et al. CVPR 2011 11

Visual Genome dataset 33K object categories 42K relationship categories

dataset also contains descriptions, question answers and attributes Krishna et al. IJCV 2016 13

Observation 1: ride

Quadratic explosion of - N objects, - K relationships leading to N2K detectors

next to

lying Visual Genome dataset N = 33K K = 42K

drag

falling off

carry

resting on throw 14

# of occurrences

Observation #2

Long tail distribution of relationships - makes supervised training difficult

relationships

15

# of occurrences

Observation #2

ww tree ride behind skateboard car onw street dog dog

Long tail distribution of relationships - makes supervised training difficult

relationships

1 6

# of occurrences

Observation #2

w car onwstreet dog ride skateboard

elephant wdrink milk

Long tail distribution of relationships - makes supervised training difficult

dog ride wsurfboard relationships

1 7

Visual module

Language module Input

Tackles: Quadratic explosion of N2K detectors

Tackles: Long tail distribution of relationships



Output

Visual module Input

Definitions:

Output

Visual module Proposals: Uijlings et al. IJCV 2013

Input

Definitions:

Output

Visual module Proposals: Uijlings et al. IJCV 2013

Input Sample:

object detector

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] Output

Visual module Proposals: Uijlings et al. IJCV 2013

Input Sample:

object detector

relationship detector

Output

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …]

Visual module Proposals: Uijlings et al. IJCV 2013

Input Sample:

object detector

relationship detector

⋅ Output

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple

Visual module Proposals: Uijlings et al. IJCV 2013

Input Sample:

object detector

relationship detector



person in horse

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple

Visual module

Language module

Proposals: Uijlings et al. IJCV 2013

o1: man

r: ride

o2: horse

Input Sample:

object detector

relationship detector



person in horse

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple

Visual module

Language module

Proposals: Uijlings et al. IJCV 2013

o1: man

r: ride

o2: horse

Input Sample:

object detector

relationship detector



person in horse

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple

Visual module

Language module

Proposals: Uijlings et al. IJCV 2013

o1: man

r: ride

o2: horse

Input Sample:

⋅ object detector

relationship detector



person riding horse

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] T is a triple

Visual module Proposals: Uijlings et al. IJCV 2013

Sample:

Tackles:

object detector

relationship detector



Quadratic explosion only requires N+K detectors

Language module Tackles:

Long tail distribution can predict rare relationships

o1: man

r: ride

o2: horse

Training the visual module 1. Pre-train using ImageNet

object detector

object detector

relationship detector

Definitions:

Deng et al. 2009

Training the visual module 1. Pre-train using ImageNet 2. Train object detector object detector

object detector

relationship detector

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …]

Girshirk et al. CVPR 2014

Training the visual module 1. Pre-train using ImageNet 2. Train object detector 3. Train relationship detector object detector

object detector

relationship detector

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …]

Training the visual module

object detector

relationship detector

object detector

⋅ Ranking loss

1. 2. 3. 4.

Pre-train using ImageNet Train object detector Train relationship detector Fine-tune both jointly

Definitions: b1, b2 are object proposals o1, o2 ∈ [person, horse, …] r ∈ [on, in, ride, front of, …] Deng et al. 2009

Training the language module

w dog ride skateboard

dog ride wsurfboard

34

Training the language module

w dog ride skateboard

dog ride wsurfboard

where cos is the cosine distance

35

Training the language module

w dog ride skateboard

0

dog ride wsurfboard

0 where cos is the cosine distance

36

Training the language module

w dog ride skateboard

dog ride wsurfboard

where cos is the cosine distance Minimize: 37

Training both modules iteratively Visual module

Language module



Our results:

39

Our results: spatial, comparative, asymmetrical, verb, prepositional

taller than person

person left of wear

on

wear

shirt

snow

ski 40

Our results: spatial, comparative, asymmetrical, verb, prepositional

taller than person

person left of wear

on

wear

shirt

snow

ski 41

Our results: spatial, comparative, asymmetrical, verb, prepositional

taller than person

person left of wear

on

wear

shirt

snow

ski 42

Our results: spatial, comparative, asymmetrical, verb, prepositional

taller than person

person left of wear

on

wear

shirt

snow

ski 43

Relationship types: spatial, comparative, asymmetrical, verb, prepositional

taller than person

person left of wear

on

wear

shirt

snow

ski 44

Our results: spatial, comparative, asymmetrical, verb, prepositional

taller than person

person left of wear

on

wear

shirt

snow

ski 45

Ablation study

Sadeghi et al. 2011 Recall @ 50 Recall @ 100 mAP

Visual only

Visual + language

Ablation study

person wear shirt person wear shirt

Sadeghi et al. 2011 Recall @ 50 Recall @ 100 mAP

0.07 0.09 0.04

Visual only

Visual + language

Ablation study

Recall @ 50 Recall @ 100 mAP

person wear shirt person wear shirt

person in horse person in shirt

Sadeghi et al. 2011

Visual only

0.07 0.09 0.04

1.58 1.85 0.84

Visual + language

Ablation study

person wear shirt person wear shirt

Sadeghi et al. 2011 Recall @ 50 Recall @ 100 mAP

0.07 0.09 0.04

person in horse person ride horse person in shirt person near horse Visual only

1.58 1.85 0.84

Visual + language

13.86 14.76 1.52

person ride bicycle 50

person throw frisbee person throw frisbee 51

Zero shot detection

person sit chair 948 training examples

hydrant on ground 29 training examples

52

Zero shot detection

person sit chair 948 training examples

hydrant on ground 29 training examples

person sit hydrant 0 training examples 53

Zero shot detection

person ride horse 578 training examples

person wear hat 1023 training examples

54

Zero shot detection

person ride horse 578 training examples

person wear hat 1023 training examples

horse wear hat 0 training examples 55

Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University

Poster #4 Questions? 56