Context-Aware Correlation Filter Tracker (Supplementary Material)

0 downloads 150 Views 430KB Size Report
context image patches below the target image patch forming a new data matrix ... window) in the next frame, where Z deno
Context-Aware Correlation Filter Tracker (Supplementary Material) Matthias Mueller, Neil Smith, and Bernard Ghanem King Abdullah University of Science and Technology (KAUST), Saudi Arabia {matthias.mueller.2, neil.smith, bernard.ghanem}@kaust.edu.sa

1

Formulation

1.1

Single-Channel Features

Solution in the Primal Domain. min ||A0 w − y||22 + λ1 ||w||22 + λ2 w

k X

||Ai w||22

(1)

i=1

The primal objective function fp in Eq. (1) can be rewritten by stacking the context image patches below the target image patch forming a new data matrix B ∈ R(k+1)n×n . The new regression target y ¯ ∈ R(k+1)n concatenates y with zeros.

    2

y

√ A0

 λ2 A1  0

    2 fp (w, B) =  .  w −  .  + λ1 ||w||22 = kBw − y ¯k2 + λ1 ||w||22

 ..   .. 



λ 2 Ak 0

(2)

2

where w ∈ Rn , B ∈ R(k+1)n×n , y ¯ ∈ R(k+1)n Since fp (w, B) is convex, it can be minimized by setting the gradient to zero, yielding: (3) ∇w f (w) = 2BT (Bw − y ¯) + 2λ1 w = 0 Solving for w: w = (BT B + λ1 I)−1 BT y ¯

(4)

Identity for circulant matrices (F is the FFT matrix): X = F diag(ˆ x ) FH XT = F diag(ˆ x ∗ ) FH

) XT X = F diag(ˆ x∗ x ˆ ) FH

(5)

2

Matthias Mueller, Neil Smith, and Bernard Ghanem

Therefore: BT B = AT0 A0 + . . . + λ2 ATk Ak = AT0 A0 + λ2

k X

ATi Ai

i=1

= F diag(ˆ a∗0 ˆ a0 ) FH + λ2

k X

F diag(ˆ a∗i ˆ a i ) FH

(6)

i=1

= F diag ˆ a∗0 ˆ a0 + λ 2

k X

! ˆ a∗i ˆ ai

FH

i=1

 √ BT y ¯ = AT0 λ2 AT1

  y 0  √   ... λ2 ATk  .   .. 

(7)

0 = AT0 y = F diag (ˆ a∗0 ) FH y = F diag (ˆ a∗0 y ˆ) Substituting into Eq. 4: " w = F diag

ˆ a∗0

ˆ a0 + λ1 + λ2

k X

#−1

! ˆ a∗i

ˆ ai

F

F diag (ˆ a∗0 y ˆ)

H

(8)

i=1

" F

H

w = diag

ˆ a∗0

ˆ a0 + λ1 + λ2

k X

!#−1 ˆ a∗i

ˆ ai

diag (ˆ a∗0 y ˆ)

(9)

i=1

w ˆ =

ˆ a∗0 y ˆ Pk ∗ a∗i ˆ ai ˆ a0 ˆ a0 + λ1 + λ2 i=1 ˆ

(10)

Detection formula. The learned filter w is convolved with image patch z (search window) in the next frame, where Z denotes its circulant matrix. The location of the maximum response is the target location within the search window. The primal detection formula in time and frequency domains is given by: rp (w, Z) = Z w ⇔ ˆ rp = ˆ z w ˆ

(11)

Solution in the Dual Domain. Note that the solution in the primal domain in Eq. (4) has the exact same form as the solution of the standard ridge regression problem [2]. Hence, the solution in the dual domain is given by: α = BBT + λ1 I where α ∈ R(k+1)n

−1

y ¯

(12)

Context-Aware Correlation Filter Tracker - Supplementary Material

3

 √ A0  λ2 A1    √ √   BBT =  .  AT0 λ2 AT1 . . . λ2 ATk  ..  √ λ2 Ak √ √   T T λ2 A0 ATk √ A0 A0 T λ2 A0 AT1 . . .  λ2 A1 A0 λ2 A1 A1 . . . λ2 A1 ATk    =  .. .. .. ..   . . . √ . T T T λ2 Ak A0 λ2 Ak A1 . . . λ2 Ak Ak √     a0 ˆ a∗k ) FH . . . 0 diag(ˆ a0 ˆ a∗0 ) . . . diag( λ2 ˆ F ... 0   .. . .   .  .. .. .. =  ... . . . ...    . . ..  . . . √ 0 ... F 0 . . . FH ak ˆ a∗0 ) . . . diag(λ2 ˆ ak ˆ a∗k ) diag( λ2 ˆ ¯ F ¯H = FD (13) Substituting into Eq. 12: 

  ¯ (D + λ1 I) F ¯ H −1 y ¯ (D + λ1 I)−1 F ¯H y α= F ¯=F ¯ −1   y ˆ diag(d00 ) . . . diag(d0k )    ..  .. .. .. ˆ y ¯=  . . . . 0 diag(dk0 ) . . . diag(dkk )

(14)



ˆ = (D + λ1 I)−1 α

where vectors djl with j, l ∈ {1, ..., k} are given by:  a0 ˆ a∗0 + λ1   d00 = ˆ djj = λ2 (ˆ aj ˆ a∗j ) + λ1 , j 6= 0  p  djl = λ2 (ˆ aj ˆ a∗l ), j 6= l

(15)

(16)

Note that the kernel trick can be applied, since all interactions between the image patches occur as bi-products. Hence, the linear correlation can simply be replaced by one of the kernel correlations as derived for conventional kernelized CF trackers [1]. Since all blocks are diagonal, the system can be decomposed into n smaller systems of dimension R(k+1)×(k+1) . This significantly reduces complexity and allows for parallelization. Instead of solving one large system of dimension R(k+1)n×(k+1)n to compute α, a separate system is solved for each pixel p ∈ {1, ..., n} of α, as follows:  −1  y ˆ(p) d00 (p) . . . d0k (p)  ..   ..  .. α(p) ˆ =  ... . .   .  dk0 (p) . . . dkk (p) 0 

(17)

4

Matthias Mueller, Neil Smith, and Bernard Ghanem

Detection Formula. Note that B contains the context patches in addition to the target. Consequently, α ∈ R(k+1)n is composed of a concatenation of dual variables {α0 , . . . , αk }. (18) rd (α, B, Z) = Z BT α 

 √ rd = Z AT0 λ2 AT1

 α0   √  α1  ... λ2 ATk  .   .. 

(19)

αk  α0  ..   .     √ z ˆ a∗0 ) . . . λ2 diag(ˆ z ˆ a∗k ) FH rd = F diag(ˆ 

(20)

αk 

ˆ0 α   .  √ ∗ ∗ z ˆ a0 ) . . . λ2 diag(ˆ z ˆ ak )  ..  ˆ rd = diag(ˆ ˆk α ˆ rd = ˆ z ˆ a∗0 α ˆ0 +

(21)

k X p λ2 ˆ z ˆ a∗i α ˆi

(22)

i=1

1.2

Multi-Channel Features

Solution in the Primal Domain. Now we want to solve the same problem for multi-channel features and effectively learn a joint filter for all feature dimension.

min

w1 ,...,wm

m

2

m

2 m k X

X

X X

2 A0i wi − y + λ1 kwi k2 + λ2 Aji wi



(23)

2  

2

w1 m

X



 . 

A0i wi − y = A01 . . . A0m  ..  − y



i=1

2 wm

(24)

i=1

2

i=1

j=1

i=1

2

Note that:

2

 

w1 2 m

X

  2 2 ¯ 2 kwi k2 =  ...  = kwk

i=1

wm 2

   2

2 A01 . . . A0m w1 k m

X X

 ..   ..  Aji wi =  ... . . .

   . .

j=1 i=1

2 Ak1 . . . Akm wm 2

(25)

(26)

Context-Aware Correlation Filter Tracker - Supplementary Material

5

¯ in Eq. (23) can be rewritten in a Therefore, the objective function fp (w; ¯ B) similar fashion as in the case of single-channel features (Eq. (2)) with the differ¯ ∈ R(k+1)n×nm contains the base and context image patches as rows ence that B and their corresponding features columns. The filters for the different feature dimensions are stacked into w ¯ ∈ Rnm .

    2

    . . . √ A0m y

√ A01

w1 2 w1 

 λ2 A11 . . .

  λ A 0 2 1m   . 

    ¯ = ..  −  + λ1  ...  fp (w; ¯ B) 



   .. . . . .. .



  .. 

wm (27)

√ . √ . wm

2 0 2 λ2 Ak1 . . . λ2 Akm

2 ¯w = B ¯ −y ¯ 2 + λ1 ||w|| ¯ 22 ¯ ∈ R(k+1)n×nm , y where w ¯ ∈ Rnm , B ¯ ∈ R(k+1)n ¯ in Eq. (27) is convex. Hence, this Note that the objective function fp (w; ¯ B) optimization problem can be solved by setting the gradient to zero: ¯ T (B ¯w ∇w f (w) ¯ = 2B ¯ −y ¯) + 2λ1 w ¯ =0

(28)

¯TB ¯ + λ1 I)−1 B ¯Ty w ¯ = (B ¯

(29)

Solving for w: ¯ Applying the identity for circulant matrices (Eq. (5)) yields:   √  y  T √ λ2 ATk1   A01 λ2 AT11 . . .   0 .. .. .. ¯Ty B ¯ =  ...   ..  . . . . √ √ T T T λ2 Akm A0m λ2 A1m . . . 0  T      A01 y F diag(ˆ a∗01 ) FH y F diag(ˆ a∗01 y ˆ)       .. .. =  ...  =  =  . . F diag(ˆ a∗0m ) FH y

AT0m y √



(30)

ˆ) F diag(ˆ a∗0m y

 A01 . . . √ A0m √ ...     λ2 A11 . . . λ2 A1m  .. .. .. .. ¯TB ¯ = B   .  .. . . . . . .. .   √ √ √ . √ . λ2 ATkm AT0m λ2 AT1m . . . λ2 Ak1 . . . λ2 Akm  T  Pk Pk T A01 A01 + λ2 i=1 Ai1 Ai1 . . . AT01 A0m + λ2 i=1 ATi1 Aim   .. .. .. =  . . . P P k k T T T T A0m A01 + λ2 i=1 Aim Ai1 . . . A0m A0m + λ2 i=1 Aim Aim  H    F ... 0 F ... 0 diag(¯ c11 ) . . . diag(¯ c1m )     .. . . .  ¯C .. .. .. ¯F ¯H =  ... . . . ...    . . . ..  = F . . 

AT01

λ2 AT11

0 ... F

λ2 ATk1





diag(¯ cm1 ) . . . diag(¯ cmm )

0 . . . FH

(31)

6

Matthias Mueller, Neil Smith, and Bernard Ghanem

where: ¯ c11 = ˆ a∗01 ˆ a01 + λ2

k X

ˆ a∗i1 ˆ ai1

i=1

¯ c1m = ˆ a∗01 ˆ a0m + λ2

k X

ˆ a∗i1 ˆ aim

i=1

¯ cm1 = ˆ a∗0m ˆ a01 + λ2

k X

(32) ˆ a∗im ˆ ai1

i=1

¯ cmm =

ˆ a∗0m

ˆ a0m + λ2

k X

ˆ a∗im ˆ aim

i=1

Substituting into Eq. 29:   ¯ (C ¯ + λ1 I) F ¯ H −1 B ¯Ty w ¯ = F ¯

(33)

¯H w ¯ + λ1 I)−1 F ¯H B ¯Ty F ¯ = (C ¯

(34)

 diag(¯ c11 ) + λ1 I . . .  .. .. ˆ w ¯ = . .

diag(¯ c1m ) .. .

−1   

diag(¯ cm1 ) ¯ 11 C  .. = . ¯ m1 C 

. . . diag(¯ cmm ) + λ1 I  −1  ¯ 1m diag(ˆ a∗01 y ˆ) ... C  .. ..   ..  . . .   ¯ mm diag(ˆ a∗0m y ˆ) ... C

 ˆ) diag(ˆ a∗01 y   ..   . ˆ) diag(ˆ a∗0m y

(35)

The target and context image patches for each feature dimension j, l ∈ {1, ..., m} ¯ + λ1 I)−1 are defined are denoted by a0j and aij respectively. The blocks of (C as:    Pk C ¯ jj = diag ˆ a∗0j ˆ a0j + λ2 i=1 ˆ a∗ij ˆ aij + λ1 I   (36) Pk C ¯ jl = diag ˆ a∗0j ˆ a0l + λ2 i=1 ˆ a∗ij ˆ ail , j 6= l Unfortunately, this system cannot be inverted as efficiently as in the singlechannel case. However, since all of the blocks are diagonal, the system can be decomposed into n smaller systems of dimension Rm×m . This reduces the complexity significantly and allows for parallelization. Similar to Eq. (17), a separate ˆ system is solved for each pixel p ∈ {1, ..., n} of the filter w. ¯  ¯ c11 (p) + λ1 . . .  .. .. ˆ w(p) ¯ = . . ¯ cm1 (p)

¯ c1m (p) .. .

... ¯ cmm (p) + λ1

−1   

 [ˆ a∗01 y ˆ](1)   ..   . [ˆ a∗0m

(37)

y ˆ](1)

Detection formula. It is almost the same as in the single-channel case in Eq. (11) with the difference that the image patch z and the learned filter w are m-dimensional.

Context-Aware Correlation Filter Tracker - Supplementary Material

7

Solution in the Dual Domain. Just like in the case of single-channel features, the multi-channel primal solution (Eq. (29)) also has the exact same form as the solution of the standard ridge regression problem [1] yielding the following solution in the dual domain: ¯B ¯ T + λ1 I ¯= B α

−1

y ¯

(38)

¯ ∈ Rkn where α  √  T √  T √ A01 . . . √ A0m λ2 ATk1  λ2 A11 . . . λ2 A1m  A01 λ2 A11 . . .   ..  .. .. .. ¯B ¯T =  B  .   .. .. . .. . .   √ √ . √ . √ . λ2 ATkm AT0m λ2 AT1m . . . λ2 Ak1 . . . λ2 Akm √ Pm √ Pm  Pm  T λ2 i=1 A0i A1i T . . . λ2 i=1 A0i Aki T i=1 A0i A0i √ P P P  λ2 m A1i AT λ2 m A1i A1i T . . . λ2 m A1i Aki T  0i i=1 i=1 i=1   =  .. .. .. ..   . . . √ Pm. P P m m T T T λ2 i=1 Aki A0i λ2 i=1 Aki A1i ... λ2 i=1 Aki Aki √ Pm   Pm ∗ diag( i=1 ˆ a0i ˆ a∗ki ) a0i ˆ a0i ) . . . diag( λ2 i=1 ˆ  ¯H .. .. .. ¯ =F  F . . . √ Pm P m ∗ ∗ aki ˆ a0i ) . . . diag(λ2 i=1 ˆ diag( λ2 i=1 ˆ aki ˆ aki ) H ¯ ¯ ¯ = FD F (39) Substituting into Eq. 38: 

  ¯H y ¯ (D ¯ + λ1 I) F ¯ H −1 y ¯ (D ¯ + λ1 I)−1 F ¯= F ¯ α ¯=F    ¯ 00 ) . . . diag(d ¯ 0k ) −1 y diag(d ˆ    . . . . ˆ .. .. .. y ¯=   ..  ¯ k0 ) . . . diag(d ¯ kk ) 0 diag(d

(40)



¯ + λ1 I)−1 ˆ ¯ = (D α

¯ jl with j, l ∈ {1, ..., k} are given by: where vectors d  Pm ¯  (ˆ a0i ˆ a∗0i ) + λ1 d00 = i=1 P m ¯ jj = λ2 d aji ˆ a∗ji ) + λ1 , j 6= 0 i=1 (ˆ  √ P ¯ m aji ˆ a∗li ), j 6= l djl = λ2 i=1 (ˆ

(41)

(42)

Note that the linear system is the same as in case of the dual domain solution for single-channel features with the exception that there is now a sum along the feature dimension m. This solution also permits the use of kernels and the linear system can be solved in the same fashion as the single-channel case (Eq. (17)). Since all blocks are diagonal, the system can be decomposed into n smaller systems of dimension R(k+1)×(k+1) . This significantly reduces complexity and allows for

8

Matthias Mueller, Neil Smith, and Bernard Ghanem

parallelization. Instead of solving one large system of dimension R(k+1)n×(k+1)n ˆ ˆ¯ as ¯ a separate system is solved for each pixel p ∈ {1, ..., n} of α, to compute α, follows:    ¯ 00 (p) . . . d ¯ 0k (p) −1 y ˆ(p) d  ..   ..  .. ˆ¯ α(p) =  ... . .   .  ¯ k0 (p) . . . d ¯ kk (p) 0 d 

(43)

Detection Formula. It follows the single-channel feature case with the difference ¯ ∈ Rnm×n and B ¯ ∈ R(k+1)n×nm now have multiple feature dimensions as that Z columns: ¯ Z) ¯ =Z ¯B ¯Tα (44) ¯ rd (α, ¯ B, AT    .01 rd = Z1 . . . Zm  .. 

AT0m

√ √



   α ¯ λ2 ATk1  0  ¯   α1  ..   ..  .  .  √ T λ2 Akm ... ¯k α

λ2 AT11 . . . .. .. . .

λ2 AT1m

 rd =

Pm

i=1

Zi AT0i . . .



λ2

Pm

i=1

 Zi Aki T FH

 ¯0 α  ..   . 

(45)

(46)

¯k α

rd = F

Pm

i=1

diag(ˆ zi ˆ a∗0i ) . . .



λ2

  ¯0 α  H  ..  ∗ zi ˆ aki ) F  .  (47) i=1 diag(ˆ ¯k α

Pm

 ˆ¯ 0 α    √ Pm Pm zi ˆ a∗ki )  ...  zi ˆ a∗0i ) . . . λ2 diag ( i=1 ˆ ˆ rd = diag ( i=1 ˆ ¯ˆ k α 

ˆ rd =

m X

! ˆ zi ˆ a∗0i

ˆ ¯0 + α

p

λ2

i=1

1.3

k m X X j=1

(48)

! ˆ zi ˆ a∗ji

ˆ¯ j α

(49)

i=1

Energy of data term ||A0 w − y||22 = (A0 w − y)T (A0 w − y) = (A0 w)T (A0 w) − (A0 w)T y − yT (A0 w) + (y)T (y) =

||A0 w||22

where A0 w is the response

T

− 2(A0 w) y +

||y||22

(50)

Context-Aware Correlation Filter Tracker - Supplementary Material

2

9

Experiments

We also evaluate trackers with the same parameters on two additional datasets to show that our framework improves tracking performance consistently. Figures 1a and 1b show the results on TC-128 [3] and Figures 2a and 2b show the results on UAV-123 [4]. While the results are lower for all trackers due to the higher difficulty of these datasets, the context-aware CF trackers consistently outperform the corresponding baseline CF trackers by a margin. Also note that none of the parameters was adjusted and that the sampling strategy for the context patches is very naive. With further parameter tuning and a smarter sampling strategy the results can be much further improved.

Precision plots of OPE on TC128 - All Sequences

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.679] - 31.4fps

0.4

STAPLE [0.667] - 49.5fps SAMF CA [0.647] - 9.69fps

Success rate

Precision

0.9

SAMF AT [0.636] - 5.12fps

SAMF [0.635] - 12.3fps DCF CA [0.575] - 116fps

0.3

STAPLE CA [0.505] - 31.4fps

0.4

STAPLE [0.497] - 49.5fps SAMF CA [0.479] - 9.69fps SAMF AT [0.465] - 5.12fps

0.2

MOSSE CA [0.496] - 212fps

0.1

0.5

0.3

DCF [0.550] - 423fps DCF AT [0.523] - 57.4fps

0.2

0.1

MOSSE AT [0.434] - 29.2fps

MOSSE [0.360] - 466fps

0

0

10

20 30 40 Location error threshold

(a) TC-128 - Precision Plot

Success plots of OPE on TC128 - All Sequences

SAMF [0.462] - 12.3fps DCF CA [0.415] - 116fps DCF [0.385] - 423fps DCF AT [0.382] - 57.4fps MOSSE CA [0.361] - 212fps MOSSE AT [0.325] - 29.2fps

MOSSE [0.268] - 466fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

(b) TC-128 - Success Plot

Fig. 1: Average precision and success on TC-128 for all sequences

1

10

Matthias Mueller, Neil Smith, and Bernard Ghanem

OPE Precision plots on UAV123 - All Sequences

0.9

0.8

0.8

0.7

0.7

0.6

0.6

Success rate

Precision

0.9

0.5 STAPLE CA [0.672] - 29.4fps

0.4

STAPLE [0.666] - 47.5fps SAMF CA [0.605] - 4.39fps

0.3

SAMF [0.592] - 5.26fps SAMF AT [0.583] - 1.79fps

0.5 STAPLE CA [0.454] - 29.4fps

0.4

STAPLE [0.450] - 47.5fps SAMF CA [0.415] - 4.39fps

0.3

SAMF [0.396] - 5.26fps SAMF AT [0.385] - 1.79fps

DCF CA [0.575] - 127fps

DCF CA [0.362] - 127fps

DCF AT [0.565] - 55.3fps

0.2

0.2

MOSSE CA [0.546] - 181fps

0.1

DCF [0.526] - 457fps MOSSE [0.466] - 512fps

0

0

10

20 30 40 Location error threshold

(a) UAV-123 - Precision Plot

DCF AT [0.359] - 55.3fps MOSSE AT [0.348] - 29.1fps

MOSSE AT [0.528] - 29.1fps

0.1

OPE Success plots on UAV123 - All Sequences

MOSSE CA [0.347] - 181fps DCF [0.332] - 457fps MOSSE [0.297] - 512fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

(b) UAV-123 - Success Plot

Fig. 2: Average precision and success on UAV-123 for all sequences

The following Figures show the performance for each attribute on OTB-100 [5].

1

Context-Aware Correlation Filter Tracker - Supplementary Material

OPE Precision plots on OTB100 - Aspect Ratio Change (16)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.652] - 35.2fps SAMF CA [0.640] - 13fps

0.4

SAMF AT [0.626] - 6.11fps

OPE Success plots on OTB100 - Aspect Ratio Change (16)

0.9

Success rate

Precision

0.9

0.5

STAPLE CA [0.467] - 35.2fps SAMF CA [0.450] - 13fps

0.4

STAPLE [0.414] - 59.8fps SAMF AT [0.402] - 6.11fps

0.3

STAPLE [0.589] - 59.8fps SAMF [0.543] - 16.8fps DCF CA [0.527] - 82.3fps

0.3

DCF CA [0.384] - 82.3fps

0.2

DCF [0.506] - 333fps DCF AT [0.492] - 34.2fps

0.2

SAMF [0.383] - 16.8fps DCF [0.355] - 333fps DCF AT [0.352] - 34.2fps

MOSSE CA [0.246] - 115fps

0.1 0

0

10

20 30 40 Location error threshold

MOSSE CA [0.270] - 115fps

0.1

MOSSE [0.227] - 355fps MOSSE AT [0.225] - 18.9fps

11

MOSSE AT [0.253] - 18.9fps MOSSE [0.215] - 355fps

0

50

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 3: Average precision and success on OTB-100 for videos with the attribute Aspect Ratio Change

OPE Precision plots on OTB100 - Background Clutter (31)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

SAMF CA [0.801] - 13fps STAPLE CA [0.799] - 35.2fps

0.4

OPE Success plots on OTB100 - Background Clutter (31)

0.9

Success rate

Precision

0.9

SAMF AT [0.766] - 6.11fps

0.5

STAPLE CA [0.593] - 35.2fps SAMF CA [0.584] - 13fps

0.4

STAPLE [0.561] - 59.8fps SAMF AT [0.554] - 6.11fps DCF CA [0.536] - 82.3fps

DCF CA [0.764] - 82.3fps

0.3

STAPLE [0.749] - 59.8fps DCF AT [0.711] - 34.2fps

0.3

0.2

DCF [0.686] - 333fps SAMF [0.674] - 16.8fps MOSSE CA [0.551] - 115fps

0.2

0.1

DCF AT [0.512] - 34.2fps SAMF [0.491] - 16.8fps DCF [0.487] - 333fps MOSSE CA [0.404] - 115fps

0.1

MOSSE AT [0.468] - 18.9fps

MOSSE AT [0.356] - 18.9fps

MOSSE [0.410] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.302] - 355fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 4: Average precision and success on OTB-100 for videos with the attribute Background Clutter

12

Matthias Mueller, Neil Smith, and Bernard Ghanem

OPE Precision plots on OTB100 - Deformation (44)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.777] - 35.2fps SAMF CA [0.752] - 13fps

0.4

STAPLE [0.751] - 59.8fps SAMF AT [0.731] - 6.11fps

0.3

SAMF [0.707] - 16.8fps DCF CA [0.693] - 82.3fps

STAPLE CA [0.572] - 35.2fps

0.4

STAPLE [0.550] - 59.8fps SAMF CA [0.533] - 13fps

0.3

SAMF [0.516] - 16.8fps SAMF AT [0.494] - 6.11fps DCF AT [0.443] - 34.2fps

0.2

DCF [0.621] - 333fps MOSSE CA [0.486] - 115fps

0.1

0.5

DCF CA [0.477] - 82.3fps

DCF AT [0.639] - 34.2fps

0.2

OPE Success plots on OTB100 - Deformation (44)

0.9

Success rate

Precision

0.9

DCF [0.439] - 333fps MOSSE CA [0.376] - 115fps

0.1

MOSSE AT [0.424] - 18.9fps

MOSSE AT [0.332] - 18.9fps

MOSSE [0.329] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.263] - 355fps

0

50

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 5: Average precision and success on OTB-100 for videos with the attribute Deformation

OPE Precision plots on OTB100 - Fast Motion (39)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.757] - 35.2fps SAMF AT [0.738] - 6.11fps

0.4

OPE Success plots on OTB100 - Fast Motion (39)

0.9

Success rate

Precision

0.9

DCF CA [0.727] - 82.3fps

0.5

STAPLE CA [0.583] - 35.2fps SAMF AT [0.546] - 6.11fps

0.4

SAMF CA [0.543] - 13fps

0.3

STAPLE [0.541] - 59.8fps DCF CA [0.519] - 82.3fps

0.2

SAMF [0.519] - 16.8fps DCF AT [0.498] - 34.2fps

SAMF CA [0.723] - 13fps

0.3

STAPLE [0.710] - 59.8fps SAMF [0.684] - 16.8fps DCF AT [0.669] - 34.2fps

0.2

DCF [0.603] - 333fps MOSSE CA [0.545] - 115fps

0.1

DCF [0.454] - 333fps MOSSE CA [0.446] - 115fps

0.1

MOSSE AT [0.531] - 18.9fps

MOSSE AT [0.423] - 18.9fps

MOSSE [0.273] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.237] - 355fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 6: Average precision and success on OTB-100 for videos with the attribute Fast Motion

Context-Aware Correlation Filter Tracker - Supplementary Material

OPE Precision plots on OTB100 - In-Plane Rotation (51)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.796] - 35.2fps SAMF AT [0.771] - 6.11fps

0.4

STAPLE [0.768] - 59.8fps DCF CA [0.751] - 82.3fps

0.3

SAMF CA [0.743] - 13fps

OPE Success plots on OTB100 - In-Plane Rotation (51)

0.9

Success rate

Precision

0.9

0.5

STAPLE CA [0.565] - 35.2fps STAPLE [0.549] - 59.8fps SAMF CA [0.533] - 13fps

0.4

DCF CA [0.517] - 82.3fps SAMF AT [0.516] - 6.11fps

0.3

DCF AT [0.725] - 34.2fps

DCF AT [0.507] - 34.2fps

SAMF [0.707] - 16.8fps DCF [0.686] - 333fps MOSSE CA [0.570] - 115fps

0.2 0.1

SAMF [0.498] - 16.8fps DCF [0.464] - 333fps MOSSE CA [0.430] - 115fps

0.2 0.1

MOSSE AT [0.559] - 18.9fps

MOSSE AT [0.412] - 18.9fps

MOSSE [0.404] - 355fps

0

0

10

20 30 40 Location error threshold

13

MOSSE [0.300] - 355fps

0

50

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 7: Average precision and success on OTB-100 for videos with the attribute In-Plane Rotation

OPE Precision plots on OTB100 - Illumination Variation (38)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.830] - 35.2fps STAPLE [0.782] - 59.8fps SAMF CA [0.776] - 13fps

0.4

DCF CA [0.759] - 82.3fps SAMF AT [0.756] - 6.11fps

0.3

OPE Success plots on OTB100 - Illumination Variation (38)

0.9

Success rate

Precision

0.9

0.5

STAPLE CA [0.618] - 35.2fps STAPLE [0.595] - 59.8fps SAMF CA [0.571] - 13fps

0.4

SAMF AT [0.541] - 6.11fps SAMF [0.516] - 16.8fps DCF CA [0.504] - 82.3fps

0.3

DCF AT [0.728] - 34.2fps SAMF [0.713] - 16.8fps DCF [0.693] - 333fps MOSSE CA [0.510] - 115fps

0.2 0.1

DCF AT [0.491] - 34.2fps

0.2

DCF [0.466] - 333fps MOSSE CA [0.396] - 115fps

0.1

MOSSE AT [0.466] - 18.9fps

MOSSE AT [0.371] - 18.9fps

MOSSE [0.344] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.273] - 355fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 8: Average precision and success on OTB-100 for videos with the attribute Illumination Variation

14

Matthias Mueller, Neil Smith, and Bernard Ghanem

OPE Precision plots on OTB100 - Low Resolution (9)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.819] - 35.2fps SAMF AT [0.797] - 6.11fps

0.4

OPE Success plots on OTB100 - Low Resolution (9)

0.9

Success rate

Precision

0.9

DCF AT [0.744] - 34.2fps

0.5

STAPLE CA [0.451] - 35.2fps SAMF AT [0.443] - 6.11fps

0.4

SAMF CA [0.405] - 13fps

0.3

SAMF CA [0.733] - 13fps

0.3

0.2

STAPLE [0.695] - 59.8fps DCF [0.694] - 333fps MOSSE AT [0.677] - 18.9fps

STAPLE [0.399] - 59.8fps SAMF [0.348] - 16.8fps MOSSE AT [0.342] - 18.9fps

0.2

DCF AT [0.340] - 34.2fps

0.1

SAMF [0.632] - 16.8fps MOSSE CA [0.601] - 115fps

0.1

DCF CA [0.735] - 82.3fps

DCF CA [0.339] - 82.3fps DCF [0.304] - 333fps MOSSE CA [0.302] - 115fps

MOSSE [0.509] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.253] - 355fps

0

50

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 9: Average precision and success on OTB-100 for videos with the attribute Low Resolution

OPE Precision plots on OTB100 - Motion Blur (29)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.749] - 35.2fps SAMF AT [0.748] - 6.11fps

0.4

OPE Success plots on OTB100 - Motion Blur (29)

0.9

Success rate

Precision

0.9

SAMF CA [0.742] - 13fps

0.5

SAMF CA [0.583] - 13fps SAMF AT [0.582] - 6.11fps

0.4

STAPLE CA [0.575] - 35.2fps STAPLE [0.540] - 59.8fps DCF AT [0.523] - 34.2fps

DCF CA [0.711] - 82.3fps

0.3

STAPLE [0.699] - 59.8fps DCF AT [0.684] - 34.2fps

0.3

0.2

SAMF [0.658] - 16.8fps DCF [0.576] - 333fps MOSSE CA [0.547] - 115fps

0.2

0.1

DCF CA [0.523] - 82.3fps SAMF [0.511] - 16.8fps MOSSE CA [0.454] - 115fps DCF [0.453] - 333fps MOSSE AT [0.443] - 18.9fps

0.1

MOSSE AT [0.538] - 18.9fps MOSSE [0.297] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.263] - 355fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 10: Average precision and success on OTB-100 for videos with the attribute Motion Blur

Context-Aware Correlation Filter Tracker - Supplementary Material

OPE Precision plots on OTB100 - Occlusion (49)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

SAMF CA [0.750] - 13fps STAPLE CA [0.739] - 35.2fps

0.5

STAPLE CA [0.558] - 35.2fps SAMF CA [0.550] - 13fps

0.4

SAMF AT [0.738] - 6.11fps

0.3

STAPLE [0.728] - 59.8fps SAMF [0.724] - 16.8fps DCF CA [0.670] - 82.3fps

0.3

0.2

DCF AT [0.648] - 34.2fps

0.4

0.2

STAPLE [0.543] - 59.8fps SAMF [0.529] - 16.8fps SAMF AT [0.515] - 6.11fps DCF CA [0.468] - 82.3fps

DCF [0.610] - 333fps MOSSE CA [0.511] - 115fps

0.1

OPE Success plots on OTB100 - Occlusion (49)

0.9

Success rate

Precision

0.9

DCF AT [0.450] - 34.2fps DCF [0.433] - 333fps MOSSE CA [0.393] - 115fps

0.1

MOSSE AT [0.481] - 18.9fps

MOSSE AT [0.371] - 18.9fps

MOSSE [0.351] - 355fps

0

0

10

20 30 40 Location error threshold

15

MOSSE [0.270] - 355fps

0

50

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 11: Average precision and success on OTB-100 for videos with the attribute Occlusion

OPE Precision plots on OTB100 - Out-of-Plane Rotation (63)

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

Success rate

Precision

0.9

SAMF CA [0.771] - 13fps STAPLE CA [0.764] - 35.2fps

0.5

SAMF CA [0.560] - 13fps STAPLE CA [0.554] - 35.2fps

0.4

SAMF AT [0.747] - 6.11fps

0.3

STAPLE [0.738] - 59.8fps SAMF [0.732] - 16.8fps DCF AT [0.693] - 34.2fps

0.3

0.2

DCF CA [0.692] - 82.3fps

0.4

0.2

STAPLE [0.534] - 59.8fps SAMF [0.518] - 16.8fps SAMF AT [0.510] - 6.11fps DCF CA [0.477] - 82.3fps

DCF [0.665] - 333fps MOSSE CA [0.525] - 115fps

0.1

OPE Success plots on OTB100 - Out-of-Plane Rotation (63)

DCF AT [0.472] - 34.2fps DCF [0.448] - 333fps MOSSE CA [0.397] - 115fps

0.1

MOSSE AT [0.486] - 18.9fps

MOSSE AT [0.367] - 18.9fps

MOSSE [0.385] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.277] - 355fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 12: Average precision and success on OTB-100 for videos with the attribute Out-of-Plane Rotation

16

Matthias Mueller, Neil Smith, and Bernard Ghanem

OPE Precision plots on OTB100 - Out-of-View (14)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

SAMF AT [0.709] - 6.11fps SAMF CA [0.689] - 13fps

0.4

STAPLE [0.668] - 59.8fps STAPLE CA [0.666] - 35.2fps

0.3

SAMF [0.584] - 16.8fps DCF CA [0.578] - 82.3fps DCF AT [0.561] - 34.2fps

0.2

DCF [0.487] - 333fps MOSSE AT [0.389] - 18.9fps

0.1

OPE Success plots on OTB100 - Out-of-View (14)

0.9

Success rate

Precision

0.9

0.5

SAMF AT [0.524] - 6.11fps SAMF CA [0.505] - 13fps

0.4

STAPLE CA [0.485] - 35.2fps

0.3

STAPLE [0.476] - 59.8fps SAMF [0.438] - 16.8fps DCF AT [0.429] - 34.2fps

0.2

DCF CA [0.428] - 82.3fps DCF [0.390] - 333fps MOSSE AT [0.314] - 18.9fps

0.1

MOSSE CA [0.349] - 115fps

MOSSE CA [0.305] - 115fps

MOSSE [0.261] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.204] - 355fps

0

50

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 13: Average precision and success on OTB-100 for videos with the attribute Out-of-View

OPE Precision plots on OTB100 - Scale Variation (64)

0.8

0.8

0.7

0.7

0.6

0.6

0.5

STAPLE CA [0.762] - 35.2fps SAMF CA [0.758] - 13fps

0.4

OPE Success plots on OTB100 - Scale Variation (64)

0.9

Success rate

Precision

0.9

0.5

STAPLE CA [0.545] - 35.2fps SAMF CA [0.525] - 13fps

0.4

STAPLE [0.520] - 59.8fps SAMF AT [0.495] - 6.11fps

0.3

STAPLE [0.727] - 59.8fps DCF CA [0.695] - 82.3fps

0.3

SAMF [0.471] - 16.8fps DCF CA [0.436] - 82.3fps

0.2

SAMF [0.693] - 16.8fps DCF AT [0.682] - 34.2fps

0.2

SAMF AT [0.751] - 6.11fps

DCF [0.626] - 333fps MOSSE CA [0.501] - 115fps

0.1

DCF AT [0.430] - 34.2fps DCF [0.394] - 333fps MOSSE CA [0.366] - 115fps

0.1

MOSSE AT [0.469] - 18.9fps

MOSSE AT [0.346] - 18.9fps

MOSSE [0.349] - 355fps

0

0

10

20 30 40 Location error threshold

MOSSE [0.256] - 355fps

50

0

0

0.2

0.4 0.6 Overlap threshold

0.8

1

Fig. 14: Average precision and success on OTB-100 for videos with the attribute Scale Variation

Context-Aware Correlation Filter Tracker - Supplementary Material

17

References 1. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision, ECCV, 2012. 2. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 2015., 2015. 3. P. Liang, E. Blasch, and H. Ling. Encoding color information for visual tracking: Algorithms and benchmark. Image Processing, IEEE . . . , pages 1–14, 2015. 4. M. Mueller, N. Smith, and B. Ghanem. A Benchmark and Simulator for UAV Tracking, pages 445–461. Springer International Publishing, Cham, 2016. 5. Y. Wu, J. Lim, and M. H. Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, Sept 2015.