SMILe: Leveraging Submodular Mutual Information for Robust Few-Shot Object Detection

Anay Majee, Ryan Sharp, Rishabh Iyer

CARAML Lab, The University of Texas at Dallas
ECCV 2024

Paper Code arXiv

SMILe introduces a family of Submodular combinatorial objectives based on Submodular Mutual Information designed to tackle the challenge of Class Confusion and Catastrophic Forgetting in Few-Shot Object Detection (FSOD) tasks.

The SMILe Framework

SMILe introduces a paradigm shift in FSOD by imbibing a combinatorial viewpoint, where the base dataset, $𝐷_{𝑏𝑎𝑠𝑒}= \{ 𝐴_1^𝑏, 𝐴_2^𝑏, …, 𝐴_{|𝐶_𝑏|}^𝑏\}$, containing abundant training examples from $𝐶_𝑏$ base classes and the novel dataset, $𝐷_{𝑛𝑜𝑣𝑒𝑙}=\{𝐴_1^𝑛, 𝐴_2^𝑛, …, 𝐴_{|𝐶_𝑛|}^𝑛\}$ containing only K-shot ($𝐴_𝑖^𝑛 = 𝐾$ for $𝑖 \in [1, 𝐶_𝑛]$) training examples from $𝐶_𝑛$ novel classes.
The striking natural imbalance between number of samples in the base and the novel classes leads to confusion between existing (base) and newly added (novel) few-shot classes. We trace the root cause for class confusion to large inter-class bias between base and novel (K-shot) classes. This results in mis-classification of one or more base classes as novel ones.

Further, in a quest to learn the novel classes $C_n$ the model seldom forgets feature representations corresponding to the previously learnt base classes. Even though several techniques in FSOD adopt a replay technique (provide K-shot examples of the base classes during few-shot adaptation) the lack of discriminative features results in catastrophic forgetting of the base classes.

Figure 1: Application of SMILe on any existing approach which demonstrates (a) confusion and forgetting, demonstrates (b) inter-cluster separation, removing class confusion and (c) Fostering intra-class compactness, improving forgetting.

SMILe exploits the diminishing returns property of submodular functions to define a novel family of combinatorial objective (loss) functions $𝐿_{𝑐𝑜𝑚𝑏} (\theta)$ which enforces orthogonality in the feature space when applied on Region-of-Interest (RoI) features in FSOD models. The loss function $𝐿_𝑐𝑜𝑚𝑏 (\theta)$ can be decomposed into two major components - $𝑳_{𝒄𝒐𝒎𝒃}^{𝒊𝒏𝒕𝒆𝒓}$ which minimizes inter-class bias between base and novel classes and $𝑳_{𝒄𝒐𝒎𝒃}^{𝒊𝒏𝒕𝒓𝒂}$ maximizes intra-class compactness within abundant classes.

$𝑳_{𝒄𝒐𝒎𝒃}^{𝒊𝒏𝒕𝒆𝒓}$ minimizes the mutual information between classes in $𝐶_𝑛$, minimizing inter-cluster overlaps between the novel classes. This has been shown to be effective in mitigating class confusion in FSOD.

\[ L_{comb}^{inter}(\theta) = \underset{\substack{b \in C_b \\ n \in C_n}}{\sum} I_f(A_b, A_n; \theta) + \underset{\substack{i, j \in C_n \\ i \neq j}}{\sum} I_f(A_i, A_j; \theta) = \underset{\substack{i \in (C_b \cup C_n) \\ j \in C_n : i \neq j}}{\sum}I_f(A_i, A_j; \theta)\]

$𝐋_{𝐜𝐨𝐦𝐛}^{𝐢𝐧𝐭𝐫𝐚}$ minimizes the total submodular information within samples in each class in $𝐶_𝑏 \cup C_𝑛$ boosting base class performance asserting the mitigation of catastrophic forgetting.

\[L_{comb}^{intra}(\theta) = {\underset{b \in C_b}{\sum}} f(A_b, \theta) + {\underset{n \in C_n}{\sum}} f(A_n, \theta) = {\underset{k \in (C_b \cup C_n)}{\sum}} f(A_k, \theta)\]

Encapsulating $𝐿_{𝑐𝑜𝑚𝑏}^{𝑖𝑛𝑡𝑟𝑎}$ and $𝐿_{𝑐𝑜𝑚𝑏}^{𝑖𝑛𝑡𝑒𝑟}$ we define a joint objective $𝐿_{𝑐𝑜𝑚𝑏}(\theta)$ which tackles both the challenges of confusion and forgetting.

\begin{split} L_{comb}(\theta) =& (1 - \eta) L_{comb}^{intra}(\theta) + \eta L_{comb}^{inter}(\theta) \\ =& {\underset{i \in C_b \cup C_n}{\sum}} \Biggl[(1 - \eta) f(A_i, \theta) + \eta \underset{\substack{j \in C_n \\ i \neq j}}{\sum}I_f(A_i, A_j; \theta) \Biggr] \end{split}

The combined effect of $L_{comb}^{inter}$ and $L_{comb}^{intra}$ minimizes inter-class bias and intra-class variance resulting in reduced class confusion and catastrophic forgetting.

Does SMILe Overcome Catastrophic Forgetting ?

One of the most significant challenges in FSOD is the elimination of catastrophic forgetting which manifests as the degradation in the performance of classes in $C_b$ while learning classes in $C_n$. This primarily occurs due to the lack of discriminative feature representations from instances in $D_{base}$ during the few-shot adaptation (stage 2) stage. We plot the change in base class performance as the training progresses in existing SoTA methods AGCM and DiGeo against number of training iterations in Figure 2(a). Our SMILe approach introduced to SoTA approaches like DiGeo outperforms SoTA base class performance even higher than the roofline establishing the supremacy of $L_{comb}$ in overcoming catastrophic forgetting.

Figure 2: Resilience to Catastrophic forgetting and faster convergence in SMILe over SoTA approaches. (a) shows that combinatorial losses in SMILe are robust to catastrophic forgetting, while (b) shows that objectives in SMILe results in faster convergence over SoTA FSOD methods (AGCM and DiGeo).

Additionally, in Figure 2 (b) we observe that application of SMILe objectives to existing methods results in upto 2x faster convergence. This is primarily due to the adoption of a combinatorial viewpoint which replaces instance based approaches, popular in literature.

Resilience to Class-Confusion by Objectives in SMILe

Figure 3 highlights the supremacy of the proposed SMILe framework in mitigating class confusion through confusion matrix plots. We compare the confusion between classes in $C_b \cup C_n$ of SoTA approaches AGCM and DiGeo before and after introduction of combinatorial objectives in SMILe.

We show that AGCM+SMILe demonstrates 11% lower confusion than AGCM and DiGeo+SMILe shows 4% lower confusion than DiGeo. This proves the efficacy of combinatorial objectives ($L_{comb}^{inter}$) in mitigating inter-class bias, thereby reducing confusion between classes.

Figure 3: SMILe demonstrates 11% lower confusion over AGCM (a,b) and 4% lower confusion over DiGeo (c,d). Only significant numbers are highlighted.

Further from empirical results shown below we observe that SMILe significantly overcomes class confusion resulting from large inter-class bias between base and novel classes.

Quantitative results on PASCAL VOC 1, 5 and 10-shot settings and MS-COCO, 5 and 10-shot settings, are provided in our paper.

SMILe: Leveraging Submodular Mutual Information for Robust Few-Shot Object Detection

SMILe introduces a family of Submodular combinatorial objectives based on Submodular Mutual Information designed to tackle the challenge of Class Confusion and Catastrophic Forgetting in Few-Shot Object Detection (FSOD) tasks.

Abstract

The SMILe Framework

Figure 1: Application of SMILe on any existing approach which demonstrates (a) confusion and forgetting, demonstrates (b) inter-cluster separation, removing class confusion and (c) Fostering intra-class compactness, improving forgetting.

Does SMILe Overcome Catastrophic Forgetting ?

Figure 2: Resilience to Catastrophic forgetting and faster convergence in SMILe over SoTA approaches. (a) shows that combinatorial losses in SMILe are robust to catastrophic forgetting, while (b) shows that objectives in SMILe results in faster convergence over SoTA FSOD methods (AGCM and DiGeo).

Additionally, in Figure 2 (b) we observe that application of SMILe objectives to existing methods results in upto 2x faster convergence. This is primarily due to the adoption of a combinatorial viewpoint which replaces instance based approaches, popular in literature.

Resilience to Class-Confusion by Objectives in SMILe

Figure 3: SMILe demonstrates 11% lower confusion over AGCM (a,b) and 4% lower confusion over DiGeo (c,d). Only significant numbers are highlighted.

Further from empirical results shown below we observe that SMILe significantly overcomes class confusion resulting from large inter-class bias between base and novel classes.

Citation