SMILe: Leveraging Submodular Mutual Information for Robust Few-Shot Object Detection

CARAML Lab, The University of Texas at Dallas
ECCV 2024
MY ALT TEXT

SMILe introduces a family of Submodular combinatorial objectives based on Submodular Mutual Information designed to tackle the challenge of Class Confusion and Catastrophic Forgetting in Few-Shot Object Detection (FSOD) tasks.

Abstract

Confusion and forgetting of object classes have been challenges of prime interest in Few-Shot Object Detection (FSOD). To overcome these pitfalls in metric learning based FSOD techniques, we introduce a novel Submodular Mutual Information Learning (SMILe) framework for loss functions which adopts combinatorial mutual information functions as learning objectives to enforce learning of well-separated feature clusters between the base and novel classes. Additionally, the joint objective in SMILE minimizes the total submodular information contained in a class leading to discriminative feature clusters. The combined effect of this joint objective demonstrates significant improvements in class confusion and forgetting in FSOD. Further, we show that SMILe generalizes to several existing approaches in FSOD, improving their performance, agnostic of the backbone architecture. Experiments on popular FSOD benchmarks, PASCAL-VOC and MS-COCO, show that our approach generalizes to State-of-the-Art (SoTA) approaches improving their novel class performance by up to 5.7% (3.3 mAP points) and 5.4% (2.6 mAP points) on the 10-shot setting of VOC (split 3) and 30-shot setting of COCO datasets respectively. Our experiments also demonstrate better retention of base class performance and up to 2Γ— faster convergence over existing approaches, agnostic of the underlying architecture.

The SMILe Framework

SMILe introduces a paradigm shift in FSOD by imbibing a combinatorial viewpoint, where the base dataset, \(𝐷_{π‘π‘Žπ‘ π‘’}= \{ 𝐴_1^𝑏, 𝐴_2^𝑏, …, 𝐴_{|𝐢_𝑏|}^𝑏\}\), containing abundant training examples from \(𝐢_𝑏\) base classes and the novel dataset, \(𝐷_{π‘›π‘œπ‘£π‘’π‘™}=\{𝐴_1^𝑛, 𝐴_2^𝑛, …, 𝐴_{|𝐢_𝑛|}^𝑛\}\) containing only K-shot (\(𝐴_𝑖^𝑛 = 𝐾\) for \(𝑖 \in [1, 𝐢_𝑛]\)) training examples from \(𝐢_𝑛\) novel classes.
The striking natural imbalance between number of samples in the base and the novel classes leads to confusion between existing (base) and newly added (novel) few-shot classes. We trace the root cause for class confusion to large inter-class bias between base and novel (K-shot) classes. This results in mis-classification of one or more base classes as novel ones.

Further, in a quest to learn the novel classes \(C_n\) the model seldom forgets feature representations corresponding to the previously learnt base classes. Even though several techniques in FSOD adopt a replay technique (provide K-shot examples of the base classes during few-shot adaptation) the lack of discriminative features results in catastrophic forgetting of the base classes.

Overcoming Forgetting and Confusion

Figure 1: Application of SMILe on any existing approach which demonstrates (a) confusion and forgetting, demonstrates (b) inter-cluster separation, removing class confusion and (c) Fostering intra-class compactness, improving forgetting.



SMILe exploits the diminishing returns property of submodular functions to define a novel family of combinatorial objective (loss) functions \(𝐿_{π‘π‘œπ‘šπ‘} (\theta)\) which enforces orthogonality in the feature space when applied on Region-of-Interest (RoI) features in FSOD models. The loss function \(𝐿_π‘π‘œπ‘šπ‘ (\theta)\) can be decomposed into two major components - \(𝑳_{π’„π’π’Žπ’ƒ}^{π’Šπ’π’•π’†π’“}\) which minimizes inter-class bias between base and novel classes and \(𝑳_{π’„π’π’Žπ’ƒ}^{π’Šπ’π’•π’“π’‚}\) maximizes intra-class compactness within abundant classes.

\(𝑳_{π’„π’π’Žπ’ƒ}^{π’Šπ’π’•π’†π’“}\) minimizes the mutual information between classes in \(𝐢_𝑛\), minimizing inter-cluster overlaps between the novel classes. This has been shown to be effective in mitigating class confusion in FSOD.

\[ L_{comb}^{inter}(\theta) = \underset{\substack{b \in C_b \\ n \in C_n}}{\sum} I_f(A_b, A_n; \theta) + \underset{\substack{i, j \in C_n \\ i \neq j}}{\sum} I_f(A_i, A_j; \theta) = \underset{\substack{i \in (C_b \cup C_n) \\ j \in C_n : i \neq j}}{\sum}I_f(A_i, A_j; \theta)\]

\(𝐋_{πœπ¨π¦π›}^{𝐒𝐧𝐭𝐫𝐚}\) minimizes the total submodular information within samples in each class in \(𝐢_𝑏 \cup C_𝑛\) boosting base class performance asserting the mitigation of catastrophic forgetting.

\[L_{comb}^{intra}(\theta) = {\underset{b \in C_b}{\sum}} f(A_b, \theta) + {\underset{n \in C_n}{\sum}} f(A_n, \theta) = {\underset{k \in (C_b \cup C_n)}{\sum}} f(A_k, \theta)\]

Encapsulating \(𝐿_{π‘π‘œπ‘šπ‘}^{π‘–π‘›π‘‘π‘Ÿπ‘Ž}\) and \(𝐿_{π‘π‘œπ‘šπ‘}^{π‘–π‘›π‘‘π‘’π‘Ÿ}\) we define a joint objective \(𝐿_{π‘π‘œπ‘šπ‘}(\theta)\) which tackles both the challenges of confusion and forgetting.

\begin{split} L_{comb}(\theta) =& (1 - \eta) L_{comb}^{intra}(\theta) + \eta L_{comb}^{inter}(\theta) \\ =& {\underset{i \in C_b \cup C_n}{\sum}} \Biggl[(1 - \eta) f(A_i, \theta) + \eta \underset{\substack{j \in C_n \\ i \neq j}}{\sum}I_f(A_i, A_j; \theta) \Biggr] \end{split}

The combined effect of \(L_{comb}^{inter}\) and \(L_{comb}^{intra}\) minimizes inter-class bias and intra-class variance resulting in reduced class confusion and catastrophic forgetting.

Does SMILe Overcome Catastrophic Forgetting ?


One of the most significant challenges in FSOD is the elimination of catastrophic forgetting which manifests as the degradation in the performance of classes in $C_b$ while learning classes in $C_n$. This primarily occurs due to the lack of discriminative feature representations from instances in $D_{base}$ during the few-shot adaptation (stage 2) stage. We plot the change in base class performance as the training progresses in existing SoTA methods AGCM and DiGeo against number of training iterations in Figure 2(a). Our SMILe approach introduced to SoTA approaches like DiGeo outperforms SoTA base class performance even higher than the roofline establishing the supremacy of \(L_{comb}\) in overcoming catastrophic forgetting.

Forgetting and Convergence

Figure 2: Resilience to Catastrophic forgetting and faster convergence in SMILe over SoTA approaches. (a) shows that combinatorial losses in SMILe are robust to catastrophic forgetting, while (b) shows that objectives in SMILe results in faster convergence over SoTA FSOD methods (AGCM and DiGeo).

Additionally, in Figure 2 (b) we observe that application of SMILe objectives to existing methods results in upto 2x faster convergence. This is primarily due to the adoption of a combinatorial viewpoint which replaces instance based approaches, popular in literature.

Resilience to Class-Confusion by Objectives in SMILe

Figure 3 highlights the supremacy of the proposed SMILe framework in mitigating class confusion through confusion matrix plots. We compare the confusion between classes in \(C_b \cup C_n\) of SoTA approaches AGCM and DiGeo before and after introduction of combinatorial objectives in SMILe.

We show that AGCM+SMILe demonstrates 11% lower confusion than AGCM and DiGeo+SMILe shows 4% lower confusion than DiGeo. This proves the efficacy of combinatorial objectives (\(L_{comb}^{inter}\)) in mitigating inter-class bias, thereby reducing confusion between classes.

Confusion Matrix plots across methods

Figure 3: SMILe demonstrates 11% lower confusion over AGCM (a,b) and 4% lower confusion over DiGeo (c,d). Only significant numbers are highlighted.


Further from empirical results shown below we observe that SMILe significantly overcomes class confusion resulting from large inter-class bias between base and novel classes.

Object Detection Results
Quantitative results on PASCAL VOC 1, 5 and 10-shot settings and MS-COCO, 5 and 10-shot settings, are provided in our paper.

Citation

If you like our work or use it in your research please feel free to cite it based on the citation below.

        @inproceedings{smile,
          title = {SMILe: Leveraging Submodular Mutual Information for Robust Few-Shot Object Detection},
          author = {Anay Majee and Ryan Sharp and Rishabh Iyer},
          booktitle = {European Conference on Computer Vision (ECCV)},
          year = {2024},
        }