Looking Beyond the Known: Towards a Data Discovery Guided Open-World Object Detection

Anay Majee, Amitesh Gangrade, Rishabh Iyer

CARAML Lab, The University of Texas at Dallas
NeurIPS 2025

Paper Code arXiv

Overall Architecture of CROWD showing our novel combinatorial data-discovery guided representation learning approach to (a) identify unknown objects and (b) learn distinguishable representations of both known and unknown objects.

The CROWD Framework

CROWD casts Open-World Object Detection (OWOD) as a set-based discovery and learning problem. For each task \(t\), we view the known object classes as a collection of sets \(K^t\) and aim at grouping all candidate unknowns into a single pseudo-labeled set \(U^t\) while treating everything else as background \(B^t\). This viewpoint surfaces two unique challenges in the domain of OWOD -

How to identify instances of unlabeled unknown objects \(U^t\) given labeled examples of only known ones in \(K^t\) ?

How to effectively learn representations of currently known objects without forgetting the previously known (classes introduced in \(T_i\), where \(i < t\)) ones ?

Figure 1: Interleaved Data-Discovery and Representation Learning in CROWD on an incoming task \(T_t\). CROWD takes as input the model weights from \(T_{t-1}\) and a small replay buffer of previously known classes \(\hat{K}^{t -1}\), applies (a) CROWD-Learn to discover unknown RoIs and (b) CROWD-L to learn discriminative features of both known and unknown instances to return an updated model \(h^{t+1}\) and the current task replay buffer \(\hat{K}^t\).

CROWD achieves this in two stages - namely CROWD-Discover (a.k.a. CROWD-D) and CROWD-Learn (a.k.a. CROWD-L) as shown in Figure 1. Given an incoming task \(T_t\) we first train \(h^t(.; \theta)\) on currently known classes in \(D^t\). At this point, CROWD-D utilizes the frozen model weights of \(h^t\) and uses a small replay buffer (typically containing examples from both previously known and currently introduced objects) to discover highly representative proposals of unknown classes \(U^t\).

Subsequently, CROWD-L introduces a novel combinatorial learning strategy to rapidly finetune \(h^t\) on this replay buffer (we adopt the predefined buffer in (Joseph etal., 2021) to distinguish between known classes \(K^t\) and unknown \(U^t\) while preserving distinguishable features from the previously known classes.

Discovering Unknown Backgrounds : CROWD-D

CROWD introduces a novel combinatorial viewpoint in OWOD by modeling the identification of unknown instances of a given task as a data discovery problem (CROWD-D), selecting unknown RoIs which maximize the SCG between and the known object instances.

Figure 2: Illustration of the data-discovery pipeline in CROWD-D on a synthetic dataset with \(|\mathtt{R}| = 500\) and budget \(\mathtt{k} = 10\) and the underlying submodular function as Graph-Cut. CROWD-D selects \(U^t\) which are both dissimilar to background \(B^t\) and known \(K^t\) instances.

Algorithm 1 : CROWD-D - Data Discovery Algorithm in CROWD.

Given a set of RoI proposals \(\mathtt{R}\) and a submodular function \(f\) we define the data discovery task (Algorithm 1) as a targeted selection problem which selects a set of unknown instances \(U^t\) from \(\mathcal{V} = \mathtt{R} \setminus K^t\) that maximizes the SCG \(H_f\) given a query set comprising of known \(K^t\) and the background \(B^t\) instances (line 8 in Algorithm 1). This is followed by two key steps -

Identify True Backgrounds: This step follows the intuition that exemplars are largely different from \(K^t\) contains potential true backgrounds. We denote them as \(B^t\) (shown in red in Figure 2(b)) and is achieved by maximizing the SCG between examples in \(\mathtt{R} \setminus K^t\) (denoted as \(\mathcal{V}\)) and knowns \(K^t\).

Mine Unknowns: From the definition of SCG selected examples in \(U^t\) are largely dissimilar to examples in \(K^t \cup B^t\) indicating that they are neither background objects nor visually similar to known objects as shown in Figure 2(c).

The exclusion thresholds \(\tau_e\) and \(\tau_b\) are empirically determined to be 0.2 and 30% respectively and the underlying submodular function in our experiments is chosen to be Graph-Cut which has been evidenced in Kothawade et al., 2023 (PRISM) to model both representation and diversity among selected examples.

CROWD-L : Learning the Unknown Set

CROWD also introduces a novel set-based learning paradigm CROWD-L, based on SCG functions which minimizes the cluster overlap between embeddings of known and unknown objects while retaining the discriminative feature information from the known ones.

Given a set of known \(K^t \cup \hat{K}^{t-1}\), unknown \(U^t\) classes alongside a submodular function \(f\) we define a learning objective \(L_{\text{CROWD}}(\theta)\) as shown in the below equation which jointly minimizes the Submodular Total Information (\(L_{\text{CROWD}}^{self}\)) over each known class \(K_i^t \in \{K^t \cup \hat{K}^{t-1}\}\) and the SCG between known class \(K^t_i\) and the unknown set \(U^t\) (\(L_{\text{CROWD}}^{cross}\)).

\[L_{\text{CROWD}}^{self}(\theta) = \sum_{i = 1}^{C^t} f(K^t_i; \theta) \]
\[L_{\text{CROWD}}^{cross} (\theta) = \sum_{i = 1}^{C^t} H_f(K^t_i | U^t; \theta) = \sum_{i = 1}^{C^t} f(K^t_i \cup U^t) - f(U^t)\]
\[L_{\text{CROWD}}(\theta) = L_{\text{CROWD}}^{self}(\theta) - \eta L_{\text{CROWD}}^{cross}(\theta)\]

Figure 3 : Learning strategy in CROWD-L creating a family of learning objectives for OWOD.

Note that \(f\) relies on the pairwise interaction between examples in a batch which we represent using cosine similarity \(s_{ku}(\theta) = \frac{h^t(x_{k}, \theta)^{\text{T}} \cdot h^t(x_{u}, \theta)}{||h^t(x_{k}, \theta)|| \cdot ||h^t(x_{u}, \theta)||}\) and can be different from the one used in CROWD-D.

By varying the choice of \(f\) between popular submodular functions - Facility-Location (FL), Graph-Cut (GC) and Log-Determinant (LogDet) we introduce a family of loss functions characterized in Figure 4 below. \(L_{\text{CROWD}}\) is applied to the classification head of \(h^t(.; \theta)\) model during all training stages.

Figure 4: Characterization of losses in CROWD-L on a synthetic two-cluster imbalanced dataset by increasing known vs. unknown class separation (cases 1 through 3) similar to the RoI embedding space of \(h^t(.; \theta)\). The synthetic dataset generation is performed under the same seed.

Experimental Results on OWOD Benchmarks

We evaluate our approach on two well established benchmarks - M-OWOD and S-OWOD. M-OWOD, (Superclass-Mixed OWOD Benchmark) consists of images from both MS-COCO and PASCAL-VOC depicting 80 classes grouped into 4 tasks (20 classes per task).

On the other hand, S-OWOD (Superclass-Separated OWOD Benchmark) consists of images from only MS-COCO dataset. Both benchmarks split the underlying data points into four distinct (non-overlapping) tasks \(T_t\), where \(t \in [1,4]\).

CROWD demonstrates \(\sim2.4\times\) increase in unknown recall per task alongside up to 2.8% improvement on M-OWODB and 2.1% improvement on S-OWODB in known class performance (measured as mAP) over several existing OWOD baselines.

Further from empirical results shown below we observe that CROWD significantly overcomes confusion between known and unknown classes while improving performance on already learnt object classes.

Open-World Object Detection Qualitative Results

Qualitative results from CROWD contrasted against OrthogonalDet (previous SoTA method) showing that our approach mitigates (a) confusion (b) generalizes to unknowns and (c) reduces forgetting.

Looking Beyond the Known: Towards a Data Discovery Guided Open-World Object Detection

Overall Architecture of CROWD showing our novel combinatorial data-discovery guided representation learning approach to (a) identify unknown objects and (b) learn distinguishable representations of both known and unknown objects.

Abstract

The CROWD Framework

Discovering Unknown Backgrounds : CROWD-D

CROWD introduces a novel combinatorial viewpoint in OWOD by modeling the identification of unknown instances of a given task as a data discovery problem (CROWD-D), selecting unknown RoIs which maximize the SCG between and the known object instances.

Figure 2: Illustration of the data-discovery pipeline in CROWD-D on a synthetic dataset with \(|\mathtt{R}| = 500\) and budget \(\mathtt{k} = 10\) and the underlying submodular function as Graph-Cut. CROWD-D selects \(U^t\) which are both dissimilar to background \(B^t\) and known \(K^t\) instances.

Algorithm 1 : CROWD-D - Data Discovery Algorithm in CROWD.

CROWD-L : Learning the Unknown Set

Figure 3 : Learning strategy in CROWD-L creating a family of learning objectives for OWOD.

Figure 4: Characterization of losses in CROWD-L on a synthetic two-cluster imbalanced dataset by increasing known vs. unknown class separation (cases 1 through 3) similar to the RoI embedding space of \(h^t(.; \theta)\). The synthetic dataset generation is performed under the same seed.

Experimental Results on OWOD Benchmarks

Further from empirical results shown below we observe that CROWD significantly overcomes confusion between known and unknown classes while improving performance on already learnt object classes.

Qualitative results from CROWD contrasted against OrthogonalDet (previous SoTA method) showing that our approach mitigates (a) confusion (b) generalizes to unknowns and (c) reduces forgetting.

Citation