Looking Beyond the Known: Towards a Data Discovery Guided Open-World Object Detection

CARAML Lab, The University of Texas at Dallas
NeurIPS 2025
CROWD pipeline

Overall Architecture of CROWD showing our novel combinatorial data-discovery guided representation learning approach to (a) identify unknown objects and (b) learn distinguishable representations of both known and unknown objects.

Abstract

Open-World Object Detection (OWOD) enriches traditional object detectors by enabling continual discovery and integration of unknown objects via human guidance. However, existing OWOD approaches frequently suffer from semantic confusion between known and unknown classes, alongside catastrophic forgetting, leading to diminished unknown recall and degraded known-class accuracy. To overcome these challenges, we propose Combinator Open-World Detection (CROWD), a unified framework reformulating unknown object discovery and adaptation as an interwoven combinatorial (set-based) data-discovery (CROWD-Discover) and representation learning (CROWD-Learn) task. CROWD-Discover strategically mines unknown instances by maximizing Submodular Conditional Gain (SCG) functions, selecting representative examples distinctly dissimilar from known objects. Subsequently, CROWD-Learn employs novel combinatorial objectives that jointly disentangle known and unknown representations while maintaining discriminative coherence among known classes, thus mitigating confusion and forgetting. Extensive evaluations on OWOD benchmarks illustrate that CROWD achieves improvements of 2.83% and 2.05% in known-class accuracy on M-OWODB and S-OWODB, respectively, and nearly 2.4x unknown recall compared to leading baselines.

The CROWD Framework

CROWD casts Open-World Object Detection (OWOD) as a set-based discovery and learning problem. For each task \(t\), we view the known object classes as a collection of sets \(K^t\) and aim at grouping all candidate unknowns into a single pseudo-labeled set \(U^t\) while treating everything else as background \(B^t\). This viewpoint surfaces two unique challenges in the domain of OWOD -

  1. How to identify instances of unlabeled unknown objects \(U^t\) given labeled examples of only known ones in \(K^t\) ?

  2. How to effectively learn representations of currently known objects without forgetting the previously known (classes introduced in \(T_i\), where \(i < t\)) ones ?


CROWD Flow Diagram

Figure 1: Interleaved Data-Discovery and Representation Learning in CROWD on an incoming task \(T_t\). CROWD takes as input the model weights from \(T_{t-1}\) and a small replay buffer of previously known classes \(\hat{K}^{t -1}\), applies (a) CROWD-Learn to discover unknown RoIs and (b) CROWD-L to learn discriminative features of both known and unknown instances to return an updated model \(h^{t+1}\) and the current task replay buffer \(\hat{K}^t\).


CROWD achieves this in two stages - namely CROWD-Discover (a.k.a. CROWD-D) and CROWD-Learn (a.k.a. CROWD-L) as shown in Figure 1. Given an incoming task \(T_t\) we first train \(h^t(.; \theta)\) on currently known classes in \(D^t\). At this point, CROWD-D utilizes the frozen model weights of \(h^t\) and uses a small replay buffer (typically containing examples from both previously known and currently introduced objects) to discover highly representative proposals of unknown classes \(U^t\).

Subsequently, CROWD-L introduces a novel combinatorial learning strategy to rapidly finetune \(h^t\) on this replay buffer (we adopt the predefined buffer in (Joseph etal., 2021) to distinguish between known classes \(K^t\) and unknown \(U^t\) while preserving distinguishable features from the previously known classes.



Discovering Unknown Backgrounds : CROWD-D


CROWD introduces a novel combinatorial viewpoint in OWOD by modeling the identification of unknown instances of a given task as a data discovery problem (CROWD-D), selecting unknown RoIs which maximize the SCG between and the known object instances.

CROWD-D Selection

Figure 2: Illustration of the data-discovery pipeline in CROWD-D on a synthetic dataset with \(|\mathtt{R}| = 500\) and budget \(\mathtt{k} = 10\) and the underlying submodular function as Graph-Cut. CROWD-D selects \(U^t\) which are both dissimilar to background \(B^t\) and known \(K^t\) instances.



CROWD algorithm

Algorithm 1 : CROWD-D - Data Discovery Algorithm in CROWD.

    Given a set of RoI proposals \(\mathtt{R}\) and a submodular function \(f\) we define the data discovery task (Algorithm 1) as a targeted selection problem which selects a set of unknown instances \(U^t\) from \(\mathcal{V} = \mathtt{R} \setminus K^t\) that maximizes the SCG \(H_f\) given a query set comprising of known \(K^t\) and the background \(B^t\) instances (line 8 in Algorithm 1). This is followed by two key steps -

  1. Identify True Backgrounds: This step follows the intuition that exemplars are largely different from \(K^t\) contains potential true backgrounds. We denote them as \(B^t\) (shown in red in Figure 2(b)) and is achieved by maximizing the SCG between examples in \(\mathtt{R} \setminus K^t\) (denoted as \(\mathcal{V}\)) and knowns \(K^t\).

  2. Mine Unknowns: From the definition of SCG selected examples in \(U^t\) are largely dissimilar to examples in \(K^t \cup B^t\) indicating that they are neither background objects nor visually similar to known objects as shown in Figure 2(c).


The exclusion thresholds \(\tau_e\) and \(\tau_b\) are empirically determined to be 0.2 and 30% respectively and the underlying submodular function in our experiments is chosen to be Graph-Cut which has been evidenced in Kothawade et al., 2023 (PRISM) to model both representation and diversity among selected examples.

CROWD-L : Learning the Unknown Set


CROWD also introduces a novel set-based learning paradigm CROWD-L, based on SCG functions which minimizes the cluster overlap between embeddings of known and unknown objects while retaining the discriminative feature information from the known ones.

Given a set of known \(K^t \cup \hat{K}^{t-1}\), unknown \(U^t\) classes alongside a submodular function \(f\) we define a learning objective \(L_{\text{CROWD}}(\theta)\) as shown in the below equation which jointly minimizes the Submodular Total Information (\(L_{\text{CROWD}}^{self}\)) over each known class \(K_i^t \in \{K^t \cup \hat{K}^{t-1}\}\) and the SCG between known class \(K^t_i\) and the unknown set \(U^t\) (\(L_{\text{CROWD}}^{cross}\)).

    \[L_{\text{CROWD}}^{self}(\theta) = \sum_{i = 1}^{C^t} f(K^t_i; \theta) \]
    \[L_{\text{CROWD}}^{cross} (\theta) = \sum_{i = 1}^{C^t} H_f(K^t_i | U^t; \theta) = \sum_{i = 1}^{C^t} f(K^t_i \cup U^t) - f(U^t)\]
    \[L_{\text{CROWD}}(\theta) = L_{\text{CROWD}}^{self}(\theta) - \eta L_{\text{CROWD}}^{cross}(\theta)\]

CROWD algorithm

Figure 3 : Learning strategy in CROWD-L creating a family of learning objectives for OWOD.


Note that \(f\) relies on the pairwise interaction between examples in a batch which we represent using cosine similarity \(s_{ku}(\theta) = \frac{h^t(x_{k}, \theta)^{\text{T}} \cdot h^t(x_{u}, \theta)}{||h^t(x_{k}, \theta)|| \cdot ||h^t(x_{u}, \theta)||}\) and can be different from the one used in CROWD-D.

By varying the choice of \(f\) between popular submodular functions - Facility-Location (FL), Graph-Cut (GC) and Log-Determinant (LogDet) we introduce a family of loss functions characterized in Figure 4 below. \(L_{\text{CROWD}}\) is applied to the classification head of \(h^t(.; \theta)\) model during all training stages.

Characteristics of CROWD-L

Figure 4: Characterization of losses in CROWD-L on a synthetic two-cluster imbalanced dataset by increasing known vs. unknown class separation (cases 1 through 3) similar to the RoI embedding space of \(h^t(.; \theta)\). The synthetic dataset generation is performed under the same seed.

Experimental Results on OWOD Benchmarks

We evaluate our approach on two well established benchmarks - M-OWOD and S-OWOD. M-OWOD, (Superclass-Mixed OWOD Benchmark) consists of images from both MS-COCO and PASCAL-VOC depicting 80 classes grouped into 4 tasks (20 classes per task).

On the other hand, S-OWOD (Superclass-Separated OWOD Benchmark) consists of images from only MS-COCO dataset. Both benchmarks split the underlying data points into four distinct (non-overlapping) tasks \(T_t\), where \(t \in [1,4]\).

CROWD demonstrates \(\sim2.4\times\) increase in unknown recall per task alongside up to 2.8% improvement on M-OWODB and 2.1% improvement on S-OWODB in known class performance (measured as mAP) over several existing OWOD baselines.

Quantitative Results in CROWD

Further from empirical results shown below we observe that CROWD significantly overcomes confusion between known and unknown classes while improving performance on already learnt object classes.

Open-World Object Detection Qualitative Results

Qualitative results from CROWD contrasted against OrthogonalDet (previous SoTA method) showing that our approach mitigates (a) confusion (b) generalizes to unknowns and (c) reduces forgetting.

Citation

If you like our work or use it in your research please feel free to cite it based on the citation below.

        @inproceedings{majee2025crowd,
          title = {Looking Beyond the Known: Towards a Data Discovery Guided Open-World Object Detection},
          author = {Anay Majee and Amitesh Gangrade and Rishabh Iyer},
          booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
          year = {2025},
        }