ImageNet3D: Towards General-Purpose
Object-Level 3D Understanding

Technical Report

Wufei Ma1Guanning Zeng2Guofeng Zhang1Qihao Liu1Letian Zhang3
Adam Kortylewski4,5Yaoyao Liu1Alan Yuille1

1Johns Hopkins University2Tsinghua University3UC Santa Cruz
4University of Freiburg5Max Planck Institute for Informatics

We present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. With the new annotations available in ImageNet3D, we could (i) analyze the object-level 3D awareness of visual foundation models, (ii) study and develop general-purpose models that infer both 2D and 3D information for arbitrary rigid objects in natural images, and (iii) integrate unified 3D models with large language models for 3D-related reasoning. We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation.

ImageNet3D overview
Figure 1. Overview of our ImageNet3D dataset.

ImageNet3D

Motivation

Despite the importance of object-level 3D understanding, previous datasets in this area were limited to a very small number of categories or specific domains. It is largely understudied of how to develop unified 3D models that are capable of inferring 2D and 3D information for all common rigid objects in natural images.

We consider two types of unified 3D models.

Overview

We choose the ImageNet21k dataset as the source of our image data. We start by annotating 2D bounding boxes for the object instances in the image. Then we collect 3D CAD models from Objaverse as representative shapes for each object category. Finally we recurit a total of 30 annotators to annotate 6D poses for the objects, as well as the scene density and object visual quality.

Our dataset features three key designs: (i) a large-number of categories and instances, (ii) cross-category 3D alignment, and (iii) natural captions interleaved with 3D information.

Cross-Category 3D Alignment

In previous datasets such as ObjectNet3D , canonical poses from different categories are not necessarily aligned. However, as we scale up the number of categories in 3D-annotated datasets, having cross-category 3D alignment is a crucial design for the study of general-purpose object-level 3D understanding. Correctly aligning the canonical poses will (i) allow models to utilize the semantic similarities between parts of different categories and exploit the benefits of joint learning from multiple categories, and (ii) generalize to novel categories by inferring 3D viewpoints from semantic parts that the model has seen from other categories during training.

We manually align the canonical poses of all 200 categories in ImageNet3D, based on the following three rules: (i) semantic parts, (ii) similar shapes, and (iii) common knowledge.

mis-aligned canonical poses in ObjectNet3D
Figure 2. Mis-aligned canonical poses in ObjectNet3D .

cross-category 3D alignment
Figure 3. Meta classes and cross-category 3D alignment in our ImageNet3D.

Natural Captions with 3D Information

An important application of general-purpose object-level 3D understanding models is to integrate them with large language models (LLMs) and benefit downstream multi-modal reasoning. Hence we present image captions interleaved with 3D information, which can be used to develop multi-modal large language models (MLLMs) with 3D reasoning capabilities similar to previous approaches .

We adopt a GPT-assisted approach to produce natural captions with 3D information. By feeding our 2D and 3D annotations via the textual prompts, GPT-4v would integrate these information and produce a coherent image caption interleaved with 3D annotations represented by a special <pose6D> token.

Tasks

Besides standard 3D/6D pose estimation and image classification as studied in prior works , we further consider two new tasks, probing of 3D object-level awareness and open-vocabulary pose estimation.

Linear Probing of Object-Level 3D Awareness

Recent developments of large-scale pretraining have yielded visual foundation models with strong capabilities. Self-supervised approaches such as MAE and DINO provide strong and generalizable feature representations that benefit downstream recognition and localization. Are these visual foundation models object-level 3D aware? Can these feature representations distinguish objects from different 3D viewpoints or retrieve objects from similar 3D viewpoints?

We evaluate object-level 3D awareness by linear probing the frozen feature representations on 3D viewpoint classification task. Specifically, three linear classifiers are trained with respect to each of the three parameters encoding 3D viewpoint.

We compute pose errors given by the angle between the predicted rotation matrix and the groundtruth rotation matrix , and report pose estimation accuracy, which is the percentage of samples with pose errors smaller than a pre-defined threshold.

Open-Vocabulary Pose Estimation

In this setting, we study how 3D models generalize to novel categories. Models may utilize semantic parts that are shared between novel categories and categories that are seen during training. Additionally, open-vocabulary pose estimation models may utilize large-scale 2D pre-training data or vision-language supervision and learn useful semantic information. Lastly we provide detailed descriptions of object shape, part structure, and how humans interact with these objects for all categories in ImageNet3D.

open-vocabulary pose estimation
Figure 4. Illustration of open-vocabulary pose estimation.

Baseline Results

Task 1: Linear Probing of Object-Level 3D Awareness

We measure the object-level 3D awareness for a range of general-purpose vision models designed for representation learning , multi-modal learning , and depth estimation . These models adopt standard transformer architectures and we train a linear probe on frozen class embedding features.

Model Arch Supervision Dataset Pose Accuracy @ pi/6
Avg. Elec. Furn. Hou. Mus. Spo. Veh. Work
DeiT III ViT-B/16 classification ImageNet21k 36.6 47.9 48.2 36.8 21.5 16.6 35.0 25.3
MAE ViT-B/16 SSL ImageNet1k 46.6 57.6 67.8 40.2 29.0 20.2 58.4 25.6
DINO ViT-B/16 SSL ImageNet1k 42.0 53.1 57.0 39.8 28.0 19.3 45.3 27.0
DION v2 ViT-B/14 SSL LVD-142M 56.3 64.0 75.3 47.9 32.9 23.5 74.7 38.1
CLIP ViT-B/16 VLM private 39.7 50.3 52.8 39.7 23.1 19.3 39.8 26.4
MiDaS ViT-L/16 depth MIX-6 40.5 50.9 56.7 40.2 26.7 18.9 39.2 28.1
Table 1. Quantitative results of linear probing of object-level 3D awareness.

Task 2: Open-Vocabulary Pose Estimation

For baseline results, we consider models that learn category-agnostic features that generalize to novel categories and instances. Specifically, we consider (i) classification-based methods that formulate pose estimation as a classification problem, and (ii) 3D compositional models that learn neural mesh models with contrastive features and perform analysis-by-synthesis during inference. The implementation of 3D compositional models extends from .

Model Pose Accuracy @ pi/6
Avg. Elec. Furn. Hou. Mus. Spo. Veh. Work
Oracle Model
ResNet50-General 53.6 49.2 52.4 45.8 26.0 65.2 56.5 58.5
Open-Vocabulary Models
ResNet50-General 37.1 30.1 35.6 28.1 11.8 51.7 36.7 40.9
SwinTrans-T-General 35.8 30.9 34.3 26.1 12.2 46.2 34.4 39.2
NMM-Sphere 29.5 31.7 25.4 21.7 25.6 19.8 33.4 19.3
Table 2. Quantitative results of open-vocabulary pose estimation. Note that the oracle model is trained on both known and novel categories.

Task 3: Joint Image Clasification and Pose Estimation

Similar to task 2, we consider two types of models: (i) classification-based methods, and (ii) 3D compositional models. We adopt a 3D-aware classification accuracy, where a prediction is correct only if the predicted class label is correct and the predicted pose error is lower than a given threshold.

Model 3D-Aware Pose Accuracy @ pi/6
Avg. Elec. Furn. Hou. Mus. Spo. Veh. Work
ResNet50-General 50.9 60.0 67.2 43.0 43.8 27.7 64.1 33.8
SwinTrans-T-General 53.2 63.1 71.6 44.8 45.3 30.4 66.2 35.0
LLaVA-pose 49.1 58.0 65.6 41.6 41.0 26.1 61.8 32.1
NOVUM 56.2 59.6 65.6 52.5 41.9 30.6 69.6 39.3
NMM-Sphere 57.4 61.3 65.9 52.4 51.7 40.5 67.9 43.4
Table 3. Quantitative results of joint image classification and pose estimation.

Discussion

Conclusion. In this paper we present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D largely extends the number of rigid categories and object instances, as compared to previous datasets with 3D annotations. Moreover, ImageNet3D improves the quality of 3D annotations by annotating cross-category 3D alignment, and provides new types of annotations, such as object visual qualities and image captions interleaved with 3D information that enable new research problems. We provide baseline results on standard 3D tasks, as well as novel tasks such as probing of object-level 3D awareness and open-vocabulary pose estimation.

Experimental results show that with ImageNet3D, we can develop general-purpose models capable of inferring 3D information for a wide range of rigid categories. We also identify limitations of existing 3D models from our baseline experiments and discuss new problems and challenges for future studies.

Ethics. We follow the ethics guidelines and obtained Institutional Review Board (IRB) approvals prior to the start of our work. We described potential risks to the annotators, such as being exposed to inappropriate images from the ImageNet21k dataset , and explained the purpose of the study and how the collected data will be used. All annotators are paid by a fair amount as required at our institution.

BibTeX

@article{ma2024imagenet3d,
  title={ImageNet3D: Towards General-Purpose Object-Level 3D Understanding},
  author={Ma, Wufei and Zeng, Guanning and Zhang, Guofeng and Liu, Qihao and Zhang, Letian and Kortylewski, Adam and Liu, Yaoyao and Yuille, Alan},
  journal={arXiv preprint arXiv:2406.09613},
  year={2024}
}

Notes

This website template is adapted from Image Sculpting.