Skip to yearly menu bar Skip to main content


Poster

Reconstructing Animals and the Wild

Peter Kulits · Michael J. Black · Silvia Zuffi


Abstract:

The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes composed of trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting important environmental context. This limits their usefulness for analysis tasks, as animals inherently exist within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct a natural scene from a single image. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models, and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one-million images and thousands of assets. Our approach, trained exclusively on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research.

Live content is unavailable. Log in and register to view live content