"I don't paint things. I only paint the difference between things."
— Henri Matisse


Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene, and additionally produce a photorealistic composition as if the asset was placed within the environment. This not only opens up a new level of control for object stylization, for example, the same assets can be stylized to reflect changes in the environment, such as summer to winter or fantasy versus futuristic settings-but also makes the object-scene composition more controllable. We achieve this by combining modeling and optimizing the object's texture and environmental lighting through differentiable ray tracing with image priors from pre-trained text-to-image diffusion models. We demonstrate that our method is applicable to a wide variety of indoor and outdoor scenes and arbitrary objects.


In this paper, we are motivated by a practical goal-what if a 3D object is placed into a 2D scene?-and propose a novel framework that allows:
  • Stylizing the object with an adapted texture that aligns with the given scene; and
  • Achieving a photorealistic scene composition with the aid of estimated light of the environment.
We formulate the problem through the lens of a creative tool, animated as below.

Texture Adptation

Achieving a realistic texture adaptation for 3D objects placed into 2D scenes requires three components: Environmental Influence involves adjusting the texture to realistically reflect the environmental impact; Identity Preservation aims to preserve the objects' unique visual aspects, and Blending focuses on guiding the texture adaptation to match the visual characteristic of the scene, ensuring seamless integration with the surroundings and avoiding stark contrasts.
We visualize its training dynamics, showcasing the rendered images, the positive and negative supervisory targets, the PBR texture maps, and the augmented HDR environment map (from top-left to bottom-right).

Light Estimation

To light the object, we utilize a high dynamic range (HDR) environment map, which is well-suited for representing natural illumination. Since the given 2D scene image only captures a small angle of the full 360-degree environment map, we first use the image as input to estimate a low dynamic range (LDR) environment map. This process depends on the scene type (i.e., indoor versus outdoor) and is explained below.

Inspired by the traditional inverse rendering setup where the environment map can almost be perfectly reconstructed from objects' reflections, we introduce a novel concept that incorporates a virtual light-capturing apparatus alongside the object of interest during the optimization process.

Visual Results

Our method successfully blends the objects into various environments, achieving photorealistic adaptation for both appearance and lighting. This includes scenarios drawn from both real-world and fantasy settings.

Case Study

We conduct a case study where we specifically place a leather sofa into a diverse array of scenes:

Scene-Agnostic Texturing

We compare our framework with in the scene-agnostic texture generation setup with mesh texturing methods: Prolific Dreamer, Fantasia3D, and TEXTure. Our method demonstrates superior photo-realistic texture generation, leveraging VSD.

Additionally, we also consider a scene-agnostic texture editing setup and compare with a self-baseline using InstructPix2Pix. Instead of using instructions as text prompts, we directly use appearance descriptions. The experiments reveal that our method is a performant alternative for general instruction-following 3D editing tasks, providing much more fine-grained and accurate control.


The website template was borrowed from Michaël Gharbi and Instruct-NeRF2NeRF.