OpenAI Releases Shap-E: Revolutionizing 3D Asset Generation with Conditional Generative Models

Introduction

The rapid growth of Generative AI has captured the attention of organizations and researchers alike, with its potential to create unique and original content. Large Language Models (LLMs) have made it possible to complete various tasks with ease. OpenAI's DALL-E, a text-to-image generation model, allows users to create realistic images from textual prompts and has already amassed over a million users. OpenAI has now expanded its portfolio with the release of Shap-E, a conditional generative model designed to generate 3D assets.

Shap-E: A New Era in 3D Asset Generation

Shap-E stands out from traditional models that generate single output representations. Instead, it produces the parameters of implicit functions, which can be rendered as textured meshes or neural radiance fields (NeRF) for versatile and realistic 3D asset generation.

‍

🚨Breaking: OpenAI has quietly released research for their new text-to-3D model, Shap-E.

This is ChatGPT for creating generative 3D modeling.

Text-to-3D printers are about to be a thing very soon. pic.twitter.com/vulI4qSWlY
— Rowan Cheung (@rowancheung) May 5, 2023

‍

Training the Shap-E Model

Researchers first trained an encoder to take 3D assets as input and map them into the parameters of an implicit function. This allowed the model to learn the underlying representation of the 3D assets thoroughly. Following this, a conditional diffusion model was trained using the encoder's outputs. The diffusion model learns the conditional distribution of the implicit function parameters given the input data and generates diverse and complex 3D assets by sampling from the learned distribution. The model was trained on a large dataset of paired 3D assets and corresponding textual descriptions.

Implicit Neural Representations (INRs) in Shap-E

Shap-E uses INRs for 3D representations, which provide a versatile and flexible framework by capturing detailed geometric properties of 3D assets. The two types of INRs utilized in Shap-E are Neural Radiance Fields (NeRF) and DMTet with its extension GET3D.

NeRF maps coordinates and viewing directions to densities and RGB colors, enabling realistic and high-fidelity rendering from arbitrary viewpoints. DMTet and GET3D represent textured 3D meshes by mapping coordinates to colors, signed distances, and vertex offsets, allowing the construction of 3D triangle meshes in a differentiable manner.

Shap-E's Impressive Performance

The Shap-E model has demonstrated its ability to produce high-quality outputs in seconds. Example results include 3D assets for textual prompts such as a bowl of food, a penguin, a voxelized dog, a campfire, and an avocado-shaped chair. When compared to Point·E, another generative model for point clouds, Shap-E exhibited faster convergence and achieved comparable or better sample quality, despite modeling a higher-dimensional, multi-representation output space.

For random samples on selected prompts, see samples.md.

Conclusion

Shap-E is a promising and significant addition to the world of Generative AI, offering an efficient and effective generative model for 3D assets. Its capacity to generate versatile and realistic 3D assets is poised to revolutionize the industry, opening up new possibilities for content creators and researchers alike.

‍

Research paper

Github page

‍

To get started with examples, see the following notebooks:

sample_text_to_3d.ipynb - sample a 3D model, conditioned on a text prompt
sample_image_to_3d.ipynb - sample a 3D model, conditioned on an synthetic view image.
encode_model.ipynb - loads a 3D model or a trimesh, creates a batch of multiview renders and a point cloud, encodes them into a latent, and renders it back. For this to work, install Blender version 3.3.1 or higher, and set the environment variable BLENDER_PATH to the path of the Blender executable.

‍