GANverse3D Turns a Single Image Into 3D Objects | Hacker Noon

@whatsaiLouis Bouchard

I explain Artificial Intelligence terms and news to non-experts.

This promising model called GANverse3D only needs an image to create a 3D figure that can be customized and animated!

How cool would it be to take a picture of an object on the internet, let’s say a car, and automatically have the 3D object in less than a second ready to insert in your game?

This is cool, right? Well, imagine that within a few seconds, you can even animate this car, making the wheels turn, flashing the lights, etc. Would you believe me if I told you that an AI could already do that? If video games weren’t enough, this new application works for any 3D scene you are working on, illustrations, movies, architecture, design, and more!

Watch the video

References

  1. Video demo: https://youtu.be/dvjwRBZ3Hnw
  2. Karras et al., (2019), “StyleGAN”: https://arxiv.org/pdf/1812.04948.pdf
  3. Chen et al., (2019), “DIB-R”: https://arxiv.org/pdf/1908.01210.pdf
  4. Omniverse, NVIDIA, (2021): https://www.nvidia.com/en-us/omniverse/
  5. Zhang et al., (2020), “IMAGE GANS MEET DIFFERENTIABLE RENDERING FOR INVERSE GRAPHICS AND INTERPRETABLE 3D NEURAL RENDERING”: https://arxiv.org/pdf/2010.09125.pdf
  6. GANverse3D official NVIDIA video: https://youtu.be/0PQnrnUIBlU
  7. NVIDIA’S GANverse 3D blog article: https://blogs.nvidia.com/blog/2021/04/16/gan-research-knight-rider-ai-omniverse/

Video transcript

00:00

What you see here is someone carefully creating

00:02

a scene for a video game.

00:04

It takes many hours of work by a professional just for a single object like this one.

00:10

How cool would it be to take a picture of an object on the internet, let’s say a car,

00:15

and automatically have the 3D object in less than a second ready to insert in your game?

00:21

This is cool, right?

00:22

Well, imagine that within a few seconds, you can even animate this car, making the wheels

00:28

turn, flashing the lights, etc.

00:30

Would you believe me if I told you that an AI could already do that?

00:34

If video games weren’t enough, this new application works for any 3D scene you are working on,

00:41

illustrations, movies, architecture, design, and more!

00:44

Removing hundreds if not thousands of hours by professional designers for long and iterative

00:50

tests, allowing small businesses to produce quick simulations a lot cheaper!

00:55

By the time you take your sip of coffee, this model will have processed an image of a car

01:00

and generated a whole 3D animated version of it with realistic headlights, taillights,

01:06

and blinkers!

01:07

Moreover, you can even drive it around in a virtual environment platform like Omniverse,

01:12

as you can see here.

01:13

To introduce this new tool presented in the recent GTC event, Omniverse was designed for

01:18

creators who rely on virtual environments to test new ideas and visualize prototypes

01:24

before creating their final products.

01:27

You can use this tool to simulate complex virtual worlds with real-time ray tracing.

01:32

Since this video isn’t about Omniverse, which is awesome by itself, I will not dive further

01:37

into this new platform’s details.

01:39

I linked more resources about it in the description.

01:43

Here, I want to focus on the algorithm behind the 3D model generation technique NVIDIA published

01:49

in ICLR and CVPR 2021.

01:52

Indeed, this promising model called GANverse3D only needs an image to create a 3D figure

02:00

that can be customized and animated!

02:02

Just by its name, I think it won’t surprise you if I say that it uses a GAN to achieve

02:07

that.

02:08

Here, I won’t enter into how GANs work since I covered it many times on my channel, where

02:14

you can find many videos explaining them like the one appearing in the top right corner

02:19

right now.

02:20

Generative networks are relatively new in 3D model generation from 2D images, also called

02:26

“inverse graphics” because of the complexity of the task needing to understand depths,

02:32

textures, and lighting using multiple viewpoints of an object to generate such an accurate

02:37

3D model.

02:39

Well, the researchers discovered that generative adversarial networks were implicitly acquiring

02:44

such knowledge during training.

02:46

Meaning that the information regarding the shapes, lighting, and texture of the objects

02:51

was already encoded inside the GAN model’s latent code.

02:56

This latent code is the output of the encoder part of the GAN architecture that is typically

03:01

sent into a decoder to generate a new image controlling specific attributes.

03:07

As observed in previous research, we know that different layers control different attributes

03:12

within the images, which is why you saw so many different and cool applications using

03:18

GANs in the past year where some could control the style of the face to generate cartoon

03:23

images.

03:24

In contrast, others could make your head move and all this from a single image of yourself.

03:30

In this case, they used the well-known StyleGAN architecture, a powerful generator used on

03:36

many different buzz applications you saw on the internet and my channel.

03:41

The researchers experimentally found that the first four layers could control the camera

03:46

viewpoints by fixing the remaining layers.

03:49

Thus, by manipulating this characteristic of the StyleGAN architecture, they could use

03:54

these first four layers to automatically generate such novel viewpoints for the rendering task

03:59

from only one picture!

04:01

Similarly, as you can see in the first two rows, doing the opposite and fixing these

04:06

first four layers, they could produce images of different objects with the same viewpoints.

04:11

This characteristic, coupled with different loss functions, could control not only the

04:16

shape and viewpoints of the images but also the texture and background!

04:21

This discovery is very novative since most works on inverse graphics use 3D labels or

04:28

at least multi-view images of the same object during the training of their rendering network.

04:34

This type of data is typically difficult to have and thus very limited.

04:38

These approaches struggle on real photographs because of the domain gap between the training,

04:43

synthetic, images, and these real images due to this lack of training data.

04:49

As you can see, it only needs one picture to generate these amazing transformations

04:53

that look just as real, reducing the need for data annotation by over 10,000 times.

04:59

Of course, this GAN architecture that generates such important novel viewpoints also needs

05:04

to be trained on a lot of data to make this possible.

05:07

Fortunately, it is a lot less costly since it just needs many examples of the object

05:13

itself and does not require multiple viewpoints of the same picture, but this is still a limitation

05:19

to what object we can model using this technique.

05:23

As you can see here, StyleGAN is used as a multi-view generator to build the missing

05:27

data to train the rendering architecture.

05:30

Before going into the renderer, let’s jump back a little to understand the whole process.

05:36

You can see here that the architecture doesn’t start with a regular image but with a latent

05:41

code.

05:42

This latent code is basically what they learn during training.

05:45

The CNNs and MLP networks you see here are just basic convolutional neural networks and

05:51

multi-layer perceptrons used to create a code that disentangles the shape, texture, and

05:57

background of the image.

05:59

Meaning that this code will independently contain all these characteristics that will

06:03

be used in the rendering model.

06:05

During training, this code is updated to control these features by playing with the different

06:10

StyleGAN layers, as we just saw.

06:12

When you will use this model and send an image, it will pass through the StyleGAN encoder

06:17

and create the latent code containing all the information we need.

06:21

Then, this information will be extracted using the disentangling module we just talked about

06:26

to extract the camera viewpoint, the 3D mesh, texture, and background of your image.

06:32

These characteristics are individually sent to the renderer producing the final model.

06:37

In this architecture, the renderer is a state-of-the-art differentiable renderer called DIB-R, here

06:44

referred to as DIFFGraphicsRenderer.

06:46

It is called a differentiable renderer because this technique, also developed by NVIDIA,

06:52

just like StyleGAN and this very paper, was one of the first to allow the gradient to

06:57

be analytically computed over the entire images making it possible to train a neural network

07:03

to generate the 3D shape.

07:05

You can see that they mainly used state-of-the-art models for each individual task because the

07:10

overall architecture is much more important and innovative than these models themselves

07:15

that are already extremely powerful on their own.

07:19

This is how this new paper, coupled with NVIDIA’s new 3D platform: Omniverse, will allow architects,

07:26

creators, game developers, and designers over the world to easily add new animated objects

07:32

to their mockups without needing any expertise in 3D modeling or a large budget to spend

07:37

on renderings.

07:38

Note that this application currently only exists for cars, horses, and birds because

07:44

of the amount of data GANs need to perform well, but this is extremely promising.

07:49

I just want to come back in one year and see how powerful it will have become.

07:53

Who would’ve thought 10 or 20 years ago that creating a controllable, realistically animated

07:59

version of your car on your computer screen could take less than one second?

08:04

And that to do so, it only needed a shiny little gadget in your pocket to take a picture

08:09

of it and upload it.

08:11

This is just crazy.

08:12

I can’t wait to see what researchers will come up with in another 10-20 years!

08:17

Before ending this video, I just wanted to announce that I just created a Patreon if

08:22

you would like to support my work.

08:23

It would help me improve the quality of the videos and keep on making them.

08:28

Regardless of what you decide to do, I will take this opportunity to thank you for watching

08:32

the videos.

08:33

This is, of course, already more than enough!

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.

read original article here