2D to 3D Reconstruction of Furniture Objects




|   DECEMBER 05, 2019   •   15 MINUTE READ   |


Ever thought about how good a piece of furniture would look in your desired space before actually buying it?

We help you in figuring that out by reconstructing 3D models of furniture just from a single 2D image and you can visualize how well it fits in your environment with the help of an Augmented Reality (AR) application on your device.

In this project, we build and examine model-free and model-based deep learning methods for 3D reconstruction. In Model Free approach, the reconstruction is done directly in 128*128*128 voxel space making it computationally expensive. Whereas in Model based method, we used a pre-computed parametric shape representation to decrease computational requirements. We compare the performances of both these models and discuss the trade-offs.

Let’s dive in!

Model - Free Approach



modelfree pipeline

Figure [1]

Traditionally, multiple 2D images with different views provide us with the extra information that can be used to solve this reconstruction problem. This problem is more challenging if the reconstruction has to be done from just a single 2D image. We can tackle this problem by leveraging deep learning techniques.

prior_est

Figure [2] : 2.5D Sketch Estimation Network

In our framework shown in Figure [1], we first train a 2.5D sketch estimation network that takes in a 2D image and predicts the prior knowledge required for 3D reconstruction. The outputs of this network are 2.5D images and comprises of Depth image, Surface Normal Image and a Silhouette Image. It is a ResNet18 based encoder - decoder architecture with Mean Squared Error as the Loss Function.

prior_est

Figure [3] : 3D Shape completion network

These 2.5D images are fed into a Shape Completion Network that reconstructs the 3D object in 128*128*128 voxel space. This is also a ResNet18 based architecture with Binary Cross Entropy as the loss function. Both of these networks are trained independently using a custom dataset which contains synthetically rendered 2D images from ShapeNet and Pix3D. These images have a corresponding 2.5D images and 3D models to facilitate supervised learning. This synthetic data provides us additional viewpoints to solve the problem of less training data.

Since, the networks are trained on rendered images, it is necessary to observe its performance on real images that have different lighting conditions. One key contribution is to compare the performance of the model on real-world images and synthetically rendered images. We take the real world images from Extended IKEA dataset.

Results

The results below show the input 2D images with the predicted 2.5D images and reconstructed 3D models with two views.

pres1

Figure [4] : Output of synthetic image

In Figure [4], we can see that our network tries to learn different features of a chair like “Handles” for the given 2D synthetic image.

pres2

Figure [5] : Output of real image

In Figure [5], input image is a real image which has a non-uniform lighting. We can see that lighting has an effect on the 2.5D images, particularly silhouette and normal images are distorted.

pres3

Figure [6] : Output of real image - failed case

Figure [6] is one of the failed cases we observed in our experiments. In this case, 2.5D predictions are distorted, resulting in an imperfect 3D reconstruction. This can be because the background and chairs are of the same colour.

Model - Based Approach

In the previous approach, the network predicts 128*128*128 points in voxel space. Training the network to predict such a huge voxel space is both time consuming and computationally intensive. Hence, we trained a model-based pipeline to reduce both training time and computational power needed for 3D reconstruction.


model-based pipeline

Figure [7] : Model-based approach pipeline


In this approach, the shape is parameterized by a base model and per-instance deformations are calculated from the base model. In order to calculate these deformations, Iterative Closest Point (ICP) algorithm is employed to estimate the best alignment of two point-clouds (both point to point and point to plane).

Applying ICP on the vertices of the training objects is quite challenging. As the ground truth 3D meshes have varied number of vertices, edges, and faces, meshes are down-sampled so that all the training 3D meshes have the same number of vertices. Down-sampling the vertices was done manually on Blender.

Now that we have one-to-one correspondence between all the vertices, the vertices are flattened and the deformations are calculated as given below:

Deformation = Flattened Input Vertices - Base Vertices


Now this deformation is calculated for all training data and then they are stacked as columns to build a Deformation Matrix. Furthermore, Principal Component Analysis (PCA) is performed on the Deformation Matrix to get the direction of maximum variation in deformation.

We can represent deformation of every training data as shown below.

param1

Figure [8]


Thus, we can get the deformation coefficients of every training model by taking pseudo-inverse of the deformation matrix and multiplying it with deformation.

param2

Figure [9]


For a new 2D image, these deformation coefficients are predicted using a trained CNNs. Training the CNN is done in a supervised way with mean squared error as the loss function.

param3

Figure [10]


Now that our model predicts the deformation coefficients, we can get instant specific deformation by simple matrix multiplication as shown in Figure [8]. We will get our final predicted shape by adding these Deformations to Base Vertices.

Results
pres1

Figure [11]

Here we see, by adding relevant deformations to the Base Model, we predict the output 3D shape. The predicted output is very close to the ground truth model. It is trying to learn the curvature along the side to an extent and also trying to fill in the gap which our base model had. Also it is trying to eliminate the extra support of base model’s legs. But the outcome is noisy. Given we had to predict only 15 coefficients, the pipeline does an excellent work of predicting 3D models.

Below, let us have a look at how our model learns the deformation.

pres2

Figure [12]

Here we see, how our model is trying to learn deformations and transforming from Base Model to Ground Truth. The principal direction explains 59.19% of the variance in deformation. To see the transformation that our model applies, we increase the coefficient along the principal direction of deformation. As we keep increasing, the model keeps adding deformations to bring it closer to actual 3D model.


Limitations
pres3

Figure [13]

Here, our ground truth model is relatively different from the base model and hence we see, even though it is trying to learn the deformation, the resulting noise is too high in the final predicted shape. Hence, even though it is computationally cheap, the results heavily depend on the base model that we choose.


Performance Analysis

pres4

Figure [14]

For real-world testing, we test our approaches with a real 2D image of a chair. The model-free approach constructs the shape of chair but fails to capture the specific features such as thickness. This can be attributed to the difference between the lighting and background differences between a synthetic and a real image. The model-based approach captures the overall shape of the chair but results in a noisy shape.

Also, for model-based approach to give good results the base model should be very close to the output shape that we want to predict whereas the model-free approach does not have the said constraint. But, model-free approach requires a lot of computational power to predict 128*128*128 points whereas model-based approach just has to predict 15 points!

Life comes with trade-offs! :)


Demo



Scope for Improvement

  • Choosing a better base model by computing a mean model that generalizes well for specific object categories.


  • Employing a Generative Adversarial Network (GAN) for improving the naturalness of the generated object surfaces.


  • Adding texture to the generated object shapes using style transfer techniques.

  • Code

    You can find the code for our project here

    References

    [1] MarrNet : 3D Shape Reconstruction via 2.5D Sketches, Wu J., Wang Y., Xue T., Sun X., Freeman W.T. & Tenenbaum J.B., NIPS (2017)

    [2] Deep Residual Learning for Image Recognition, He K., Zhang X., Ren S. & Sun J., CVPR (2015)

    [3] Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, Wu J., Zhang C., Xue T., Freeman W.T. & Tenenbaum J.B., NIPS (2016)

    [4] Learning Category-Specific Mesh Reconstruction from Image Collections, Kanazawa A., Tulsiani S., Efros A.A. & Malik J., ECCV (2018)

    [5] Learning 3D Shape Priors for Shape Completion and Reconstruction, Wu J., Zhang C., Zhang X., Zhang Z., Freeman W.T. & Tenenbaum J.B., ECCV (2018)

    [6] 3D Menagerie: Modeling the 3D Shape and Pose of Animals, Zuffi S., Kanazawa A., Jacobs D. & Black M.J., CVPR (2017)

    [7] Neural 3D Mesh Renderer, Hiroharu K., Yoshitaka U. & Tatsuya H., CVPR (2018)



    Special thanks to Karl Pertsch and Prof. Joseph Lim for their support and guidance.