| DECEMBER 05, 2019 • 15 MINUTE READ |
Ever thought about how good a piece of furniture would look in your desired space before actually buying it?
We help you in figuring that out by reconstructing 3D models of furniture just from a single 2D image and you
can visualize how well it fits in your environment with the help of an Augmented Reality (AR) application on
your device.
In this project, we build and examine model-free and model-based deep learning methods for 3D reconstruction.
In Model Free approach, the reconstruction is done directly in 128*128*128 voxel space making it computationally
expensive. Whereas in Model based method, we used a pre-computed parametric shape representation to decrease
computational requirements. We compare the performances of both these models and discuss the trade-offs.
Let’s dive in!
Figure [1]
Traditionally, multiple 2D images with different views provide us with the extra information that can be used to solve this reconstruction problem. This problem is more challenging if the reconstruction has to be done from just a single 2D image. We can tackle this problem by leveraging deep learning techniques.
Figure [2] : 2.5D Sketch Estimation Network
In our framework shown in Figure [1], we first train a 2.5D sketch estimation network that takes in a 2D image and predicts the prior knowledge required for 3D reconstruction. The outputs of this network are 2.5D images and comprises of Depth image, Surface Normal Image and a Silhouette Image. It is a ResNet18 based encoder - decoder architecture with Mean Squared Error as the Loss Function.
Figure [3] : 3D Shape completion network
These 2.5D images are fed into a Shape Completion Network that reconstructs the 3D object in 128*128*128 voxel space.
This is also a ResNet18 based architecture with Binary Cross Entropy as the loss function. Both of these networks are
trained independently using a custom dataset which contains synthetically rendered 2D images from ShapeNet and Pix3D.
These images have a corresponding 2.5D images and 3D models to facilitate supervised learning. This synthetic data
provides us additional viewpoints to solve the problem of less training data.
Since, the networks are trained on rendered images, it is necessary to observe its performance on real images that
have different lighting conditions. One key contribution is to compare the performance of the model on real-world
images and synthetically rendered images. We take the real world images from Extended IKEA dataset.
The results below show the input 2D images with the predicted 2.5D images and reconstructed 3D models with two views.
Figure [4] : Output of synthetic image
In Figure [4], we can see that our network tries to learn different features of a chair like “Handles” for the given 2D synthetic image.
Figure [5] : Output of real image
In Figure [5], input image is a real image which has a non-uniform lighting. We can see that lighting has an effect on the 2.5D images, particularly silhouette and normal images are distorted.
Figure [6] : Output of real image - failed case
Figure [6] is one of the failed cases we observed in our experiments. In this case, 2.5D predictions are distorted, resulting in an imperfect 3D reconstruction. This can be because the background and chairs are of the same colour.
In the previous approach, the network predicts 128*128*128 points in voxel space. Training the network to predict such a huge voxel space is both time consuming and computationally intensive. Hence, we trained a model-based pipeline to reduce both training time and computational power needed for 3D reconstruction.
Figure [7] : Model-based approach pipeline
In this approach, the shape is parameterized by a base model and per-instance deformations are calculated
from the base model. In order to calculate these deformations, Iterative Closest Point (ICP) algorithm is
employed to estimate the best alignment of two point-clouds (both point to point and point to plane).
Applying ICP on the vertices of the training objects is quite challenging. As the ground truth 3D meshes
have varied number of vertices, edges, and faces, meshes are down-sampled so that all the training 3D
meshes have the same number of vertices. Down-sampling the vertices was done manually on Blender.
Now that we have one-to-one correspondence between all the vertices, the vertices are flattened and the
deformations are calculated as given below:
Deformation = Flattened Input Vertices - Base Vertices
Now this deformation is calculated for all training data and then they are stacked as columns to build a
Deformation Matrix. Furthermore, Principal Component Analysis (PCA) is performed on the Deformation Matrix
to get the direction of maximum variation in deformation.
We can represent deformation of every training data as shown below.
Figure [8]
Thus, we can get the deformation coefficients of every training model by taking pseudo-inverse
of the deformation matrix and multiplying it with deformation.
Figure [9]
For a new 2D image, these deformation coefficients are predicted using a trained CNNs. Training
the CNN is done in a supervised way with mean squared error as the loss function.
Figure [10]
Now that our model predicts the deformation coefficients, we can get instant specific deformation by simple
matrix multiplication as shown in Figure [8]. We will get our final predicted shape by adding these Deformations
to Base Vertices.
Figure [11]
Here we see, by adding relevant deformations to the Base Model, we predict the output 3D shape.
The predicted output is very close to the ground truth model. It is trying to learn the curvature
along the side to an extent and also trying to fill in the gap which our base model had. Also it is
trying to eliminate the extra support of base model’s legs. But the outcome is noisy. Given we had to
predict only 15 coefficients, the pipeline does an excellent work of predicting 3D models.
Below, let us have a look at how our model learns the deformation.
Figure [12]
Here we see, how our model is trying to learn deformations and transforming from Base Model to Ground Truth. The principal direction explains 59.19% of the variance in deformation. To see the transformation that our model applies, we increase the coefficient along the principal direction of deformation. As we keep increasing, the model keeps adding deformations to bring it closer to actual 3D model.
Figure [13]
Here, our ground truth model is relatively different from the base model and hence we see, even though it is trying to learn the deformation, the resulting noise is too high in the final predicted shape. Hence, even though it is computationally cheap, the results heavily depend on the base model that we choose.
Figure [14]
For real-world testing, we test our approaches with a real 2D image of a chair. The model-free approach
constructs the shape of chair but fails to capture the specific features such as thickness. This can be
attributed to the difference between the lighting and background differences between a synthetic and a
real image. The model-based approach captures the overall shape of the chair but results in a noisy shape.
Also, for model-based approach to give good results the base model should be very close to the output shape
that we want to predict whereas the model-free approach does not have the said constraint. But, model-free
approach requires a lot of computational power to predict 128*128*128 points whereas model-based approach
just has to predict 15 points!
Life comes with trade-offs! :)
[1] MarrNet : 3D Shape Reconstruction via 2.5D Sketches, Wu J., Wang Y., Xue T., Sun X., Freeman W.T. & Tenenbaum J.B., NIPS (2017)
[2] Deep Residual Learning for Image Recognition, He K., Zhang X., Ren S. & Sun J., CVPR (2015)
[3] Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, Wu J., Zhang C., Xue T., Freeman W.T. & Tenenbaum J.B., NIPS (2016)
[4] Learning Category-Specific Mesh Reconstruction from Image Collections, Kanazawa A., Tulsiani S., Efros A.A. & Malik J., ECCV (2018)
[5] Learning 3D Shape Priors for Shape Completion and Reconstruction, Wu J., Zhang C., Zhang X., Zhang Z., Freeman W.T. & Tenenbaum J.B., ECCV (2018)
[6] 3D Menagerie: Modeling the 3D Shape and Pose of Animals, Zuffi S., Kanazawa A., Jacobs D. & Black M.J., CVPR (2017)
[7] Neural 3D Mesh Renderer, Hiroharu K., Yoshitaka U. & Tatsuya H., CVPR (2018)