Current location - Education and Training Encyclopedia - Graduation thesis - Depth completion of a single RGB-D image
Depth completion of a single RGB-D image
Depth complement of single RGB-D image

Home Page: pletion.cs.princeton.edu/

github:/yindaz/DeepCompletionRelease

Thesis: pletion.cs.princeton.edu/paper.pdf

Target-completion depth channel

RGB-D image

Problem-Commercial depth cameras are usually unable to perceive the depth of shiny, bright, transparent and distant surfaces.

Methods-Using RGB images as input, the dense surface normals and occlusion boundaries were predicted. Then, these predictions are combined with the original depth observation provided by the RGB-D camera to solve the depth of all pixels, including the missing pixels in the original observation.

Goal: Complete the deep channel.

The RGB-d images captured by commercial cameras (that is, all holes are filled) fill the gaps in the depth map.

The previous depth restoration method was solved by manual adjustment, that is, extrapolating the boundary surface and synthesizing Markov images. Come on, fill the hole.

Deep network has been used for depth estimation, but it has not been used for deep completion due to the following difficulties:

For the captured RGB-D images paired with the completed depth map, such large-scale training data is not easy to obtain.

In this way, depth estimation can only reproduce the observed depth, but can not estimate the unobserved depth information.

This paper introduces a new data set, 105432 RGB-D image, which corresponds to 72 complete depth images calculated by large-scale surface reconstruction in real environment.

Depth representation

It is very difficult to estimate the exact depth from monocular color images, even for humans, especially for such a large area missing from the map 1.

Therefore, this paper first uses the network to predict the local differential properties of depth: surface normal &; Occlusion boundary

No one has trained an end-to-end network to complete the depth from RGB-D images before.

One idea is to extend the previous color network to depth, but

What exactly does the dislocation here mean? Is there a spatial dislocation? Pixels with color information don't necessarily have depth information?

In this paper, only color images are used as input, and the supervised local surface normals and occlusion boundaries are predicted first, because the depth network is competent to predict local features from color information. Then the depth is completed by a global optimization problem that combines these predictions with the input depth.

Main insights

Benefits: By doing this, the network is independent of the observed depth and does not need to be retrained because of the new depth sensor?

Depth estimation from monocular color images

Defocused shape

others

-Old methods

compiler

begin

The previous methods did not study the depth image restoration, which is a difficult problem because the depth image lacks robust features, strong features and large-scale training data.

Markov random field

Restore shape from light and shade

break up

Dictionary method

Although some methods can be used for deep completion, their concerns are different.

Other work has studied the depth reconstruction of color images enhanced by sparse depth measurement sets.

But the motivation of this study is to reduce the perceived cost under certain settings (such as saving the cost of robots), rather than completing it in depth.

Corresponding to the three difficulties mentioned in the introduction, this paper also focuses on the following three issues:

However, this method is expensive and time-consuming, and this type of public data set only contains a few indoor scenes.

Example: Matterport3D

This results in a data set containing RGB-D and D* image pairs!

Question: The combination of multi-view RGB-D images requires the registration between images, right? Is this process of obtaining grids from the original data set ready? Global surface reconstruction is an existing data set.

see

Since the surface reconstruction is built in a 3D mesh size equivalent to the resolution of the depth camera, the resolution is usually not lost in the finished depth image. However, when projected onto the viewing plane, the same 3D resolution provides an effective higher pixel resolution for surfaces far away from the camera. Therefore, when rendering a high-resolution grid, the finished depth image can use sub-pixel antialiasing to obtain a finer resolution than the original image (please pay attention to the details in the furniture in Figure 3). why

The data set in this paper is a rendered117516 RGB-d image:

Training set:105432; Test set: 12084

However, unlike predicting absolute depth from a single image,

This paper is to predict the local properties, surface normals and occlusion boundaries of each pixel.

Why use surface normals to block boundaries:

Therefore, it works well in dense prediction from color images to surface normals.

So, how to get it from the surface normal &; Calculation depth of occlusion boundary:

A) What loss should be used to train the network?

Two options: train only on holes and all pixels:

Training with Rendered Normal and Original Normal?

See the appendix for details.

Comparative experimental results:

B) What image channel should be input to the network?

Experiments show that if RGB-D is used as the input to predict the normal, the prediction effect of pixels in holes is very poor (although it is effective for observed pixels). Presumably, this network only predicts the normal from the depth channel in RGB-D, so it can't work on holes.

The conclusion in Figure 5 inspires the author to predict the surface normal only with color images.

It is competitive to separate "prediction without depth" from "optimization with depth" for two reasons: advantages.

Previously, the network predicted the surface normal image n and the occlusion boundary image B(== What do they look like? ==)

Solving equations

The objective function is the weighted sum of four square errors.

$E_D$: the distance between the estimated depth and the original observed depth.

$E_N$: predict the consistency between depth and surface normal by multiplying tangent point by normal.

$E_S$: Make adjacent pixels have similar depth values.

B: $B ∈ [0, 1] $ Reduce the weight of the normal term according to the prediction probability of the pixel on the occlusion boundary $(B(p))$

= = Question: If at the boundary, the normal vertical tangent is not actually satisfied, then reduce its weight? In extreme cases, only $E_N$ = at the occlusion boundary is considered.

= = Question: Isn't the square error already nonlinear = =

The matrix form of the objective function is sparse symmetric positive definite, so we can use = = a sparse Cholesky decomposition [1 1] sparse Cholesky decomposition = = to solve the approximate objective inclusion function.

Evaluation index

(Measure the depth error above and the surface normal below)

Table 1 shows the results under different inputs (the bigger the arrow in the table, the better; On the contrary, the smaller the better)

For example, the median error of normal value 17.28.

= = supplementary materials = = also shows that this advantage still exists under different loss settings (only observed VS only not observed).

The author thinks that when it is an observation depth, the network will learn interpolation instead of synthesizing a new depth in the hole.

This experimental result urges this paper to divide the whole method into two steps: two-stage system++! !

Note that in Table 2, d here is the depth predicted from the depth.

Take Rel as an example, no. 089

The author thinks that the surface normal only represents the direction of the surface, so it is best to predict. See [3 1] for details. And = = and does not change with the change of depth, which is consistent in different views = =

Table 2: Yes means there is B, no means there is no weight loss, compared with 0.089.

Is the surface normal with occlusion boundary = = area noisy and inaccurate? = = Figure 6

The second column is the normal and occluded boundary of the network output, and the third and fourth columns of the second row are the comparison of whether there is boundary weight or not. The third and fourth columns of line 1 are the surface normals calculated from the output depth map. Occlusion boundary = = provides depth discontinuity information, which helps to maintain the clarity/sharpness of the boundary = = See normal map calculated according to depth.

Figure 7

The horizontal axis of an image is the number of pixels with depth (unmasked) in the image. The left figure shows the predicted depth accuracy of observed pixels, and the right figure shows the predicted depth accuracy of unobservable pixels.

Obviously, unobserved accuracy; Below the observed value; However, as long as there is a small input depth (==2000 depth only accounts for 2.5% of all pixels =), this shows from the side that even other depth sensor designs with sparse measurement can get more objective prediction results. = = There is no need to retrain the network (the input of the network is only color) = = But the true ground normal when training the network comes from the rendered depth image. If you just do a test, it really doesn't depend on the depth of raw.

Table 3

The comparison methods in the table include joint bilinear filtering, fast bilateral solution and global edge perception energy optimization.

It is found that Rel is the smallest of all methods.

Fig. 8 shows a comparison with joint bilinear filtering.

According to the results shown in Figure 8, the depth map boundary of this method is more accurate.

Compared with the depth estimation method of color versus depth

Table 4

All the indicators in this paper are the best, with an increase of 23-40%. Y represents the observed depth, and n represents the unobserved depth.

This also shows that predicting the normal is also a good method for depth estimation.

Note that not only the predicted depth is more accurate, but also by comparing the calculated surface normals, it can be seen that this method has learned better scene geometry.

Build a bridge between color map and depth map. The information bridge is normal!

Obviously, this is a game that sacrifices time for image quality.

1. The speed is very slow.

For an image with a resolution of 320x256, it takes about 0.3 seconds to use NVIDIA TITAN X GPU. About 1.5 seconds on Intel Xeon 2.4GHz CPU.

2. Relying on high-performance hardware. It is difficult to control the cost.