Current location - Education and Training Encyclopedia - Graduation thesis - Meta Reality Labs shares the inspiration and challenge of 3D dynamic rendering technology of MVP volume.
Meta Reality Labs shares the inspiration and challenge of 3D dynamic rendering technology of MVP volume.
(Wei Ying. Com 202 1, 65438+265438+2, 2 1)Meta put forward an MVP dynamic rendering technology for volume 3D content in a paper entitled "Volume primitive mixing for efficient neural rendering". This technology combines the advantages of voxel and voxel-based method to achieve high-performance decoding and efficient rendering. Recently, Stephen Lombardi, a research scientist at Reality Labs, a subsidiary of Meta, accepted an exclusive interview with SIGGRAPH and further introduced relevant research.

In the aspect of three-dimensional neural body, the team proposed a method of real-time reconstruction and rendering of moving objects according to new views only given multi-view image data. This is a very exciting research field, because it will realize compelling interactive content in virtual reality and augmented reality.

The main idea behind the three-dimensional neural body is to simulate the scene with the three-dimensional representation of the body. The three-dimensional representation of the volume includes RGB color and opacity value of each point in the space. In that paper, the team explored the three-dimensional representation of volume based on voxels. Voxel-based method has a series of advantages. First of all, 3D convolution can be used to generate voxel grids in real time, so as to model the dynamic scene. Secondly, trilinear interpolation can be used to quickly sample the color and opacity values of the three-dimensional volume. These advantages enable the "neural volume" model to be presented in real time. However, the "neural volume" distributes voxels evenly in the three-dimensional range of the scene, which makes it difficult to model high-resolution objects.

To solve this problem, the team put forward "mixed mutable primitive (MVP)". Instead of using a large three-dimensional voxel grid to simulate the scene, a set of smaller moving voxel grids are used to simulate the scene. By allowing the model to better control the voxel density of different parts of the scene, and using the motion of primitives to model the motion of the scene, MVP can model dynamic scenes with higher resolution and faster frame rate than' neural body'.

Creating a set of 3D primitives for moving objects involves two main parts: the initialization of primitives and the learning framework for training the system from multi-view video data.

For initialization, classical face modeling techniques (such as key point detection, three-dimensional reconstruction and mixed shape tracking) are used to generate the dynamic triangular mesh of human face. To initialize primitives, just place them on the surface of triangular mesh and distribute them evenly in the UV space of surface mesh. This initialization is very important for obtaining truly high-quality results, because during training, the learning framework may fall into a local minimum. By initializing the primitives to be evenly distributed on the face surface, all primitives can be guaranteed to be used, and the resolution of the whole face model is roughly similar.

Although initialization provides suitable initial positions for many 3D primitives (especially faces), initialization is usually wrong for other regions (such as hair and shoulders). In order to solve this problem, the training model generates the basic position, direction and content to best match the images captured from the multi-view capture system. This training process allows high-quality character rendering from any angle.

The biggest challenge is to decide the research direction of exploration. Although learnable 3D modeling and rendering technology has been very popular in the past few years, at that time, we didn't know how successful this method would be. Even now, we are still trying to improve the real-time performance of MVP, so that it can compete with more traditional representation methods, such as triangular mesh. But considering the complexity of the model, it is very difficult.

Real-time is so important because Meta's task is to create a realistic avatar in virtual reality, and finally realize the feeling of * * * in AR, so that you can easily communicate with each other's thoughts and emotions, not only through words, but also through facial expressions and body movements.

It is conceivable that considering all this, a large number of people need to participate. In addition to the research team that develops the algorithm, there is also a large team that is responsible for managing the hardware and software of the capture system, managing the capture process of data, and managing the storage and preprocessing of data (for example, developing and running the classic face tracking algorithm). In fact, this paper is the result of the team's years of hard work in Pittsburgh, Pennsylvania.