(How, 20 16)) gradually reduces the resolution of the feature map. Therefore, in order to make pixel-level prediction, the decoder needs to restore the samples to the pixel level. Most advanced semantic segmentation models do not pre-train the additional parameters introduced by the decoder and randomly initialize them. In this paper, we believe that the random initialization of the decoder is far from optimal, and the performance can be significantly improved by pre-training the decoder weights through simple but effective denoising methods.
Automatic denoising encoders have a long and rich history in machine learning (Vincent et al., 2008; 20 10)。 The general method is to add noise to clean data, and train the model to separate the noise data back to clean data and noise components, which requires model learning data distribution. Denoising targets are very suitable for training dense prediction models because they can be easily defined at the pixel level. Although the idea of denoising has a long history, the goal of denoising has recently attracted new interest under the background of denoising diffusion probability model (DPM). (Sohl Dickstein et al., 2015; Song & Ermont, 2019; Ho et al., 2020). Through a series of iterative denoising steps, DPMs transforms Gaussian noise into target distribution, thus approaching complex empirical distribution. This method has achieved impressive results in image and audio synthesis (Nichol &; Dhariwal,202 1; Drival & Nicole, 2021; Sahara et al., 2021b; How, 2021; Chen et al., 202 1b), which is superior to strong GAN and autoregressive baseline in sample quality score.
Inspired by the new interest and success of denoising in the diffusion model, we study the effectiveness of the representation of denoising automatic encoder in semantic segmentation, especially for the pre-trained decoder weights that are usually randomly initialized.
In a word, this paper studies the pre-training of decoder in semantic segmentation architecture, and finds that significant benefits can be obtained through random initialization, especially in limited label data settings. We suggest using denoising to pre-train the decoder, and connecting the denoising automatic encoder to the diffusion probability model to improve all aspects of the pre-training of denoising, such as predicting noise instead of image in the denoising target and scaling the image before adding Gaussian noise. This leads to a significant improvement in the standard supervised pre-training of encoders on three data sets.
In the second part, we briefly summarize the details of conventional denoising pre-training, and then conduct in-depth research on it.
Decoders in Part 3 and Part 4 denoise the pre-training.
The fifth section introduces the empirical comparison with the latest method.
Two methods
Our goal is to learn the image representation that can be well transferred to intensive visual prediction tasks. We consider an architecture consisting of an encoder fθ and a decoder gφ, which is parameterized by two sets of parameters θ and φ. The model takes the image X as the input ∈ R H×W×C, and converts it into a dense representation of y∈ R h×w×c, such as a semantic segmentation mask.
We hope to find a way to initialize parameters θ and φ, so that the model can effectively fine-tune semantic segmentation through several labeled samples. For encoder parameters θ, we can initialize them by using pre-trained weights during classification according to standard practice. Our main contribution relates to the decoder parameter φ, which is usually randomly initialized. We suggest that these parameters be pre-trained as automatic de-noising encoders (Vincent et al., 2008; 20 10): given an unlabeled image x, we add Gaussian noise σc and fixed standard deviation σ to x to get a noisy image xe, and then train the model as an automatic encoder gφ? Fθ minimizes the reconstruction error kgφ(fθ(xe))? Xk 2 2 (only optimize φ and keep θ unchanged). We call this method decoder denoising pre-training (DDeP). Alternatively, both φ and θ can be denoised (denoised pre-training; DeP) for training. Next, we will discuss several important modifications to the standard automatic encoder formula, which will significantly improve the quality of the representation.
As our experimental device, we used Transune (Chen et al. (2021a)); Figure 2). The encoder is pre-trained according to the classification of ImageNet-2 1k (Deng et al., 2009), while the decoder is pre-trained by our denoising method, also using ImageNet-2 1k unlabeled images. After pre-training, the model is fine-tuned according to the urban landscape, Pascal context or ADE20K semantic segmentation data set (Cordts et al., 2016; Mottaghi et al, 2014; Zhou et al, 20 18). We report the average intersection ratio (mIoU) of all semantic categories. Further implementation details are described in Section 5. 1.
Figure 1 shows that our DDeP method is obviously superior to the method of only pre-training the encoder, especially in the case of small samples. Fig. 6 shows that even DeP, that is, pre-training the whole model (encoder and decoder) without any supervised pre-training, can compete with supervised pre-training. Our results show that although denoising pre-training is simple, it is an effective method to learn semantic segmentation representation.
Pre-training of encoder and decoder denoising
As mentioned above, our goal is to learn effective visual representation, which can be well converted into semantic segmentation and other intensive visual prediction tasks. We re-examine the goal of denoising to achieve this goal. Firstly, we introduce the formula of standard denoised automatic encoder (used for encoder and decoder). Then, we suggest some modifications to the standard formula, which are inspired by the recent success of diffusion model in image generation (Ho et al., 2020; Nicole & ampDhariwal, 2021; Saharia et al., 202 1b).
3. 1 standard denoising target
In the formula of standard denoising automatic encoder, given an unlabeled image x, we get a noisy image xe by adding Gaussian noise σc and fixed standard deviation σx,
Then we train an automatic encoder gφ? Fθ minimizes the reconstruction error kgφ(fθ(xe))? Xk 2. Therefore, the objective function takes the following form.
Although this objective function has produced a useful representation for semantic segmentation, we find that some key modifications can significantly improve the quality of the representation.
3.2 Selection of denoised targets among targets
The target training model of standard denoise automatic encoder is used to predict noiseless image X. However, diffusion model is usually trained to predict noise vector c(Vincent, 2011; Ho et al., 2020):
For the model with jump connection from input xe to output, these two formulas behave similarly. In this case, the model can easily combine its estimate of c with the input xe to obtain X.
But in the absence of explicit jump connection, our experiments show that the prediction of noise vector is obviously better than that of noise-free image (table 1).
3.3 Scalability of denoising as a pre-training goal
Unsupervised pre-training method will eventually be limited by the mismatch between the representation learned by the pre-training target and the representation required by the final target task. For any unsupervised goal, an important "rationality check" is that it will not reach this limit soon to ensure that it is well consistent with the target task. We found that the representation of learning by denoising will continue to increase to our maximum feasible pre-training calculation budget (Figure 3). This shows that denoising is an extensible method, and the quality will be improved with the increase of calculation budget.
3.4 Training before denoising and supervision
In the standard formula of denoising automatic encoder, the whole model (encoder and decoder) is trained by denoising. However, at least in the case of abundant fine-tuning data, the denoising pre-training performance of the whole model is not as good as the standard supervised pre-training of the encoder (Table 2). In the next section, we will explore the combination of denoising and supervised pre-training to obtain the benefits of both.
4. Only pre-train the decoder for denoising.
In fact, because there are powerful and scalable methods for pre-training encoder weights, the main potential of denoising lies in pre-training decoder weights. Therefore, we fix the encoder parameter θ on the value obtained by supervised pre-training on ImageNet-2 1k, and only pre-train and denoise the decoder parameter φ, so as to achieve the following purposes:
We call this pre-training scheme decoder denoising pre-training (DDeP). As shown in the following figure, among all labeling efficiency mechanisms, the performance of DDeP is better than pure supervision or pure denoising pre-training. Before giving the benchmark test results in Section 5, we studied the key design decisions of DDeP, such as the noise formula and the optimal noise level in this section.
4. 1 Noise size and relative ratio of image to noise
The key superparameter of decoder denoising pre-training is the noise added to the image. The noise variance σ must be large enough, and the network must learn meaningful image representation before it can be removed, but it should not be too large, resulting in a large distribution deviation between clean images and noisy images. For visual inspection, Figure 4 shows several sample values of σ.
In addition to the absolute size of noise, we also find that the relative scaling of clean image and noise image also plays an important role. Different denoising methods are different in this respect. Specifically, DDPM generates a noise image xe as.
This is different from the standard denoising formula in equation (1) because the attenuation of x is √ γ and that of c is √ 1. γ to ensure that if the variance of x is 1, the variance of random variable xe is 1. Using this formula, our pre-training goal of denoising becomes:
In fig. 5, we compare this scaled additive noise formula with a simple additive noise formula (formula (1)), and find that scaling the image can significantly improve the downstream semantic segmentation performance. We speculate that the decoupling of variance and noise size of noise image reduces the distribution deviation between clean image and noise image, thus improving the transfer of pre-training representation to the final task. Therefore, this formula will be used in the rest of this article. We found that the optimal noise amplitude is 0.22 (Figure 5) for the scaled additive noise formula, and used this value in the following experiments.
4.2 Selection of data sets before training
In principle, any image data set can be used for denoising pre-training. Ideally, we want to use large data sets (such as ImageNet) for pre-training, but this brings a potential problem, that is, the distribution change between pre-training data and target data may affect the performance of target tasks. In order to verify this, we compare the pre-training of the decoder on several data sets, while the pre-training of the encoder on ImageNet-2 1K keeps the classification goal unchanged. We found that for all test data sets (urban landscape, Pascal context and ADE20K;; Table 3), the effect of pre-training the decoder on ImageNet-2 1K is better than that on the target data. It is worth noting that this is even applicable to urban landscape, and the image distribution is significantly different from ImageNet-21K. Therefore, the model pre-trained by using DDeP on general image data sets is usually suitable for a wide range of target data sets.
4.3 decoder variants
Considering that the pre-training of decoder denoising significantly improves the random initialization of decoder, we assume that this method can expand the scale of decoder to the extent that the benefits are reduced when using random initialization. We test this by changing the number of feature maps at each stage of the decoder. The default (1×) decoder configuration for all our experiments is [1024,512,256,128,64], where the value at index I corresponds to the number of feature maps at the ith decoder block. This is reflected in Figure 2. On Cityscapes, we try to double (2 times) the default width of all decoder layers, while on Pascal context and ADE20K, we try to double (3 times) the width. Although larger decoders usually improve performance, DDeP will bring extra gain in all cases, even during random initialization. Therefore, DDeP can open a new decoder-intensive architecture. The fifth section gives the main results of 1× decoder and 2×3× decoder.
4.4 Expansion of diffusion process
As mentioned above, we found that by adjusting some aspects of the standard automatic encoder formula, such as the selection of prediction target and the relative scaling of image and noise, the pre-training representation can be improved to make it more similar to the diffusion model. This raises the question whether the characterization can be further improved by using the pre-training of the total diffusion process. Here, we have studied the extensions that make this method closer to the complete diffusion process used in DDPM, but found that they did not improve the results of the above simple methods.
Variable noise table.
Because it uses a single fixed noise level (γ in Equation (6)), our method corresponds to a single step in the diffusion process. The complete DDPMs simulates the complete diffusion process from clean image to pure noise (and vice versa) by randomly and uniformly sampling the noise amplitude γ from [0, 1] of each training sample (Ho et al., 2020). Therefore, we also conducted experiments on randomly sampled γ, but found that fixed γ performed best (Table 4).
Adjust the noise level.
In the diffusion form, the model represents the (inverse) transfer function from one noise level to the next, so it is conditional on the current noise level. In fact, this is achieved by providing the sample γ as an additional model input for each training sample to, for example, the normalization layer. Since we usually use a fixed noise level, our method does not need to be adjusted. When a variable noise plan is used, adjustment does not provide any improvement.
Noise level weighting.
In DDPM, the relative weight of different noise levels in the loss has a great influence on the sample quality (Ho et al., 2020). Because our experiments show that learning transferable representation does not require multiple noise levels, we have not experimented on the weights of different noise levels, but please note that this may be an interesting direction for future research.
Five benchmark test results
We evaluate the effectiveness of the proposed decoder de-noising pre-training (DDeP) on several semantic segmentation data sets, and conduct labeling efficiency experiments.
5. 1 Detailed Rules for Implementation
For the downstream fine-tuning of the pre-training model of semantic segmentation task, we use the standard pixel-by-pixel cross entropy loss. We use Adam (King Horse &; Ba, 20 15) optimizer and cosine learning rate attenuation plan. For the pre-training of decoder denoising (DDeP), we use the batch size of 5 12 to train 100 epoch. Learning speed 6e? 5 for 1x and 3x width decoders. 1e? 4 for 2 times the width decoder.
When fine-tuning the pre-training model in the target semantic segmentation task, we scan the weight attenuation and learning rate values [1e? 5,3e? 4] and choose the best combination for each task. For the 100% setting, we report the average of 10 runs on all data sets. In Pascal context and ADE20K, we also reported the average value of 10 runs (different subsets), in which 1%, 5% and 10% were marked with scores, and the average value of 5 runs was set to 20%. On the urban landscape, we reported the average of 10 operation under the setting of130, the average of the next six operations under the setting of 1/8 and the average of the next four operations under the setting of 1/4.
In the training process, random cropping and random left-right flipping are applied to the image and its corresponding segmentation template. For urban landscape, we randomly crop the image to a fixed size of 1024× 1024, and for ADE20K and Pascal contexts, we crop the image to a fixed size of 5 12×5 12. All the pre-training operations of decoder denoising are carried out at the resolution of 224×224.
In the process of urban landscape inference, the full resolution 1024×2048 image is divided into two input patches: 1024× 1024 for evaluation. We use horizontal flip and average each half of the results. The two halves are connected in series to produce full resolution output. For Pascal Context and ADE20K, in addition to horizontal flipping, we also use multi-scale evaluation for the rescaled version of the image. The scale factors used are (0.5, 0.75, 1.0, 1.25, 1.5, 1.75).
5.2 Performance gain of decoder pre-training
In terms of urban landscape, DDeP is superior to DeP and supervised pre-training. In Figure 6, we report the results of DeP and DDeP on urban landscape, and compare them with the training results of randomly initializing or using ImageNet-2 1K pre-training encoder. DeP results use the scaled additive noise formula (Equation (5)), and the performance is significantly improved compared with the results obtained by the standard noise reduction target.
As shown in Figure 6, DeP is superior to the supervision baseline in 1% and 5% annotation image settings. For 1× and 2× decoder variants, the decoder denoising pre-training (DDeP) is further improved than DeP and ImageNet-2 1K supervised pre-training (Table 6).
As shown in Table 5.2, DDeP is superior to the effective semantic segmentation method of urban landscape labeling proposed before in all labeling scores. DDeP has only 25% training data, and it has better segmentation effect than the strongest baseline method PC2Seg(Zhong et al., 202 1) when training on complete data sets. Different from the recent work, we don't conduct multi-scale assessment of urban landscape, which will lead to further improvement.
DDeP also improves supervised pre-training on Pascal context data sets. Figure 1 compares the performance of DDeP with the monitoring baseline and Pascal on 1%, 5%, 10%, 20% and 100% training data. Table 5.2 compares these results with those obtained by using 3x decoder. For 1× and 3× decoders, the performance of DDeP is obviously superior to that of the co-supervised model in architecture, and the mIOU improvement of 4- 12% is achieved in all semi-supervised settings. It is worth noting that DDeP is superior to the supervised model of 20% label training only by using 10% label.
Figure 7 shows a similar improvement of DDeP on the ADE20K dataset. Once again, we see that in the settings of 5% and 10%, the revenue exceeds 10 points, and in the settings of 1%, the revenue exceeds 5 points. These consistent results prove the effectiveness of DDeP in data set and training set scale.
We use Transune (Chen et al. (2021a)); ) For the above results. Figure 2) Maximum performance architecture, but DDeP has nothing to do with the backbone, and it will also bring benefits when used with a simpler backbone architecture. In Table 7, we trained a standard U-Net with ResNet50 encoder and DDeP (no multi-scale evaluation) in Pascal environment. DDeP is superior to monitoring baseline in all settings, which shows that our method goes beyond the transformer architecture.
6 related work
Because it is expensive, time-consuming and error-prone to collect detailed pixel-level annotations for semantic segmentation, many methods have been proposed to realize semantic segmentation from fewer annotation samples (Tarvainen &: Valpola, 2017; Gong Teng et al, 2018; Hong et al, 2018; Mittal et al., 2021; French et al., 2019; Ouali et al., 2020; Zou et al, 2021; Feng et al, 2020b Ke et al, 2020; Olson et al., 2021; Zhong et al., 202 1). These methods usually resort to semi-supervised learning (SSL)(Chapelle et al., 2006; Van Engelen & ampHoos, 2020), in this learning, in addition to the labeled training data, it is assumed that a large number of unlabeled image data sets can be accessed. In the following content, we will discuss the previous work on the role of strong data enhancement, model generation, self-training and self-supervised learning in tagging effective semantic segmentation. Although this work focuses on self-monitoring pre-training, we believe that powerful data enhancement and self-training can be combined with the proposed denoising pre-training method to further improve the results.
Data enhancement.
French et al. (French et al., 20 19) proved that Cutout (devries &; Taylor (20 17) and CutMix(Yun et al., 20 19) and other powerful data enhancement technologies are particularly effective for semantic segmentation of a few labeled samples. Ghiasi et al. (202 1) found that simple copy-and-paste enhancement is helpful for instance segmentation. Preliminary work (Remez et al., 2018; Chen et al, 2019; Bilsky & favaro, 2019; Alan gerovi? & Ampziserman (2019) also explored completely unsupervised semantic segmentation by combining different foreground and background regions with GANs(Goodfello et al., 20 14). We use relatively simple data enhancement, including horizontal flipping and random initial cropping (Szegedy et al., 20 15). Use more powerful data to improve the jobs left for the future.
Generate the model.
In the early work of marking effective semantic segmentation, GANs is used to generate synthetic training data (Souly et al., 20 17), and the real and predicted segmentation masks (Hung et al., 2018; Mittal et al., 202 1).DatasetGAN (Zhang et al., 202 1) shows that modern GAN architecture (Karras et al., 20 19) is effective in generating synthetic data to help pixel-level image understanding, but only a few marked images are available at this time. Our method is highly correlated with diffusion and fractional generation models (Sohl Dickstein et al., 2015; Song & Ermont, 2019; Ho et al., 2020), which represents a new generation of model families, is better than Gans (Dhariwal &: Nicole, 2021; How, 202 1). These models are connected with the automatic de-noising encoder through de-noising score matching (Vincent, 20 1 1), which can be regarded as a method of training energy-based models (Hyv? Lining & Dayan, 2005). Denoising diffusion model () has recently been applied to conditional generation tasks such as super-resolution, coloring and restoration (Li et al., 2021; Sahara et al., 2021b; ; Song et al, 2021; Saharia et al., 202 1a), indicating that these models may be able to learn useful image representations. We are inspired by the success of DDPM, but we find that many components of DDPM are not necessary, and simple denoising pre-training has a good effect. Diffusion model has been used to iteratively optimize semantic segmentation mask (Amit et al., 2021; Hoogeboom et al, 202 1). Baranchuk et al. (Baranchuk et al., 202 1) proved that the features learned by diffusion model are effective in semantic segmentation, and these features come from a few labeled samples. In contrast, we use simple denoising pre-training for representation learning and study the complete fine-tuning of encoder-decoder architecture instead of extracting fixed features. In addition, we use mature benchmarks to compare our results with previous work.
Self-training, consistency standardization.
Self-training (self-learning or pseudo-marking) is one of the oldest SSL algorithms (Scudder,1965; Flalik,1967; Agrawara,1970; Yarowski, 1995). Its working principle is to use an initial supervised model, label unlabeled data with so-called pseudo-tags, and then use a mixture of pseudo-tags and artificial tags to train the improved model. This iterative process can be repeated many times. Self-training has been used to improve target detection (Rosenberg et al., 2005; Zoph et al., 2020) and semantic segmentation (Zhu et al., 2020; Zou et al, 2021; Feng et al., 2020a;; Chen et al., 2020a). Consistency regularization is closely related to self-training, which enhances the consistency of prediction during image enhancement (French et al., 2019; Jin et al., 2020; Ouali et al., 2020). These methods usually require careful adjustment of hyperparameters and reasonable initial models to avoid noise propagation. Combining self-training with denoising pre-training can further improve the results.
Self-supervised learning.
Self-supervised learning method can generate predictive excuse tasks, which are easy to build from unlabeled data and are beneficial to downstream differentiated tasks. In natural language processing (NLP), the task of masking language modeling (Devlin et al., 2019; Liu et al, 2019; Raffel et al., 2020) has become a de facto standard, which shows impressive results in NLP tasks. In computer vision, different excuse tasks are proposed for self-supervised learning, including the task of predicting the relative position of adjacent patches in an image (Doersch et al., 20 15), the task of repairing (Pathak et al., 20 16), and the task of solving puzzles (Noroozi & , 20 16), image coloring (Zhang et al., 2016; Larsson et al, 20 16), rotation prediction (Gidaris et al, 20 18) and other tasks (Zhang et al, 2017; Caron et al., 2018; Kolesnikov et al., 20 19). Recently, methods based on sample discrimination and comparative learning have shown promising results in image classification (Oord et al., 2018; Hjelm et al., 2018; How, 2020; Chen et al., 2020bc;; Grill et al., 2020). These methods have been used to successfully pre-train the backbone for object detection and segmentation (he et al., 2020; Chen et al., 2020d), but unlike this work, they usually initialize the decoder parameters randomly. Recently, a series of automatic coding methods based on masking have also appeared, such as (Bao et al., 202 1), MAE (how, 202 1) and other methods (Zhou et al., 2021; Dong et al, 2021; Chen et al., 2022). We notice that our method is developed for modeling these mask image sequences at the same time, and our technology is orthogonal, because we pay attention to decoder pre-training, which is not the focus of the above paper.
Self-supervised learning with intensive prediction.
Pinheiro et al.(2020) and Wang et al.(202 1) put forward reinforcement contrast learning, which is a self-monitoring pre-training method for reinforcement prediction tasks. Contrast learning is applied to patch and pixel-level features, not image-level features. This reminds people of Amdim (Bachmann et al., 20 19) and CPC V2 (Heinef et al., 20 19). Zhong et al. (202 1) further considered this idea, and combined the consistency of segmentation mask between different enhanced (possibly unlabeled) model outputs of images with the consistency of pixel-level features of overall enhancement.
Visual transformers.
Inspired by the success of Transformer in NLP (Vaswani et al., 20 17), there are publications that study the combination of convolution and self-attention (Carion et al., 2020), semantic segmentation (Wang et al., 2018; 2020b) and panoramic segmentation (Wang et al., 2020a). Vision Transformer (Vit) (Dosovitskiy et al., 202 1) proved that the deconvolution method can produce impressive results when a large number of labeled data sets are available. Recent studies have explored ViT as the backbone of semantic segmentation (Zheng et al., 2020; Liu et al; Rudder et al, 202 1). These methods are different in the structure of the decoder, but they all show the semantic segmentation ability based on ViT. We use mixed ViT(Dosovitskiy et al., 202 1) as the backbone, in which the patch embedding projection is applied to the patches extracted from the convolution feature map. We studied the size of the decoder and found that a wider decoder usually improves the semantic segmentation results.
7 conclusion
Inspired by the recent popular diffusion probability models of image synthesis, we study the effectiveness of these models in learning useful transferable representations of semantic segmentation. Surprisingly, we find that the pre-trained semantic segmentation model can greatly improve the performance of semantic segmentation as an automatic denoising encoder, especially when the number of labeled samples is limited. Based on this observation, we propose a two-stage pre-training method, in which the supervised pre-training encoder is combined with the denoising pre-training decoder. This leads to the consistent benefit of data set and training set size, thus forming a feasible pre-training method. It is also interesting to explore the application of denoising pre-training in other intensive prediction tasks.