A Decoder-free Transformer-like Architecture for High-efficiency Single Image Deraining
Xiao Wu , Ting-Zhu Huang
∗, Liang-Jian Deng
∗and Tian-Jing Zhang University of Electronic Science and Technology of China, Chengdu, 611731 [email protected], [email protected], [email protected],
[email protected]
Abstract
Despite the success of vision Transformers for the image deraining task, they are limited by computation-heavy and slow runtime. In this work, we investigate Transformer decoder is not neces- sary and has huge computational costs. There- fore, we revisit the standard vision Transformer as well as its successful variants and propose a novel Decoder-Free Transformer-Like(DFTL) architec- ture for fast and accurate single image deraining.
Specifically, we adopt a cheap linear projection to represent visual information with a lower computa- tional cost than previous linear projections. Then we replace standard Transformer decoder blocks with designedProgressive Patch Merging (PPM), which attains comparable performance and effi- ciency. Finally, DFTL could significantly alleviate the computation and GPU memory requirements through proposed modules. Extensive experiments demonstrate the superiority of DFTL compared with competitive Transformer architectures, e.g., ViT, DETR, IPT, Uformer, and Restormer. The code is available at https://github.com/XiaoXiao- Woo/derain.
1 Introduction
Image deraining is a classical problem in computer vision, which is highly desired in consumer photography and image processing. However, it is a challenging task since distant rain streaks usually combine with water droplets to form a veil over the backdrop, substantially reducing the visibility of the image. Recently, with the emergence of convolutional neural networks (CNNs), the state-of-the-art (SOTA) meth- ods for image deraining are dominated by CNNs. However, the receptive field of convolution operations is limited, and CNNs usually have bulky structures to boost performance.
Transformer, first applied in natural language processing, is increasingly being used to replace CNNs in various vi- sion tasks. Pioneering works in Transformer [Vaswaniet al., 2017], [Liuet al., 2021] widely involve self-attention mech- anisms as basic blocks to realize a strong feature represen-
* Corresponding authors.
9
28
6
12 48.6
23.30
69.1
44.42
11.1
4.9 4.5 3
39.36 39.71 41.69 41.18
0.985
0.986
0.99
0.989
0.982 0.984 0.986 0.988 0.990 0.992
0 20 40 60 80
IPT IPT w/o decoder Restormer Restormer with PPM
MaxBatch Training Time Inference Time PSNR SSIM
Figure 1: Ablation of decoders for two Transformer architectures.
Transformer decoders have limited performance and higher com- plexity than a decoder-free Transformer-like architecture. Our meth- ods achieve competitive results compared with standard Transformer with less GPU memory and faster runtime for image deraining.
tation ability. Most Transformer-based methods consist of three parts: a) patch generator; b) MLP; c) multi-head self- attention (MSA). In general, Transformer can be written in the following form:
X=
X1;X2;· · ·;XN ,
Q=XWTQ,K=XWTK,V=XWTV, Z=Softmax(QKT
√d )V+X, Z=FFN(Z) +Z,
(1)
where X ∈ RN×C denotes the input features, N is the number of patches and C is channels. WQ, WK, WV are weights of query(Q),key(K) andvalue (V) in the lin- ear projections (LP). FFN(·) is the feed-forward network.
The complete Transformer is achieved by stacking Eq. 1 to build Transformer encoder block (TEB) and Transformer decoder block (TDB). Therefore, the computational com- plexity of vanilla Transformer is O(N2 + N C) that is more cumbersome than CNNs. Meanwhile, we could build cross-covariance attention (XCA) by matrix multiplication VTKQT, which isO(N C+C2). The computational com- plexity is quadratic with the channels on the input images.
By analyzing previous works such as IPT [Chen et al., 2021], Uformer [Wanget al., 2021b], and Restormer [Za-
miret al., 2021] (please refer to Fig. 1), we find redundancy that they just simply apply TEB-to-TEB Transformer as the encoder-decoder architecture and decoders perform feature interaction across inputs of encoders rather than encoders and TDB, as shown in Fig. 2 (a). In other words, these decoders actually result in a slight or no improvement in their perfor- mance, but with high costs in terms of the parameter number, FLOPs, and training/inference time. It motivates us to build a decoder-free Transformer-like (DFTL) framework.
In this paper, we first adopt a cheap linear projection (LP) to generate intrinsic feature maps with lower complexity.
Then, we introduce the Transformer encoder block (TEB) to represent desired features. Finally, Progressive Patch Merg- ing (PPM) is proposed to restore rich features with different spatial resolution representations. Meanwhile, the architec- ture is still under the self-attention mechanism, which attains the powerful contextual representation ability of the vanilla Transformer. The advances of DFTL can be summarized as 1)Efficient Transformer:propose cost-effective operations for better results. 2)Flexible structure: a scalable frame- work for the restoration of rainy images.
The contributions of this work are as follows:
1. We propose a new DFTL architecture that could achieve competitive performance with less GPU memory and hold satisfying complexity/performance trade-offs com- pared to other self-attention-based techniques.
2. We adopt a Cheap LP and MSA/XCA to capture the multi-scale contextual details. Based on these mod- ules optimized for Transformer, we develop two vari- ants named DFTL-W and DFTL-X, which consistently achieve comparable performance to the prior arts.
3. A new hybrid loss is designed for more effective train- ing, which could favor the potential convergence and im- prove the final testing performance expectedly.
2 Related Works
In general, existing single image deraining methods can be roughly divided into two categories, i.e., optimization-based and deep learning-based methods.
2.1 Optimization-based Deraining Methods
Optimization-based deraining methods view the rainy image as components assembled with the background imageBand the rain streaksR. The whole process can be expressed as the following formula:Y=B+R. (2)
To remove R and obtain a clean image Y, several works are proposed for single image deraining by founding effec- tive optimization models based on image priors, e.g., direc- tional sparsity prior [Jiang et al., 2019; Jiang et al., 2017;
Denget al., 2018]. With the development of deep learning, optimization-based techniques may be insufficient. They are only capable of dealing with particular situations.
2.2 Deep Learning-based Deraining Methods
CNN-based techniques. With the powerful representation and extraction ability of CNN, diverse CNN-based structures are designed to improve the performance of image derain- ing. In [Li et al., 2018], the authors utilize dilated CNN and squeeze-and-excitation blocks to deal with the task of im- age deraining, obtaining superior outcomes compared to tra- ditional optimization-based methods. PReNet given by [Ren et al., 2019] is to deepen shallow ResBlock via recurrent op- erations progressively, achieving competitive results. Wanget al. introduce a convolutional dictionary learning mechanism to remove rain streaks [Wanget al., 2020] effectively. In addi- tion, Fuet al.firstly attempt to use the so-called graph convo- lution network to extract relation-aware features and exploit pixel-level global spatial relationship [Fuet al., 2021], getting the SOTA deraining results. However, all the techniques men- tioned above are based on the characteristics of CNN, which requires stacking several convolution blocks to enlarge the receptive field and improve performance, lacking long-range interactions and ignoring geometric details extraction.Vision Transformer techniques. In [Dosovitskiy et al., 2021], it introduces Vision Transformer (ViT) to treat input image as 16×16words and attains excellent results on im- age recognition. Besides, IPT [Chen et al., 2021] pretrains the model on ImageNet dataset and achieves SOTA on sev- eral low-level visual tasks. SwinIR [Lianget al., 2021] builds a single way structure based on Swin Transformer to achieve image restoration. More recently, Restormer [Zamiret al., 2021] designs Transformer with convolution projections and cross-covariance across feature channels, further improving the structures of Transformer.
3 Proposed Method 3.1 Pipeline
An overview of the DFTL structure is presented in Fig. 2 (b).
Given an imageI∈RH×W×3, we first progressively divide it into 4-level patches by Patchify module. These multi-scale patches are {1,1/2,1/4,1/8} relative to H, W, termed as Hl, Wl, abbreviated as Pl standing for patch size P at the specific scalelby the following Eq. 3:
N=
(P−k+ 2·pad)
s + 1
, (3)
whereNis the number of patches,kis patch/kernel size,pad means padding, andsdenotes stride. In what follows, we de- scribe proposed modules: (a) Cheap Linear Projection (LP);
(b) Transformer Encoder Block (TEB); (c) Progressive Patch Merging (PPM).
3.2 Cheap Linear Projection
Linear projection (LP) could represent patch-level features into long range dependencies. However, according to Eq. 1, we can find one of complexity of Transformer is LP. Thus, we sort out several classic LP variants and show Cheap LP in Fig. 3. Convolution LP embeds spatial relationships into feature maps based on strided convolution operations. ViT
Stage i
Conv2DLayer Conv2DLayer…… PPM
TDB TDB TEB
object queries TEB
Layer Norm
Layer Norm FFN
Attention
TEB TEB
input image
(a) Transformer Architecture (b) Decoder-free Transformer-like Architecture
Stage 1
Layer Norm
Layer Norm FFN
Attention Layer Norm
Layer Norm FFN
Attention Layer Norm FFN
Attention
Layer Norm
LinearProjection TEB
TEB …
Patchify LinearProjection TEB
TEB …
Patchify
Encoders
Decoders
IH W C OH W D
Figure 2: Comparisons of different architectures, where Conv2D layer is convolution, and TEB/TDB stands for Transformer encoder/decoder block. (a) Many vision Transformers employ encoder-decoder structure for various vision tasks. However, recent vision Transformers adopt TEB-to-TEB to construct encoder-decoder for restoration of rainy image. They do not calculate TDB by object queries. (b) A decoder-free Transformer-like architecture (DFTL). Progressive Patch Merging (PPM) is used to replace standard Transformer decoders. The input channel and the output channel areCandD, respectively.
LP projects the flattened patches and window based LP rep- resents window based self-attention (W-MSA) along channel direction. Compared to depthwise convolution, they would occupy more memory and FLOPs. Hence, given the in- put X ∈ RC×H×W, X could be partitioned into patches {Xp ∈ Rk×k×C}Pp=1 by Eq. 3, where Xp is pth patch (p= 1,· · ·, P). Cheap LP can be expressed as,
Xp=Concat(
C
X
c=0 k
X
i=0 k
X
j=0
Xp·WT,Yp), (4) whereW ∈ RCk
2×D denotes learnable weights,Yp is the output of ViT LP. As shown in Fig. 3 (d), Cheap LP generates patches with the same channels, then employs kernel sizek× kof linearly depthwise operations to expand channels.
KD C k k
Reshape
2 2
Pk C
KC D
Reshape
2 2
P Ck
KCk2D
k
Reshape
2 /2
KkD
C
ViTLP DwLP
2 / 2
KCkD
Matrix multiplication
Convolution Filter weights Patch/Window Neighbor
k
(a)
(c)
(b)
(d) OP P D
OH W D
OP P D
OP P D
Figure 3: Flowchart of LP in PVT [Wanget al., 2021a], ViT, SwinT and Cheap LP. (a) Convolution LP; (b) ViT LP; (c) Window-based LP; (d) Cheap LP. Cheap LP is used to extract features in DFTL.
Note that the dimension of input isH×W×Cand LPs have differ- ent ways to share weightsK. P, k, C, Ddenotepatch size,kernel size (window size),in channel,out channelrespectively.
3.3 Transformer Encoder Block
Self-attention (SA) is the core of Transformer but is infeasible to GPU overheads. Thus, we analyze two SA mechanisms in this section, including 1) W-MSA through introducing one lo- cal window to consistent with the efficiency and performance, 2) XCA by channel correlation with a lower spatial cost.
Window Based Self-Attention (W-MSA). To reduce cost in spatial resolution, W-MSA encodes local pixel similarity in a window M ×M, whose computational complexity is O(M2+M C), which reduces the computational/memory overhead. With W-MSA followed by FFN module, TEB can be computed as follows,
Zp=W-MSA(Q,K,V) +Xp
=Softmax(QKT
√
d )V+Xp, Yp =FFN(Zp) +Zp,
(5)
whereXp andYp denote the input and output of TEB.Zp
indicates the intermediate feature maps, and d is a scalar.
DFTL-W is built by the sequential form of LP and W-MSA.
Cross-Covariance Attention (XCA). Considering the fea- ture may have more channels, DFTL-W may produce limited performance because W-MSA only encodes feature vectors in each pixel. We introduce XCA to compute channel corre- lation along channel dimension:
Zp=XCA(Q,K,V) +Xp
=VTSoftmax(KTQ
√
d ) +Xp, (6) where KTQ is also termed as the attention matrix with RC×C. The model with XCA is named as DFTL-X, which is the sequential form of TEB and LP.
Methods
Datasets
Rain12 Rain200L Rain200H DID-Data DDN-Data MaxBatch
Metrics PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM -
Input 30.14 0.8553 26.71 0.8438 13.08 0.3733 23.63 0.7324 25.23 0.7901 - DSC [Luoet al., 2015] 30.07 0.8664 27.16 0.8663 14.73 0.3815 24.24 0.8279 27.31 0.8373 - GMM [Liet al., 2016] 32.14 0.9145 28.66 0.8652 14.50 0.4164 25.81 0.8344 27.55 0.8479 - JCAS [Guet al., 2017] 33.10 0.9302 31.42 0.9173 14.69 0.4999 25.16 0.8509 26.81 0.8632 - Clear [Fuet al., 2017b] 31.15 0.9358 30.51 0.9361 13.90 0.7091 24.10 0.8518 25.86 0.8781 100 DDN [Fuet al., 2017a] 32.71 0.9291 34.37 0.9578 25.99 0.8006 28.95 0.8619 29.64 0.8913 240 RESCAN [Liet al., 2018] 36.63 0.9527 38.01 0.9796 27.46 0.8485 34.15 0.9294 32.87 0.9294 180 PReNet [Renet al., 2019] 36.54 0.9606 36.81 0.9767 27.78 0.8717 33.47 0.9252 32.24 0.9257 96 FBL [Yanget al., 2020] 37.69 0.9651 39.02 0.9827 30.07 0.9021 34.26 0.9320 33.05 0.9334 8 RCDNet [Wanget al., 2020] 37.59 0.9608 39.17 0.9885 30.24 0.9048 34.08 0.9532 33.04 0.9472 21 FuGCN [Fuet al., 2021] 37.38 0.9674 39.61 0.9860 29.77 0.8991 34.37 0.9620 33.01 0.9489 32 IPT [Chenet al., 2021] 37.12 0.9629 39.36 0.9850 29.41 0.8909 34.32 0.9354 33.21 0.9366 9 DFTL-W 37.60 0.9632 39.99 0.9873 30.02 0.9050 34.37 0.9574 33.27 0.9375 180 DFTL-X 38.09 0.9670 41.27 0.9890 31.81 0.9271 34.73 0.9604 33.70 0.9424 16
Ideal value +∞ 1 +∞ 1 +∞ 1 +∞ 1 +∞ 1 +∞
Table 1: Quantitative experiments evaluated on the five datasets. The best and the second best results are boldfaced and underlined.
3.4 Progressively Patch Merging (PPM)
It is redundant that decoders perform feature interaction across theQuery, KeyandValue from encoders. Consider- ing generations of patch embeddings, we develop PPM to re- place decoders of Transformer. Specifically, we use bilinear interpolation to upsample the patch-level features. Then we restore detailed information and shrink the channels to obtain the outputs by two convolution layers. In this process, we use skip-connection to achieve concatenation of patches progres- sively. The outputs of PPM are then added with the original inputs to remove rain streaks and produce clean images. The whole process is presented as follows:
Fl=Upsample(Pl×Pl)(Fl−1), Fl=Conv(Cl, Cl/2)(Concat(Fl)), O=Conv(Cl, 3)(Fl) +I,
(7)
whereFlrepresentslth-level feature maps,IandOare the input and output of our DFTL, respectively.
3.5 Hybrid Loss Function
Our work proposes a novel gradient-based hybrid loss func- tion (GBHL) to achieve better results. We empirically present the Eq. 8, in which the significant numerical value is provided by SSIM [Wanget al., 2004] or mean square error (MSE), etc. Mathematically, we impose a regularizer as a constraint for SSIM or MAE as follows:
Lssim=LssimJ P
k̸=ssim
|∥LLk
k∥F|, Lmse=LmseJ P
k̸=mse
|∥LLk
k∥F|. (8)
Here J
represents hadamard product. ∥Lk∥F is set to requires grad = F alse. ∥ · ∥F is Frobenius norm, and k= [ssim, mae, mse].
4 Experiments
In this section, we demonstrate the advantages of proposed method via comprehensive experiments on both synthetic and real datasets. In particular, we also compare our DFTL with other Transformer architectures to prove the efficiency of our methods. Please refer to the supplementary materials for more details, e.g., datasets, implementation details and dis- cussion of model structures and complexity.
4.1 Comparison on Synthetic and Real Datasets
We evaluate our model on five synthetic datasets to com- pare quantitative results. Besides, we list the maximum batch (MaxBatch) of each model that can be trained simultaneously on single GPU. Also, we perform tests on two real scenarios.Synthetic cases. The performance of all compared methods on five synthetic datasets is reported in Table 1. It can be ob- served that our models could exceed the most advanced tech- niques. Besides, we show visual comparisons on Rain200H, DDN, and DID datasets, see Fig. 4. For rainy streaks in differ- ent directions, our methods could preserve more details and attain better disentanglement from complex scenes. We also visualize the trade-off analysis between latency and perfor- mance among these deraining models in Fig. 5.
Real cases.To demonstrate the generalization ability, we per- form the visual evaluation on the real-world rainy images in Fig. 6. It can be observed that DFTL-W/-X could remove more rainy streaks than other methods and have better visual qualities on the whole image.
4.2 Comparison with other Transformers
In this section, we perform a comparison with existing vision Transformers to verify the effectiveness of DFTL targeting the specific image deraining task. We select seven classic Transformer architectures, i.e., ViT256, DETR [Carionet al.,
GT Input DSC JCAS Clear DDN RESCAN
PReNet RCDNet FBL DualGCN IPT DFTL-W DFTL-X
GT Input DSC JCAS Clear DDN RESCAN
PReNet RCDNet FBL DualGCN IPT DFTL-W DFTL-X
PReNet RCDNet FBL DualGCN IPT DFTL-W DFTL-X
GT Input DSC JCAS Clear DDN RESCAN
Figure 4: Visual comparisons on Rain200H, DDN, DID datassets with synthetic different rain streaks. The first row is from Rain200H dataset, the second/last row shows results on DDN/DID dataset.
Clear DDN
RESCAN PReNet
FBLFuGCN RCDNet IPT
ViT Uformer
Restormer DFTL-W DFTL-X
30 33 36 39 42
0 2 4 6 8 10 12
PSNR (db)
Latency (s/it)
Figure 5: Comparison between latency and performance.
2020], DeformDETR [Zhuet al., 2021], Uformer [Wanget al., 2021b], IPT [Chen et al., 2021] and Restormer [Zamir et al., 2021]. All models are trained in the same framework with default settings as their original codes. In Table 2, we summarize the PSNR and SSIM of all outcomes produced by different architectures on Rain200L dataset. Besides, we re- port parameter number (Param) and FLOPs, demonstrating that our DFTL can significantly reduce computational costs.
Furthermore, we show the comparisons on PSNR and latency in Fig. 5. Our methods could obtain better PSNR in an effi- cient manner.
4.3 Ablation Study
Ablation on Cheap LP. In Table 3, we compare differ- ent locations replacing LP with Cheap LP in DFTL-X over Rain200H dataset. Cheap LP can achieve competitive perfor- mance and high efficiency. On the one hand, benefiting from
Methods PSNR SSIM Param FLOPs
Input 26.71 0.8438 - -
VIT256[Dosovitskiyet al., 2021] 36.84 0.9652 159.9M 15.3G DETR [Carionet al., 2020] 27.84 0.7144 39.4M 1.6G DeformDETR [Zhuet al., 2021] 25.64 0.7092 40.1M 1.5G Uformer [Wanget al., 2021b] 38.53 0.9816 146.7M 148.1G IPT [Chenet al., 2021] 39.36 0.9850 64.3M 18.0G Restormer [Zamiret al., 2021] 41.69 0.9903 35.9M 9.7G
DFTL-W 39.99 0.9873 17.6M 1.8G
DFTL-X 41.27 0.9890 29.3M 6.8G
Table 2: Comparison with various Transformer architectures.
DFTL-X PSNR SSIM Param FLOPs
Network + LP 31.76 0.9243 30.2M 6.9G Patchify + Cheap LP 31.81 0.9271 29.3M 6.8G Network + Cheap LP 31.56 0.9238 20.8M 4.6G Table 3: Comparison of different locations of Cheap LP in DFTL-X.
Model Sturcture PSNR SSIM Latency Memory Param FLOPs SA Single
Scale
- - - Crack 1.7G 229.9G
XCA 41.39 0.9902 6.4 14187M 60.9M 51.7G SA Multi-
Scale
41.30 0.9890 5.1 14845M 161.8M 25.6G W-MSA 40.47 0.9871 1.9 7459M 161.8M 5.3G
XCA 41.18 0.9890 3.0 5365M 30.2M 6.9G
Table 4: Comparison of different self-attention in DFTL-X.
(h) PReNet (i) RCDNet (j) FBL (k) DualGCN (l) IPT (m) DFTL-W (n) DFTL-X
(a) Input (b) DSC (c) GMM (d) JCAS (e) Clear (d) DDN (g) RESCAN
(a) Input (b) DSC (c) GMM (d) JCAS (e) Clear (d) DDN (g) RESCAN
(h) PReNet (i) RCDNet (j) FBL (k) DualGCN (l) IPT (m) DFTL-W (n) DFTL-X Figure 6: Visual comparisons on two real-world datasets obtained by [Fuet al., 2017a], [Zhanget al., 2017]
depthwise operation, Patchify with Cheap LP could generate more feature maps to improve performance with lower com- plexity. On the other hand, the network with Cheap LP leads to weak performance because of poor feature extraction.
Ablation on DFTL.We compare the effects of different self- attention in DFTL-X on Rain200L dataset. In Table 4, SA and XCA significantly surpass other compared counterparts.
However, SA has higher computational loads and better re- sults than XCA. W-MSA has the same parameter number as SA and larger memory than XCA. Compared to it, DFTL- W avoids the problem by employing the sequential form of LP and W-MSA (see more details in the supplementary ma- terial). Thus, it attains competitive results and handles larger MaxBatch with limited resources in Table 1.
Ablation on PPM.This part investigates the effects of con- ventional decoders and proposed PPM on two representative Transformers, i.e., IPT and Restormer. Since IPT is a single way architecture to restore images, PPM is used for multi- scale architecture and can’t be directly integrated into IPT.
Thus we only ablate the decoders of IPT. Restormer’s de- coders are replaced with PPM to validate the effectiveness.
Through the comparisons in Fig. 1, decoders of Transformer have a slight performance gain but lead to colossal GPU memory occupations (2x∼3x) and costs of latency (∼1.5x).
Thus, it is significant for an efficient Transformer to build a decoder-free Transformer-like architecture.
Ablation on Hybrid Loss. We compare the results between different weighted combinations and proposed GBHL. Due to the numerical differences betweenLM AEandLSSIM are about 10 times in the initial training phase,αis set to 10. Two groups shown in Table 5 are adopted to conduct a compari-
son on Rain200H. Experimental results show that our GBHL makes the highest value for both PSNR and SSIM.
Loss combinations Weights{α,β,γ} PSNR SSIM MAE + 1-SSIM {10,1} 29.76 0.8955 MAE + MSE + 1-SSIM {10,1,1} 29.77 0.8968
GBHL - 30.02 0.9050
Table 5: The table shows the performance of DFTL-W for different combinations of weights evaluated on Rain200H dataset.
5 Conclusion
This paper presents a decoder-free Transformer-like architec- ture (DFTL) for image deraining to analyze popular Trans- former architectures from a new perspective. It reveals that decoders are redundant to Transformer. Proposed modules are more computationally efficient compared with standard Transformer modules. The comparisons with several compet- itive Transformers show our methods have good feature rep- resentation ability at low computational costs. Moreover, we propose a novel gradient-based hybrid loss function to pro- duce more reasonable results. Extensive experiments demon- strate DFTL can achieve the comparable to SOTA methods.
Acknowledgments
This research is supported by NSFC (12171072, 61876203, 61702083), Key Projects of Applied Basic Research in Sichuan Province (Grant No. 2020YJ0216), and National Key Research and Development Program of China (Grant No.
2020YFA0714001)
References
[Carionet al., 2020] N Carion, F Massa, G Synnaeve, N Usunier, A Kirillov, and S Zagoruyko. End-to-end ob- ject detection with transformers. In ECCV, pages 213–
229, 2020.
[Chenet al., 2021] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In CVPR, pages 12299–12310, 2021.
[Denget al., 2018] Liang-Jian Deng, Ting-Zhu Huang, Xi- Le Zhao, and Tai-Xiang Jiang. A directional global sparse model for single image rain removal. Applied Mathemati- cal Modelling, 59:662–679, 2018.
[Dosovitskiyet al., 2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Min- derer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021.
[Fuet al., 2017a] Xue-Yang Fu, Jia-Bin Huang, Zeng De- Lu, Yue Huang, Xing-Hao Ding, and John Paisley. Re- moving rain from single images via a deep detail network.
InCVPR, pages 1715–1723, 2017.
[Fuet al., 2017b] Xue-Yang Fu, Jia-Bin Huang, Xing-Hao Ding, Ying-Hao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Transactions on Image Processing, 26(6):2944–2956, 2017.
[Fuet al., 2021] Xue-Yang Fu, Qi Qi, Zheng-Jun Zha, Yu- Rui Zhu, and Xing-Hao Ding. Rain streak removal via dual graph convolutional network. InAAAI, pages 1352–
1360, 2021.
[Guet al., 2017] Shu-Hang Gu, De-Yu Meng, Wang-Meng Zuo, and Lei Zhang. Joint convolutional analysis and syn- thesis sparse representation for single image layer separa- tion. InICCV, pages 1717–1725, 2017.
[Jianget al., 2017] Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao, Liang-Jian Deng, and Yao Wang. A novel tensor- based video rain streaks removal approach via utilizing discriminatively intrinsic priors. In CVPR, pages 2818–
2827, 2017.
[Jianget al., 2019] Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao, Liang-Jian Deng, and Yao Wang. Fastderain: A novel video rain streak removal method using directional gradient priors. IEEE Transactions on Image Processing, 28(4):2089–2102, 2019.
[Liet al., 2016] Yu Li, Robby T Tan, Guo Xiao-Jie, Lu Jiang-Bo, and Michael S Brown. Rain streak removal using layer priors. InCVPR, pages 2736–2744, 2016.
[Liet al., 2018] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin Zha. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In ECCV, pages 262–277, 2018.
[Lianget al., 2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir:
Image restoration using swin transformer. InICCV, pages 1833–1844, 2021.
[Liuet al., 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, pages 10012–10022, 2021.
[Luoet al., 2015] Yu Luo, Yong Xu, and Hui Ji. Removing rain from a single image via discriminative sparse coding.
InICCV, pages 3397–3405, 2015.
[Renet al., 2019] Dong-Wei Ren, Wang-Meng Zuo, Qing- Hua Hu, Peng-Fei Zhu, and De-Yu Meng. Progressive im- age deraining networks: A better and simpler baseline. In CVPR, pages 3932–3941, 2019.
[Vaswaniet al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeuIPS, page 6000–6010, 2017.
[Wanget al., 2004] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[Wanget al., 2020] Hong Wang, Qi Xie, Qian Zhao, and De- Yu Meng. A model-driven deep neural network for single image rain removal. InCVPR, pages 3100–3109, 2020.
[Wanget al., 2021a] Wen-Hai Wang, En-Ze Xie, Xiang Li, Deng-Ping Fan, Kai-Tao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A ver- satile backbone for dense prediction without convolutions.
InICCV, 2021.
[Wanget al., 2021b] Zhen-Dong Wang, Xiao-Dong Cun, Jian-Min Bao, and Jian-Zhuang Liu. Uformer: A general u-shaped transformer for image restoration.arXiv preprint arXiv:2106.03106, 2021.
[Yanget al., 2020] Wenhan Yang, Shiqi Wang, Dejia Xu, Xi- aodong Wang, and Jiaying Liu. Towards scale-free rain streak removal via self-supervised fractal band learning.
InAAAI, volume 34, pages 12629–12636, 2020.
[Zamiret al., 2021] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. arXiv preprint arXiv:2111.09881, 2021.
[Zhanget al., 2017] He Zhang, Sindagi Vishwanath, and Vishal M Patel. Image de-raining using a conditional gen- erative adversarial network. IEEE Transactions on Cir- cuits and Systems for Video Technology, 2017.
[Zhuet al., 2021] Xi-Zhou Zhu, Wei-Jie Su, Le-Wei Lu, Bin Li, Xiao-Gang Wang, and Ji-Feng Dai. Deformable detr:
Deformable transformers for end-to-end object detection.
InICLR, 2021.