VO + Net: An Adaptive Approach Using Variational Optimization and Deep Learning for

(1)

VO + Net: An Adaptive Approach Using Variational Optimization and Deep Learning for

Panchromatic Sharpening

Zhong-Cheng Wu , Ting-Zhu Huang , Member, IEEE, Liang-Jian Deng , Member, IEEE, Jin-Fan Hu, and Gemine Vivone , Senior Member, IEEE

Abstract— Pansharpening refers to a spatio-spectral fusion of a lower spatial resolution multispectral (MS) image with a high spatial resolution panchromatic image, aiming at obtaining an image with a corresponding high resolution both in the domains.

In this article, we propose a generic fusion framework that is able to weightedly combine variational optimization (VO) with deep learning (DL) for the task of pansharpening, where these crucial weights directly determining the relative contribution of DL to each pixel are estimated adaptively. This framework can benefit from both VO and DL approaches, e.g., the good modeling explanation and data generalization of a VO approach with the high accuracy of a DL technique thanks to massive data training. The proposed method can be divided into three parts: 1) for the VO modeling, a general details injection term inspired by the classical multiresolution analysis is proposed as a spatial fidelity term and a spectral fidelity employing the MS sensor’s modulation transfer functions is also incorporated; 2) for the DL injection, a weighted regularization term is designed to introduce deep learning into the variational model; and 3) the final convex optimization problem is efficiently solved by the designed alternating direction method of multipliers. Extensive experiments both at reduced and full-resolution demonstrate that the proposed method outperforms recent state-of-the-art pansharpening methods, especially showing a higher accuracy and a significant generalization ability.

Index Terms— Adaptive fusion, deep learning, image fusion, multiresolution analysis, pansharpening, remote sensing, variational models.

I. INTRODUCTION

S

ATELLITE systems, such as IKONOS, QuickBird, Pl ´eiades, GeoEye, WorldView-3, and WorldView-2, capture two types of images for richer spectral and spatial

Manuscript received November 5, 2020; revised January 19, 2021;

accepted February 24, 2021. This work was supported in part by NSFC under Grant 61772003, Grant 61702083, and Grant 61876203; in part by the Key Projects of Applied Basic Research in Sichuan Province under Grant 2020YJ0216; and in part by the National Key Research and Devel- opment Program of China under Grant 2020YFA0714001. (Corresponding authors: Ting-Zhu Huang; Liang-Jian Deng.)

Zhong-Cheng Wu, Ting-Zhu Huang, Liang-Jian Deng, and Jin-Fan Hu are with the School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China (e-mail:

[email protected]; [email protected]; [email protected];

[email protected]).

Gemine Vivone is with the National Research Council, Institute of Method- ologies for Environmental Analysis, CNR-IMAA, 85050 Tito Scalo, Italy (e-mail: [email protected]; [email protected]).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TGRS.2021.3066425.

Digital Object Identifier 10.1109/TGRS.2021.3066425

Fig. 1. Schematic illustration of pansharpening for the Rio data set at reduced resolution (source: WorldView-3). (a) MSI interpolated at PAN scale, (b) PAN image, and (c) pansharpened result obtained by our approach. The 5th, the 3rd, and the 2nd spectral channels are extracted for rendering the shown true-color images.

information. A multispectral (MS) image with a lower spatial resolution (LR-MSI) is acquired in order to retain spectral information, instead, a panchromatic (PAN) image accounts for greater spatial content. Thus, approaches for reconstructing an MSI with a higher spatial resolution (HR-MSI) driven by the simultaneously acquired PAN image are developed. In particular, the so-called pansharpening, a research line inside the class of the spatio-spectral fusion methods, is dedicated to the study of this issue. For an intuitive perspective, we present a representative example in Fig. 1.

Pansharpening plays a crucial role as a preliminary step for subsequent applications, e.g., target recognition [1] and change detection [2]. Furthermore, some commercial products, such as Bing Maps and Google Earth, exploit fused images in order to show high-resolution products. Thus, the increasing demand for pansharpened data has led to the growing number of commercial products in recent years [3] with an increment of the related academic research.

A. Related Works

Pansharpening is arousing widespread interest and a huge literature from different perspectives and methodologies can easily be found, see [4]–[6]. A general classification divides the approaches addressing the pansharpening issue into four main categories [7], [8]: 1) component substitution (CS) methods; 2) multiresolution analysis (MRA) methods; 3) variational optimization-based (VO) approaches; and 4) deep learning (DL) techniques.

The principle of CS methods is the replacement of a spatial component of the MSI, which is extracted thanks to a

See https://www.ieee.org/publications/rights/index.html for more information.

(2)

spectral transformation of the LR-MSI, with the PAN image.

Thus, the final product can be obtained through the relative inverse spectral transformation. Some well-known instances of methods into this class are the Brovey transform [9], the prin- cipal component analysis (PCA) [10], the partial replacement adaptive component substitution (PRACS) [11], the intensity- hue-saturation (IHS) [12], the Gram–Schmidt (GS) spectral sharpening [13], and the band dependent spatial detail (BDSD) methods [14], [15]. In general, the CS approaches are characterized by lower time-consumption, whereas they usually show severe spectral distortion [16].

Different from CS methods, the MRA approaches are based on injecting the spatial structure information extracted from the PAN image via spatial filtering into the upsampled LR-MSI. Powerful examples classified into this family are the “`a-trous” wavelet transform (ATWT) [17], the smoothing filter-based intensity modulation (SFIM) [18], and general- ized Laplacian Pyramid [19]. Context-based solutions have also been proposed for this class of algorithms [20], [21].

Compared with CS techniques, the pansharpened images using MRA methods mainly suffer from spatial distortion, whereas spectral contents are preserved well.

The VO approaches in pansharpening have been success- fully attracted more attention for their capability of settling usual ill-posedness inverse problems [22]–[26]. These latter are committed to describing an exact link among the PAN image, LR-MSI, and the unknown HR-MSI, thus formulating an energy function. Therefore, the unknown (pansharpened) image will be get via minimizing this known function. In 2006, Ballester et al. [27] proposed a pioneering pansharpening method, named P+XS, based on variational regularization, mainly relying upon the assumption that the PAN image can be represented as a linear degradation of the target HR-MSI along the spectral dimension. However, the assumption is unrealistic [28] so that this method sometimes suffers from spectral distortion. To strongly avoid these drawbacks, Fu et al. [29] take the gradient similarity constraint between the underlying HR-MSI and the PAN image into account from the local rather than global perspective, enforcing a structural similarity and ensuring better fitting of the relation between the sensors’ responses. Besides, pansharpening methods based on unmixing [30] and dictionary learning [31] are also turned out to be effective techniques. In general, better results, both spatially and spectrally, can be theoretically produced via variational methodology with a solid mathematical foundation [8].

Unfortunately, these methods usually produce many unpre- dictable deviations once making some unreasonable assump- tions, as in [27]. In addition, the misalignment among channels is also considered as another issue, usually resulting in artifacts on final products.

Recently, the DL class has rapidly developed for computer vision problems. Many convolutional neural networks (CNNs) based approaches have been designed for image fusion showing excellent capabilities for feature extraction and nonlinear mapping learning [8], [32]–[38]. They can perfectly compen- sate for the aforementioned deficiencies reflected by VO methods getting state-of-the-art performance [39]. Nevertheless, CNN-based methods are often over-dependent on the training

data [29], thus leading to a weakened generalization, i.e., these methods have excellent performance only on data similar to the ones exploited in the training set. Specifically, since the network parameters are fixed once the training is fin- ished, the accuracy of CNN-based methods cannot be further improved [29]. Although some exceptions, see e.g. [40], where a self-supervised parameter tuning phase has been incorporated in the inference engine, have recently been implemented, the vast majority of the remaining DL-based methods still present this problem.

Moreover, hybrid approaches sharing some of the concepts of CS, MRA, VO, and DL have also been proposed, e.g., the guided filter PCA (GFPCA) [41], the additive wavelet luminance proportional (AWLP) approaches, see [42], where the first two concepts are integrated. Furthermore, Shen et al. [8] designed a groundbreaking architecture for remote sensing image fusion problem via combining a VO method and DL, in which HR-MSI’s gradient priors learned through a residual CNN are integrated into their image observation variational models, getting state-of-the-art results compared to traditional methods. Given the above considerations, we explored in this article, a hybrid fusion framework (detailed in Section III), which can leverage the favorable merits of both MRA and VO and DL methodologies, thus expecting to achieve higher performance on various data.

Different from the work [8], we develop in this article a new strategy, which can adaptively distinguish the relative contribution of deep networks to each pixel, i.e., the weight of characteristics closer to ground-truth (GT) image will be elevated via the proposed estimator. These kinds of algorithms cannot be characterized by the behavior of any single class of the above rigid taxonomy. For instance, our scheme can simultaneously be classified as in the MRA, the VO, and the DL categories.

B. Contributions

In this article, we propose a novel “VO+Net” scheme to merge the LR-MSIs and PAN images, which can weighted combine a generic variational model and a DL technique, thus achieving higher accuracy and generalization. In particular, the designed model consists of the spectral and spatial data fitting terms and a CNN-based proximal term with pixel-wise weights that are automatically estimated (we will call this term WPDI from hereon). The spectral fidelity using the 2

norm is imposed on the LR-MSI and motivated by analyzing the degradation from an underlying HR-MSI to the LR-MSI.

To capture spatial details in a proper way, the spatial coun- terpart is imposed by exploiting a details injection framework encouraging sharp edges of the unknown HR-MSI. Moreover, we introduce the prior knowledge contained in the CNNs into the proposed variational model through the WPDI, which is considered a regularizer. This new hybrid problem is solved by designing an alternating direction method of multipliers (ADMM)-based algorithm, which is guaranteed to efficiently converge to the global optimum. Extensive experiments both at reduced and full resolution, exploiting real data acquired by WorldView-3, WorldView-2, and QuickBird sensors, verify the

(3)

Fig. 2. Flowchart of our scheme. The spectral matching [43] denotes the pre-processing operation, see Section II-A. The acronym CNN stands for deep convolutional neural network.

superiority of the proposed framework compared to other state- of-the-art methods belonging to different categories. Finally, discussions about the parameters analysis, the ablation study, the effects of the new weighting term, the generalization ability, and the sensitivity analysis of computational load, are also provided to the readers. The flowchart of the proposed scheme is provided in Fig. 2.

Thecontributionsof this article are summed up as follows.

1) A details injection framework inspired by the MRA methodology has been introduced in our scheme as a spatial fidelity term;

2) A proximal injection term with adaptive gains in order to combine the prior information from various CNNs with our variational model has been designed resulting in an approach with a better generalization ability and a higher accuracy than the two approaches taken alone;

3) An ADMM-based algorithm has been designed and developed to solve the formulated hybrid problem;

4) A huge experimental analysis based on real data has been conducted to assess the performance of the proposed framework against the benchmark consisting of state-of-the-art pansharpening approaches.

C. Organization

The remaining of this article is organized as follows. The notation and motivations are briefly presented in Section II.

In Section III, the proposed regularization-based framework is detailed. Afterward, the algorithm we designed is provided in Section IV. In Section V, the experimental analysis is shown comparing the proposed approach with some state-of-the-art techniques belonging to different classes of pansharpening algorithms. Finally, conclusions are drawn in Section VI.

II. NOTATION ANDMOTIVATIONS

A. Related Notation

Lowercase letters denote scalars, matrices and vectors are denoted by uppercase and lowercase bold letters, respectively, and calligraphic letters represent tensor. We denote by 1 the

all-one matrix, whose dimension, when not given explicitly, shall be inferred from the context. Furthermore, the main acronyms and symbols used in this article are reported below fori =1,2, . . . ,S.

1) HR-MSI: X ∈ R^H^×^W^×^S with S spectral images;

X⁽ⁱ⁾∈R^H^×^W, whereH andW are the number of pixels in the row and column dimensions, respectively.

2) LR-MSI: Y ∈ R^h^×w×^S with S spectral images;

Y⁽ⁱ⁾ ∈ R^h^×w, where H = h ×r, W = w×r, and r stands for the scale factor.

3) PAN: the known single channel imageP∈R^H^×^W. 4) Reference panchromatic image (RePAN): the extended

PAN image P ∈R^H^×^W^×^S with S bands P⁽ⁱ⁾ ∈R^H^×^W, used as reference for the proposed spatial fidelity term in Section III-B.

P⁽ⁱ⁾can be regarded as a function of LR-MSI bands and the PAN image and it is generated through some pre-processing operations, e.g., spectral matching [43]. Their mode-3 unfold- ing (refer to [44] for more details) can be marked by the sibling uppercase bold letters, namely, by X=Unfold(X), Y=Unfold(Y)and P=Unfold(P), respectively. Given l=(l1,l2)∈, whereis the related image domain,X⁽ⁱ⁾[l]

denotes the pixel value of X⁽ⁱ⁾ at location (l1,l2). The other symbols will be defined along this article, where needed.

B. Motivations

Thanks to the remarkable nonlinear mapping ability of DL, CNN-based methods have recently attracted attention, yielding competitive results even for image fusion. Nevertheless, these methods have also been revealed some fatal defects, as analyzed in Section I, e.g., the enormous data dependence, owing to the use of network parameters that can hardly be adjusted on data very different from the ones shown in the training phase.

As a result, an optimization strategy that can make up for these disadvantages needs to be developed to further improving the performance. For this reason, we propose the concept of WPDI, aiming to introduce the output of a CNN-based method into a VO framework. The WPDI term establishes a relation between this output and the target HR-MSI, thus transferring

(4)

the required prior information from CNN into a variational fusion framework for further optimization. In this model, the WPDI plays the generic role of a regularizer avoiding using other additional prior information often used in the related literature, see, e.g., low-rank prior [45] and sparse prior [46].

Furthermore, a variational model that includes the spectral and spatial data in input fitting terms needs to be completed.

To the best of our knowledge, the spectral fidelity is designed almost in the same way for most VO techniques, even if rare innovations can be found in the related literature, e.g., [5].

Instead, a crucial choice regards the spatial fidelity data fitting term accounting for the PAN image’s spatial content, considering the clear difficulty in modeling the underlying nonlinear even indirect relation. Hence, a new scheme to boost the spatial structure is developed based on the multi-resolution analysis philosophy characterized by a superior spectral fidelity.

III. PROPOSEDMODEL

This section is devoted to presenting the proposed model and the related spatial, spectral, and WPDI terms.

A. Proposed VO+Net Fusion Framework

In this work, we propose a general fusion framework linking the traditional VO and the burgeoning DL techniques for pansharpening, aiming to benefit from both the VO and DL approaches, e.g., the good modeling explanation and data generalization of VO approaches as well as the high accuracy of DL techniques. In particular, this uniformed framework is built up thanks to an adaptive tensor-format weight W. The optimization model is as follows:

minX fVO(X,Y,P)+αfWPDI(X,Xnet,W) (1) where fVO(X,Y,P) is the VO model involving the underlying image X and the observed images Y and P;

fWPDI(X,Xnet,W), i.e., the WPDI term, plays the role of a prior term,αis a positive parameter to balance the VO and the WPDI terms, Xnet ∈ R^H^×^W^×^S indicates the output of a generic CNN-based method and W ∈ R^H^×^W^×^S is the tensor-format weight that is viewed as a bridge connecting VO and DL in element-wise manner.

More specifically, f_VO(X,Y,P) in (1) can be substi- tuted by two popular data fitting terms in pansharpening, i.e., fspec(X,Y)(the spectral fidelity term) and fspat(X,P)(the spatial fidelity term). Afterward, the adopted pansharpening model is given by

minX fspec(X,Y)+λfspat(X,P)+αfWPDI(X,Xnet,W) (2) whereλ is a positive regularization parameter.

For simplicity, the problem in (2) can be simply rewritten as

minX fspec(X,Y)+λfspat(X,P)+αfWPDI(X,Xnet,W) (3) where X, Xnet, W ∈ R^S^×^{H W} and Y ∈ R^S^×^h^w are the Unfold(X), the Unfold(Xnet), the Unfold(W) and the Unfold(Y), respectively.

B. Spectral Fidelity Term

Many existing spatio-spectral fusion methods, see e.g. [9], [11], [14], upsample the LR-MSI to the PAN scale. Nev- ertheless, imprecise information could be carried using this (usually biased) interpolation procedure, thus impacting on the performance. Accordingly, the downsampled version of the underlying HR-MSI is considered in this article. We perform this operation according to the point spread function (PSF) of the MSI sensor [47], [48], aiming to design the blurring operation. Thus, we model that

Y=XBS+ξ1 (4) whereB∈R^{H W}^×^{H W} is the matrixization of blurring operators, S ∈ R^{H W}^×^h^w denotes the decimation matrix consisting of sparse components, and ξ1 indicates a zero-mean Gaussian noise assumed to follow a normal distributionN(0, σ₁²)with a standard deviation σ1. From (4), we paradigmize a spectral fidelity term as

fspec(X,Y)= XBS−Y²F (5) where · F is the Frobenius norm. In some previous articles, researchers used a uniform or a fixed Gaussian kernel to define Bwithout taking into account of the possible differences along the spectral domain. Indeed, kernels designed to match the MS sensor’s modulation transfer functions (MTFs) are advis- able [16], [49]. These usually assume a Gaussian shape with band-dependent standard deviations that are set thanks to the above-mentioned information [16], [49]. Thus, these values are also sensor-dependent invalidating the use of simple solutions based on uniform or fixed Gaussian kernels. In this work, we defineBvia the above-mentioned prior information based on the MS sensor’s MTFs for a more accurate implementation of the deblurring operation. See [47], [48], [50]–[52] for more information about the matrices BandS.

C. Proposed Spatial Fidelity Term

Many spatial fidelity models, see [53], prefer choosing intensity similarity or structure similarity as a constraint, forc- ing pixels or gradient values between the unknown HR-MSI and the PAN image to be consistent. However, the spectral accuracy of the results is often affected and tends to get worse with the increase of the channels to be fused (see e.g. approaches into the CS class). Considering the analysis drawn in Section I, an alternative technique for obtaining the critical details is constituted by the multiresolution analysis (MRA), which is generally achieved by linear decomposition procedures, such as wavelet [54] and Laplacian pyramids [55].

These often yield a high fidelity and consistency in rendering the spatial and spectral features in the final image retain- ing a good tradeoff, especially when the characteristics of the MS and PAN acquisition systems are considered [16], [56]. Besides, their main advantages also consist in tempo- ral coherence [57] and robustness to aliasing under proper conditions [58]. Inspired by the idea under MRA approaches, we exploit the MRA framework presented in [16] and [57]

defining a new spatial improved term based on the extraction first and then injection of spatial details.

(5)

The general MRA framework [16], [57] can be formulated as

X⁽ⁱ⁾=Y⁽_Hⁱ⁾+G⁽ⁱ⁾◦

P⁽ⁱ⁾−P⁽_Lⁱ⁾

(6) for i = 1,2, . . . ,S, where Y⁽_Hⁱ⁾,G⁽ⁱ⁾ ∈ R^H^×^W denote the upsampled version of Y⁽ⁱ⁾ at the scale of the P⁽ⁱ⁾ and the injection coefficients, respectively, the symbol◦represents the element-wise multiplication, andP⁽_Lⁱ⁾is a low-pass version of P⁽ⁱ⁾ via filtering with impulse response hi, i.e., P⁽_Lⁱ⁾ = P⁽ⁱ⁾∗hi [59], where ∗ indicates the convolution operation.

Moreover, P⁽ⁱ⁾ −P⁽_Lⁱ⁾ are the image details regarded as a high-pass version of P⁽ⁱ⁾. Two common options for defining the injection gains areG⁽ⁱ⁾=1, namely the so-calledadditive injection scheme, and G⁽ⁱ⁾=Y⁽_Hⁱ⁾P⁽_Lⁱ⁾, which is referred to as themultiplicativeinjection scheme orhigh-pass modulation (HPM) scheme, where denotes an element-wise division.

The latter, adopted in this article, can reproduce the local intensity contrast of the PAN image in the fused image [60]

and turns out to be better suited than the former, being explained by its greater flexibility in configuring the local weights [16].

Analogously to Section III-A, we make the substitution of YH with XB, to avoid the introduction of the LR-MSI, Y, thus independently modeling the relation from the unknown HR-MSI to the PAN image. Hence, considering the matrix form, we have

G◦ P−PL

=X−XB+ξ2 (7)

where ξ2 is an error assumed to follow a zero-mean normal distributionN(0, σ₂²)with a standard deviationσ2. This is an implicit expression about the histogram-matched PAN image, contributing to a componentwise mapping from an unknown HR-MSI to the details image closely related to the PAN image. It is worth to be remarked that by comparing with the structure-similarity models employing the hypothesis of linear combination, see [3], the spatial fidelity model in (7) can improve spatial details while preserving the spectral content to the greatest extent, benefiting from the dual characteristics of the MRA and of the element by element mapping in the high-frequency domain (using HPM injection models).

Accordingly, an MRA-based details injection model for a spatial fidelity can be obtained as follows:

f_spta(X,P)=X−

XB+G◦

P−PL²

F. (8)

In this article, we adopt a HPM scheme to deal with the details extraction. Namely, we define G=XBPL (directly using XBinstead ofYH), thus having

XB+G◦ P−PL

=XB◦ PPL

(9) where PL is defined exploiting MTF-based filters as in Section III-A.

Accordingly, the final spatial fidelity term relying upon an HPM scheme can be formalized as follows:

fspta(X,P)=X−XB◦

PPL²

F. (10)

The advantages of the formulation in (10) are an enhancement of the spatial details while preserving spectral content and the absence of any interpolation operation, unlike in (6).

D. WPDI Term

In [61] the following proximal deep injection (PDI) term modulated by a constant parameter α as a regularizer is introduced

fPDI(X,Xnet)= X−Xnet²F. (11) By using this proximal term, it is interesting to note in [61] that the proposed technique gets always better performance than the CNN-based outcomeX_net, through controlling the parame- terα. If the input data significantly differs from the training data of the CNN-based method, one could select a smaller α, even zero, to weaken the PDI term. However, from the element-wise perspective, we argue that, even in the worst case where the CNN result,Xnet, seems so far from the optimality, some pixel values in Xnet are very close to that of the GT image. This means that it is better to have a spatial-dependent weight, aiming to locally differentiate the relative contributions of Xnet. Thus, we propose a pixel-based weight to mini- mize the distance between the unknown HR-MSI and the CNN-based outcome Xnet, named weighted proximal deep injection (WPDI) term. Besides, a novel strategy is designed to adaptively estimate this weight. The proximal term is defined as follows:

f_WPDI(X,X_net)= W◦(X−X_net)²F (12) where W ∈ R^S^×^{H W} denotes a coefficient matrix defining a weight for each pixel. A better result can be expected through a set of weighting factors that are inversely proportional to the absolute error between GT image and Xnet. Thus, still considering the parameter α, we propose a new weighting procedure (absorbing α into the paradigm to simplify the mathematical formula) as follows:

W_α =√

αW=√

αwf() (13)

whereα is a balance parameter,wf is strictly defined by an element-wise monotone decreasing function, and denotes the absolute residual matrix. In Section V, we definewfas

wf()^def=√

1− (14)

in which 1− is selected as a correction to W = 1 (i.e., Unweighted), and √

(·) plays a role to offset the quadratic effect, as in sub-solution (34) of Section IV. Note that, outliers exceeding 1 inare forced to be 1. This definition suggests that the undetermined W has a simple dependence on the reconstructed accuracy of the network or, more precisely, on the dispersion of the reconstructed residuals. The degree of dispersion is measured by the standard deviation of , i.e., Std(), abbreviated as STDR from hereon. The procedure in (13) is committed to distinguishing the effectiveness of each element, thus referring to as differential weighting based on the dispersion of residuals (DWDR). The higher the STDR, the more significant the effect of the DWDR.

Section V-E experimentally demonstrates the validity of the

(6)

proposed weighting scheme. However, the reference GT image is unavailable, which leads to the next estimation step based on (7).

E. Estimating

Combining (7) with (9), we have that X=G◦

P−PL

+XB−ξ2

=XB◦ PPL

−ξ2. (15) Thus, we can redefine the expression above usingXnetand the GT image, say it Xgt, as follows:

X_net =X_netB◦ PPL

−ξ₂ Xgt =XgtB ◦

PPL

−ξ₂ (16) where ξ2 and ξ2 are the two reconstruction errors assumed to be equivalent when inducing. Then, the absolute residual matrix can be estimated as

= ₊

Xnet−Xgt

= +

XnetB−XgtB

◦ PPL

−

ξ₂−ξ₂

≈ +

(X_netB−YH)◦ PPL

(17) where ₊indicates the componentwise projection operator on the set of nonnegative real numbers, that is,

+(X)[l]^def=

−X[l], IfX[l]<0

X[l], IfX[l]≥0. (18) Accordingly, the final optimization model consisting of three energy terms is expressed as

minX

1

2XBS−Y²

F+λX−XB◦

PPL²

F

+W_α◦(X−Xnet)²

F. (19) The proposed scheme explores the obtainable spatial details and spectral contents via the first two fitting terms for data in the input. Furthermore, with the WPDI term treated as a regularizer, the prior knowledge provided by any CNN is incorporated into the proposed VO model to optimize further.

Although the objective function (19) is convex and differ- entiable, it is inadvisable to directly generate the derivative because of a heavy computational and storage burden. In the next section, a new ADMM-based algorithm will be explored to pursue an iterative solution.

IV. PROPOSEDALGORITHM

This section applies an ADMM methodology [62] designed for structured convex optimization problems to the proposed model, aiming to separate the problem in three independent subproblems, all having a closed-form solution. By introduc- ing two auxiliary variables U andV, we rewrite (19) as the following equivalent constrained problem:

Xmin,U,V

1 2

US−Y²F +λX−U◦

PPL²

F

+W_α◦V−W_α◦X_net²

F

s.t. U=XB, V=X. (20)

The augmented Lagrangian function of this constrained mini- mization problem (20) can be expressed as

L_η1,η2(X,U,V, 1, 2)

= 1

2US−Y²F

+λX−U◦

PPL²

F+ W_α◦V−W_α◦Xnet²F

+η1

2

XB−U+1

η1

²

F+η2

2

X−V+2

η2

²

F+const (21) where 1, 2 denote the Lagrange multipliers, η1, η2 > 0 are two penalty parameters, and const represents a generic constant. Afterward, (21) can be alternatively and iteratively solved via updating the following simpler subproblems.

A. UpdatingX

By fixing U,V, 1, 2, then, the X-subproblem is con- verted to

arg min

X

λX−U◦

PPL²

F+η1

2

XB−U+1

η1

²

F

+η2

2

X−V+2

η2

²

F (22)

which is a simple least squares problem and has a closed-form solution under the condition of a periodic boundary, as

X:=F⁻¹ M+N

(2λ+η2)1+η1F(B)◦F(B)^†

(23) with

M =2λF U◦

PPL

+η1F(U)◦F(B)^† (24) N= −F(1)◦F(B)^†+η2F(V)−F(2) (25) whereF(·) and F⁻¹(·) are the fast Fourier transform (FFT) and its inverse operators, respectively, the superscript † denotes the complex conjugate, and the division is componentwise, as well. Exploiting MTF-based filters (generating different blurring kernels for each spectral band), this solution can be rearranged as a single channel-wise expression, appearing as

X⁽ⁱ⁾:=F⁻¹

M⁽ⁱ⁾+N⁽ⁱ⁾ (2λ+η2)1+η1O⁽ⁱ⁾◦

O⁽ⁱ⁾_†

(26) fori=1,2, . . . ,S, in which

M⁽ⁱ⁾=2λF U⁽ⁱ⁾◦

P⁽ⁱ⁾P⁽_Lⁱ⁾

+η1F U⁽ⁱ⁾

◦ O⁽ⁱ⁾†

(27) N⁽ⁱ⁾= −F(⁽₁ⁱ⁾)◦(O⁽ⁱ⁾)^†+η2F(V⁽ⁱ⁾)−F(⁽₂ⁱ⁾) (28) where O⁽ⁱ⁾ ∈ R^H^×^W denotes the optical transfer function (OTF) obtained by applying the FFT to hi ∈ R^r^×^r matching the MS sensor’s MTF at theith spectral channel.

(7)

Fig. 3. Graphical illustration of (30) whenr = 4. White squares with a solid line and a blank content are zero values.

B. UpdatingU

Similarly, the variable U can be updated by solving the following function:

arg min

U

1

2US−Y²F+λX−U◦

PPL²

F

+η1

2

XB−U+1

η1

²

F. (29) Before going into the details, we provide to the readers a definition of a novel concept, saying it decimation mask.

Definition 4.1 (Decimation Mask): Given a scale factor r, the decimation mask is defined as a sparse matrix K∈R^r^×^r, whose entry is 1 only in one position determined by the decimation scheme, e.g., K[(3,3)] = 1 when r = 4 (as in our article) whereasK[l]is 0 for all the other cases.

Afterward, the following equation holds, also interpreted as in Fig. 3

USS^T =U◦DS S^T (30) with

DS S^T =[vec(K⊗1),vec(K⊗1), . . . ,vec(K⊗1)]^T (31) where the symbol ⊗ denotes the Kronecker product, vec(·) implies the vectorization operator, and 1 ∈ R^h^×w. Based on (30) and (31), the solution of the least squares problem (29) can be denoted as

U:= YS^T +η1XB+2λX◦ PPL

+1

DS S^T +η11+2λ PPL

◦ PPL

. (32)

Note that the element-wise division is needed, as well.

C. UpdatingV

Different from the above-mentioned procedures, the V-subproblem is simpler and has a straightforward solution according to

arg min

V

W_α◦V−W_α◦X_net²F +η2

2

X−V+2

η2

²

F

(33)

Algorithm 1 ADMM-Based Solver for the Proposed Pan- sharpening Model (19)

Input:LR-MSIY, PAN image P,Xnet,λ, α,η1,η2,r,kmit. Initialization: X⁰=(Y,r),U⁰=V⁰=⁰₁=⁰₂=0

1: whilek<kmit andr elcha> ε do

2: Update Xvia (26) - (28).

3: Update Uvia (31) - (32).

4: Update Vvia (34).

5: Update Lagrange multiplier 1 via (35).

6: Update Lagrange multiplier 2 via (36).

7: end while

Output:Fused HR-MSI X

correspondingly, the closed-form solution can be directly given by

V:= 2W_α◦W_α◦Xnet+η2X+2

2W_α◦W_α+η21 . (34) Likewise, the division is element-wise, as well.

D. Updating1,2

According to the ADMM framework, multipliers 1, 2

can be updated via

1:=1+η1(XB−U) (35) and

2:=2+η2(X − V). (36) The algorithm considers the termination criterion as that the relative change (r elcha) between two successive pansharpened results less than a tolerance value, ε, that is,

relcha= X^k⁺¹−X^kF/X^kF < ε. (37) The detailed algorithm solving the problem (19) is summa- rized in Algorithm 1, in which,k_mitis the maximum iteration of execution, and indicates the bicubic interpolation for accelerating convergence. The convergence of the designed iterative scheme is guaranteed [63].

V. EXPERIMENTALRESULTS

This section is developed to compare the proposed scheme with some state-of-the-art approaches using several data sets collected by different sensors. Two types of validation, i.e., qualitative and quantitative, are conducted at reduced and full resolution, respectively. Reduced resolution data are obtained according to Wald’s protocol [60] under the scale invariance assumption, i.e., filtering with filters designed to match the MS sensor’s MTFs and decimation [16], [49]. All the experiments are implemented in MATLAB on a computer of 16Gb RAM and Intel(R) Core(TM) i7-5960X CPU:

at 3.00 GHz.

For quantitative evaluation at reduced resolution, some popular indexes exploiting a reference image are adopted, including the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [64], which are widely used to evalu- ate the similarity between two images in the image processing

(8)

literature; the spectral angle mapper (SAM) index [65] and spatial correlation coefficient (SCC) [66], which are utilized for assessing the spectral and spatial distortions, respectively; the erreur relative globale adimensionnelle de synth `ese (ERGAS) index [67] and the Q2ⁿ index (Q4 and Q8 for 4 and 8 bands fused outcomes, respectively) [68], [69], which are employed to measure the global quality. The ideal values for SSIM, SCC, and Q2ⁿ are 1, for SAM (degree,^◦) and ERGAS are 0, whereas+∞ for PSNR. Thequality with no reference (QNR)[70] index is applied to assess the performance at full resolution. It consists of a spectral distortion index D_λ and a spatial distortion index Ds. In particular, the highest QNR (i.e., 1) is obtained when bothD_λandDs are 0 indicating the best-fused image quality. Furthermore, scale factors for all the experiments are 4, i.e., r = 4, the tolerance value is preset to ε = 2×10⁻⁵, and kmit is empirically fixed to 200. The optimum parameters are set to provide the best compromise between the spectral and spatial distortions, measured by the SAM and the SCC, respectively.

For a broader comparative experimental analysis, some CNN approaches generating the Xnet are needed. Many state-of-the-art approaches for pansharpening can be found, see [33]–[36]. In our experiments, the PanNet proposed by Yang et al. [33] and the DiCNN proposed by He et al. [36]

are selected to determinate the WPDI term. Then, these WPDI terms are employed to form our final schemes, called OursPanNet and OursDiCNN, respectively.

A. Data Sets

The satellite data sets, some of them freely available online,¹ are exploited to examine the effectiveness of the proposed pansharpening method, including: 1) the Tripoli and the Rio data sets both acquired by the WorldView-3 sensor including an MS image with 8 spectral bands and a PAN image with a spatial resolution of 0.3 m; 2) the Stockholm and the WashingtonDC data sets captured by the WorldView-2 satellite characterized by an MS image with 8 spectral bands and a PAN image at a spatial resolution of 0.4 m; and 3) the Indianapolis data set collected by the QuickBird sensor included an MS image with four spectral bands and a PAN image with a 0.61-m resolution at spatial.

B. Benchmark

We compare the proposed algorithm with some recent state-of-the-art techniques belonging to the CS, MRA, VO, and DL families. In greater details, the following algorithms are considered: EXP [55], PCA [10], GIHS [12], GSA [71], SFIM [18], MTF-GLP-HPM [72], GLP-Reg- FS [59], 18’TIP [3], 19’CVPR [29], PanNet [33], and DiCNN [36]. It is rather remarkable that the source codes of competitors are available at either of the website² or the authors’ homepages. For fairness, we fine-tune the parameters of the 18’TIP and the 19’CVPR to get their best performance.

It is worth noting that the PanNet and the DiCNN are with consistent parameter configurations as that in [33] and [36],

1http://www.digitalglobe.com/samples?search=Imagery

2http://openremotesensing.net/kb/codes/pansharpening/

respectively, and both of them are mainly re-trained on two different training data sets. Following Wald’s protocol [37], the training data sets are simulated, as follows.

1) Training data from WorldView-3: 8806 {PAN, MS, GT}

training samples with the size of 64×64, 16×16×8, and 64×64×8, respectively, are contained.

2) Training data from QuickBird: 20685 {PAN, MS, GT}

image pairs with the size of 64×64, 16×16×4, and 64×64×4, respectively, are encompassed.

The PanNet and DiCNN trained based on the former are employed for the WorldView-3 experiments and WorldView-2 (verifying generalization ability), and trained to exploit the latter for the QuickBird experiments. Note that, all testing samples employed in Section V are spatially disjoint from the adopted training patches.

C. Qualitative Comparison

A true or pseudo color representation is selected for the qualitative analysis of the pansharpened results.

1) Reduced Resolution Assessment: As aforementioned, we give the comparative results at reduced resolution, as shown via an RGB rendering in Figs. 4–6. Inspecting Figs. 4 and 6, only the results of PanNet, DiCNN, and ours (including OursPanNetand OursDiCNN) can clearly retain details. Having a look at the error maps depicted in Figs. 4 and 6, we can clearly see that the PanNet, the DiCNN, and the proposed methods obtain the best performance both from a spectral and a spatial points of view compared to the GT image.

The performance in Fig. 5 differs from the ones in Fig. 4.

Specifically, the PanNet and the DiCNN achieve poorer visual appearance with respect to many traditional methods, e.g., MTF-GLP-HPM [72] and GLP-Reg-FS [59], on the reduced resolution WashingtonDC data set, since the two networks are both trained on WorldView-3 data set (i.e., because of an unsatisfactory network generalization). In contrast, the results obtained by our framework show high visual quality as depicted in the related close-ups and having a look at the difference maps.

2) Full Resolution Assessment: Then, we corroborate the proposed models on full resolution images in order to further support the obtained results at reduced resolution. For these experiments, the images used for testing are extracted from the full resolution Rio and Stockholm data sets. All the visual inspections are displayed in Figs. 7 and 8. It is worth noting that the near-infrared (NIR) channel is selected to highlight the differences in the vegetated areas. As shown in Figs. 7 and 8, many results performed at full resolution are in line with those obtained on the reduced resolution Washington DC data set and our methods always obtain the details closest to the PAN image regarded as a reference for the spatial enhancement.

From the close-ups in the two experiments, we clearly observe that the highways and cars, which are blurred in the EXP images but recognizable in the PAN images, are super-resolved by our algorithms and without both visible artifacts and spectral distortion. Among the compared approaches, the GIHS often provides outcomes with severe spectral distortion, clearly visible to an easy inspection of the fused

(9)

Fig. 4. Visual results on the reduced resolution Tripoli data set (source: WorldView-3). The scale of the PAN image is 256×256. Top united row: the visual inspection of the GT image and the close-ups of images fused by EXP, PCA, GIHS, GSA, SFIM, MTF-GLP-HPM, GLP-Reg-FS, 18’TIP, 19’CVPR, PanNet, DiCNN, OursPanNet, OursDiCNN, and of the GT image, respectively. Bottom united row: the corresponding residual maps using the GT image as reference.

For a better visualization, 0.4 has been added to all channels.

Fig. 5. Visual results on the reduced resolution Washington DC data set (source: WorldView-2). The scale of the PAN image is 256×256. Top united row: the visual inspection of the GT image and the close-ups of images fused by EXP, PCA, GIHS, GSA, SFIM, MTF-GLP-HPM, GLP-Reg-FS, 18’TIP, 19’CVPR, PanNet, DiCNN, OursPanNet, OursDiCNN, and of the GT image, respectively. Bottom united row: the corresponding residual maps using the GT image as reference. For a better visualization, 0.4 has been added to all channels.

products. Thus, the superiority of the proposed framework is corroborated at full resolution.

D. Quantitative Comparison

To further assess the performance of the proposed method, we provide the quantitative comparisons across 75, 75, 5, 55, and 55 images with a corresponding PAN size on the five data sets in Figs. 4–8. These images are extracted in order to cover all the possible features in a scene, e.g., vegetation, water, buildings, and so forth. The statistical results for all the metrics and the execution times (in seconds, s) are shown in Tables I–III, IV-(a), IV-(b). We can clearly observe that our models consistently achieve much better average values (except for D_λ) than the other methods on both reduced and

full-resolution data sets, implying that our framework can get competitive results. Moreover, our schemes can also achieve comparable efficiency with respect to variation methods.

To sum up, both qualitative and quantitative experiments are conducted indicating that the proposed scheme can get very high performance at reduced and full resolution, with respect to the benchmark. In particular, for the reduced resolution Tripoli and Indianapolis data sets, although the PanNet and the DiCNN methods also achieve excellent performance, our schemes improve them. Instead, for the reduced resolution WashingtonDC and the full resolution Rio and Stockholm data sets, where the effects of these networks are relatively lower, state-of-the-art results are still obtained by our framework.

It is noteworthy that better results can probably be produced

(10)

Fig. 6. Visual results on the reduced resolution Indianapolis data set (source: QuickBird). (Top united row) Visual inspection of the GT image and the close-ups of images fused by EXP, PCA, GIHS, GSA, SFIM, MTF-GLP-HPM, GLP-Reg-FS, 18’TIP, 19’CVPR, PanNet, DiCNN, OursPanNet, OursDiCNN, and of the GT image, respectively. (Bottom united row) Corresponding residual maps using the GT image as reference. For a better visualization, 0.4 has been added to all channels.

TABLE I

QUANTITATIVERESULTS FORALL THECOMPAREDMETHODS ON75 IMAGESFROM THEREDUCEDRESOLUTIONTRIPOLIDATASET (SOURCE: WORLDVIEW-3). ALL THECNN METHODSAREEXECUTED ONGPU (G), WHEREAS THEOTHERAPPROACHES

USE THECPU (C). (BOLD: BEST; UNDERLINE: SECONDBEST)

TABLE II

QUANTITATIVERESULTS FORALL THECOMPAREDAPPROACHES ON75 IMAGESFROM THEREDUCEDRESOLUTIONWASHINGTON DC DATASET(SOURCE: WORLDVIEW-2). ALL THECNN METHODSAREEXECUTED ONGPU (G), WHEREAS THEOTHER

APPROACHESUSE THECPU (C). (BOLD: BEST; UNDERLINE: SECONDBEST)

(11)

TABLE III

QUANTITATIVERESULTS FORALL THECOMPAREDAPPROACHES ON5 IMAGESFROM THEREDUCEDRESOLUTIONINDIANAPOLIS DATASET(SOURCE: QUICKBIRD). ALL THECNN METHODSAREEXECUTED ONGPU (G), WHEREAS THE

OTHERAPPROACHESUSE THECPU (C). (BOLD: BEST; UNDERLINE: SECONDBEST)

Fig. 7. Pseudo color (NIR(7th), G(3rd), and B(2nd) bands as R, G, and B channels) compositions for the full resolution Rio data set (source: WorldView-3).

The scale of the PAN image is 400×400. Close-ups are exhibited in the down-left corners, zooming in to see the details.

Fig. 8. False color [NIR(7th), G(3rd), and B(2nd) bands as R, G, and B channels] compositions for the full resolution Stockholm data set (source:

WorldView-2). The scale of the PAN image is 400×400. Close-ups are exhibited in the down-left corners, zooming in to see the details.

via fine-tuning the parameters of our schemes for different experiments on the same data set, but a fixed set of parameters is adopted for showing the algorithm robustness by varying the test cases.

E. Discussions

In this section, some discussions about the proposed scheme, see e.g., the parameters analysis and the ablation study, are given. All the discussions are about the proposed

(12)

TABLE IV

QUANTITATIVERESULTS FORALL THECOMPAREDMETHODS ON(A) 55 IMAGESFROM THEFULLRESOLUTIONRIODATASET(SOURCE: WORLDVIEW- 3). (B) 55 IMAGESFROM THEFULLRESOLUTIONSTOCKHOLMDATA SET(SOURCE: WORLDVIEW-2), RESPECTIVELY. ALL THECNN METHODS

AREEXECUTED ONGPU (G), WHEREAS THEOTHERAPPROACHESUSECPU (C). (BOLD: BEST; UNDERLINE: SECONDBEST)

Fig. 9. SAM, SCC, ERGAS, Q8 curves for: the regularization parameters (a) λ and (b)α, the penalty parameters (c) η1 and (d)η2 on a reduced resolution Tripoli data (source: WorldView-3). The best points are pointed out with a black star. Note that, for better distinguishing the slight difference and reflecting the changing trend, we process the obtained indexes by (index−Mean(index))/Std(index), whereMean(·)andStd(·)represent the mean and standard deviation operations, respectively. Besides, the mean and standard deviation for SAM, SCC, ERGAS and Q8 are (a) 3.0783±0.2293;

0.9847±0.0041; 2.0113±0.2824; 0.9670±0.0091; (b) 2.9458±0.0185;

0.9870±0.0003; 1.8529±0.0350; 0.9718±0.0007; (c) 3.0243±0.1015;

0.9858±0.0017; 1.9360±0.1164; 0.9695±0.0030; (d) 3.0390±0.1583;

0.9854±0.0028; 2.1918±0.6798; 0.9682±0.0060.

OursPanNet approach using the reduced resolution Tripoli data (as well as other images if needed) for brevity.

1) Parameters Analysis: Four parameters are involved in the proposed model, i.e., λ, α, η1, η2. Fig. 9 depicts the performance varying these parameters on a reduced resolution Tripoli data. We fix all the parameters except for the one to be analyzed, aiming to make the parameter selection step simpler.

It is clear that the combination ofλ=3×10⁻⁴,α=1.1×10⁻³, η1=10⁻² andη2=3×10⁻² is the best choice whatever the adopted quality metric. Figs. 9(a)–(d) only show slight changes

TABLE V

QUANTITATIVERESULTS FOR THEABLATIONEXPERIMENTUSING THE REDUCED RESOLUTION TRIPOLI DATA (SOURCE: WORLDVIEW-3).

(BOLD: BEST; UNDERLINE: SECONDBEST)

for all the considered indexes, thus confirming the robustness of our scheme related to its parameters. The same policy can be applied to other remote sensing data sets in order to determine the corresponding optimal parameter configuration.

2) Ablation Study: We conduct an ablation study on our model for a deeper insights, and then, three sub-models are generated as follows:

Submodel-I minX

1

2XBS−Y²

F+λX−XB◦

PPL²

F. (38) Submodel-II

minX

1

2XBS−Y²

F+W_α◦(X−Xnet)²

F. (39)

Submodel-III minX

1 2

X−XB◦

PPL²

F +W_α◦(X−Xnet)²

F. (40) We perform the tests corresponding to Fig. 4, again, using these models with optimal parameters. The quantitative results are reported in Table V. We can observe that the models involving the WPDI term (i.e., Submodels-II and -III) get better performance than the ones of the PanNet, which is used to determinate the WPDI term. Furthermore, the fusion performance of the Submodel-II is comparable to that of the OursPanNet (see Table V) indicating the dominant role of the