FINE: A Framework for Distributed Learning on Incomplete Observations for Heterogeneous

(1)

FINE: A Framework for Distributed Learning on Incomplete Observations for Heterogeneous

Crowdsensing Networks

Luoyi Fu¹, Songjun Ma², Lingkun Kong¹, Shiyu Liang², Xinbing Wang^1,2 Dept. of{Computer Science and Engineering¹, Electronic Engineering²}

Shanghai Jiao Tong University, Shanghai, China

Email:{yiluofu,masongjun,klk316980786,lsy18602808513,xwang8}@sjtu.edu.cn

Abstract—In recent years there has been a wide range of applications of crowdsensing in mobile social networks and vehicle networks. As centralized learning methods lead to un- reliabitlity of data collection, high cost of central server and concern of privacy, one important problem is how to carry out an accurate distributed learning process to estimate parameters of an unknown model in crowdsensing. Motivated by this, we present the design, analysis and evaluation of FINE, a distributed learning Framework for Incomplete-data and Non- smooth Estimation. Our design, devoted to develop a feasible framework that efficiently and accurately learns the parameters in crowdsensing networks, well generalizes the previous learning methods in that it supports heterogeneous dimensions of data records observed by different nodes, as well as minimisation based on non-smooth error functions. In particular, FINE uses a novel Distributed Record Completion algorithm that allows each node to obtain the global consensus by an efficient communication with neighbours, and a Distributed Dual Average algorithm that achieves the efficiency of minimizing non-smooth error functions.

Our analysis shows that all these algorithms converge, of which the convergence rates are also derived to confirm their efficiency.

We evaluate the performance of our framework with experiments on synthetic and real world networks.

I. INTRODUCTION

Recently, there emerge massive applications of crowdsensing/participatory sensing in mobile social networks and vehicle networks [1]–[7]. The crowd acquire some (potentially high dimensional) data from the environment and each user in the crowd can exploit the cooperatively acquired data to perform a learning process for an accurate estimation of the parameters of some specific models. This, in turn, leads to an accurate prediction of future events and correct decision of the following action.

In this paper, we aim to address the issue of the accurate learning in the undirected-static-random crowdsensing networks. In order to solve this problem, there have been various proposed approaches [8]–[10] whose learning processes are usually formulated as optimizations of the total training error, likelihood function and etc [11] (e.g.,, liner regression, support vector machines or expectation-maximization [12]).

However, these methods usually employ centralized learning algorithms, which leads to three major problems. First, in real world crowdsensing settings, mobile devices are likely to be located over an enormous space, which makes it both energy

consuming and prone to error for central server collecting data from all mobile devices, especially those who are distributed far away from the server. Second, dealing with large volume of data by centralized algorithms requires an expensive high- configuration data center that possesses huge memory for data storage and processing. Third, managing data by central servers make the private information of users more likely to be exposed to the adversary [13]–[15], which might cause severe information leakage.

The three problems above imply the necessity of a distributed realization of parameter learning in crowdsensing environments. However, when applying existing distributed learning methods to our scenarios, restrictions of two charac- teristics in the common distributed framework spawn addition- al problems. To illustrate, first, for mathematical tractability, the error function is usually assumed to be smooth and convex for the design of efficient algorithms; while as the emerging crowdsensing applications may incorporate different properties, the training error functions may be non-smooth in nature [16], [17]. For instance, in distributed detections [18], source intensity functions may be non-smooth, resulting in non-smoothness in training error as well. Second, the common framework requires each terminal to acquire a set of complete records, i.e., each record with data elements in all dimensions to ensure the accuracy of the learning process. This implicitly assumes that the terminals are homogeneous in functionalities so that each of them should record the same types of signals (e.g., each mobile phone can record traveling speed, waiting time as well as ambient noise at every position). In contrast, it is impossible for each terminal to record full-dimension data in the crowdsensing applications. For example, one mobile phone can only be responsible for data acquisition at its own position, leaving the observation of elements in other positions (i.e., dimensions) a job of other mobile phones. Moreover, mobile phones may hold different types of sensors, and therefore are unable to acquire all kinds of signals.

We are thus motivated to propose a distributed learning Framework of Incomplete-data and Non-smooth Estimation (FINE), which aims to exhibit high compatibility to learning applications in crowdsensing environments. There are two major challenges in the design: 1) It is difficult for each node to supplement the unknown dimensions of the observed vector,

(2)

especially when users hesitate to upload the acquired data to the central server for privacy concerns.¹ 2) Due to the huge amount of data collected and the high dimensionality of each record, it is often required that the learning process should be operated by each terminal in a distributed fashion.

The efficiency of minimizing a single non-smooth function is notoriously low [19], yet a distributed processing is even more challenging due to intricate interdependency among the multiple non-smooth optimizations processed by each respective terminal.

We overcome the difficulties above by designing two algorithms in FINE. First, we design a Distributed Record Completion (DRC) algorithm to allow each node to obtain global consensus. Specifically, each terminal consistently obtains incomplete record and completes the missing elements from its neighbors. The combined successive observation and consensus design ensure that each node can acquire unbiased and accurate multidimensional global parameters in spite of the originally fragmentary inputs. Second, we design a novel Distributed Dual Average (DDA) algorithm to solve the non- smooth convex optimization problems with efficiency. To sum up, our major contributions are listed as follows:

• We propose FINE, a novel framework addressing a class of distributed learning problems in heterogeneous crowdsensing networks. FINE is robust to observation noises, capable of handling fragmentary data inputs as well as non-smooth objective functions, and efficient to solve distributed learning problems with a convergence rate² of O(^log√

|V|

1−C ).

• We design two important algorithms in FINE: a DRC algorithm to ensure each node to acquire complete information based on its incomplete data acquisition, and a DDA algorithm to solve non-smooth convex optimizations with efficiency.

• We formally prove the convergence of the above two algorithms, and further derive their convergence rates.

We provide the insights on the relationship between the convergence rate and the network topology, and reveal important design principles for such networks.

The rest of the paper is organized as follows. In SectionII, we provide problem description and the design insights of our framework FINE. In section III, we propose two important algorithms and demonstrate the implementation of such a framework. SectionIVpresents a detailed analysis of proposed algorithms. SectionVprovides detailed proofs of lemmas and theorems. We give performance evaluation of our framework in Section VI and literature review in Section VII. Section VIIIis contributed to the concluding remarks.

II. SYSTEMMODEL ANDPROBLEMFORMULATION

In this section, we first present the system model and then

1The privacy concern of information leakage is not new in crowdsensing networks, which often stems from the centralized management of crowdsensing systems [13]–[15]. Therefore, users might be reluctant to upload the acquired data to the central server..

2|V| is the number of agents and 1 − C represents the spectral gap of the network.

formulate our problems, followed by giving insights of our algorithm design.

A. System Model

We consider a heterogeneous sensor network described by an undirected graphG = (V, E) consisting of |V| nodes. We use V and E to denote the set of nodes and edges, respec- tively. The undirected graph implies that the sensor network allows each pair of connected sensors can communicate in bi-direction.

The heterogeneity refers to the fact that terminals observe different dimensions of data. We let y_k(i) denote an incom- plete data vector observed by each terminal k at time i. For practical consideration, we allow the observation to include noise εk(i). Assume that y^∗ is an M -dimensional complete correct data vector, the incomplete observations of the k-th agent, y_k(i), could be expressed by a linear mapping, Hk, from the global data vector y^∗:

y_k(i) = Hky^∗+ εk(i)∈ R^M, (1) where Hk is a matrix inR^M^×M.

Each terminal requires the full dimensional data to conduct learning process. Assume that terminals have various error functions fk. We allow each local error function to be non- smooth. The local training error ϵ_k at node k is a function of global tunable parameter vector θ and global data y, i.e., ϵ_k = f_k(θ, y), which is convex and non-smooth on θ.

B. Problem Formulation

In consideration of the above settings, we should first guarantee that each node can obtain an accurate estimation y on the global data vector. This encourages us to design theˆ DRC algorithm. In short, the goal of the DRC is to make each node k hold the complete estimation which satisfies:

lim

i→∞yˆ_k(i) = y^∗.

Then, with the estimations and the settings of error func- tions, we find the optimal parameter θ^∗ to minimize the total training error:

θ^∗:= arg min θ

∑|V|

k=1

fk(θ, ˆy), s.t., θ∈ Θ, ˆy ∈ Y,

where Θ⊂ R^Q,Y ⊂ R^M are both convex sets. This encourage us to design the DDA algorithm. Note that the solution needs information exchange between a node and its neighbors, which involves the consideration of the network topology.

To sum up, our formulation deals with the non-smooth learning. We extend the previous dual averaging approach [20], [21] by taking the incomplete observation and observation noise into account.

Remark: As a contrast, the traditional formulation [8]–[10], [22], [23] considers the smooth optimization problem based on a homogeneous sensor network. The homogeneity implies that each agent k should record the same types of signals, and

(3)

full dimensional observations y_k. The learning problem is to find θ^∗ which minimizes the summation of the local training error:

θ^∗ := arg min θ

∑|V|

k=1

f (θ, y_k), s.t., θ∈ Θ, yk∈ Y, where Θ andY are both convex sets. θ is a globally tunable parameter vector and the function f is assumed to be convex and smooth on the parameter θ.

C. Algorithm Design Insights

In this section, we briefly describe the insights of algorithm design and explain how FINE can efficiently and accurately deal with the challenging distributed learning problems in heterogeneous crowdsensing networks. FINE uses a Distributed Record Completion (DRC) algorithm (SectionIII-A) to ensure each node to obtain global consensus in spite of the incomplete local observations and a Distributed Dual Average (DDA) algorithm (Section III-B) to efficiently solve the non-smooth optimization problems.

1) DRC Algorithm: Our DRC allows each node to com- municate with each other to complete the data and achieve consensus, on the condition that each node’s successive observations are required. This combined successive observation and consensus design ensure that each node can acquire unbiased and accurate multidimensional global parameters y^∗. On the one hand, if nodes cooperate but only process the initial noisy observations (i.e., only {yk(0)} are employed), as in traditional consensus [24]–[26], they all converge to the average of initial estimates which might result in severe bias.

Therefore, the requirement of both successive observations and consensus in DRC ensures that each node converges to an unbiased and accurate estimate of the global vector y^∗. Remark: The communication overhead, which quantifies the weight of communication among sensors or mobile phone users during reaching to a consensus, is an important performance metric of consensus-based learning algorithms [27]–

[30]. However, in our paper, we are interested in measuring the time it takes for the whole network to discover the convergence of optimization, i.e., the optimal parameter θ^∗. Here the time refers to communication complexity in reaching convergence of optimization, while neglecting local computation complexity inside mobile devices as it is highly determined by the sensors required in a crowdsensing task and is considerably small – can be completed during clock ticks.

2) DDA Algorithm: A well-known sub-gradient method [31] used for solving a single non-smooth optimization has a time efficiency of O(_ε¹2), where ε is the tolerable error.

For solving distributed non-smooth optimization problem consisting of multiple interdependent functions, the efficiency of the sub-gradient would be O(^|V|_ε2³) [21], [32], [33], where|V|

is the network size. This order does not reflect the explicit influence of the network topology, but only reflects the worst case of efficiency. References [23], [34], [35] provide the convergence rate of the optimization error under some special network topologies. Aiming at the general network topology

with the network size|V| and networking topological param- eter C, literature [36]’s method can achieve the convergence rate of the error ε withO(^log₁_−C^|V|).

In the context of crowdsensing, more practical factors, i.e., partial observation and observation noise, need to be considered, and these factors will lead to the increase of the time complexity. Therefore, it is worthwhile to propose a non-smooth algorithm to improve the efficiency. Towards this end, we build our algorithm on an efficient non-smooth optimization methodology called the dual averaging method introduced in [20]. The dual averaging in nature is an efficient non-smooth convex optimization method. By realizing the dual averaging method into a distributed manner, as will be prescribed in later sections, we will show that our algorithm can return a consistent estimate θ^∗for each node. Additionally, as we will also provably demonstrate later, the convergence rate of our distributed algorithm can also achieveO(^log√

|V|

1−C ).

Remark: The DDA algorithm, as we will unfold in sequel, is designed by extending the centralized dual averaging method into a distributed form, requiring that each node performs local information exchange that follows a weighting process (where each edge is assigned a weight). Thus, intuitively the process is heavily correlated with the network topology. Compared to [36], the DDA algorithm provides consistent efficiency despite that it is applied to solve more complicated distributed learning problems that take practical factors like partial observation and observation noise into account.

Up till now, we have presented our problem, and articulate all insights which inform FINE’s design. In the sequel, we will detail the design of DRC and DDA in FINE.

III. ALGORITHMS

In this section, we describe the details of the design of DRC and DDA algorithms used in FINE. We firstly introduce the notations. Let En denote n× n identity matrix; let 1n, 0n

denote the column vectors of ones and zeros, defined in Rⁿ respectively; let∥·∥ denote the standard Euclidean 2-norm for a vector and the induced 2-norm for matrixes, equavelent to the matrix spectral radius for symmetric matrixes. Besides, we use ∥·∥_∗ to denote the dual norm to ∥·∥,defined by ∥u∥_∗ :=

sup_∥v∥⟨u, v⟩, which refer to the value of the linear function u∈ X^∗ at v ∈ X (X is a vector space and X^∗ is its dual space). We further letP[·] and E[·] denote the probability and the expectation operators, respectively. We use the symbol⊗ to denote the Kronecker product manipulation, commonly used in matrix manipulations. For instance, the Kronecker product of the n×n matrix L and Emis an nm×nm matrix, denoted by L⊗ Em.

In the undirected graphG we consider, we denote the neigh- bor of an arbitrary node v∈ V by Nv ={u ∈ V|euv ∈ E}, the number of edges incident to v by the degree dv=|Nv| of node v, and the degree matrix byD = diag(d1, ..., d_|V|). Then, we use an adjacent matrix,A = [Auv]_|V|×|V|, to describe the network connectivity. We set Auv = 1, if euv ∈ E; or 0

(4)

otherwise. Further, we define the graph Laplacian matrix³ L as L = D − A, a positive semidefinite matrix with ordered eigenvalues 0≤ λ1(L) ≤ λ2(L) ≤ ... ≤ λ_|V|(L). Moreover, for an n× n matrix B, it has a series of order singular values σ1(B)≥ σ2(B)≥ ... ≥ σn(B)≥ 0. Interested readers may refer to [37], [38] for more details on spectral graph theory.

A. Distributed Record Completion

The Distributed Record Completion, or DRC algorithm, allows each node k to obtain an accurate and complete global data vector ˆy_k. DRC algorithm is an iteratively updating process as described below.

Algorithm 1 Framework of DRC for Each Terminal.

Start:

Initial non-random observation y_k(0) ∈ R^M, and let u_k(0) = y_k(0).

Output:

Estimation on the global data, ˆy_k.

1: for i = 1 to T do

2: Receiving neigbors’ estimation on the global data, i.e., {uv(i), v∈ Nk}.

3: Comparing its own estimation with its incomplete ob- servation, i.e., Hkuk(i)− yk(i);

4: Comparing its own estimation with its neighbors esti- mation, i.e.,∑

v∈Nk(uk(i)− uv(i));

5: Updating estimation uk(i + 1) based on Eq. (2);

6: end for

7: return ˆy_k= u_k(T );

Alg.1summarizes the outline of DRC. To illustrate, the sequence {uk}k≥0 is defined to represent the estimated records on all dimensions of data, generated by each node k as follows.

Starting from the initial non-random observation y(0)∈ R^M, at each iteration i, after observing an incomplete data y_k(i), each node k updates uk(i) by a distributed iterative algorithm.

In this algorithm, each node compares its estimated record u_k(i) with its neighbors’, and also with the observation y_k(i).

Then he determines the estimated record of the next time slot with the difference between uk(i) and the deviations, as shown in what follows:

uk(i + 1) = uk(i)− α(i) ∑

v∈Nk

(uk(i)− uv(i))

− β(i)Hk^T(H_ku_k(i)− yk(i)),

(2)

where α(i)∑

v∈Nk(uk(i)− uv(i)) is the deviation of records from neighbors and β(i)H_k^T(H_ku_k(i)− yk(i)) implies the deviation from observations. Since uk is node k’s estimation on all dimensions, for the comparison between the record and the observation, we use the linear mapping H_k. Both the positive weight sequence{α(i)}i≥0 and{β(i)}i≥0 satisfy the persistence condition C.5 given in AppendixA. For the ease

3Numerical natures of the graph can be investigated with the graph Laplacian matrix, for example, connectivity, expanding properties, diameter and etc. In this paper, we define the network connectivity using the Laplacian spectrum, which will be illustrated in the following assumption A.2.

of notation, we rewrite iterations in Equation (2) in a compact form, which can describe the consensus process of all nodes.

To begin with, we store the incomplete observations of all n- odes at iteration i in a long vector y(i) = [y^T₁(i), ..., y^T_|V|(i)]^T, store updates of the i-th iteration in another long vector u(i) = [u^T₁(i), ..., u^T_|V|(i)]^T, and define the following two matrices:

H = diag¯ [

H₁^T, ..., H_|V|^T ]

, (3)

H = diag˜ [

H₁^TH1, ..., H_|V|^T H_V ]

. (4)

Then, using the Kronecker product symbol, we can rewrite the Equation (2) in a compact form as

u(i + 1) = u(i)− α(i)(L ⊗ EM)u(i)

− β(i) ¯H( ¯H^Tu(i)− y(i)). (5) Given the total number of iteration steps T , each node k will obtain a data vector uk(T ) in the end. Let ˆy_k = uk(T ), in SectionIV-A, we will show ˆy_k is in fact an unbiased estimate on y. As now we have the detailed updating process used in DRC algorithm, we will solve the distributed non-smooth minimization problem based on ˆy_k.

B. Distributed Dual Average

Based on ˆy_k, we now use Distributed Dual Average, or DDA algorithm to provide an accurate estimate ˆθ^∗_k on the optimal parameter θ^∗ for each node in a distributed style.

Formally, we need to solve the following minimization:

min θ

∑|V|

k=1

f_k(θ, ˆy_k), s.t., θ∈ Θ, ˆyk∈ Y. (6)

Note that fk is a non-smooth function, where non- smoothness implies that the function does not have a continuous gradient, which makes solving such function more difficult than the smooth function. To deal with the non-smooth function, the sub-gradient method should be employed, while a slow convergence has to be endured [39]. For example, solving a single non-smooth optimization has an efficiency estimate of O(_ε¹2) [40], where ε is the desired accuracy of the approximation solution, while minimizing a single smooth function only requires an efficiency estimate of the order O(√

1

ε) [31]. Furthermore, solving the distributed non-smooth optimization problem consisting of multiple interdependent non-smooth functions has even lower efficiency. Therefore, we propose the DDA algorithm to improve the efficiency of the distributed non-smooth optimization of Eq. (6) in crowdsensing networks.

Distributed Dual Averaging (DDA) Algorithm:

In Alg.2, we summarizes the outline of DDA. It is designed by extending the centralized dual averaging method [20] into a distributed form. Now we provide the details of the algorithm.

The DDA algorithm requires each node to exchange information with its neighbors, and the exchange follows a weighting process, where the edge is assigned a weight. Thus, the process is strongly correlated with the network topology.

(5)

Algorithm 2 Framework of DDA for Each Terminal.

Start:

Initial pair of vectors (θk(0), µ_k(0))∈ Θ × R^M, and let µ_k(0) = ˆy_k.

Output:

Estimation on the optimal parameter θ^∗.

1: for i = 1 to T do

2: Computing the sub-gradient g_k(t)∈ ∇θf_k(θ_k(t), ˆy_k);

3: Receiving estimated information from neigobors, i.e., {µj(t), j∈ Nk};

4: Updating (θk(t), µ_k(t)) with Eq. (7) and (8);

5: end for

6: return ˆθk(T ) with Eq. (9);

At each iteration t, each node k maintains a pair of vectors (θk(t), µ_k(t)) ∈ Θ × R^M, computes its own sub- gradient g_k(t) ∈ ∇θfk(θk(t), ˆy_k), and receives information on sequences from its neighbor nodes, i.e., {µj(t), j ∈ Nk}.

Next, at each iteration, each node updates its maintained vector (θk(t), µ_k(t)) by weighting values from its neighbors. To model this weighting process, we use P ∈ R^|V|×|V| to denote the edge weights matrix of the graphG. Thus, Pkl> 0 if and only if ekl ∈ E and k ̸= l. This matrix represents the weight of each link, which can capture some natures of the link.

For instance, the value can represent the intimacy between two nodes. A higher value implies the neighbor on this link will contribute more in the information exchange. Note that P_kl> 0 only if (k, l)∈ E, and Pkl > 0 only if (k, l)∈ E, the weight update is described with the following equations:

µ_k(t + 1) = ∑

l∈Nk

Pklµ_l(t) + gk(t), (7) θ_k(t + 1) = π_ω(t+1)(−µk(t + 1)), (8) where the function πω(u) is defined by

πω(µ) = arg min

ξ∈Θ{− ⟨µ, ξ⟩ + ωϕ(ξ)},

and {ω(t)} is the non-decreasing sequence of positive step- sizes.

Specifically, we assume the matrix P is a doubly stochastic matrix, so

∑|V|

l=1

Pkl= ∑

l∈Nk

Pkl = 1 for all k∈ V;

∑|V|

k=1

Pkl = ∑

k∈Nl

Pkl= 1 for all l∈ V.

To sum up, each node k computes its new dual sequence µ_k(t + 1) by weighting both its own sub-gradient gk(t) and the sequences{µl(t), l∈ Nk} stored in its neighborhood Nk, and the node also computes its next local primal parameters θk(t + 1) by a projection defined by the proximal function ϕ and step-size ω(t) > 0.

The intuition behind this method is: based on its current iteration (θk(t), µ_k(t)), each node k chooses its next iteration

θ_k(t+1) so as to minimize an averaged first-order approxima- tion to the function f =∑

kf_k, while the proximal function ϕ and step-size ω(t) > 0 ensure that{θk(t)} does not oscillate wildly during iterations.

At the end of iteration T , each node k has obtained a sequence{θk(t)}1≤t≤T. We run a local average for each node as follows:

ˆθk(T ) = 1 T

∑T t=1

θk(t). (9)

This means if we let ˆθ^∗_k = ˆθ_k(T ) at the end of iteration T , we will have limT→∞f ( ˆθ^∗_k) = f (θ^∗). Thus, with this iteration, each node k can obtain an estimate of the optimal parameter with any desired accuracy. We will prove the convergence of DDA in SectionIV.

To sum up, in order to solve distributed non-smooth minimization problems in heterogeneous crowdsensing networks, we first present a DRC algorithm to allow each heterogeneous node to obtain an accurate estimate on the globally required data vector y. Based on this, we design a DDA algorithm to ensure that each node obtains an accurate estimate on the optimal parameter θ^∗. In the next section, we will present a formal analysis on the convergence and convergent rates of both algorithms.

IV. MAINPROPERTIES OFDRCANDDDA

In this section, we present the main properties of the DRC and DDA algorithms. We defer detailed proofs in SectionV.

A. Main Properties of DRC

To begin with, we present main properties with regard to the asymptotic unbiasedness and the consistency of the DRC algorithm. Furthermore, we carry out a convergence rate analysis by studying the deviation characteristic of DRC.

The results rely on the following three assumptions:

(A.1) Observation Noise: For i ≤ 0, the noise {

ε(i) = [ε^T₁(i), ..., ε^T_|V|(i)]^T }

i≥0 is i.i.d. zero mean. More- over, at each node k, the noise sequence {εk(i)}1≤k≤|V|,i≥0

is independent with each other, and the covariance of the observing noise, Sε, is independent over time i, i.e.,

E[ε(i)ε(j)^T] = S_εδ_ij,∀i, j ≥ 0, (10) where δij = 1 iff i = j or 0 otherwise.

(A.2) Networking Connectivity: The second eigenvalue of graph Laplacian L is non-negative, i.e., λ2(L) ≥ 0. We require the graph to be connected to allow communication among nodes. This can be guaranteed if λ2(L) > 0. See [37], [38] for details.

Before presenting the final assumption, we first give the following definition:

Definition 1. The observations formulated by Equation (1) is distributedly observable if the matrix H, defined by H =

∑_|V|

k=1H_k^TH_k, is of full rank.

Remark: This distributed observability is essentially an extension of the observability condition for the centralized

(6)

observing system which is designed to obtain consistent and complete observation on the vector y.

Now let us present the last assumption.

(A.3) Observability: The observations formulated by Equa- tion (1) is distributedly observable defined by Definition1.

1) Unbiasedness and consistency of DRC: In this part, we show the unbiasedness and the consistency of DRC algorithm, and we provide two theorems to illustrate them respectively.

Theorem 1. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A), the record sequence {uk(i)}i≥0 at node k is asymptotic unbiased

ilim→∞E[uk(i)] = y^∗,∀1 ≤ k ≤ |V|. (11) We defer the proof in Section V-A. Theorem 1 shows the unbiasedness of the algorithm. It indicates that each node’s estimation on the global data would be correct on the average in the long run. The consistency of DRC algorithm is guaranteed by the following theorem.

Theorem 2. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A),the records sequence {uk(i)}i≥0 at node k is consistent

P[

ilim→∞uk(i) = y^∗,∀1 ≤ k ≤ |V|]

= 1.

We provide the proof in AppendixB. Based on Theorem2, record sequence{uk(i)}i≥1 at every node, with probability 1, converges to the true vector y^∗.

2) Convergence rate analysis: We now analyze the conver- gence rate of the DRC algorithm via its deviation characteristic. We first present a relative definition which is used to characterized the convergence rate of sequential process.

Definition 2. A sequence of records {u(i)}i≥0 is asymptoti- cally normal if a positive semidefinite matrix S(y) exists and satisfies that

lim

i→∞

√i(u_k(i)− y^∗)→ N (0M, S_kk(y(i))),∀1 ≤ k ≤ n.

The matrix S(y(i)) is called the asymptotic variance of the observing sequence{y(i)}i≥0, and Skk(y)∈ R^M^×M denotes the k-th principal block of S(y(i)).

In the following part, we analyze the asymptotic normality of the DRC algorithm. Let λmin(γL ⊗ EM+ ˜H) denote the smallest eigenvalue of [γL ⊗ EM + ˜H]. Recalling the noise covariance Sϵ in (10), we present the following theorem to establish the asymptotic normality of the DRC algorithm.

Theorem 3. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A), with weight sequence {α(i)}i≥0 and {β(i)}i≥0 that are given by

α(i) = a i + 1, lim

i→∞

α(i)

β(i) = γ > 0,

for some a > 0. Let the record sequence{u(i)}i≥0be the state sequence generated by (5). Then, for a > ¹

2λmin(γL⊗EM+ ˜H), we obtain

√(i)(u(i)− 1_|V|⊗ y^∗) =⇒ N (0, S(y(i))),

where

S(y(i)) = a²

∫ _∞

0

e^ΣvS0e^Σvdv, (12) Σ = −a[γL ⊗ EM + ˜H] +1

2E_M_|V|, (13) and

S0 = HS¯ ϵH¯^T. (14) Especially, the record sequence{uk(i)}i≥0 at any node k is asymptotically normal:

√(i)(uk(i)− y^∗) =⇒ N (0, Skk(y(i))).

We provide the proof in Appendix B. Therefore, the error sequence {uk(i)− y^∗}i≥0 at each node can be regarded as being convergent to a normal distribution with a rate of √¹

i. Up until now, we have presented asymptotic unbiasedness, the consistence and the asymptotic normality of the DRC algorithm. In the next section, we present main properties of the DDA algorithm.

B. Main Properties of DDA

In this section, we prove the convergency of running average θˆk(T ) to the optimal parameter θ^∗and derive the convergence rate of the DDA algorithm.

Now we present the following theorem.

Theorem 4. The random family {θk(t)}^∞t=0and {µk(t)}^∞t=0

are generated by iteration (8) and (7), with the positive non- decreasing step-size sequence{ω(t)}^∞t=0, where ϕ is strongly convex with respect to the norm ∥·∥ with dual norm ∥·∥_∗. Let the record error E[ˆy_k]− 1|V|⊗ y^∗ be bounded by an arbitrary small constant C_err. For any θ^∗∈ Θ and each node k∈ V, we have

f (ˆθ_k(T ), ˆy_k)− f(θ^∗, y^∗)≤ OPT + NET + SAMP, where

OP T = ω(T )

T ϕ(θ^∗) + L² 2T τ

∑T t=1

1 ω(t), N ET = L

T

∑T t=1

1 ω(t)E

[ 2

|V|

∑|V|

j=1

¯µ(t)− µj(t)

∗

+∥¯µ(t) − µk(t)∥ ]

, SAM P = LCerr,

¯

µ(t) = 1

|V|

∑|V|

k=1

µ_k(t).

Recall that τ is the convexity parameter.

Theorem 4 explicitly shows the difference between the estimated results from the true optimality. It is bounded by a value which is a sum of three types of errors: (1) The OPT error can be viewed as the optimization error; (2) the NET error is induced by various estimations of nodes; and (3) the SAMP error is incurred on account of the input noisy. The

(7)

theorem indicates the relationship between the difference and T , which will help us understand the convergency of the DDA algorithm. The detailed proof will be given in Section V.

We next investigate the relationship between the convergence rates and the spectral property of the network. For a given graph G, we assume that communications between nodes are controlled by a double stochastic matrix P . In the following, we show that the spectral gap of the network, i.e., γ(P ) = 1− σ2(P ) of P severely influences the convergence rate of DDA, where σ2(P ) is the second largest singular value of P .

Theorem 5. Following Theorem4and recalling that ϕ(θ^∗)≤ A², if we define the step-size ω(t) and the record error E[ˆy_k]− 1_|V|⊗ y^∗ as:

ω(t) = A√

t and Cerr= 2L A√

T · ln(T√

|V|) 1− σ2(P ), we will have

f (ˆθ_k(T ), y)− f(θ, y^∗)≤ 16L² A√

T ln(T√

|V|)

1− σ2(P ), for all k∈ V.

We defer the proof in Section V-C. Theorem 5 shows that the convergence rate of distributed subgradient methods heavily relies on the graph spectral property. The dependence on the spectral quantity 1−σ2(P ) is quite natural, since lots of work have noticed that the propagation of information severely relies on the spectral property of the underlying network.

As we have presented all main properties of our algorithms, we will next turn to the detailed proof of each theorem.

V. PROOF OFTHEOREMS

A. Proof of Theorem 1

Proof: Taking the expectation of both sides of Eq. (5), it follows

E[u(i + 1)] = E[u(i)] − α(i)(L ⊗ EM)E[u(i)]

+ β(i) ¯HE[y(i)] − β(i) ¯H ¯H^TE[u(i)]. (15) Given that

(L ⊗ EM)(1_|V|⊗ y^∗) = 0_|V|M, (16) H(1˜ _|V|⊗ y^∗) = HE[y(i)],¯ (17) subtracting both sides of Eq. (15) by 1_|V|⊗ y^∗, we have

E[u(i + 1)] − 1_|V|⊗ y^∗= [E_|V|M− α(i)L ⊗ EM

− β(i) ˜H][E[u(i)] − 1_|V|⊗ y^∗]. (18) Continuing the iteration in (18), we have, for each i ≥ i0= max{i1, i₂},

E[u(i)]− 1_|V|⊗ y^∗ ≤(_i₋₁

∏

j=i0

E_M_|V|− α(j)L ⊗ EM

−β(j) ˜H

)

× E[x(i0)]− 1_|V|⊗ y^∗] . (19) To further derive the above formulation, we have the following facts.

First, since ^α(i)_β(i) → γ, we have

∃i1∋: γ 2 ≤α(i)

β(i) ≤ 2γ, ∀i ≥ i1. (20) Second, let λmin(γL⊗EM+ ˜H) be the smallest eigenvalue of the positive definite matrix⁴ [γL ⊗ EM+ ˜H]. Since α(i) → 0, we have

∃i2∋: α(i) ≤ 1

λ_min(γL ⊗ EM+ ˜H).∀i ≥ i2 (21) Third, the other facts include: 1) λmin(A + B) ≥ λmin(A)+λmin(B) (Courant-Fischer Minimax Theorem [41]), 2) λmin(L ⊗ EM) = λmin(L) ≥ 0.

Based on above facts, the multiplicand of Equation (19) follows from (21), for each j≥ i0

||EM|V|− α(j)L ⊗ EM− β(j) ˜H||

=

E^M^|V|− β(j)(α(j)

β(j)L ⊗ EM + ˜H)

= 1− β(j)λmin(α(j)

β(j)L ⊗ EM+ ˜H)

= 1− β(j)λmin((α(j) β(j)−γ

2)L ⊗ EM+γ

2L ⊗ EM+ ˜H)

≤ 1 − β(j)λmin(γ

2L ⊗ EM+ ˜H).

(22) From (19) and (22), we now have for each i > i0,

E[u(i)]− 1|V|⊗ y ≤(_i−1

∏

j=i₀

(1− β(j)λmin(γ

2L ⊗ EM+ ˜H)) )

× E[u(i0)]− 1|V|⊗ y^∗] .

(23) Finally, from the inequality 1− a ≤ e^−a, 0≤ a ≤ 1, we get E[u(i)]− 1_|V|⊗ y^∗ ≤exp

[

−λmin(γ

2L ⊗ EM + ˜H)

i−1

∑

j=i0

β(j) ]

× E[u(i0)]− 1_|V|⊗ y^∗] , i > i0. (24) With the facts that λmin(γL ⊗ EM+ ˜H) > 0 and the sum of β(j) approaches to infinity, we have

ilim→∞ E[u(i)]− 1_|V|⊗ y^∗ = 0.

Thus we complete the proof.

B. Proof of Theorem 4

Before proving the theorem of algorithm convergency, we present here some basic assumptions and necessary lemmas.

(A.4) A prox-function ϕ : Θ → R exists to be τ-strongly convex with respect to the norm∥·∥, i.e.,

ϕ(θ₁)≥ ϕ(θ2) +⟨∇θϕ(θ₂), θ₁− θ2⟩ +τ

2∥θ1− θ2∥², for θ1, θ2∈ Θ. Function ϕ is non-negative over Θ and ϕ(0) = 0. The prox-center of Θ is given by θ0= arg minθ{ϕ(θ) : θ ∈

4Due to the page limit, we omit the proof of the positive semidefiniteness of the matrix.

(8)

Θ}. Moreover, we assume that for the optimal parameter θ^∗, ϕ(θ^∗)≤ A².

(A.5) The error function fk at each node k is L-Lipschitz with respect to the norm ∥·∥, i.e., for θ1, θ2∈ Θ, we have

|fk(θ1, ˆy_k)− fk(θ2, ˆy_k)| ≤ L ∥θ1− θ2∥ . Lemma 1. Define the function

V_ω(θ) = max

ζ_∈Θ{⟨θ, ζ − θ0⟩ − ωϕ(ζ)}.

Then function Vω(·) is convex and differentiable on Θ. More- over, its gradient is L-Lipschitz continuous with respect to the norm∥·∥

∥∇Vω(u)− ∇Vω(v)∥ ≤ 1

ωτ ∥u − v∥ , ∀u, v ∈ Θ, where the gradient is defined as follows

∇Vω(u) = πω(u)− u0, πω(u) = arg min

v∈Θ{− ⟨u, v⟩ + ωϕ(v)}.

(25) Note that u₀= π_ω(0).

Lemma 2. Let{g(t)}^∞t=1be an arbitrary sequence of vectors, and consider the sequence{θ(t)}^∞t=1 generated by

θ(t + 1) = arg min θ∈Θ

{ _t

∑

r=1

⟨g(r), θ⟩ + ω(t)ϕ(θ) }

= π_ω(t) (

−

∑t r=1

g(r) )

.

For any non-decreasing positive step-sizes{ω(t)}^∞t=0, and any θˆ∈ ΘC, we have

∑T t=1

⟨

g(t), θ(t)− ˆθ⟩

≤ 1 2τ

∑T t=1

∥g(t)∥²_∗

ω(t) + ω(T )C.

For any ˆθ∈ ΘC⊂ Θ^∗, we have

∑T t=1

⟨

g(t), θ(t)− ˆθ⟩

≤ 1 2τ

∑T t=1

∥g(t)∥²_∗

ω(t) + ω(T )ϕ(θ^∗).

In addition, we establish the convergency of algorithm via two auxiliary sequences

φ(t + 1) = π_ω(t)( ¯µ(t + 1)) (26) and present the following lemma.

Lemma 3. With definitions of the random family {θk(t)}^∞t=0, {µk(t)}^∞t=0 and{φk(t)}^∞t=1in Eq. (7), (8) and (26), and the L-Lipschitz condition of each f_k, for each node k ∈ V, we have

∑T t=1

[f (θk(t), y_k)− f(θ^∗, y^∗)]≤

∑T t=1

[f (φ(t), y_k)− f(θ^∗, y_k)]

+

∑T t=1

[L∥θk(t)− φ(t)∥ + L ∥yk− y^∗∥].

Similarly, defining ˆφ(T ) = _T¹ ∑T

t=1φ(t) and θˆk(T ) = _T¹∑T

t=1θk(t), we have

f (ˆθ_k(t), y_k)− f(θ^∗, x^∗)≤ f(ˆφ(t), yk)− f(θ^∗, y_k)

+ L

ω(t)T

∑T t=1

∥θk(t)− φ(t)∥ + L ∥yk− y^∗∥.

Based on above lemmas, we now present the proof of Theorem4.

Proof: We perform our proof by analyzing the random family{φ(t)}^∞t=0. Given an arbitrary θ^∗∈ Θ, we have

∑T t=1

[f (φ(t), ˆy_k)− f(θ^∗, ˆy_k)]

= 1

|V|

∑T t=1

∑|V|

k=1

[fk(φ(t), ˆy_k)− fk(θ^∗, ˆy_k)]

≤

∑T t=1

1

|V|



∑^|V|

k=1

[L∥φ(t) − θk(t)∥ + fk(φ(t))− fk(θ_k(t))]



 .

The inequality of the above equation is resulted by the L- Lipschitz condition on fk.

Let g_k(t) ∈ ∂fk(θk(t)) and use the convexity of the function, then we will obtain the following bound:

∑|V|

k=1

[f_k(θ_k(t))− fk(θ^∗)]≤

∑|V|

k=1

⟨gk(t), θ_k(t)− θ^∗⟩

=

∑|V|

k=1

⟨ˆgk(t), φ(t)− θ^∗⟩ +

∑|V|

k=1

⟨ˆgk(t), θk(t)− φ(t)⟩

+

∑|V|

k=1

⟨gk(t)− ˆgk(t), θk(t)− θ^∗⟩ (27)

For the first term in the right hand side of Equation (27), from the Lemma 2, it follows that

1

|V|

∑T t=1

⟨_|V|

∑

k=1

ˆ

g_k(t), φ(t)− θ^∗

⟩

≤ 1 2τ

∑T t=1

1 ω(t)

1

|V|

∑|V|

k=1

gˆ_k(t)

2

∗

+ ω(T )ϕ(θ^∗).

(28)

Holder’s inequality implies that E[∥ˆgl(t)∥_∗∥ˆgk(s)∥_∗] ≤ L² and E[∥ˆgk(t)∥_∗]≤ L² since ∥ˆgk(t)∥_∗ ≤ L for any k, l, s, t.

We use these two inequalities to bound (28), E

1

|V|

∑|V|

k=1

ˆ g_k(t)

2

∗

≤ 1

|V|²

∑|V|

k,l=1

E[∥ˆgk(t)∥_∗∥ˆgl(t)∥_∗]≤ L².

For the second term in the right hand side of Equation (27), θ_k∈ Ft−1 and φ(t)∈ Ft−1 by assumption, so

E ⟨ˆgk(t), θ_k(t)− φ(t)⟩ ≤ E ∥ˆgk(t)∥ ∥θk(t)− φ(t)∥

=E(E[∥ˆgk(t)∥ |Ft−1]∥θk(t)− φ(t)∥)

≤LE ∥θk(t)− φ(t)∥

≤ L

ω(t)τE ∥¯µ(t) − µk(t)∥_∗.

(29)