MixLasso: Generalized Mixed Regression via Convex Atomic-Norm Regularization

(1)

MixLasso: Generalized Mixed Regression via Convex Atomic-Norm Regularization

Anonymous Author(s) Affiliation

Address email

Abstract

We consider a generalization of mixed regression where the response is an additive

1

combination of several mixture components, so that standard mixed regression is

2

the special case where each response is generated from exactly one component.

3

Typical approaches to the mixture regression problem employ local search methods

4

such as Expectation Maximization (EM) that are prone to spurious local optima. On

5

the other hand, a number of recent theoretically-motivated Tensor-based methods

6

either have high sample complexity, or require the knowledge of probability density

7

function of the input distribution, which might not be available in practice. In

8

this work, we study a novel convex estimator MixLasso for generalized mixed

9

regression based on an atomic norm that is specifically constructed in order to

10

regularize the number of mixture components. Our analysis of the algorithm gives a

11

risk bound that trades off prediction accuracy and model sparsity without imposing

12

stringent assumptions on the input/output distribution beyond that of boundedness.

13

Our approach is also applicable to both linear and non-linear mixture regression

14

settings. In our numerical experiments with mixtures of linear as well as nonlinear

15

regression functions, the proposed method yields high-quality solutions in a wider

16

range of settings than existing approaches.

17

1 Introduction

18

The Mixed Regression (MR) problem considers the estimation of K regression functions, from a

19

collection of input-output samples, where in each sample the output conditioned on the input is

20

specified by one of the K regression functions. When fitting linear functions in a noiseless setting,

21

this is equivalent to solving K linear systems without knowing which system each equation belongs

22

to. The MR formulation can be also employed as an approach to decompose a complicated (possibly

23

multi-valued) function into K simpler ones, by splitting the observations into K classes. Variants of

24

regression families such as piecewise-linear regression can be viewed as special cases of MR.

25

The MR problem is NP-hard in general [1] due to the simultaneous fitting of the discrete class labels

26

as well as the regression functions. Standard approaches to the mixture problem employ local search

27

methods such as Expectation Maximization (EM) [2] and Variational Bayes [3] that are prone to

28

spurious local optima. There have thus been several lines of recent work studying estimation of

29

mixed regression models with strong statistical guarantees under additional statistical assumptions.

30

For the special case of linear function with K=2 components, [4] propose a convex nuclear norm

31

minimization formulation that is guaranteed to estimate the two functions with minimax-optimal rates

32

when given a sub-Gaussian design matrix. With the additional conditions of zero noise and isotropic

33

Gaussian inputs, [1] propose an initialization for the EM algorithm to guarantee exact recovery of

34

the true parameters. However, in addition to the stringent statistical assumptions, these methods and

35

results are specialized to the case of two components, and seem non-trivial to generalize.

36

Submitted to 31st Conference on Neural Information Processing Systems (NIPS 2017). Do not distribute.

(2)

For problems with more than two components, most of the existing approaches [5, 6, 7, 8] rely

37

on the Tensor Methods. In particular, for a D-dimensional linear MR problem, [6] propose a

38

convex optimization formulation using a third-order tensor, which results in a computational cost

39

of O(N D¹²) and a sample complexity of O(D⁶/²), limiting its application to problems of small

40

dimension. The Tensor Decomposition approach proposed in [5] has a sample complexity of only

41

O(D³K⁴/²) and is computationally efficient. However, it requires the knowledge of the input

42

probability distribution in order to derive the score function used in their algorithm, which might not

43

be available, and estimating the density over the D-dimensional input variables could be an even

44

harder problem than MR itself. Other recent work [7, 8] show that in the noiseless setting with

45

isotropic Gaussian inputs, an Alternating Minimization algorithm initialized with the Tensor Method

46

leads to exact recovery of the true parameters. These latter methods have sample complexities linear

47

in D, but with O(K^K), O(K¹⁰) dependencies in K respectively. Finally, [9] observed that, under the

48

assumption of well-separated data, one can use a guaranteed clustering algorithm to find the mixture

49

assignment of each observation, and thus solves the MR problem as a by-product. However, the data

50

distribution considered in MR, such as those assumed in [5, 6, 7, 8], are usually not well-separated

51

(see our Figure 3 as an example).

52

In this work, we address a generalized version of Mixed Regression where the output can be an

53

additive combination of several mixture components. Our approach follows the general meta-approach

54

emerging in the recent years of addressing latent-variable model estimation from the perspective of

55

high-dimensional sparse estimation [10, 11, 12]. We propose a novel convex estimator MixLasso for

56

the mixed regression problem, which enforces the mixture structure through minimizing a carefully

57

constructed atomic norm that acts as a surrogate function for the number of mixture components.

58

We then propose a greedy algorithm that generates a steepest-descent component at each iteration

59

through solving a sub-problem similar to MAX-CUT. Our analysis of the algorithm gives a risk

60

bound that trades off prediction accuracy and model sparsity, with a sample complexity that is linear

61

in both D and K, and without imposing any stringent assumptions on, or assuming knowledge of, the

62

input/output distribution beyond that of boundedness, and even allowing for model mis-specification.

63

This makes our MixLasso algorithm a theoretically sound method for a wide range of practical

64

settings. Moreover, we also show how our proposed method can be easily extended to the nonlinear

65

regression setting, to regression functions lying in a Reproducing Kernel Hilbert Space (RKHS).

66

Our experiments with both generalized MR and standard MR show that the proposed method finds

67

high-quality solutions in a wider range of settings when compared to existing approaches.

68

2 Generalized Mixed Regression

69

In Generalized Mixed Regression, the response y ∈ R, given covariates x ∈ X , is specified as:

70

y =

K

X

k=1

z_kf_k(x) + ω (1)

where zk ∈ {0, 1}, k = 1, . . . , K is a latent binary vector indicating the presence or absence of each

71

component, and fk(xi) : R^D→ R is the regression function of k-th component. The standard mixed

72

regression is a special case of (1) with additional constraint kzk0= 1. Here ω ∈ R is a noise term

73

with both bias and variance. In other words, we consider the very general setting where we allow

74

for model mis-specification, and in general E[ω|x, z] 6= 0. This makes our problem setting in (1)

75

very practically plausible, especially when the regression functions {fk(x)}^K_k=1lie in some restricted

76

family such as linear functions.

77

Our goal is to find F := {fk(x)}^K_k=1minimizing the risk

78

r(F ) := E

"

min

z∈{0,1}^K

1 2(y −

K

X

k=1

zkfk(x))²

#

, (2)

while keeping the number of components K as small as possible. This yields a trade-off between

79

r(F ) and K. While one can always have a small risk with K → ∞, we would like to find the

80

smallest K that achieves such risk.

81

(3)

3 MixLasso: Convex Estimation via Atomic Norm

82

In the following, we will first focus on the linear case fk(x) := hwk, xi and consider extension

83

to nonlinear functions in Section 4.2. Given a collection of i.i.d. samples {(xi, yi)}^N_i=1, the `2-

84

regularized Empirical Risk Minimization (ERM) problem for our task (2) is

85

min

W ∈R^K×D,z_i∈{0,1}^K

1 2N

N

X

i=1

(y_i− z^T_iW x_i)²+τ

2kW k²_F. (3)

(3) is a hard optimization problem in general due to the simultaneous minimization w.r.t. parameters

86

W and binary hidden variables {zi}^N_i=1[1]. However, given hidden variables, the problem is convex

87

w.r.t. W , and thus, from the duality theory (3) is equivalent to

88

min

Z∈{0,1}^{N ×K}

max

α∈R^N

−1 N

N

X

i=1

L^∗(yi, −αi) − 1

2N²τtr(D(α)XX^TD(α)ZZ^T) (4) where Z := (zi)^N_i=1, D(α) is a diagonal matrix formed by vector α, and L^∗(y, α) = y^Tα +¹₂kαk²

89

is the convex conjugate of square loss L(y, ξ) = ¹₂(y − ξ)². The maximizer α^∗of (4) and minimizer

90

W^∗of (3) are related by W^∗= _{N τ}¹ PN

i=1α^∗_i(zix^T_i) = _{N τ}¹ Z^TD(α^∗)X.

91

A key observation for our formulation is that, although (4) is non-convex w.r.t. Z, it is a convex

92

function of M := ZZ^T (since it is a maximum over linear functions of M ). Therefore, the

93

intractability of (4) only lies in the combinatorial constraint M = ZZ^Tfor some Z ∈ {0, 1}^{N ×K}.

94

To relax such constraint, we introduce an atomic norm [13] of the form

95

kM k_S := min

c≥0

X

a∈S

ca s.t. M =X

a∈S

caa. (5)

where S := {zz^T|z ∈ {0, 1}^N}. Note if c_a takes integer values {0, 1}, M =P

a∈Sc_aa = ZZ^T

96

for some Z ∈ {0, 1}^{N ×K}and kM kS = K. When cais allowed to be any nonnegative number, (5)

97

serves as a convex approximation to the number of components K in a sense similar to `1-norm

98

as a convex approximation for the number of non-zero elements in Lasso [14]. Then the MixLasso

99

estimator is defined as

100

min

M ∈R^{N ×N}+

g(M ) + λkM kS (6)

where g(M ) := max

α∈R^N

− 1

2N²τtr(D(α)XX^TD(α)M ) − 1 N

N

X

i=1

L^∗(yi, −αi)

! . (7)

4 Algorithm

101

The convex formulation (6) is still a challenging optimization problem since it involves an atomic

102

norm defined over ¯K := 2^N atoms. An equivalent formulation expresses (6) as the minimizatioin of

103

F (c) := g

^K^¯ X

k=1

ckz^kz^{k T}

+ λkck1 (8)

w.r.t. c ∈ R^K+^¯, where {z^k}^K_k=1^¯ enumerates ∀z ∈ {0, 1}^N. We introduce a greedy algorithm

104

(Algorithm 1) for MixLasso, which maintains a sparse set of active components and adds one more

105

active component z^kz^{k T} at each iteration corresponding to the steepest descent direction

106

min

z∈{0,1}^N

h∇g(M ), zz^Ti = − 1

2N²τ max

z∈{0,1}^N

hD(α^∗)XX^TD(α^∗), zz^Ti, (9) where α^∗ is the maximizer in (7). As we show in Section 4.1, (9) is equivalent to a MAX-CUT

107

like problem that can be solved efficiently with a constant-ratio approximation guarantee. Then

108

we minimize (8) w.r.t. coefficients corresponding to the active components through a sequence of

109

proximal gradient updates:

110

c^s+1_k ←

c^s_k− 1

γ|A|(z^{k T}∇g(M^s)z^k+ λ)

+

, k ∈ A, s = 1 . . . S. (10)

(4)

Algorithm 1 A Greedy Algorithm for MixLasso (6) Initialize A = ∅, c = 0.

for t = 1...T do

1. Find a greedy component zz^Tby solving (9).

2. Add zz^Tto the active set A.

3. Minimize (8) w.r.t. coordinates cAin the active set A through updates (10).

4. Eliminate {z^kz^{k T}|ck= 0} from A.

end for.

where γ is the Lipschitz-continuous parameter of the coordinate-wise gradient z^{k T}∇g(M )z^k. The

111

evaluation of ∇g(M^s) involves finding the maximizer α^∗, which can be obtained by solving the

112

least-square problem:

113

W^∗:= argmin

W ∈R^|A|×D

1 2N

N

X

i=1

(yi− z^T_iW xi)²+τ

2tr(W^TD⁻¹(cA)W ) (11) and compute α^∗_i = (yi− z^T_iW^∗xi). Let E be the N × (|A|D) design matrix of the least-square

114

problem (11). By maintaining E, E^TE whenever the active set A changes, solving the least-square

115

problem (11) costs O(D³|A|³) amortizedly.

116

4.1 Greedy Generation of Components

117

Problem (9) for finding the steepest descent direction is a convex maximization problem with binary-

118

valued variables and is hard in general. However, we show that it is equivalent to a Boolean Quadratic

119

Maximizationproblem similar to MAX-CUT, where constant-ratio approximate algorithm exists

120

through a Semidefinite Relaxation [15]. Furthermore, the Semidefinie Relaxation of this type has

121

scalable solver that requires only complexity linear to the coefficient matrix [16, 17].

122

Let C = D(α^∗)XX^TD(α^∗). The greedy step (9) solves a problem of the form

123

max

z∈{0,1}^NhC, zz^Ti, (12)

which can be reduced to a problem of binary variables v ∈ {−1, 1}^N via a transformation v = 2z −1:

124 125

max

v∈{−1,1}^N

1

4 hC, vv^Ti + 2hC, 1v^Ti + hC, 11^Ti . (13) where 1 denotes N -dimensional vector of all 1s. By introducing a dummy variable v0, (13) is

126

equivalent to

127

max

(v₀;v)∈{−1,1}^{N +1}

1 4

v0

v

T

1^TC1 1^TC

C1 C

v0

v

. (14)

Note one can always find a solution of v0 = 1 by flipping signs of the solution since this does not

128

change the objective value. Let the matrix in (14) be ˆC. Problem of form (14) is a Boolean Quadratic

129

problem similar to MAX-CUT, for which there is Semidefinite relaxation of the form

130

max

V ∈S^N

h ˆC, V i

s.t. V 0, diag(V ) = 1

(15) and rounding from which guarantees a solution ˆv to (14) satisfying h − h(ˆv) ≤ ρ(h − h) with

131

ρ = 2/5 [15], where h(v) denotes the objective function of (14) and h, h denote the maximum and

132

minimum of the objective in (14) respectively. Note this result holds for any symmetric matrix ˆC.

133

Since our problem has a positive-semidefinite matrix ˆC, we have h = 0 and therefore the component

134

z^kfound this way satisfies

135

−z^{k T}∇g(M )z^k = h(ˆv) ≥ µh = µ max

z∈{0,1}^N−z^T∇g(M )z (16) with µ = 1 − ρ = 3/5. Semidefinite Programming of the form (15) allows specialized solver with

136

iteration cost linear to the matrix size nnz( ˆC) [16, 17]. And it is worth mentioning that, since our

137

matrix ˆC has low-rank structure (9), our implementation of the SDP solver [17] can further reduce

138

the complexity per iteration from nnz( ˆC) to nnz(X).

139

(5)

4.2 Nonlinear Extension

140

A simple way to consider a nonlinear version of the MixLasso estimator is to consider each component

141

fk(x) lying in a Reproducing Kernel Hilbert Space (RKHS) H with respect to some Mercer kernel

142

K(·, ·). In this setting, the inner minimizer {f_k^∗}^K_k=1of

143

min

zi∈{0,1}^K min

f_k∈H

1 2N

N

X

i=1

y_i−

K

X

k=1

z_ikf_k(x_i)

!² +τ

2

K

X

k=1

kf_kk²_H (17) satisfies the condition of the Representer Theorem that ensures an expression of the form

144

f_k^∗(x) =

N

X

i=1

αizikK(xi, x), k ∈ [K], (18) for the minimizer, and results in a MixLasso estimator (6) with

145

g(M ) := max

α∈R^N

− 1

2N²τtr(D(α)QD(α)M ) − 1 N

N

X

i=1

L^∗(yi, −αi) (19) where Q : N × N is the kernel matrix with Qij= K(xi, xj). Then Algorithm 1 can be applied with

146

the only difference on the evaluation of gradient ∇g(M ), which requires finding the maximizer α^∗

147

of (19) by solving the following linear system:

148

( 1

N τQ ◦ M + I)α = y. (20)

where ◦ denotes the elementwise product.

149

4.3 Rounding Procedure for Generalized & Standard MR

150

While the atomic-norm regularization λkM kS is a good convex relaxation of the number of compo-

151

nents, the number of non-zero components getting from estimator (6) cannot be precisely specified

152

apriori by the hyper-parameter λ directly. In practice, it is often useful to obtain a solution c with

153

exactly kck0= K non-zeros. This can be achieved by setting the K coefficients of largest magnitude

154

to 1 and all the other coefficients to 0. This results in a N × K matrix of hidden assignments ˆZ as the

155

output of Algorithm 1. Then, starting from ˆZ, we can perform a number of alternating minimization

156

steps between model parameters W (or {fk}^K_k=1in general) and hidden assignments {zi}^N_i=1until

157

convergence, as in a standard EM algorithm (with MAP hard assignment on zi).

158

While we have proposed a solution of the generalized version (1), in some applications, it might be of interest to solve the special case of standard mixed regression, where each observation belongs to exactly one mixture component. One approach to convert a generalized mixture solution with K components to a standard mixture of J components is to find the most frequent J patterns z₁, z₂, ..., z_J from the estimated hidden assignments {ˆz_i}^N_i=1, and then force each observation to choose their hidden assignments {zi}^N_i=1from the set {z_j}^J_j=1instead of arbitrary 0-1 patterns {0, 1}^K. This results in J functions {fj}^J_j=1of the form

fj(x) =

K

X

k=1

zjkfk(x), j ∈ [J ],

being actually used in the training observation, and thus gives a valid model {f_j}^J_j=1of standard

159

mixed regression with J components. Then as noted previously, one can further refine this rounded

160

solution through EM iterates of standard mixed regression, initialized with component functions

161

{fj}^J_j=1.

162

5 Analysis

163

5.1 Convergence Analysis

164

We assume y and x are bounded such that |y| ≤ Ry, kxk2≤ R_x. And without loss of generality, we

165

assume the data are scaled such that Ry= Rx= 1. Then the following theorem guarantees the rate

166

(6)

of convergence for Algorithm 1 up to a certain precision determined by the approximation ratio given

167

in (16).

168

Theorem 1. Let F (c) be the objective (8). The greedy algorithm (Algorithm 1) satisfies

169

F (c^T) − F (c^∗) ≤2γkc^∗k²₁ µ²

1 T

. (21)

for any iterateT satisfying F (c^T) − F (c^∗) ≥ ^2(1−µ)_µ λkc^∗k1, where c^∗ is any reference solution,

170

µ = 3/5 is the approximation ratio given by (16) and γ is the Lipschitz-continuous constant of the

171

coordinate-wise gradientz^{k T}∇g(M )z^k, ∀k ∈ [ ¯K].

172

Then the following lemma shows that, with the additional assumption that F (c) is strongly convex

173

over a restricted support set A^∗, one can get a bound in terms of the `0-norm of the reference solution.

174

Lemma 1. Let A^∗∈ [ ¯K] be a support set and c^∗:= arg minc:supp(c)=A^∗F (c^∗). Suppose F (c) is

175

strongly convex onA^∗with parameterβ. We have

176

kc^∗k₁≤ s

2kc^∗k₀(F (0) − F (c^∗))

β . (22)

Since F (0) − F (c^∗) ≤ _2N¹ PN

i=1y_i²≤ 1 , from (21) and (22), we have

177

F (c^T) − F (c^∗) ≤4γkc^∗k0

βµ²

1 T

+2(1 − µ)λ µ

s 2kc^∗k0

β . (23)

for any c^∗:= arg minc:supp(c)=A^∗F (c).

178

5.2 Generalization Analysis

179

In this section, we investigate the performance of output from Algorithm 1 in terms of the risk (2).

180

Given a coefficients c with support A, we can construct the weight matrix by ˆW (c) = D(√ c_A)W

181

with W = Z_A^TD(α^∗)X, where ZA= (z^k)k∈Aand α^∗is the maximizer in (7) as a function of c.

182

From the duality between (3) and (4), ˆW satisfies

183

F (c_A) = 1 2N

N

X

i=1

(yi− z^T_iW xˆ i)²+τ

2kW k²_F+ λkc_Ak1. (24) The following theorem gives a risk bound for the output weight matrix ˆW (c) obtained from Algorithm

184

1.

185

Theorem 2. Let A, ˆc, ˆW be the set of active components, coefficients and corresponding weight matrix obtained fromT iterations of Algorithm 1, and ¯W be the minimizer of the population risk (2) withK components and k ¯W kF ≤ R. We have r( ˆW ) ≤ r( ¯W ) + with probability 1 − ρ for

T ≥ 4γ µ²β(K

) and N = Ω(DK

³ log(RK

ρ )) withλ, τ chosen appropriately as functions of N .

186

Note the output of Algorithm 1 has number of components ˆK ≤ T . Therefore, Theorem 2 gives

187

a trade-off between the suboptimality of risk r( ˆW ) − r( ¯W ) ≤ and number of components

188

K = O(K/). Note the result of Theorem (2) is obtained without distributional assumption onˆ

189

the input/output (except boundedness), so it is in general not possible to guarantee convergence to

190

an optimal risk with exactly K components, since finding such optimal solution is NP-hard even

191

measured by the empirical risk [1]. It remains open if one can give a tighter result for the estimator (6)

192

that achieves -suboptimal risk with number of components being a constant multiple of K, or derive

193

a bound on the parameter estimation error, possibly with additional assumptions on the observations.

194

(7)

0 500 1000 1500 2000 2500 3000 3500 4000 N

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Parameter-Error

Normal-D=100-K=3

ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 0.5 1 1.5 2

N ×10⁴

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Parameter-Error

Normal-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 500 1000 1500

N 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Parameter-Error

Generalized-Normal-D=20-K=3

EM-random MixLasso

0 500 1000 1500 2000 2500 3000 3500 4000

N 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Parameter-Error

Uniform-D=100-K=3 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 0.5 1 1.5 2

N ×10⁴

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Parameter-Error

Uniform-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 500 1000 1500

N 0

0.1 0.2 0.3 0.4 0.5 0.6

Parameter-Error

Generalized-Uniform-D=20-K=3 EM-random MixLasso

Figure 1: Results for Noiseless Mixture of Linear Regression with N (0, I) input distribution (Top) and U (−1, 1) input distribution (Bottom), where (Left) D=100, K=3, (Middle) D=20, K=10, and (Right) Generalized Mixture of Regression with D=20, K=3.

0 500 1000 1500 2000 2500 3000 3500 4000

N 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Parameter-Error

Normal-Noisy-D=100-K=3 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 0.5 1 1.5 2

N ×10⁴

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Parameter-Error

Normal-Noisy-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 500 1000 1500

N 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Parameter-Error

Generalized-Normal-Noisy-D=20-K=3 EM-random MixLasso

0 500 1000 1500 2000 2500 3000 3500 4000

N 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Parameter-Error

Uniform-Noisy-D=100-K=3 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 0.5 1 1.5 2

N ×10⁴

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Parameter-Error

Uniform-Noisy-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso

0 500 1000 1500

N 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Parameter-Error

Generalized-Uniform-Noisy-D=20-K=3 EM-random MixLasso

Figure 2: Results for Noisy (σ = 0.1) Mixture of Linear Regression with N (0, I) input distribution (Top) and U (−1, 1) input distribution (Bottom), where (Left) D=100, K=3, (Middle) D=20, K=10, and (Right) Generalized Mixture of Regression with D=20, K=3.

6 Experiments

195

In this section, we compare the proposed MixLasso method with other state-of-the-art approaches

196

listed as follows:

197

• EM-Random: A standard EM algorithm that alternates between minimizing {zi}^N_i=1and

198

{fk(x)}^K_k=1until convergence, with random initialized W ~N (0, I) in the linear case and

199

random initialized Z~M ultinoulli(1/K) in the nonlinear case. Each point in the figures is

200

the best result out of 100 random trials.

201

• EM-Tensor: The EM algorithm initialized with Tensor Method proposed in [8]. The

202

formula of Tensor Method is derived assuming xi~N (0, I). We adopt implementation

203

provided by the author of [7].

204

• ALT-Random: An Alternating Minimization algorithm proposed in [7] with the same

205

initialization strategy and number of trails as EM-Random.

206

(8)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -10

-5 0 5 10 15

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-10 -5 0 5 10 15

0 100 200 300 400 500 600 700 800 900

N 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

RMSE

PolyKernel-Deg=6-K=4 EM-random Lasso

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-10 -5 0 5 10 15 20

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-10 -5 0 5 10 15

0 100 200 300 400 500 600 700 800 900

N 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

RMSE

PolyKernel-Noisy-Deg=6-K=4 EM-random MixLasso

Figure 3: Results on Mixture of 6th-order Polynomial Regression of K=4 components with noise (Bottom) and without noise (Top). (Left) The best result of EM out of 100 random initialization.

(Middle) Solution from MixLasso followed by fine-tuning EM iterates. (Right) Comparison in terms of RMSE.

• ALT-Tensor: The Alternating Minimization algorithm initialized with Tensor Method

207

proposed in [7]. The formula of Tensor Method is derived assuming xi~N (0, I). We adopt

208

implementation provided by the author of [7].

209

• MixLasso: The proposed estimator with Algorithm 1. We round our solution to exact K

210

components according to the rounding procedure described in Section 4.3 for generalized

211

MRand standard MR respectively. The rounded solution is further refined by EM iterates.

212

For the linear case, we compare methods using the root mean square error on the learned parameters

213

W compared to the ground-truth parameters W^∗of size K × D: min_S:|S|=K ^kW^S,:^√^−W^∗^k^F

DK , where

214

S denotes a multiset that selects the best matched row in W for each row in W^∗. For the nonlinear

215

case, we compare methods using RMSE between the predicted value and the ground-truth function

216

value:

q1 N

PN i=1(PK

k=1z_ikf_k(x_i) −PK

k=1z_ik^∗f_k^∗(x_i))². Figure 1 and 2 give experimental results

217

of linear model in the noiseless and noisy case (N (0, σ) noise with σ = 0.1) respectively, where

218

the ground-truth hidden assignments are generated by M ultinoulli(1/K). In the case of Normal

219

input distribution (top row), both the Tensor-initialized methods and MixLasso consistently improve

220

upon random-initialized EM/ALT (even with 100 trials) in terms of the number of samples required

221

to achieve a good performance, where ALT performs better than EM in higher dimensional case

222

(D = 100, K = 3) and EM performs better for cases of more components (D = 20, K = 10);

223

meanwhile, MixLasso leads to significant improvements in both cases.

224

On the other hand, when input distribution becomes Uniform(-1,1), the tensor-initialized method

225

becomes even worse than the random-initialized ones, presumably due to the model mis-specification,

226

while MixLasso still consistently improve upon random initialized EM/ALT. Note we test Tensor

227

Method derived from Normal inputon data with Uniform input on purpose. The goal is to see the

228

effect of model misspecification on the Tensor approach, since in practice one would always have

229

model misspecification to some degree.

230

The rightmost columns of Figure 1, 2 show the results on data generated from generalized mixed

231

regressionmodel with {zik}^K_k=1following Bernoulli(0.5), where the MixLasso improves upon EM-

232

Random by a large margin. Finally, Figure 3 gives a comparison of EM-Random and MixLasso on

233

Mixture of Kernel Regression with polynomial kernel K(xi, x_j) = (ax^T_ix_j+ b)^d, where we generate

234

K=4 random 6th-degree polynomial functions {f_k^∗}^K_k=1by uniform sampling their coefficients from

235

U (−10, 10). In this setting, we found EM-Random has a hard time converging to the ground-truth

236

solution even when we increase the sample size N , while MixLasso obtains solution close to the

237

ground truth with a small number of samples.

238

(9)

References

239

[1] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed

240

linear regression. In ICML, pages 613–621, 2014.

241

[2] Lei Xu, Michael I Jordan, and Geoffrey E Hinton. An alternative model for mixtures of experts.

242

In Advances in neural information processing systems, pages 633–640, 1995.

243

[3] Christopher M Bishop and Markus Svenskn. Bayesian hierarchical mixtures of experts. In

244

Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 57–64.

245

Morgan Kaufmann Publishers Inc., 2002.

246

[4] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed

247

regression with two components: Minimax optimal rates. In COLT, pages 560–604, 2014.

248

[5] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning

249

mixtures of generalized linear models. In Artificial Intelligence and Statistics, pages 1223–1231,

250

2016.

251

[6] Arun T Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions.

252

In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages

253

1040–1048, 2013.

254

[7] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple compo-

255

nents. In Advances in Neural Information Processing Systems, pages 2190–2198, 2016.

256

[8] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many ran-

257

dom linear equations by tensor decomposition and alternating minimization. arXiv preprint

258

arXiv:1608.05749, 2016.

259

[9] Paul Hand and Babhru Joshi. A convex program for mixed linear regression with a recovery

260

guarantee for well-separated data. arXiv preprint arXiv:1612.06067, 2016.

261

[10] Ian En-Hsu Yen, Xin Lin, Kai Zhong, Pradeep Ravikumar, and Inderjit Dhillon. A convex

262

exemplar-based approach to mad-bayes dirichlet process mixture models. In International

263

Conference on Machine Learning, pages 2418–2426, 2015.

264

[11] Ian En-Hsu Yen, Xin Lin, Jiong Zhang, Pradeep Ravikumar, and Inderjit Dhillon. A convex

265

atomic-norm approach to multiple sequence alignment and motif discovery. In International

266

Conference on Machine Learning, pages 2272–2280, 2016.

267

[12] Ian En-Hsu Yen, Wei-Chen Chang, Sung-En Chang, Arun Sai Suggula, Shou-De Lin, and

268

Pradeep. Ravikumar. Latent feature lasso. International Conference on Machine Learning,

269

2017.

270

[13] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex

271

geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–

272

849, 2012.

273

[14] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

274

Statistical Society. Series B (Methodological), pages 267–288, 1996.

275

[15] Yurii Nesterov et al. Quality of semidefinite relaxation for nonconvex quadratic optimization.

276

Université Catholique de Louvain. Center for Operations Research and Econometrics [CORE],

277

1997.

278

[16] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro ap-

279

proach works on smooth semidefinite programs. In Advances in Neural Information Processing

280

Systems, pages 2757–2765, 2016.

281

[17] Po-Wei Wang and J Zico Kolter. The mixing method for maxcut-sdp problem. NIPS LHDS

282

Workshop., 2016.

283

(10)

7 Appendix

284

7.1 Proof for Theorem 1

285

Let g(M ) be a smooth function such that ∇g(M ) is Lipschitz-continuous with parameter ρ, that is, g(M⁰) − g(M ) − h∇g(M ), M⁰− M i ≤ρ

2kM⁰− M k²_F.

Then ∇jf (c) = z_j^T∇g(M )zj is Lipschitz-continuous with parameter γ, a number of order O(1)

286

when g(.) is an empirical risk normalized by N . Let A be the active set before adding a component ˆj.

287

Consider the descent amount produced by minimizing F (c) w.r.t. the cˆjgiven that 0 ∈ ∂jF (c) for

288

all j ∈ A due to the subproblem in the previous iteration. Let j = ˆj, for any ηjwe have

289

F (c + η_je_j) − F (c) ≤ ∇_jf (c)η_j+ λ|η_j| +γ

2η²_j ≤ µ∇j^∗f (c)η_j+ λ|η_j| + γ 2η_j²

≤ µ∇_j^∗f (c)η_j+ λ|η_j| +γ 2η²_j Minimize w.r.t ηjgives

290

minηj

F (c + ηjej) − F (c) ≤ min

ηj

µ∇j^∗f (c)ηj+ λ|ηj| +γ 2η²_j

≤ min

η_j µ∇_j^∗f (c)η_j+ λ|η_j| + γ

2η_j²= min

ηk:k /∈A

X

k /∈A

µ∇_kf (c)η_k+ λ|η_k|

+γ

2 X

k /∈A

|η_k|

!2

≤ min

ηk:k /∈A µX

k /∈A

∇_kf (c)η_k+ λ|η_k|

+γ

2 X

k /∈A

|η_k|

!2

+ (1 − µ)λX

k /∈A

|η_k|

where the last equality is justified by Lemma 2 provided later. For k ∈ A, we have

291

0 = min

η_k:k∈A µX

k∈A

(∇kf (c)ηk+ λ|ck+ ηk| − λ|ck|)

Combining cases for k /∈ A and k ∈ A, we can obtain a global estimate of descent amount compared

292

to some optimal solution x^∗as follows

293

min

ηˆj

F (c + η_ˆ_je_ˆ_j) − F (c)

≤ min

η µ

h∇f (c), ηi + λkc + ηk1− λkck1

+γ

2 X

k /∈A

|ηk|

!2

+ (1 − µ)λX

k /∈A

|ηk|

≤ min

η µ

F (c + η) − F (c)

+γ

2 X

k /∈A

|ηk|

!2

+ (1 − µ)λX

k /∈A

|ηk|

≤ min

α∈[0,1]

µ

F (c + α(c^∗− c)) − F (c)

+αγ

2 kc^∗k²₁+ α(1 − µ)λkc^∗k1

≤ min

α∈[0,1]

−αµ

F (c) − F (c^∗)

+α²γ

2 kc^∗k²₁+ α(1 − µ)λkc^∗k1. It means we can always choose an α small enough to guarantee descent if

294

F (c) − F (c^∗) >(1 − µ)

µ λkc^∗k1. (25)

Then for

295

F (c) − F (c^∗) ≥2(1 − µ)

µ λkc^∗k1, (26)

we have

296

minη_ˆ_j F (c + η_ˆ_je_ˆ_j) − F (c) ≤ min

α∈[0,1] −αµ 2

F (c) − F (c^∗)

+α²γ

2 kc^∗k²₁.

(11)

Minimizing w.r.t. to α gives the convergence guarantee F (c^t) − F (c^∗) ≤ 2γkc^∗k²₁

µ² 1 t. for any iterate with F (c^t) − F (c^∗) ≥^2(1−µ)_µ λkc^∗k1.

297

Lemma 2.

minη_j µ∇_j^∗f (c)η_j+ λ|η_j| +γ

2η_j² (27)

= min

η_k:k /∈A

X

k /∈A

µ∇kf (c)ηk+ λ|ηk|

+γ

2 X

k /∈A

|ηk|

!²

(28)

Proof. The minimization (33) is equivalent to

298

min

ηk:k /∈A

X

k /∈A

µ∇_kf (c)η_k

s.t. X

k /∈A

|η_k|

!²

≤ C₁, X

k /∈A

|η_k| ≤ C₂.

and therefore is equivalent to

299

min

η_k:k /∈A µX

k /∈A

∇kf (c)ηk

s.t. X

k /∈A

|ηk| ≤ min{p C₁, C₂}

which is a linear objective subject to a convex set and thus always has solution that lies on the corner

300

point with only one non-zero coordinate ηj^∗, which then gives the same minimum as (32).

301

7.2 Proof for Lemma 1

302

Since supp(c^∗) = A^∗, and c^∗is optimal when restricted on the support, we have hη, c^∗i = 0 for

303

some η ∈ ∂F (c^∗). And since F (c) is strongly convex on the support A^∗with parameter β, we have

304

F (0) − F (c^∗) = F (0) − F (c^∗) − hη, 0 − c^∗i

≥ β

2kc^∗− 0k²₂, which gives us

kc^∗k²₂≤ 2(F (0) − F (c^∗))

β .

Combining above with the fact for any c, kck²₁≤ kck0kck²₂, we obtain the result.

305

7.3 Proof for Theorem 2

306

Lemma 3. Let r(W ) and rN(W ) be the risk (2) and the empirical risk respectively, we have

sup

W ∈R^K×D:kW k_F≤R

|r(W ) − rN(W )| ≤ s

2DK log(4RKN )

N + 1

N log(1 ρ) with probability1 − ρ.

307

Proof. Since min_z∈{0,1}N 1

2(y − z^TW x)²≤ |y|²≤ 1 for a given W , by Hoeffding inequality, P (|rN(W ) − r(W )| ≥ ) ≤ exp(−2N ²).

(12)

Let N (δ) be a δ-covering of the set W := {W ∈ R^K×D| kW kF ≤ R} with |N (δ)| ≤ ^4R_δ ^DK .

308

Then for any W ∈ W, we have ˜W ∈ N (δ) with kW − ˜W k ≤ δ. Applying a union bound, we have

309

P

sup

W ∈N (δ)˜

|rN( ˜W ) − r( ˜W )| ≥

≤ 4R δ

^DK

exp(−2N ²). (29) Then for ∆W := W − ˜W satisfying k∆W k ≤ δ, we can bound the difference of square loss of W

310

and ˜W by

311

min

z∈{0,1}^K

1

2(y − z^TW x)²− min

z∈{0,1}^K

1

2(y − z^TW x)˜ ²≤ 1

2(y − ˜z^TW x)²−1

2(y − ˜z^TW x)˜ ²

≤ k∆W kFk˜zk + 2Rk˜zk²k∆W kF ≤ 3RK

(30) where ˜z = arg min_z∈{0,1}K 1

2(y − z^TW x)² and we used the fact that kxk ≤ 1 and |y| ≤ 1.

312

By symmetry, we have

min_z∈{0,1}K 1

2(y − z^TW x)˜ ²− min_z∈{0,1}K 1

2(y − z^TW x)²

≤ 3RK.

313

Combining (29) with (30), we have

314

sup

W ∈W

|r_N(W ) − r_N(W )| ≤ 6RKδ + s

DK

2N log(4R δ ) + 1

2N log(1

ρ). (31)

with probability 1 − ρ. Setting δ = 1/(6RK√

N ) and apply Jennen’s inequality gives the result.

315

Then the following gives the proof for Theorem 2.

316

Proof. Let ¯zi = arg min_z_i_∈{0,1}K (yi− z^T_iW x¯ i)²for i ∈ [N ]. Denote ¯Z as the N × K matrix

317

stacked from (¯z^T_i)^N_i=1. Let {¯z^k}^K_k=1be the columns of ¯Z and ¯A be the indexes of atoms in the atomic

318

set (5) that have the same 0-1 patterns to those columns. Denote ¯c as the coefficient vector with

319

¯

ck = 1 for k ∈ ¯A and ¯ck = 0 for k /∈ ¯A. By the definition of F (c), we have

320

F (¯c) ≤ r_N( ¯W ) +τ

2k ¯W k²_F+ λk¯ck₁. (32) where rN( ¯W ) := _2N¹ PN

i=1min_z∈{0,1}K(y_i− z^TW x¯ _i)²is the empirical risk of ¯W . Let c^∗ :=

321

arg minc:supp(c)= ¯A F (c). We have F (c^∗) ≤ F (¯c). Then from (23),

322

F (ˆc) − F (¯c) ≤ F (ˆc) − F (c^∗) ≤4γK βµ²

1 T

+2(1 − µ)λ µ

s 2K

β . (33)

In addition, the risk of ˆW satisfies

323

r_N( ˆW ) +τ

2k ˆW k²_F+ λkˆck₁≤ F (ˆc) (34) by the definition of the empirical risk rN(.) (since it is minimized w.r.t. the hidden assignments).

324

Combining (32), (33) and (34), we obtain a bound on the difference of empirical risk

325

rN( ˆW ) − rN( ¯W ) ≤ τ

2k ¯W k²+ λK

| {z }

bias of regularization

+4γK βµ²

1 T

+2(1 − µ)λ µ

s 2K

β

| {z }

optimization error

(35)

The remaining task is to bound the estimation error |r(W ) − r_N(W )|. Since Algorithm 1 is a descent

326

algorithm w.r.t. F (c) and in the beginning F (0) ≤ 1/2, we have kck1 ≤ 1/λ and kW k² ≤ 1/τ

327

at any iterate. Then we can bound the estimation error by Lemma 3 for ˆW belonging to the set

328

W(T ) := { ˆW ∈ R^{T ×D}| k ˆW kF ≤p1/λτ }, giving

329

|r( ˆW ) − rN( ˆW )| ≤ s

2DT log(4T N/√ λτ )

N + 1

N log(1

ρ). (36)

Combining (35) and (36), and choosing λ = 1/(N K), τ = 1/(N R²), we obtain r( ˆW ) − r( ¯W ) ≤

330

with T ≥ _µ^4γ2β(^K), and N ≥ ^DT2 (2 log(^4RKT ) + log(¹_ρ)) for any 0 < < 1.

331