MixLasso: Generalized Mixed Regression via Convex Atomic-Norm Regularization
Anonymous Author(s) Affiliation
Address email
Abstract
We consider a generalization of mixed regression where the response is an additive
1
combination of several mixture components, so that standard mixed regression is
2
the special case where each response is generated from exactly one component.
3
Typical approaches to the mixture regression problem employ local search methods
4
such as Expectation Maximization (EM) that are prone to spurious local optima. On
5
the other hand, a number of recent theoretically-motivated Tensor-based methods
6
either have high sample complexity, or require the knowledge of probability density
7
function of the input distribution, which might not be available in practice. In
8
this work, we study a novel convex estimator MixLasso for generalized mixed
9
regression based on an atomic norm that is specifically constructed in order to
10
regularize the number of mixture components. Our analysis of the algorithm gives a
11
risk bound that trades off prediction accuracy and model sparsity without imposing
12
stringent assumptions on the input/output distribution beyond that of boundedness.
13
Our approach is also applicable to both linear and non-linear mixture regression
14
settings. In our numerical experiments with mixtures of linear as well as nonlinear
15
regression functions, the proposed method yields high-quality solutions in a wider
16
range of settings than existing approaches.
17
1 Introduction
18
The Mixed Regression (MR) problem considers the estimation of K regression functions, from a
19
collection of input-output samples, where in each sample the output conditioned on the input is
20
specified by one of the K regression functions. When fitting linear functions in a noiseless setting,
21
this is equivalent to solving K linear systems without knowing which system each equation belongs
22
to. The MR formulation can be also employed as an approach to decompose a complicated (possibly
23
multi-valued) function into K simpler ones, by splitting the observations into K classes. Variants of
24
regression families such as piecewise-linear regression can be viewed as special cases of MR.
25
The MR problem is NP-hard in general [1] due to the simultaneous fitting of the discrete class labels
26
as well as the regression functions. Standard approaches to the mixture problem employ local search
27
methods such as Expectation Maximization (EM) [2] and Variational Bayes [3] that are prone to
28
spurious local optima. There have thus been several lines of recent work studying estimation of
29
mixed regression models with strong statistical guarantees under additional statistical assumptions.
30
For the special case of linear function with K=2 components, [4] propose a convex nuclear norm
31
minimization formulation that is guaranteed to estimate the two functions with minimax-optimal rates
32
when given a sub-Gaussian design matrix. With the additional conditions of zero noise and isotropic
33
Gaussian inputs, [1] propose an initialization for the EM algorithm to guarantee exact recovery of
34
the true parameters. However, in addition to the stringent statistical assumptions, these methods and
35
results are specialized to the case of two components, and seem non-trivial to generalize.
36
Submitted to 31st Conference on Neural Information Processing Systems (NIPS 2017). Do not distribute.
For problems with more than two components, most of the existing approaches [5, 6, 7, 8] rely
37
on the Tensor Methods. In particular, for a D-dimensional linear MR problem, [6] propose a
38
convex optimization formulation using a third-order tensor, which results in a computational cost
39
of O(N D12) and a sample complexity of O(D6/2), limiting its application to problems of small
40
dimension. The Tensor Decomposition approach proposed in [5] has a sample complexity of only
41
O(D3K4/2) and is computationally efficient. However, it requires the knowledge of the input
42
probability distribution in order to derive the score function used in their algorithm, which might not
43
be available, and estimating the density over the D-dimensional input variables could be an even
44
harder problem than MR itself. Other recent work [7, 8] show that in the noiseless setting with
45
isotropic Gaussian inputs, an Alternating Minimization algorithm initialized with the Tensor Method
46
leads to exact recovery of the true parameters. These latter methods have sample complexities linear
47
in D, but with O(KK), O(K10) dependencies in K respectively. Finally, [9] observed that, under the
48
assumption of well-separated data, one can use a guaranteed clustering algorithm to find the mixture
49
assignment of each observation, and thus solves the MR problem as a by-product. However, the data
50
distribution considered in MR, such as those assumed in [5, 6, 7, 8], are usually not well-separated
51
(see our Figure 3 as an example).
52
In this work, we address a generalized version of Mixed Regression where the output can be an
53
additive combination of several mixture components. Our approach follows the general meta-approach
54
emerging in the recent years of addressing latent-variable model estimation from the perspective of
55
high-dimensional sparse estimation [10, 11, 12]. We propose a novel convex estimator MixLasso for
56
the mixed regression problem, which enforces the mixture structure through minimizing a carefully
57
constructed atomic norm that acts as a surrogate function for the number of mixture components.
58
We then propose a greedy algorithm that generates a steepest-descent component at each iteration
59
through solving a sub-problem similar to MAX-CUT. Our analysis of the algorithm gives a risk
60
bound that trades off prediction accuracy and model sparsity, with a sample complexity that is linear
61
in both D and K, and without imposing any stringent assumptions on, or assuming knowledge of, the
62
input/output distribution beyond that of boundedness, and even allowing for model mis-specification.
63
This makes our MixLasso algorithm a theoretically sound method for a wide range of practical
64
settings. Moreover, we also show how our proposed method can be easily extended to the nonlinear
65
regression setting, to regression functions lying in a Reproducing Kernel Hilbert Space (RKHS).
66
Our experiments with both generalized MR and standard MR show that the proposed method finds
67
high-quality solutions in a wider range of settings when compared to existing approaches.
68
2 Generalized Mixed Regression
69
In Generalized Mixed Regression, the response y ∈ R, given covariates x ∈ X , is specified as:
70
y =
K
X
k=1
zkfk(x) + ω (1)
where zk ∈ {0, 1}, k = 1, . . . , K is a latent binary vector indicating the presence or absence of each
71
component, and fk(xi) : RD→ R is the regression function of k-th component. The standard mixed
72
regression is a special case of (1) with additional constraint kzk0= 1. Here ω ∈ R is a noise term
73
with both bias and variance. In other words, we consider the very general setting where we allow
74
for model mis-specification, and in general E[ω|x, z] 6= 0. This makes our problem setting in (1)
75
very practically plausible, especially when the regression functions {fk(x)}Kk=1lie in some restricted
76
family such as linear functions.
77
Our goal is to find F := {fk(x)}Kk=1minimizing the risk
78
r(F ) := E
"
min
z∈{0,1}K
1 2(y −
K
X
k=1
zkfk(x))2
#
, (2)
while keeping the number of components K as small as possible. This yields a trade-off between
79
r(F ) and K. While one can always have a small risk with K → ∞, we would like to find the
80
smallest K that achieves such risk.
81
3 MixLasso: Convex Estimation via Atomic Norm
82
In the following, we will first focus on the linear case fk(x) := hwk, xi and consider extension
83
to nonlinear functions in Section 4.2. Given a collection of i.i.d. samples {(xi, yi)}Ni=1, the `2-
84
regularized Empirical Risk Minimization (ERM) problem for our task (2) is
85
min
W ∈RK×D,zi∈{0,1}K
1 2N
N
X
i=1
(yi− zTiW xi)2+τ
2kW k2F. (3)
(3) is a hard optimization problem in general due to the simultaneous minimization w.r.t. parameters
86
W and binary hidden variables {zi}Ni=1[1]. However, given hidden variables, the problem is convex
87
w.r.t. W , and thus, from the duality theory (3) is equivalent to
88
min
Z∈{0,1}N ×K
max
α∈RN
−1 N
N
X
i=1
L∗(yi, −αi) − 1
2N2τtr(D(α)XXTD(α)ZZT) (4) where Z := (zi)Ni=1, D(α) is a diagonal matrix formed by vector α, and L∗(y, α) = yTα +12kαk2
89
is the convex conjugate of square loss L(y, ξ) = 12(y − ξ)2. The maximizer α∗of (4) and minimizer
90
W∗of (3) are related by W∗= N τ1 PN
i=1α∗i(zixTi) = N τ1 ZTD(α∗)X.
91
A key observation for our formulation is that, although (4) is non-convex w.r.t. Z, it is a convex
92
function of M := ZZT (since it is a maximum over linear functions of M ). Therefore, the
93
intractability of (4) only lies in the combinatorial constraint M = ZZTfor some Z ∈ {0, 1}N ×K.
94
To relax such constraint, we introduce an atomic norm [13] of the form
95
kM kS := min
c≥0
X
a∈S
ca s.t. M =X
a∈S
caa. (5)
where S := {zzT|z ∈ {0, 1}N}. Note if ca takes integer values {0, 1}, M =P
a∈Scaa = ZZT
96
for some Z ∈ {0, 1}N ×Kand kM kS = K. When cais allowed to be any nonnegative number, (5)
97
serves as a convex approximation to the number of components K in a sense similar to `1-norm
98
as a convex approximation for the number of non-zero elements in Lasso [14]. Then the MixLasso
99
estimator is defined as
100
min
M ∈RN ×N+
g(M ) + λkM kS (6)
where g(M ) := max
α∈RN
− 1
2N2τtr(D(α)XXTD(α)M ) − 1 N
N
X
i=1
L∗(yi, −αi)
! . (7)
4 Algorithm
101
The convex formulation (6) is still a challenging optimization problem since it involves an atomic
102
norm defined over ¯K := 2N atoms. An equivalent formulation expresses (6) as the minimizatioin of
103
F (c) := g
K¯ X
k=1
ckzkzk T
+ λkck1 (8)
w.r.t. c ∈ RK+¯, where {zk}Kk=1¯ enumerates ∀z ∈ {0, 1}N. We introduce a greedy algorithm
104
(Algorithm 1) for MixLasso, which maintains a sparse set of active components and adds one more
105
active component zkzk T at each iteration corresponding to the steepest descent direction
106
min
z∈{0,1}N
h∇g(M ), zzTi = − 1
2N2τ max
z∈{0,1}N
hD(α∗)XXTD(α∗), zzTi, (9) where α∗ is the maximizer in (7). As we show in Section 4.1, (9) is equivalent to a MAX-CUT
107
like problem that can be solved efficiently with a constant-ratio approximation guarantee. Then
108
we minimize (8) w.r.t. coefficients corresponding to the active components through a sequence of
109
proximal gradient updates:
110
cs+1k ←
csk− 1
γ|A|(zk T∇g(Ms)zk+ λ)
+
, k ∈ A, s = 1 . . . S. (10)
Algorithm 1 A Greedy Algorithm for MixLasso (6) Initialize A = ∅, c = 0.
for t = 1...T do
1. Find a greedy component zzTby solving (9).
2. Add zzTto the active set A.
3. Minimize (8) w.r.t. coordinates cAin the active set A through updates (10).
4. Eliminate {zkzk T|ck= 0} from A.
end for.
where γ is the Lipschitz-continuous parameter of the coordinate-wise gradient zk T∇g(M )zk. The
111
evaluation of ∇g(Ms) involves finding the maximizer α∗, which can be obtained by solving the
112
least-square problem:
113
W∗:= argmin
W ∈R|A|×D
1 2N
N
X
i=1
(yi− zTiW xi)2+τ
2tr(WTD−1(cA)W ) (11) and compute α∗i = (yi− zTiW∗xi). Let E be the N × (|A|D) design matrix of the least-square
114
problem (11). By maintaining E, ETE whenever the active set A changes, solving the least-square
115
problem (11) costs O(D3|A|3) amortizedly.
116
4.1 Greedy Generation of Components
117
Problem (9) for finding the steepest descent direction is a convex maximization problem with binary-
118
valued variables and is hard in general. However, we show that it is equivalent to a Boolean Quadratic
119
Maximizationproblem similar to MAX-CUT, where constant-ratio approximate algorithm exists
120
through a Semidefinite Relaxation [15]. Furthermore, the Semidefinie Relaxation of this type has
121
scalable solver that requires only complexity linear to the coefficient matrix [16, 17].
122
Let C = D(α∗)XXTD(α∗). The greedy step (9) solves a problem of the form
123
max
z∈{0,1}NhC, zzTi, (12)
which can be reduced to a problem of binary variables v ∈ {−1, 1}N via a transformation v = 2z −1:
124 125
max
v∈{−1,1}N
1
4 hC, vvTi + 2hC, 1vTi + hC, 11Ti . (13) where 1 denotes N -dimensional vector of all 1s. By introducing a dummy variable v0, (13) is
126
equivalent to
127
max
(v0;v)∈{−1,1}N +1
1 4
v0
v
T
1TC1 1TC
C1 C
v0
v
. (14)
Note one can always find a solution of v0 = 1 by flipping signs of the solution since this does not
128
change the objective value. Let the matrix in (14) be ˆC. Problem of form (14) is a Boolean Quadratic
129
problem similar to MAX-CUT, for which there is Semidefinite relaxation of the form
130
max
V ∈SN
h ˆC, V i
s.t. V 0, diag(V ) = 1
(15) and rounding from which guarantees a solution ˆv to (14) satisfying h − h(ˆv) ≤ ρ(h − h) with
131
ρ = 2/5 [15], where h(v) denotes the objective function of (14) and h, h denote the maximum and
132
minimum of the objective in (14) respectively. Note this result holds for any symmetric matrix ˆC.
133
Since our problem has a positive-semidefinite matrix ˆC, we have h = 0 and therefore the component
134
zkfound this way satisfies
135
−zk T∇g(M )zk = h(ˆv) ≥ µh = µ max
z∈{0,1}N−zT∇g(M )z (16) with µ = 1 − ρ = 3/5. Semidefinite Programming of the form (15) allows specialized solver with
136
iteration cost linear to the matrix size nnz( ˆC) [16, 17]. And it is worth mentioning that, since our
137
matrix ˆC has low-rank structure (9), our implementation of the SDP solver [17] can further reduce
138
the complexity per iteration from nnz( ˆC) to nnz(X).
139
4.2 Nonlinear Extension
140
A simple way to consider a nonlinear version of the MixLasso estimator is to consider each component
141
fk(x) lying in a Reproducing Kernel Hilbert Space (RKHS) H with respect to some Mercer kernel
142
K(·, ·). In this setting, the inner minimizer {fk∗}Kk=1of
143
min
zi∈{0,1}K min
fk∈H
1 2N
N
X
i=1
yi−
K
X
k=1
zikfk(xi)
!2 +τ
2
K
X
k=1
kfkk2H (17) satisfies the condition of the Representer Theorem that ensures an expression of the form
144
fk∗(x) =
N
X
i=1
αizikK(xi, x), k ∈ [K], (18) for the minimizer, and results in a MixLasso estimator (6) with
145
g(M ) := max
α∈RN
− 1
2N2τtr(D(α)QD(α)M ) − 1 N
N
X
i=1
L∗(yi, −αi) (19) where Q : N × N is the kernel matrix with Qij= K(xi, xj). Then Algorithm 1 can be applied with
146
the only difference on the evaluation of gradient ∇g(M ), which requires finding the maximizer α∗
147
of (19) by solving the following linear system:
148
( 1
N τQ ◦ M + I)α = y. (20)
where ◦ denotes the elementwise product.
149
4.3 Rounding Procedure for Generalized & Standard MR
150
While the atomic-norm regularization λkM kS is a good convex relaxation of the number of compo-
151
nents, the number of non-zero components getting from estimator (6) cannot be precisely specified
152
apriori by the hyper-parameter λ directly. In practice, it is often useful to obtain a solution c with
153
exactly kck0= K non-zeros. This can be achieved by setting the K coefficients of largest magnitude
154
to 1 and all the other coefficients to 0. This results in a N × K matrix of hidden assignments ˆZ as the
155
output of Algorithm 1. Then, starting from ˆZ, we can perform a number of alternating minimization
156
steps between model parameters W (or {fk}Kk=1in general) and hidden assignments {zi}Ni=1until
157
convergence, as in a standard EM algorithm (with MAP hard assignment on zi).
158
While we have proposed a solution of the generalized version (1), in some applications, it might be of interest to solve the special case of standard mixed regression, where each observation belongs to exactly one mixture component. One approach to convert a generalized mixture solution with K components to a standard mixture of J components is to find the most frequent J patterns z1, z2, ..., zJ from the estimated hidden assignments {ˆzi}Ni=1, and then force each observation to choose their hidden assignments {zi}Ni=1from the set {zj}Jj=1instead of arbitrary 0-1 patterns {0, 1}K. This results in J functions {fj}Jj=1of the form
fj(x) =
K
X
k=1
zjkfk(x), j ∈ [J ],
being actually used in the training observation, and thus gives a valid model {fj}Jj=1of standard
159
mixed regression with J components. Then as noted previously, one can further refine this rounded
160
solution through EM iterates of standard mixed regression, initialized with component functions
161
{fj}Jj=1.
162
5 Analysis
163
5.1 Convergence Analysis
164
We assume y and x are bounded such that |y| ≤ Ry, kxk2≤ Rx. And without loss of generality, we
165
assume the data are scaled such that Ry= Rx= 1. Then the following theorem guarantees the rate
166
of convergence for Algorithm 1 up to a certain precision determined by the approximation ratio given
167
in (16).
168
Theorem 1. Let F (c) be the objective (8). The greedy algorithm (Algorithm 1) satisfies
169
F (cT) − F (c∗) ≤2γkc∗k21 µ2
1 T
. (21)
for any iterateT satisfying F (cT) − F (c∗) ≥ 2(1−µ)µ λkc∗k1, where c∗ is any reference solution,
170
µ = 3/5 is the approximation ratio given by (16) and γ is the Lipschitz-continuous constant of the
171
coordinate-wise gradientzk T∇g(M )zk, ∀k ∈ [ ¯K].
172
Then the following lemma shows that, with the additional assumption that F (c) is strongly convex
173
over a restricted support set A∗, one can get a bound in terms of the `0-norm of the reference solution.
174
Lemma 1. Let A∗∈ [ ¯K] be a support set and c∗:= arg minc:supp(c)=A∗F (c∗). Suppose F (c) is
175
strongly convex onA∗with parameterβ. We have
176
kc∗k1≤ s
2kc∗k0(F (0) − F (c∗))
β . (22)
Since F (0) − F (c∗) ≤ 2N1 PN
i=1yi2≤ 1 , from (21) and (22), we have
177
F (cT) − F (c∗) ≤4γkc∗k0
βµ2
1 T
+2(1 − µ)λ µ
s 2kc∗k0
β . (23)
for any c∗:= arg minc:supp(c)=A∗F (c).
178
5.2 Generalization Analysis
179
In this section, we investigate the performance of output from Algorithm 1 in terms of the risk (2).
180
Given a coefficients c with support A, we can construct the weight matrix by ˆW (c) = D(√ cA)W
181
with W = ZATD(α∗)X, where ZA= (zk)k∈Aand α∗is the maximizer in (7) as a function of c.
182
From the duality between (3) and (4), ˆW satisfies
183
F (cA) = 1 2N
N
X
i=1
(yi− zTiW xˆ i)2+τ
2kW k2F+ λkcAk1. (24) The following theorem gives a risk bound for the output weight matrix ˆW (c) obtained from Algorithm
184
1.
185
Theorem 2. Let A, ˆc, ˆW be the set of active components, coefficients and corresponding weight matrix obtained fromT iterations of Algorithm 1, and ¯W be the minimizer of the population risk (2) withK components and k ¯W kF ≤ R. We have r( ˆW ) ≤ r( ¯W ) + with probability 1 − ρ for
T ≥ 4γ µ2β(K
) and N = Ω(DK
3 log(RK
ρ )) withλ, τ chosen appropriately as functions of N .
186
Note the output of Algorithm 1 has number of components ˆK ≤ T . Therefore, Theorem 2 gives
187
a trade-off between the suboptimality of risk r( ˆW ) − r( ¯W ) ≤ and number of components
188
K = O(K/). Note the result of Theorem (2) is obtained without distributional assumption onˆ
189
the input/output (except boundedness), so it is in general not possible to guarantee convergence to
190
an optimal risk with exactly K components, since finding such optimal solution is NP-hard even
191
measured by the empirical risk [1]. It remains open if one can give a tighter result for the estimator (6)
192
that achieves -suboptimal risk with number of components being a constant multiple of K, or derive
193
a bound on the parameter estimation error, possibly with additional assumptions on the observations.
194
0 500 1000 1500 2000 2500 3000 3500 4000 N
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Parameter-Error
Normal-D=100-K=3
ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 0.5 1 1.5 2
N ×104
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Parameter-Error
Normal-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 500 1000 1500
N 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Parameter-Error
Generalized-Normal-D=20-K=3
EM-random MixLasso
0 500 1000 1500 2000 2500 3000 3500 4000
N 0
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Parameter-Error
Uniform-D=100-K=3 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 0.5 1 1.5 2
N ×104
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Parameter-Error
Uniform-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 500 1000 1500
N 0
0.1 0.2 0.3 0.4 0.5 0.6
Parameter-Error
Generalized-Uniform-D=20-K=3 EM-random MixLasso
Figure 1: Results for Noiseless Mixture of Linear Regression with N (0, I) input distribution (Top) and U (−1, 1) input distribution (Bottom), where (Left) D=100, K=3, (Middle) D=20, K=10, and (Right) Generalized Mixture of Regression with D=20, K=3.
0 500 1000 1500 2000 2500 3000 3500 4000
N 0
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Parameter-Error
Normal-Noisy-D=100-K=3 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 0.5 1 1.5 2
N ×104
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Parameter-Error
Normal-Noisy-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 500 1000 1500
N 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Parameter-Error
Generalized-Normal-Noisy-D=20-K=3 EM-random MixLasso
0 500 1000 1500 2000 2500 3000 3500 4000
N 0
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Parameter-Error
Uniform-Noisy-D=100-K=3 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 0.5 1 1.5 2
N ×104
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Parameter-Error
Uniform-Noisy-D=20-K=10 ALT-random ALT-tensor EM-random EM-tensor MixLasso
0 500 1000 1500
N 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Parameter-Error
Generalized-Uniform-Noisy-D=20-K=3 EM-random MixLasso
Figure 2: Results for Noisy (σ = 0.1) Mixture of Linear Regression with N (0, I) input distribution (Top) and U (−1, 1) input distribution (Bottom), where (Left) D=100, K=3, (Middle) D=20, K=10, and (Right) Generalized Mixture of Regression with D=20, K=3.
6 Experiments
195
In this section, we compare the proposed MixLasso method with other state-of-the-art approaches
196
listed as follows:
197
• EM-Random: A standard EM algorithm that alternates between minimizing {zi}Ni=1and
198
{fk(x)}Kk=1until convergence, with random initialized W ~N (0, I) in the linear case and
199
random initialized Z~M ultinoulli(1/K) in the nonlinear case. Each point in the figures is
200
the best result out of 100 random trials.
201
• EM-Tensor: The EM algorithm initialized with Tensor Method proposed in [8]. The
202
formula of Tensor Method is derived assuming xi~N (0, I). We adopt implementation
203
provided by the author of [7].
204
• ALT-Random: An Alternating Minimization algorithm proposed in [7] with the same
205
initialization strategy and number of trails as EM-Random.
206
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -10
-5 0 5 10 15
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-10 -5 0 5 10 15
0 100 200 300 400 500 600 700 800 900
N 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
RMSE
PolyKernel-Deg=6-K=4 EM-random Lasso
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-10 -5 0 5 10 15 20
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-10 -5 0 5 10 15
0 100 200 300 400 500 600 700 800 900
N 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
RMSE
PolyKernel-Noisy-Deg=6-K=4 EM-random MixLasso
Figure 3: Results on Mixture of 6th-order Polynomial Regression of K=4 components with noise (Bottom) and without noise (Top). (Left) The best result of EM out of 100 random initialization.
(Middle) Solution from MixLasso followed by fine-tuning EM iterates. (Right) Comparison in terms of RMSE.
• ALT-Tensor: The Alternating Minimization algorithm initialized with Tensor Method
207
proposed in [7]. The formula of Tensor Method is derived assuming xi~N (0, I). We adopt
208
implementation provided by the author of [7].
209
• MixLasso: The proposed estimator with Algorithm 1. We round our solution to exact K
210
components according to the rounding procedure described in Section 4.3 for generalized
211
MRand standard MR respectively. The rounded solution is further refined by EM iterates.
212
For the linear case, we compare methods using the root mean square error on the learned parameters
213
W compared to the ground-truth parameters W∗of size K × D: minS:|S|=K kWS,:√−W∗kF
DK , where
214
S denotes a multiset that selects the best matched row in W for each row in W∗. For the nonlinear
215
case, we compare methods using RMSE between the predicted value and the ground-truth function
216
value:
q1 N
PN i=1(PK
k=1zikfk(xi) −PK
k=1zik∗fk∗(xi))2. Figure 1 and 2 give experimental results
217
of linear model in the noiseless and noisy case (N (0, σ) noise with σ = 0.1) respectively, where
218
the ground-truth hidden assignments are generated by M ultinoulli(1/K). In the case of Normal
219
input distribution (top row), both the Tensor-initialized methods and MixLasso consistently improve
220
upon random-initialized EM/ALT (even with 100 trials) in terms of the number of samples required
221
to achieve a good performance, where ALT performs better than EM in higher dimensional case
222
(D = 100, K = 3) and EM performs better for cases of more components (D = 20, K = 10);
223
meanwhile, MixLasso leads to significant improvements in both cases.
224
On the other hand, when input distribution becomes Uniform(-1,1), the tensor-initialized method
225
becomes even worse than the random-initialized ones, presumably due to the model mis-specification,
226
while MixLasso still consistently improve upon random initialized EM/ALT. Note we test Tensor
227
Method derived from Normal inputon data with Uniform input on purpose. The goal is to see the
228
effect of model misspecification on the Tensor approach, since in practice one would always have
229
model misspecification to some degree.
230
The rightmost columns of Figure 1, 2 show the results on data generated from generalized mixed
231
regressionmodel with {zik}Kk=1following Bernoulli(0.5), where the MixLasso improves upon EM-
232
Random by a large margin. Finally, Figure 3 gives a comparison of EM-Random and MixLasso on
233
Mixture of Kernel Regression with polynomial kernel K(xi, xj) = (axTixj+ b)d, where we generate
234
K=4 random 6th-degree polynomial functions {fk∗}Kk=1by uniform sampling their coefficients from
235
U (−10, 10). In this setting, we found EM-Random has a hard time converging to the ground-truth
236
solution even when we increase the sample size N , while MixLasso obtains solution close to the
237
ground truth with a small number of samples.
238
References
239
[1] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed
240
linear regression. In ICML, pages 613–621, 2014.
241
[2] Lei Xu, Michael I Jordan, and Geoffrey E Hinton. An alternative model for mixtures of experts.
242
In Advances in neural information processing systems, pages 633–640, 1995.
243
[3] Christopher M Bishop and Markus Svenskn. Bayesian hierarchical mixtures of experts. In
244
Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 57–64.
245
Morgan Kaufmann Publishers Inc., 2002.
246
[4] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed
247
regression with two components: Minimax optimal rates. In COLT, pages 560–604, 2014.
248
[5] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning
249
mixtures of generalized linear models. In Artificial Intelligence and Statistics, pages 1223–1231,
250
2016.
251
[6] Arun T Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions.
252
In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages
253
1040–1048, 2013.
254
[7] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple compo-
255
nents. In Advances in Neural Information Processing Systems, pages 2190–2198, 2016.
256
[8] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many ran-
257
dom linear equations by tensor decomposition and alternating minimization. arXiv preprint
258
arXiv:1608.05749, 2016.
259
[9] Paul Hand and Babhru Joshi. A convex program for mixed linear regression with a recovery
260
guarantee for well-separated data. arXiv preprint arXiv:1612.06067, 2016.
261
[10] Ian En-Hsu Yen, Xin Lin, Kai Zhong, Pradeep Ravikumar, and Inderjit Dhillon. A convex
262
exemplar-based approach to mad-bayes dirichlet process mixture models. In International
263
Conference on Machine Learning, pages 2418–2426, 2015.
264
[11] Ian En-Hsu Yen, Xin Lin, Jiong Zhang, Pradeep Ravikumar, and Inderjit Dhillon. A convex
265
atomic-norm approach to multiple sequence alignment and motif discovery. In International
266
Conference on Machine Learning, pages 2272–2280, 2016.
267
[12] Ian En-Hsu Yen, Wei-Chen Chang, Sung-En Chang, Arun Sai Suggula, Shou-De Lin, and
268
Pradeep. Ravikumar. Latent feature lasso. International Conference on Machine Learning,
269
2017.
270
[13] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex
271
geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–
272
849, 2012.
273
[14] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
274
Statistical Society. Series B (Methodological), pages 267–288, 1996.
275
[15] Yurii Nesterov et al. Quality of semidefinite relaxation for nonconvex quadratic optimization.
276
Université Catholique de Louvain. Center for Operations Research and Econometrics [CORE],
277
1997.
278
[16] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex burer-monteiro ap-
279
proach works on smooth semidefinite programs. In Advances in Neural Information Processing
280
Systems, pages 2757–2765, 2016.
281
[17] Po-Wei Wang and J Zico Kolter. The mixing method for maxcut-sdp problem. NIPS LHDS
282
Workshop., 2016.
283
7 Appendix
284
7.1 Proof for Theorem 1
285
Let g(M ) be a smooth function such that ∇g(M ) is Lipschitz-continuous with parameter ρ, that is, g(M0) − g(M ) − h∇g(M ), M0− M i ≤ρ
2kM0− M k2F.
Then ∇jf (c) = zjT∇g(M )zj is Lipschitz-continuous with parameter γ, a number of order O(1)
286
when g(.) is an empirical risk normalized by N . Let A be the active set before adding a component ˆj.
287
Consider the descent amount produced by minimizing F (c) w.r.t. the cˆjgiven that 0 ∈ ∂jF (c) for
288
all j ∈ A due to the subproblem in the previous iteration. Let j = ˆj, for any ηjwe have
289
F (c + ηjej) − F (c) ≤ ∇jf (c)ηj+ λ|ηj| +γ
2η2j ≤ µ∇j∗f (c)ηj+ λ|ηj| + γ 2ηj2
≤ µ∇j∗f (c)ηj+ λ|ηj| +γ 2η2j Minimize w.r.t ηjgives
290
minηj
F (c + ηjej) − F (c) ≤ min
ηj
µ∇j∗f (c)ηj+ λ|ηj| +γ 2η2j
≤ min
ηj µ∇j∗f (c)ηj+ λ|ηj| + γ
2ηj2= min
ηk:k /∈A
X
k /∈A
µ∇kf (c)ηk+ λ|ηk|
+γ
2 X
k /∈A
|ηk|
!2
≤ min
ηk:k /∈A µX
k /∈A
∇kf (c)ηk+ λ|ηk|
+γ
2 X
k /∈A
|ηk|
!2
+ (1 − µ)λX
k /∈A
|ηk|
where the last equality is justified by Lemma 2 provided later. For k ∈ A, we have
291
0 = min
ηk:k∈A µX
k∈A
(∇kf (c)ηk+ λ|ck+ ηk| − λ|ck|)
Combining cases for k /∈ A and k ∈ A, we can obtain a global estimate of descent amount compared
292
to some optimal solution x∗as follows
293
min
ηˆj
F (c + ηˆjeˆj) − F (c)
≤ min
η µ
h∇f (c), ηi + λkc + ηk1− λkck1
+γ
2 X
k /∈A
|ηk|
!2
+ (1 − µ)λX
k /∈A
|ηk|
≤ min
η µ
F (c + η) − F (c)
+γ
2 X
k /∈A
|ηk|
!2
+ (1 − µ)λX
k /∈A
|ηk|
≤ min
α∈[0,1]
µ
F (c + α(c∗− c)) − F (c)
+αγ
2 kc∗k21+ α(1 − µ)λkc∗k1
≤ min
α∈[0,1]
−αµ
F (c) − F (c∗)
+α2γ
2 kc∗k21+ α(1 − µ)λkc∗k1. It means we can always choose an α small enough to guarantee descent if
294
F (c) − F (c∗) >(1 − µ)
µ λkc∗k1. (25)
Then for
295
F (c) − F (c∗) ≥2(1 − µ)
µ λkc∗k1, (26)
we have
296
minηˆj F (c + ηˆjeˆj) − F (c) ≤ min
α∈[0,1] −αµ 2
F (c) − F (c∗)
+α2γ
2 kc∗k21.
Minimizing w.r.t. to α gives the convergence guarantee F (ct) − F (c∗) ≤ 2γkc∗k21
µ2 1 t. for any iterate with F (ct) − F (c∗) ≥2(1−µ)µ λkc∗k1.
297
Lemma 2.
minηj µ∇j∗f (c)ηj+ λ|ηj| +γ
2ηj2 (27)
= min
ηk:k /∈A
X
k /∈A
µ∇kf (c)ηk+ λ|ηk|
+γ
2 X
k /∈A
|ηk|
!2
(28)
Proof. The minimization (33) is equivalent to
298
min
ηk:k /∈A
X
k /∈A
µ∇kf (c)ηk
s.t. X
k /∈A
|ηk|
!2
≤ C1, X
k /∈A
|ηk| ≤ C2.
and therefore is equivalent to
299
min
ηk:k /∈A µX
k /∈A
∇kf (c)ηk
s.t. X
k /∈A
|ηk| ≤ min{p C1, C2}
which is a linear objective subject to a convex set and thus always has solution that lies on the corner
300
point with only one non-zero coordinate ηj∗, which then gives the same minimum as (32).
301
7.2 Proof for Lemma 1
302
Since supp(c∗) = A∗, and c∗is optimal when restricted on the support, we have hη, c∗i = 0 for
303
some η ∈ ∂F (c∗). And since F (c) is strongly convex on the support A∗with parameter β, we have
304
F (0) − F (c∗) = F (0) − F (c∗) − hη, 0 − c∗i
≥ β
2kc∗− 0k22, which gives us
kc∗k22≤ 2(F (0) − F (c∗))
β .
Combining above with the fact for any c, kck21≤ kck0kck22, we obtain the result.
305
7.3 Proof for Theorem 2
306
Lemma 3. Let r(W ) and rN(W ) be the risk (2) and the empirical risk respectively, we have
sup
W ∈RK×D:kW kF≤R
|r(W ) − rN(W )| ≤ s
2DK log(4RKN )
N + 1
N log(1 ρ) with probability1 − ρ.
307
Proof. Since minz∈{0,1}N 1
2(y − zTW x)2≤ |y|2≤ 1 for a given W , by Hoeffding inequality, P (|rN(W ) − r(W )| ≥ ) ≤ exp(−2N 2).
Let N (δ) be a δ-covering of the set W := {W ∈ RK×D| kW kF ≤ R} with |N (δ)| ≤ 4Rδ DK .
308
Then for any W ∈ W, we have ˜W ∈ N (δ) with kW − ˜W k ≤ δ. Applying a union bound, we have
309
P
sup
W ∈N (δ)˜
|rN( ˜W ) − r( ˜W )| ≥
≤ 4R δ
DK
exp(−2N 2). (29) Then for ∆W := W − ˜W satisfying k∆W k ≤ δ, we can bound the difference of square loss of W
310
and ˜W by
311
min
z∈{0,1}K
1
2(y − zTW x)2− min
z∈{0,1}K
1
2(y − zTW x)˜ 2≤ 1
2(y − ˜zTW x)2−1
2(y − ˜zTW x)˜ 2
≤ k∆W kFk˜zk + 2Rk˜zk2k∆W kF ≤ 3RK
(30) where ˜z = arg minz∈{0,1}K 1
2(y − zTW x)2 and we used the fact that kxk ≤ 1 and |y| ≤ 1.
312
By symmetry, we have
minz∈{0,1}K 1
2(y − zTW x)˜ 2− minz∈{0,1}K 1
2(y − zTW x)2
≤ 3RK.
313
Combining (29) with (30), we have
314
sup
W ∈W
|rN(W ) − rN(W )| ≤ 6RKδ + s
DK
2N log(4R δ ) + 1
2N log(1
ρ). (31)
with probability 1 − ρ. Setting δ = 1/(6RK√
N ) and apply Jennen’s inequality gives the result.
315
Then the following gives the proof for Theorem 2.
316
Proof. Let ¯zi = arg minzi∈{0,1}K (yi− zTiW x¯ i)2for i ∈ [N ]. Denote ¯Z as the N × K matrix
317
stacked from (¯zTi)Ni=1. Let {¯zk}Kk=1be the columns of ¯Z and ¯A be the indexes of atoms in the atomic
318
set (5) that have the same 0-1 patterns to those columns. Denote ¯c as the coefficient vector with
319
¯
ck = 1 for k ∈ ¯A and ¯ck = 0 for k /∈ ¯A. By the definition of F (c), we have
320
F (¯c) ≤ rN( ¯W ) +τ
2k ¯W k2F+ λk¯ck1. (32) where rN( ¯W ) := 2N1 PN
i=1minz∈{0,1}K(yi− zTW x¯ i)2is the empirical risk of ¯W . Let c∗ :=
321
arg minc:supp(c)= ¯A F (c). We have F (c∗) ≤ F (¯c). Then from (23),
322
F (ˆc) − F (¯c) ≤ F (ˆc) − F (c∗) ≤4γK βµ2
1 T
+2(1 − µ)λ µ
s 2K
β . (33)
In addition, the risk of ˆW satisfies
323
rN( ˆW ) +τ
2k ˆW k2F+ λkˆck1≤ F (ˆc) (34) by the definition of the empirical risk rN(.) (since it is minimized w.r.t. the hidden assignments).
324
Combining (32), (33) and (34), we obtain a bound on the difference of empirical risk
325
rN( ˆW ) − rN( ¯W ) ≤ τ
2k ¯W k2+ λK
| {z }
bias of regularization
+4γK βµ2
1 T
+2(1 − µ)λ µ
s 2K
β
| {z }
optimization error
(35)
The remaining task is to bound the estimation error |r(W ) − rN(W )|. Since Algorithm 1 is a descent
326
algorithm w.r.t. F (c) and in the beginning F (0) ≤ 1/2, we have kck1 ≤ 1/λ and kW k2 ≤ 1/τ
327
at any iterate. Then we can bound the estimation error by Lemma 3 for ˆW belonging to the set
328
W(T ) := { ˆW ∈ RT ×D| k ˆW kF ≤p1/λτ }, giving
329
|r( ˆW ) − rN( ˆW )| ≤ s
2DT log(4T N/√ λτ )
N + 1
N log(1
ρ). (36)
Combining (35) and (36), and choosing λ = 1/(N K), τ = 1/(N R2), we obtain r( ˆW ) − r( ¯W ) ≤
330
with T ≥ µ4γ2β(K), and N ≥ DT2 (2 log(4RKT ) + log(1ρ)) for any 0 < < 1.
331