Procedure: StaticCTA(A, {Rn}Nn=1, C)
Input: An N -th order tensor A ∈ RI1×I2×···×IN, a set of reduced ranks {Rn}Nn=1, and the number of clusters C for the clustered mode m.
Output: The core tensor and basis matrices of each clustern
Zc,Un,c
i=1into C disjoint regions using traditional clustering methods for c ← 1 to C do
Procedure: IterativeCTA(A, {Rn}Nn=1, C)
Input: An N -th order tensor A ∈ RI1×I2×···×IN, a set of reduced ranks {Rn}Nn=1, and the number of clusters C for the clustered mode m.
Output: The core tensor and basis matrices of each cluster n
Re-classify Ahmiiby solving Equation 5.8 end
from the core tensor of each cluster to solve this issue. In the update stage, the core tensor and the basis matrices of each cluster are then updated using N -SVD, while the cluster membership is fixed to satisfy the constraints in Equation 5.6. The above two stages are repeated until the sum of the Frobenius norm of each cluster core tensor converges, or a pre-defined maximum iteration count is reached. In Algorithm 5.2, the pseudo-code of the proposed static and iterative CTA algorithms is presented in details.
5.3.2 Clustering Stage
In this stage, we would like to re-classify each mode-m sub-tensor into the cluster with the minimum approximation error from the core tensor and basis matrices of each cluster. To facil-itate the computation of approximation errors for the mode-m sub-tensors outside cluster c, the dual mode-m basis matrix Vm,c ∈ RRm×(Rm+1Rm+2···RNR1R2···Rm−1) of cluster c is additionally
Since the rows of ufmZc are orthogonal, Equation 5.7 is equivalent to the normalization of each row of ufm Zc.
With the help of dual mode-m basis matrices, we propose the following theorem to re-classify each mode-m sub-tensor:
Theorem 5.3: Let Vm,c ∈ RRm×(Rm+1Rm+2···RNR1R2···Rm−1)be the dual mode-m basis matrix of clusterc as defined in Equation 5.7. The cluster ci with the minimum approximation error for thei-th mode-m sub-tensor Ahmiiis obtained by solving the following constrained integer optimization problem:
wherek·kF denotes the Frobenius norm of a matrix.
Theorem 5.3 states that Ahmiishould be classified into a cluster whose basis matrices ade-quately preserve its projected norm, so that the mode-m sub-tensors within a cluster are corre-lated with each other. Interested readers may refer to Section 5.5 for the mathematical proof of Theorem 5.3.
5.3.3 Update Stage
Given the cluster membership of each mode-m sub-tensor, the core tensor and the basis matri-ces of each cluster are updated using N -SVD in this stage. Since the partitioned clusters are independent of each other, their decomposed results can be separately computed.
Neverthe-less, to satisfy the orthonormal constraints on mode-m basis matrices in Equation 5.6, only the member sub-tensors of a cluster should be considered during tensor decomposition.
More formally, let Mcbe the membership index set of cluster c defined as
Mc =i | i ∈ {1, 2, . . . , Im}, Ahmiiis a member of cluster c , (5.9) and Mc∈ RIm×|Mc| denotes the membership matrix of cluster c, whose entries are
∀i1, ∀i2, Mc
i1i2 =
1, if i1 = (Mc)i2, 0, otherwise,
(5.10)
where | · | denotes the cardinal of a set, and (Mc)i2 is the i2-th element of the membership index set Mc. We then extract the member sub-tensors of cluster c into an N -th order tensor Ac∈ RI1×I2×···×Im−1×|Mc|×Im+1×Im+2×···×IN by the following equation:
Ac= A ×mMTc. (5.11)
When applying tensor approximation to the shrunken tensor Ac, non-members of cluster c are excluded from the decomposition. ZcandUn,c N
n=1are then updated by the decomposed core tensor and basis matrices of Ac. Note that since Accontains only members of cluster c, Um,c is further updated by the multiplication McUm,c to satisfy the constraints in Equation 5.6.
5.4 Implementation Issues
5.4.1 Initial Guess
One significant implementation issue of static and iterative CTA is the initial guess of cluster membership of each mode-m sub-tensor. According to the experiments, we have found that applying some advanced clustering methods, such as enhanced LBG [133] and local (or clus-tered) principal component analysis [75, 161], to determine the initial cluster membership is often much better than using conventional K-means clustering. Although this scheme will push the problem back to the initial seeds for clustering methods, various techniques for generating and fixing the initial cluster seeds usually provide satisfactory results in our experience.
The best approach may vary with the given data sets, but here we present a general method as a guideline to determine the initial cluster seeds. For a real-world data set, each mode of the input tensor is associated with a parametric space that describes its physical conditions, for example, different illumination or view directions for a reflectance function. Based on the assumption that the observations from nearby parameters in the mode-m parametric space are expected to be highly correlated, we can perform initial clustering in the parametric space in-stead. This is computationally efficient since the dimensionality of the mode-m parametric
space is frequently lower than that of the observations. Sophisticated or even exhaustive meth-ods therefore can be employed to generate appropriate cluster seeds. In our experiments, this scheme generally improves the approximation errors of CTA by 3% ∼ 8% when compared with directly performing K-means clustering on mode-m sub-tensors to determine the initial cluster seeds for CTA.
5.4.2 Global Basis Matrices
Sometimes a single global mode-n basis matrix for all clusters is preferred rather than an in-dividual local mode-n basis matrix for each cluster. The preference for global basis matrices may be due to computational costs, storage space, or the special purpose of an application. To account for this issue, the global basis matrices are computed by decomposing an N -th order tensor A before applying CTA.
Let G be the index set of global basis matricesUn
n∈G, where the subscript c of a local mode-n basis matrix is omitted to denote a global one. After extractingUn
n∈Gby applying tensor approximation to A, we project A ontoUn
n∈Gto obtain an N -th order reduced tensor AGas
AG = Aą
n n∈G
UTn. (5.12)
CTA is then performed on AG to compute the core tensor and the local basis matrices of each cluster, while the global basis matrices are fixed.
Although an iterative algorithm can be employed to alternately update the global basis matri-ces and the decomposed results of CTA, it is computationally too expensive and only improves approximation errors by a small amount. We therefore just performed the initial tensor decom-position on A to derive the global basis matrices and did not update them after CTA for all experimental results in this dissertation.
5.5 Mathematical Proofs
Re-Classification in the Clustering Stage
In the clustering stage of CTA, the mode-m sub-tensors of A are re-classified into C disjoint clusters, while the core tensor and basis matrices, other than the mode-m basis matrix, of each cluster are fixed. We thus introduced Theorem 5.3 in Section 5.3.1 to determine the cluster membership of each mode-m sub-tensor with strong intra-cluster correlations. In this section, the mathematical proof of Theorem 5.3 is presented in details to justify the correctness of the proposed CTA algorithm.
Proof of Theorem 5.3: Since the cluster membership of a mode-m sub-tensor does not depend
on that of others, Equation 5.6, without the orthonormal constraints on Um,1, Um,2, . . . , Um,C, can be separated into Imdistinct constrained least-squares optimization sub-problems as
min
= Rm (disjoint constraint),
∀c,
∈ {0, Rm} (membership constraints),
(5.13)
for i = 1, 2, . . . , Im. After solving all the optimization sub-problems, we should finally trans-form each derived mode-m basis matrix by a matrix of size Rm×Rm to satisfy the orthonormal constraints in Equation 5.6. Since this transformation actually resembles a rotation in the clus-ter sub-space associated with the m-th mode, one can always find such a transformation matrix without changing the final approximation errors. Therefore, the remaining issue of this proof is to show that the solution to Equation 5.13 will implicitly provide a solution to Equation 5.8.
Without losing generality, suppose that Ahmii is re-classified into cluster ci. Since the con-straints in Equation 5.13 can be enforced by setting Um,c
i∗to zeros for all c 6= ci, the objective function in Equation 5.13 becomes
1
where the right side comes from the definitions of the mode-n unfolded matrix and the mode-n product. Furthermore, the definition of Frobenius norm allows us to rewrite Equation 5.14 into a standard least-squares objective function as follows:
1
Therefore, the optimal solution of Um,ci
i∗to Equation 5.15 is the least-squares estimation of the solution to the following linear equation:
ufm Ahmii = Um,ci
where
denotes a series of Kronecker products, and the symbol ⊗ represents the Kronecker product operator. Equation 5.16 comes from the relation between the mode-n product and the Kronecker product [33]. From Equation 5.7, we can further rewrite Equation 5.16 as
ufm Ahmii = Um,ci row of ufmZci. Since each basis matrix has orthonormal columns and the dual mode-m basis matrix has orthonormal rows, the least-squares solution to Equation 5.15 is
Um,ci
After substituting Equation 5.19 for Um,ci
i∗in Equation 5.15, we have 1
+ 1
From Equation 5.20 to Equation 5.21, we use the facts that the Kronecker product of two column-orthonormal matrices is another column-orthonormal matrix, and the `2 norm of a row vector vT is unchanged under the transformation vTU for a row-orthonormal matrix U, namely vTU
2 = vT
2. Since the first term in Equation 5.22 is a constant, the minimization of Equation 5.22 is equivalent to the maximization of the second term in Equation 5.22. Theo-rem 5.3 thus is proved by identifying that the cluster membership is implicitly specified in the solution to Equation 5.13.
Chapter 6
K-Clustered Tensor Approximation
Recently, there has been a growing interest in modeling real-world observations as sparse lin-ear combinations of atoms (or basis functions) in an over-complete dictionary. Although the underlying physical process of a natural phenomenon may be a complex function or mixture of heterogeneous elements, it is frequently desirable to represent observations in a sparse form that allows efficient data analysis. However, this intuitive concept is far from easy to achieve in prac-tice. Even with a fixed dictionary, searching for an optimal solution in which each signal exactly depends on a given number of atoms was proved to be NP-hard [31]. Therefore, many practical algorithms instead consider sub-optimal solutions, such as matching pursuit [110, 148], basis pursuit [19], and Bayesian models [83, 102].
In addition to pursuit algorithms and Bayesian models, previous studies have also reported a close connection between sparse representation and vector quantization [83, 177]. K-singular value decomposition(K-SVD) [3, 4] thus generalized K-means clustering to seek sparse rep-resentations by alternating between the pursuit process and dictionary learning. Nevertheless, rather than using Bayesian inference, it applied singular value decomposition (SVD) to simulta-neously update dictionary atoms and non-zero basis coefficients in the dictionary learning stage so that convergence rate could be improved. Interested readers may refer to [3] for more details about K-SVD and a comprehensive review on recent progress of sparse representation.
In this chapter, we introduce a novel sparse multi-linear model, namely K-clustered tensor approximation (K-CTA), which combines the advantages of clustered tensor approximation (CTA) and K-SVD to bridge the gap between sparse representation and tensor approximation.
Several important theorems of K-CTA are discussed and proved to demonstrate that K-CTA in fact resembles the behaviors of K-SVD in the tensor space and is a natural extension of CTA.
As a result, CTA is further generalized to provide a sparse multi-dimensional data analysis tool in which the sparsity is under user control.
6.1 Mathematical Formulation
One major drawback of CTA for decomposing an N -th order tensor A ∈ RI1×I2×···×IN along the clustered mode m is that it enforces hard clustering in which each mode-m sub-tensor is classified into just one cluster. The subsequent tensor approximation within each cluster thus can only exploit intra-cluster coherence. In addition, the performance of CTA heavily depends on the initial guess of cluster membership, but estimating an appropriate initial guess is a non-trivial problem at all. Even if the globally optimal solution to hard clustering could be easily found, the decomposed core tensors and basis matrices of different clusters may still have strong correlations.
As illustrated in Figure 6.1, our solution to this issue is to relax the hard clustering constraint into a soft one. Each mode-m sub-tensor now can be classified into more than one cluster and approximated by mixing the decomposed results of these clusters. To reduce run-time reconstruction costs, this soft constraint should also be a sparse one in which each mode-m sub-tensor belongs to just a few, say Km, clusters. This not only permits K-CTA to exploit the inter-cluster coherence that can not be analyzed by CTA, but also alleviates the influence of an inappropriate initial guess by breaking the hard cluster boundaries. K-CTA thus can be regarded as a sparse extension of CTA and a multi-linear generalization of K-SVD [3, 4].
To allow soft and sparse clustering for K-CTA, Equation 5.6 is reformulated into the follow-ing constrained least-squares optimization problem:
min
where Km specifies the number of mixture clusters for a mode-m sub-tensor. For K-CTA, the mode-m basis matrix Um,cis also called the mixing matrix of cluster c whose i-th row contains the mixing coefficients of the i-th mode-m sub-tensor Ahmiiwith respect to cluster c. The only difference between Equation 5.6 and Equation 6.1 is that the sparse and membership constraints of K-CTA enforce that each mode-m sub-tensor belongs to exact Km clusters instead of only one. As a result, the decomposed mode-m basis matrices are still sparse for K-CTA, and their sparsity is totally controllable by the value of Km. Note that Equation 6.1 will be exactly equivalent to Equation 5.6 when Km= 1. In this case, K-CTA should derive the same results as CTA to permit it as a natural generalization of CTA.
. . .
. . .
. . .
Input Tensor
Mode 1
Mode 2 Mode 3
. . .
≈ +
+ ...
Mode-3 Sub-Tensor Partial Sub-Tensor 1 Partial Sub-Tensor 2
Km Partial Sub-Tensors Cluster 1
Mode 1
Mode 2
Mode 3
. . . . . .
Cluster 2
Cluster 3
Cluster C
Figure 6.1: Decompose a third order tensor using K-CTA. Suppose that we intend to classify the mode-3 sub-tensors of a third order tensor (left top) into total C clusters. A mode-3 sub-tensor (enclosed in the red rectangle) is approximated as the sum of Km partial sub-tensors, each of which is classified into one different cluster. All the partial sub-tensors within a cluster thus can be concatenated along the third mode and decomposed using traditional tensor approximation.
6.2 Algorithm
6.2.1 Overview
Although the mathematical formulation of K-CTA is almost identical to that of CTA, the prac-tical algorithm for solving Equation 6.1 is substantially different from static and iterative CTA.
For a small total number of clusters, while an exhaustive approach for finding the optimal so-lution to a hard clustering problem may be efficient in practice, a brute-force method for soft clustering is frequently time-consuming and only feasible when the number of mixture clusters is rather small. It is therefore difficult to extend CTA into K-CTA with just minor modifications, but the fundamental alternating framework of CTA is still applicable.
Similar to CTA, the proposed iterative K-CTA algorithm also consists of two stages: the clustering stage (Section 6.2.2) and the update stage (Section 6.2.3). After initializing the core tensor and basis matrices of each cluster, all variables in Equation 6.1 are fixed in the clustering
stage, except for the mode-m basis matrix of each cluster. For each mode-m sub-tensor, a greedy approach is then applied to sequentially search for the best Km mixture clusters that minimize its approximation error. The mixing coefficients, namely the rows of the mode-m basis matrix of each cluster, are also updated by the proposed optimal projection method. Since the mode-m basis matrices derived by the optimal projection method do not have orthonormal columns, we should additionally update them, together with the core tensor of each cluster, to satisfy the orthonormal constraints in Equation 6.1. In the update stage, the core tensor and basis matrices of each cluster are then iteratively updated using N -SVD, one cluster at a time, while those of other clusters are unchanged. Note that we also fix the cluster membership in the update stage to simplify the proposed algorithm. Finally, the above two stages are iteratively executed until the sum of the Frobenius norm of each cluster core tensor converges, or a user-specified maximum iteration count is reached. The whole process of iterative K-CTA is summarized in Algorithm 6.1. Here, we omit the description of static K-CTA, since it just follows the process of iterative K-CTA without repeating the clustering stage and the update stage.
6.2.2 Clustering Stage
Given the core tensor and basis matrices of each cluster, we would like to find the best Km mixture clusters for each mode-m sub-tensor and derive the corresponding mixing coefficients in this stage, so that the sparse and membership constraints in Equation 6.1 are satisfied. This issue can be considered as a multi-linear counterpart of the pursuit problem for sparse repre-sentation [3, 4, 19, 110, 148], which was proved to be NP-hard [31]. Indeed, it is difficult to solve the cluster membership and the non-zero mixing coefficients of a mode-m sub-tensor at the same time, since the decomposed results of different clusters are frequently correlated. Even a single change in the membership of a mode-m sub-tensor, either adding the sub-tensor to or removing it from a cluster, will affect its mixing coefficients for other clusters. Nevertheless, greedy approaches often provide a satisfactory approximate solution to the pursuit problem in both theory and practice [31, 176]. We thus sequentially update the cluster membership and sparse mixing coefficients of all the mode-m sub-tensors, one cluster at a time.
Greedy Search
First mixture cluster: To allow K-CTA as a natural generalization of CTA, the first mixture cluster of a mode-m sub-tensor is obtained by using the same method in the clustering stage of CTA. From Theorem 5.3, we can determine the first mixture cluster ci1 of the i-th mode-m sub-tensor Ahmiiby solving the following constrained integer optimization problem:
maxci1