• 沒有找到結果。

A unifying theorem for spectral embedding and clustering

N/A
N/A
Protected

Academic year: 2022

Share "A unifying theorem for spectral embedding and clustering"

Copied!
8
0
0

加載中.... (立即查看全文)

全文

(1)

A unifying theorem for spectral embedding and clustering

Matthew Brand and Kun Huang

Mitsubishi Electric Research Labs, Cambridge, Massachusetts, USA

Electrical & Computer Engineering, University of Illinois at Urbana-Champagne, USA

Abstract

Spectral methods use selected eigenvectors of a data affinity matrix to obtain a data representa- tion that can be trivially clustered or embedded in a low-dimensional space. We present a the- orem that explains, for broad classes of affin- ity matrices and eigenbases, why this works:

For successively smaller eigenbases (i.e., using fewer and fewer of the affinity matrix’s domi- nant eigenvalues and eigenvectors), the angles between “similar” vectors in the new represen- tation shrink while the angles between “dissim- ilar” vectors grow. Specifically, the sum of the squared cosines of the angles is strictly increas- ing as the dimensionality of the representation decreases. Thus spectral methods work because the truncated eigenbasis amplifies structure in the data so that any heuristic post-processing is more likely to succeed. We use this result to construct a nonlinear dimensionality reduction (NLDR) al- gorithm for data sampled from manifolds whose intrinsic coordinate system has linear and cyclic axes, and a novel clustering-by-projections algo- rithm that requires no post-processing and gives superior performance on “challenge problems”

from the recent literature.

1 Introduction

Spectral methods for multivariate data analysis are notable both for their practical successes and for their rapidly de- veloping theoretical underpinnings. A spectral algorithm typically begins with an “affinity matrix” of pairwise rela- tionships between the samples or the variates, and derives a more useful representation of the data from its eigen- value decomposition (EVD), often using just one or a few eigenvectors (a truncated eigenbasis). Many classic dimen- sionality reduction and nonlinear embedding algorithms have this character: Principal components analysis (PCA) [11] uses the variates’ covariance matrix; multidimensional scaling (MDS) [16] uses the samples’ pairwise distance ma- trix; kernelPCA[1,24] uses a kernel matrix where the ker- nel functionκ(xi, xj) represents the dot product of samples

in an unknown “feature space1”; and locally linear embed- ding (LLE) [23] uses a matrix containing correlations of samples’ barycentric coordinates.

Spectral methods have been even more successful for data clusterings and graph partitionings: Spectral bipartitioning [9,6] cuts a graph in two by thresholding the second eigen- vector of the graph’s normalized Laplacian matrix, and nu- merous clustering algorithms use selected eigenvectors of dot-product or kernel matrices to re-represent the data for clustering by simpler heuristics such as thresholding or K- means [25,2,5,7,27,21,29,13,3,18,19,4,20,22].

While the statistical basis and optimality of PCA is well understood, virtually all other spectral methods are moti- vated by imperfect analogies between data-derived graphs and physical problems (e.g., harmonic analysis2and ran- dom walks3), or as approximations to other problems (e.g., vector quantization [2], min-cut [27], or max-flow [6]).

Underlying all this work is the notion that the truncated eigenvector basis somehow makes the problem simpler for the subsequent analysis. Our theoretical goal is to explain how and why this works.

Embeddings and clusterings imply loss of information, but there has been little effort to bound or even quantify what is lost and characterize what is conserved. This is acutely true for the vast majority of algorithms in which the spectral analysis is just a prelude to further information- lossy data analysis. Promising steps in the right direc- tion include work by Alpert & Yao [2] that equates spec- tral partitioning with vector quantization (thereby imply- ing an objective function), and analyses by Fiedler, Per- ona & Freeman, Shi & Malik, Meila & Shi, and Weiss et al. [9,21,27,29,19,20] that justify using one or a few eigenvectors as a cluster indicator when the data is already clustered or nearly so. In particular, if the affinity matrix already has a block structure, then some of its eigenvectors will be piece-wise constant, such that if items i and j share cluster membership, elements i and j in these eigenvectors

1Feature space is the “linearization space” of Aizerman et al.

[1], in which Euclidean relationships between points are consis- tent with the kernel’s similarity measure.

2The eigenvectors of a graph’s normalized Laplacian matrix are analogous to the modes of vibration the graph would exhibit if shaken [8].

3The eigenvectors of a stochastic matrix describe the steady- state properties of an infinite random walk on the graph [6,19].

(2)

will have the same value [29,19,20]. For nearly clustered data, the eigenvectors are approximately piecewise.

However, this leaves open many questions, particularly when the data is not nearly clustered: What special proper- ties should the affinity matrix have? Stochasticity? Unit di- agonal? Positive definiteness? Unit spectral radius? Which and how many eigenvectors should be used? What in- formation is conserved in a truncated eigenbasis? Obvi- ously, the answers to these questions should inform the post-processing of the eigenvectors.

In this paper we develop a unifying view of spectral meth- ods that answers many of these questions and gives guid- ance for the construction of clustering and nonlinear di- mensionality reduction algorithms. For most of our dis- cussion, it will be useful to think of the affinity matrix in terms of a (possibly unknown) kernel: Affinity value Ai j=κ(xi, xj) =Φ(xi)>Φ(xj) is the dot product of two vectors representing the (usually unknown) locations of points i and j in a high-dimensional feature space asso- ciated with the kernel. Spectral analysis gives a new data representation derived from the eigenvalues and eigenvec- tors of the symmetric affinity matrix A. To summarize the main theoretical result:

(1) An eigenvalue-scaled eigenvector representation of the data encodes angles (equivalently, correlations) between points embedded in the surface of a hypersphere.

(2) When the representation is truncated by suppressing the smallest magnitude eigenvalues, the angles (equiv., correla- tions) between high-affinity points are least distorted, high- lighting the manifold structure of the data.

(3) As the representation is further truncated, the angles (equiv., correlations) decrease between points having high affinity and increase between points having low affinity, highlighting the cluster structure of the data.

In short, nonlinear dimensionality reduction and clustering can be obtained from the same process. The theorem is limited to symmetric non-negative definite affinity matri- ces, but a corollary establishes relevance to non-positive matrices as well, and to asymmetric matrices (e.g., B) via their Grams (B>B or BB>).

In the remainder of the paper we leverage this theorem into novel methods for nonlinear dimensionality reduction (NLDR) and clustering. TheNLDRalgorithm maps the data to a mixed vector and toric space, with the linear or cyclic nature of each axis determined from statistical tests. The clustering algorithm works entirely by projections, whose information loss is easily characterized and minimized or bounded at each step. Experiments show that it produces high-quality clusterings of a wide variety of “challenge problems” exhibited in the recent literature. We also use it to solve an unusually difficult visual segmentation prob- lem.

2 The polarization theorem

Let A∈RD×Dbe a non-negative definite symmetric matrix having eigenvalue decomposition (EVD) VΛV>= A with eigenvalues sorted in descending order on the diagonal of Λ. Define representation X .

1/2V> and let truncated representation X(d)be the top d rows of X—the d princi- pal eigenvectors scaled by the square roots of their associ- ated eigenvalues. A well-known property of such truncated

EVDs is that A(d) .

= X>(d)X(d)is the best rank-d approxima- tion to A with respect to the Frobenius norm; equivalently, the most energy-preserving projection to rank d.

Let Y(d) be an angle-preserving projection of the column vectors of X(d)onto the surface of a d-dimensional hyper- sphere, obtained by scaling each column of X(d) to unit norm. The angle between two column vectors xi, xj∈ X(d)

(equivalently yi, yj∈ Y(d)) is θi j .

= ∠(xi, xj) = arccos x>i xj

kxik · kxjk (1) and the correlation between two vectors xi, xj ∈ X(d)

(equivalently yi, yj∈ Y(d)) is

corr(xi, xj) = y>i yj= cosθi j. (2) We may now state the main result:

Theorem (polarization): As positive (resp., non- negative) A is projected to successively lower ranks A(D−1), A(D−2),··· , A(d),·· · , A(2), A(1), the sum of squared angle-cosinesi6= j(cosθi j)2 (equivalently squared cor- relations kY>(d)Y(d)k2F) is strictly increasing (resp., non-decreasing).

In short, as the dimensionality of the representation is re- duced, the distribution of cosines migrates away from 0 toward the two poles ±1, such that angles migrate from θi j=π/2 toθi j∈ {0,π}.

The full proof requires a large number of lemmas and runs to several pages; because of page limits it will be published separately. The following proof sketch gives the flavor of the argument: The identity diag(Λ) = (V◦ V)diag(A) allows one to derive the distribution of the nonzero eigenvalues [γ1,·· · ,γd] of the cosine matrix Y>(d)Y(d) = diag(diag(A(d)))−1/2A(d)diag(diag(A(d)))−1/2. One can then show that as d ↓ 1 the variance of the eigenvalues grows. However, the projection onto the hypersphere keeps the mean root-eigenvalue constant at 1ddi γ1/2i = 1. There- fore the sum ∑di γi = trace(Y>(d)Y(d)) =kY>(d)Y(d)k2F =

i6= j(cosθi j)2+ D grows monotonically.

The following two corollaries will be developed into algo- rithms in the remainder of the paper:

Corollary (embedding): Suppressing the smallest- magnitude eigenvalues of A gives a d < D dimensional em- bedding in which small angles are least distorted.

(3)

This unsurprising corollary is quite similar to the motiva- tion forPCA. In our case, the mass-preserving embedding spreads the data out on the hypersphere surface, preserving small angles most accurately because their cosines com- prise most of the energy in the affinity matrix. This means that local relations between nearby points are well pre- served. In the next section we show that this allows one to construct a relatively low-dimensional embedding for affin- ity data, with the unusual feature that the embedding space may have both linear and cyclic degrees of freedom.

Corollary (clustering): Truncation of the eigenbasis am- plifies any unevenness in the distribution of points on the d-dimensional hypersphere by causing points of high affin- ity to move toward each other and other to move apart.

In short, the distribution approaches a clustering for small d  D. This explains many, if not all, spectral clustering methods: Using a subset of all the eigenvectors emphasizes the data’s cluster structure, improving the output of any heuristic clustering procedure. This does not mean that the lowest-dimensional embedding is the best one for cluster- ing; there is a tradeoff between amplifying cluster structure and losing information. In section4, we show that by us- ing a large subset of eigenvectors one can depend entirely on projections andEVDs to do the clustering, removing the need for heuristic post-processing.

Although the theorem is limited to non-negative symmet- ric affinity matrices, it also has explanatory value for spec- tral methods that employ selected eigenvectors of non- positive matrixes: Every real symmetric matrix can be writ- ten C = A− B where positive semi-definite matrices A and B satisfy rank(A) + rank(B) = rank(C) and A, constructed from the positive part of C’s spectrum, is the best (least- squares) gram approximation of C (one offering a real- valued decomposition A = X>X). The theorem applies to A. For example, Weiss [29] showed that clustering meth- ods related to the min-cut problem (e.g., [21,27,18]) are of this nature: Although posed as generalized eigenvalue problems with non-positive Laplacian matrices, these al- gorithms ultimately consult a single eigenvector from the positive part the spectrum of the normalized Laplacian ma- trix. It is worth noting that it follows from basic properties of the normalized Laplacian [6] that either (1) this eigen- vector is only approximately piecewise constant, present- ing some uncertainty for the final clustering, or (2) eigen- value multiplicity makes the choice of eigenvector ambigu- ous. The polarization theorem suggests that additional rel- evant information lies in the remaining eigenvectors; below we construct an algorithm that exploits this information and eliminates the abovementioned ambiguities.

3 Dimensionality reduction

Motivated by the first corollary above, we observe that Y(d) is a low-dimensional nonlinear embedding of the data onto the surface of a d-dimensional hypersphere, with the arc-

length between two points inversely related to their affinity score. Let us choose a dimensionality d that truncates only the lesser eigenvalues of the affinity matrix A. As inPCA, this choice is usually a matter of eyeballing the eigenvalue spectrum, unless one has prior knowledge about the true noise levels in the data and how they affect the kernel.

Since it is difficult to work with spherical embeddings, our goal is to “re-embed” the data in a vector space, where pos- sible. Let p be the point on the hypersphere surface having smallest arc-length to all points in Y and let [u1,· ·· , ud−1] be an orthogonal basis of the hyperplane tangent to the sur- face at p. (Note that U .

= [u1,· ·· , ud−1, p] satisfies U>U = I.) Each ui specifies a direction around the hypersphere which can be visualized as a great circle parallel to uiand passing through p. Each uican be interpreted as one of the axes of a projection toRa×Tb—the Cartesian product of an a-dimensional vector space and a b-dimensional toric4 space—with a + b = d− 1. If the data wraps fully around the hypersphere along direction ui, then axis i is cyclic; if the data wraps only partway around the hypersphere along direction uj, then axis j is linear.

Much as PCA axes are statistically motivated from multi- variate gaussian distribution, the tangent point p and axes U can be estimated by fitting a gaussian distribution to the surface of a hypersphere. The complex Bingham distribu- tion [15] is a multivariate gaussian density on y∈C Sd−1⊂ Cd conditioned on the fact that all vectors are unit-length (yy = 1, where ydenotes complex conjugate transpose):

p(y|Σ) = C(Σ)−1exp(yΣy). (3) The complex Bingham is parameterized by hermitian ma- trix Σ= Udiag([κ1,·· · ,κd])U whose eigenvalues κ1 <

κ2<·· · <κd−1d = 0 are the concentration parame- ters of the density. A strongly negative κi 0 indicates that the density has little extent along direction ui. The last eigenvector udpoints to the mode of the distribution, thus p = ud. The normalizing constant C(Σ) is calculated via the matrix confluent hypergeometric function1F1[14], which in this case has a compact form discovered by Kent [15]:

C(Σ) = 1F1(d 2,1

2, diag([κ1,· ·· ,κd])) (4)

= 2πd−1d−1

j=1

eκj

i6= jj−κi). (5)

Jupp and Mardia [12] showed that the direction vec- tors in U and the concentration parameters [κ1,··· ,κd−1] are related to the scatter of Y through its EVD

Udiag([λ1,··· ,λd])U= YY>, with 0 <λ12<·· · <λd 4A toric spaceTn= (S1)nhas n cyclic axes but, unlike spheri- cal spaceSn, every point has a unique set of ordinates modulo 2π; there are no poles presenting singularities. E.g.: when walking, leg and arm phase are two cyclic variates inT2; but the orienta- tion of a featureless cone in 3-space is described by two variates inS2(a.k.a. Euler angles).

(4)

satisfying

λj=∂logC(Σ)

∂κj

. (6)

For large sample sizes with concentrated density, κj

−c/λj for some constant c. Numerical solution for the concentration parameters in higher dimensions is feasible but nontrivial. Fortunately, for dimensionality reduction, knowing the EVDof the scatter suffices: Its eigenvectors U = [u1,·· · , ud−1, p] are exactly the maximum likelihood (ML) estimate of the point of tangency (a.k.a. modal direc- tion) and the axes of the tangent space, while its eigenval- ues [λ1,· · · ,λd] give a rough indication of which axes are cyclic: A small eigenvalue (λi≈ 0) indicates that the data is approximately linear in direction ui; a large eigenvalue (λi≈λd) indicates the axis is cyclic.

random points on cylinder side view

spherical embedding from affinity matrix

mode

side view with Bingham axes linear

cyclic

Figure 1: Dimensionality reduction of points distributed on a 3Dcylinder (in 10Dspace) to the 2DspaceR1×T1—one linear axis and one cyclic axis. 500 points are randomly generated on the surface of a 3Dcylinder embedded in 10D

space, and contaminated with isotropic 10Dgaussian noise.

The top two images show a 3Dand a 2Dprojection of the points. The bottom two images show an embedding of the affinity matrix on the surface of a 3Dsphere. The embed- ding forms a wide belt around the equator. The arrows show the modal direction and two degrees of freedom of a Bingham distribution fitted to the embedding. Statistical tests for uniformity indicate that the data is cyclic around the equator but linear in the other direction.

To test more precisely whether the data is cyclic in direc- tion uiaround the surface of the hypersphere, consider the projection of the data onto the great circle parallel to ui

and passing through the mode ud. A uniform distribution on this circle (=S1=T1) implies that the data is cyclic in direction ui. The projection is Z .

= [z1,·· · , zN] with zj[ud, ui]>yj, kzjk = 1. With this projection, we can apply a result of Mardia [17] which gives a statistical test

500 points from swiss roll + noise

spherical embedding from affinities

mode

mode

top view map to 2D vector space side view

Figure 2: Dimensionality reduction of points distributed on a curled plane in 10Dspace to the 2Dvector spaceR2. The data is analyzed as in figure1, but the spherical embedding does not wrap all the way around the equator. The sta- tistical test indicates that the data is Bingham-distributed in both directions, so the sphere’s surface is azimuthally mapped to a 2Dvector space, recovering the original planar coordinate system. Since the original data manifold passes close by itself, some points, particularly at the ends, have affinity for nonlocal neighbors, resulting in some distortion in the recovered coordinates.

to assess the hypothesis that a distribution on the (perime- ter of the) unit circle is uniform rather than complex Bing- ham: Letγ12be the eigenvalues of the normalized scatter ZZ>/n. Then, for large data-sets (n 0),

3n(γ1−γ2)223, (7) whereχ23is a chi-squared distribution with three degrees of freedom. Thus we may reject the hypothesis that axis uiis uniform (cyclic) with confidence Pr(χ23> 3n(γ1−γ2)2).

Once identified, the non-cyclic axes are isolated by pro- jecting Y onto the union of the non-cyclic axes and p, then rescaling the resulting vectors to unit norm. After projection onto this reduced hypersphere, the modal point is [0,··· , 0, 1] and the Bingham axes are similarly axis- aligned. An azimuthal equidistant mapping at the modal point then takes the points into a vector space. The pro- cess is illustrated in figure 1 for 10Dpoints noisily sam- pled from a 2D nondevelopable5 manifold having genus 1 (a cylinder) and in figure2 for 10Dpoints noisily sam- pled from a 2D developable manifold having genus 0 (a rectangular plane curled into a “swiss roll”). The affinity matrix for all data-sets in this paper is Ai j=κ(xi, xj)∝

5A “developable” d-dimensional manifold embedded inRn can mapped toRdwithout internal distortions, e.g., a developable surface can be unrolled and flattened without stretching.

(5)

exp−(kxi− xjk2/2σ2), whereσis taken to be the average of the distances between each point and its closest neigh- bor (which we denote σs). For simplicity of analysis, we normalize the affinity matrix by projecting it to the nearest doubly stochastic matrix P .

= diag(d)Adiag(d) using a fast modification of the Sinkhorn procedure [28] to solve for d satisfying P1 = 1 (and P>1 = 1 since P>= P). While not crucial to this method, a doubly stochastic matrix has two properties that make it appealing as a model of the data:

P can be interpreted as the transition probabilities of a ran- dom walk on the data; and P’s largest eigenvalueλmax = 1 has corresponding eigenvector u11, which implies that the stationary distribution of the random walk is uniform (every point is equally probable). To obtain the embed- dings in the figures, we discarded the totally uninformative u1and constructed Y(d)as in section2from eigenvectors 2-4 of P, then fitted Bingham densities to the results to de- termine the appropriate embedding inRa×Tb.

One could also make embeddings inRa×Tb×Sc, though testing uniformity hypotheses onScfor c > 1 requires ex- plicit calculation of the Bingham concentration parameters.

In practice, we find that the MLmodal estimator for the complex Bingham distribution is rather sensitive to noise;

many samples may be necessary to get a good estimate.

Fisher or Watson distributions may be better behaved, but currently they are less tractable analytically and computa- tionally. We turn now to the problem of clustering, where we obtain an easily analyzed, highly competitive algorithm.

4 Clustering

Spectral methods have been extensively studied in graph partitioning and clustering problems. Fiedler [9] first showed that the eigenvector of the Laplacian matrix corre- sponding to the second eigenvalue gives an embedding of the graph in a real line; cutting this embedding at the ori- gin gives a bipartitioning of the graph. This was extended to k-way partitioning where the feature points are mapped into a k-dimensional space with the new coordinates be- ing the normalized row vector of the matrix formed by the first k eigenvectors of the affinity matrix [25,26]. Simi- larly, in Ng et al. [20], the normalized row vectors of the matrix formed by the first k weighted eigenvectors are used as the input to a k-means clusterer, and a perturbational analysis was used to show that the results should be sta- ble if the data is already “nearly clustered”. In Chan et al. [5], the directional angle between the row vectors of the first k eigenvectors of the Laplacian matrix was used as a new distance measure for partitioning. Alpert & Yao [2]

equated partitioning with the problem of clustering these row-vectors, and found that the more eigenvectors used, the better. Spectral bipartitioning methods were adapted for vi- sual clustering problems by Perona and Freeman [21] and Shi & Malik [27]. Analyses by Weiss [29] and Meila & Shi [18,19] showed that normalizing a nearly block-structured affinity matrix makes its eigenvectors approximately piece-

(a) (b)

(c) (d)

0,0,0

Figure 3: Spectral clustering of data distributed in three rings. (a) Cluster assignments are indicated by different markers. (b) The initial log-affinity matrix, sorted by true clusters (the algorithm is blind to such orderings). (c) The P matrix at convergence after 4 iterations. (d) In the con- verged representation, all the points belonging to any one cluster are co-located in a “corner” of a 3Dsphere.

wise constant, and therefore easy to interpret as cluster as- signments. However, such structure is not guaranteed for real problems, and some post-processing is necessary.

Our goal in visiting this already crowded field is to elimi- nate heuristic post-processing steps. Based on the our theo- retical result, we constructed one (of many possible) spec- tral clustering algorithms in which there is no post-EVD

clustering or thresholding; instead, the stochastic eigenvec- tors form a discrete indicator matrix showing the member- ship of each point.

Our basic strategy is to cast clustering as two alternat- ing projections: Projection to low-rank, and projection to the set of zero-diagonal doubly stochastic matrices. In both cases it is easy to characterize what is conserved and quantify what is lost. The projection to lower rank A→ A(d)(or P→ A(d)) is exactly the process character- ized by the polarization theorem: We polarize the distribu- tion of angles with minimal loss of energy kA − A(d)k2F. The projection to a zero-diagonal doubly stochastic ma- trix A(d)→ P = diag(d)(A(d)− diag(diag(A(d))))diag(d) suppresses any differences in the stationary probability of points induced by the projection to low-rank. Disre- garding the suppressed diagonal, this projection is simply an angle-preserving rescaling of the embedding vectors, because diag(d)A(d)diag(d) = diag(d)X>(d)X(d)diag(d) = (X(d)diag(d))>(X(d)diag(d)). Suppressing the diagonal induces negative eigenvalues in the spectrum of P (associ- ated with removing energy placed on the diagonal by the positive eigenvalues); these eigenvalues account for less

(6)

(a) (b) (c)

(d) (e) (f)

Figure 4: A gallery of “challenge problems” adapted from [20] and successfully clustered by our method. Cluster member- ship is indicated by marker symbol. As with all radial kernel methods, clustering reflects connectivity and scale similarity;

thus the size of the kernel has an effect on the results: The same data-set is differently clustered withσ= 10σsin (c) and σ= 2σs in (d) (settingσ=σs breaks the sides from the corners). A very similar data-set was only bipartitioned after a search overσin [20]. Similar results can be observed for (e) and (f) whereσ=σs/2 for (e) andσ=σsfor (f).

than half the energykPk2Fin P. We project to lower rank by suppressing the negative eigenvalues and the uninformative unit eigenvalue. This gives an automatic determination of d and a bound on the loss of variance. The alternating pro- jections terminate when the resulting P matrix has two or more stochastic (unit) eigenvalues, implying reducibility.

(A reducible matrix P can be row- and column-permuted into block-diagonal form, e.g., figure5c).

Analysis: It can be shown that (1) If P has unique posi- tive nonstochastic eigenvalues, alternating the projections P→ A(d) and A(d)→ P will drive the leading eigenval- ues up toward the positive bound λi≤ +1 and all other eigenvalues down toward the negative boundλi≥ −1. For- mally, the vector norm of the eigenvalues increases while their sum remains constant. (2) Once P becomes reducible (has multiple stochastic eigenvaluesλi= +1), the matrix X(k) whose k columns are P’s stochastic eigenvectors has exactly k unique rows. (3) Let Z(k)be a matrix formed from these rows. The product Z(k)Z>(k)is diagonal and the prod- uct X(k)Z>(k)(Z(k)Z>(k))−1= X(k)Z−1(k)is a binary (0/1) cluster indicator matrix that maps all points in a single cluster to a unique positive axis ofRk(e.g., figure3d).

It remains to be shown whether the conditions in (1) are sufficient to guarantee absolute convergence. All our ex- periments, including those of figures3–6, have converged

quite quickly.

When the affinity matrix is produced by a gaussian kernel, this procedure groups points that all have similar affinity values, essentially creating clusters where the inter-point distances are all on the same scale. Clusters may be non- convex and wrap around each other (e.g., figure 4a, fig- ure4bef; figure5ab), but the results are generally in agree- ment with human judgment; some authors have drawn con- nections between gaussian kernel clustering and human perceptual gestalts [21,20].

This procedure automatically produces multi-way parti- tions without prior knowledge of the number of clusters.

However, if the data contains clusters at different scales, the P matrix may become reducible before all the clusters have been revealed, only giving a partial clustering. Often we find that some remaining non-stochastic eigenvalues are close to 1, indicating the clusters can be further subparti- tioned. We partition the P matrix according to its stochas- tic eigenvectors and continue the alternating projections on its submatrices, thereby obtaining a hierarchical clustering.

Figure5treats a well-known problem this way.

4.1 Application to motion segmentation

Spectral clustering has become the preferred method for segmentation problems in computer vision [25,7,27,21, 18,29, 10,4,22]. Motion segmentation (e.g., [7]) takes

(7)

(a) all points, frame 101 (b) segmentation (c) individual clusters, separated to show correctness of segmentation

Figure 6: Nonrigid motion segmentation by spectral clustering. 500 frames of 2Dtracking data from three people are scaled, centered, and superimposed to remove spatial cues that might aid grouping. The tracking matrix is factored P→ ˜M ˜S via

SVD; spectral clustering of the columns of ˜S correctly groups the points on the basis of correlated motion though time.

The grayscale images depict the top ten eigenvectors of the affinity matrix at initialization (far left) and of the P matrix at convergence (far right) after 11 iterations. At initialization, there is a noisy hint of the clustering in eigenvectors 2 and 5;

at convergence, the clustering is clear in piece-wise constant eigenvectors 1–3, which are associated with unit eigenvalues.

Other eigenvalues are near 1, indicating that the faces can be subpartitioned; this results in segmentation of the jaws.

(a) (b)

Figure 5: Hierarchical clustering of a problem treated in [18] and [20]. Our algorithm first gives a bipartitioning (a), then recursively analyzes the two subsets, using submatri- ces of P from the first partitioning. In the second round one subset is immediately bipartitioned to give (b). One point appears mis-classified but its assignment may be consistent with the kernel, since its distances to other points are more consistent with the scale of inter-point distances on the line (◦’s) than of those in the tight cluster (♦’s).

2Dtracking data for a number of points in the scene, and seeks to group those points into independently moving 3D

objects on the basis of correlated motion. The 2Dprojec- tion of N points on the rigid 3Dsurface of object j in image If is given by Pf j∈R2×N= Mf jSj, where motion matrix Mf j encodes the position of the object relative to the cam- era, and shape matrix Sj∈R3×Nor ∈R4×Ngives the 3D

location of the points in object-centered or homogeneous coordinates. With multiple frames and multiple objects,

P = MS∈R2F×(N1+···+NJ) (8)

=.

M11 · · · M1J

... . .. ... MF1 ·· · MFJ

S1 0 0

0 . .. 0 0 0 SJ

 (9)

Each column of S describes one point; columns represent- ing points from two different objects are orthogonal (Two columns from the same object may be orthogonal as well, but no column can be orthogonal to all columns from the same object). Unfortunately, S cannot be recovered di-

rectly from P. Instead, a “thin”SVDcan be used to factor P→ ˜M ˜S .

= (MG−1)(GS), but unknown matrix G may ar- bitrarily re-order or mix the columns of S, destroying the orthogonal structure. However, if the motions of the ob- jects are approximately independent (they are never per- fectly so), two columns in ˜S belonging to the same ob- ject will be more similar than two columns belonging to different objects, in the sense that their inner product will be more positive because the motions of the corresponding points is highly correlated. Therefore spectrally clustering the columns of ˜S has been found to to be highly successful for segmenting the set of points into objects (e.g., [4]).

Two objections have been leveled at this approach: (1) It has only been applied to sequences in which the points can be segmented easily using simpler criteria such as spatial grouping. This is understandable; no one has tracking data for overlapping objects. (2) It is not clear that the method would extend to nonrigid objects, where the motions of points on one object are more weakly correlated.

To satisfy these objections, we constructed an unrealisti- cally hard segmentation problem by superimposing dense face-tracking data from three “talking-heads” videos. The motion is highly nonrigid. To remove spatial separation cues, the data for each head was centered on the origin in each frame, so that when the data is combined all three faces overlap and there are no spatial or translational cues for segmentation. After centering, the motion of some points on a face is actually anti-correlated with most of the other points of the face (e.g., lower-lip versus upper- lip points). To accommodate the nonrigid motion, we in- creased the rank of theSVDby a factor of 5 (Si∈R15×N), which allows 5 modes of deformation per object but yields a harder, higher-dimensional clustering problem. Figure6 shows that ˜S is perfectly clustered, yielding the correct motion segmentation. The tracking data, the evolution of the dominant eigenvectors, and the extracted clusters are shown in the accompanying video.

(8)

5 Summary

Spectral methods practitioners have long understood that a representation derived from selected eigenvectors of the affinity matrix somehow makes embedding and clustering problems easier for subsequent heuristic algorithms. To date, formal analyses have justified this approach only for problems with very obvious cluster structure and for cer- tain kinds of affinity matrix. The polarization theorem of section 2 provides a unified explanation for virtually all the algorithms and affinity matrices in the cited literature:

There exists an eigenvector representation which matches the angles between data-points in feature space; as the di- mensionality of this representation is reduced, angles be- tween similar points shrink while angles between dissimi- lar points grow. This highlights the cluster structure of the data and makes segmentation by heuristic methods signif- icantly more likely to succeed. This theorem invites us to look at the representation as an embedding of the data on the surface of a hypersphere, where the inner product of two vectors gives the cosine of their angle. That insight led us to two algorithms: One finds nonlinear low-dimensional embeddings of data in spaces having a mixture of linear and cyclic axes; the other performs clusterings by repeated pro- jections of the data, eliminating heuristic “post-clustering.”

The clustering algorithm has the appeal that all steps are well characterized in terms of what information about the distribution is preserved or lost, and the amount of infor- mation loss can be bounded and/or minimized. It also per- forms very well in practice on both synthetic “challenge problems” from the literature and a real-world motion seg- mentation problem that is considerably harder than those contemplated in the computer vision literature.

We are currently exploring better distributions for spheri- cal data, bounds on convergence rates, and bounds on the rate at which angles change as dimensionality is reduced.

In our work, we have benefitted from conversations with Sue Whitesides, Yoav Freund, Josh Tannenbaum, and Paul Viola, to whom we extend thanks.

References

[1] M. Aizerman, E. Braverman, and L. Rozonoer. Theoret- ical foundations of the potential function method in pat- tern recognition learning. Automation and Remote Control, 25:821–837, 1964.

[2] C. Alpert and S. Yao. Spectral partitioning: The more eigen- vectors, the better. In Proc. ACM/IEEE Design Automation Conference, 1995.

[3] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spec- tral analysis of data. In Proceedings of the 33rd Symposium on Theory of Computing, pp. 619–626, 2001.

[4] M. Carcassoni and E. Hancock. A hierarchical framework for spectral correspondence. In Proc. Euro. Conf. Computer Vision, pp. 266–281, 2002.

[5] P. Chan, D. Schlag, and J. Zien. Spectral K-way ratio-cut partitioning and clustering. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 13(9):1088–

1096, 1994.

[6] F. R. Chung. Spectral graph theory, volume 92 of CBMS Re- gional Conference Series in Mathematics. American Math- ematical Society, 1997.

[7] J. Costeira and T. Kanade. A multi-body factorization method for motion analysis. In Proc. Int. Conf. Computer Vision, pp. 1071–1076, 1995.

[8] J. Demmel. Lecture notes on graph partitioning, part 2, April 1999. Notes for UC Berkeley CS267.

[9] M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematics Journal, 23:298–305, 1973.

[10] Y. Gdalyahu, D. Weinshall, and M. Werman. Self organi- zation in vision: Stochastic clustering for image segmenta- tion, perceptual grouping, and image database organization.

IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(10):1053–1074, 2001.

[11] H. Hotelling. Analysis of complex statistical variables in principal components. J. Educational Psychology, 24:417–

441, 498–520, 1933.

[12] P. Jupp and K. Mardia. Maximum likelihood estimators for the matrix von Mises-Fisher and Bingham distributions. An- nals of Statistics, 7:599–606, 1979.

[13] R. Kannan, S. Vempala, and A. Vetta. On clusterings — good, bad and spectral. In 41st Symposium on the Founda- tions of Computer Science, 2000.

[14] J. T. Kent. The Fisher-Bingham distribution on the sphere.

J. Royal Statistical Society, B, 44:71–80, 1982.

[15] J. T. Kent. The complex Bingham distribution and shape analysis. J. Royal Statistical Society, B, 56:285–299, 1994.

[16] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, Beverly Hills„ 1978.

[17] K. Mardia. Directional statistics and shape analysis. Tech- nical Report 24, U. Leeds Department of Statistics, 1995.

[18] M. Meila and J. Shi. Learning segmentation by random walks. In Proc. Adv. Neural Info. Processing Systems, pp. 873–879, 2000.

[19] M. Meila and J. Shi. A random walks view of spectral seg- mentation. In AI and STATISTICS (AISTATS) 2001, 2001.

[20] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering:

Analysis and an algorithm. In Proc. Adv. Neural Info. Pro- cessing Systems, 2002.

[21] P. Perona and W. Freeman. A factorization approach to grouping. In Proc. Euro. Conf. Computer Vision, pp. 655–

670, 1998.

[22] A. Robles-Kelly and E. Hancock. Pairwise clustering with matrix factorisation and the EM algorithm. In Proc. Euro.

Conf. Computer Vision, pp. 63–77, 2002.

[23] S. T. Roweis and L. K. Saul. Nonlinear dimensionality re- duction by locally linear embedding. Science, 290:2323–

2326, December 22 2000.

[24] B. Schölkopf, A. Smola, and K. Müller. Nonlinear compo- nent analysis as a kernel eigenvalue problem. Neural Com- putation, 10(5):1299–1319, 1998.

[25] G. Scott and H. Longuet-Higgins. Feature grouping by relo- calisation of eigenvectors of the proximity matrix. In Proc.

British Machine Vision Conference, pp. 103–108, 1990.

[26] G. Scott and H. Longuet-Higgins. An algorithm for associ- ating the features of two patterns. Proc. Royal Society Lon- don, B244:21–26, 1991.

[27] J. Shi and J. Malik. Motion segmentation and tracking us- ing normalized cuts. In Proc. Int. Conf. Computer Vision, pp. 1154–1160, 1998.

[28] R. Sinkhorn. A relationship between arbitrary positive ma- trices and doubly stochastic matrices. Annals of Mathemat- ical Statistics, 35:876–879, 1964.

[29] Y. Weiss. Segmentation using eigenvectors: A unifying view. In Proc. Int. Conf. Computer Vision, pp. 975–982, 1999.

參考文獻

相關文件

Spectral fractional elliptic operators; Caffarelli-Silvestre- Stinga type extension; Non-local operators; Weighted eigenvalue problems; Unique continuation from a measurable

Proposition 9.4.2, A orthogonal diagonalizable, Spectral Theorem.. Theorem 9.4.6

3.16 Career-oriented studies provide courses alongside other school subjects and learning experiences in the senior secondary curriculum. They have been included in the

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

The research proposes a data oriented approach for choosing the type of clustering algorithms and a new cluster validity index for choosing their input parameters.. The

The goal of our research is to study the subspace clustering algorithms and to apply the technique on the application of target market. The clustering technique is a useful

Lemma 4.5.. Then, the proof is complete.. By Theorem 4.1 and Theorem 4.6, the conclusion is drawn. Baes, Convexity and differentiability properties of spectral functions and

We will give a quasi-spectral characterization of a connected bipartite weighted 2-punctually distance-regular graph whose halved graphs are distance-regular.. In the case the