A unifying theorem for spectral embedding and clustering

(1)

A unifying theorem for spectral embedding and clustering

Matthew Brand and Kun Huang

Mitsubishi Electric Research Labs, Cambridge, Massachusetts, USA

Electrical & Computer Engineering, University of Illinois at Urbana-Champagne, USA

Abstract

Spectral methods use selected eigenvectors of a data affinity matrix to obtain a data representation that can be trivially clustered or embedded in a low-dimensional space. We present a theorem that explains, for broad classes of affinity matrices and eigenbases, why this works:

For successively smaller eigenbases (i.e., using fewer and fewer of the affinity matrix’s dominant eigenvalues and eigenvectors), the angles between “similar” vectors in the new representation shrink while the angles between “dissim- ilar” vectors grow. Specifically, the sum of the squared cosines of the angles is strictly increasing as the dimensionality of the representation decreases. Thus spectral methods work because the truncated eigenbasis amplifies structure in the data so that any heuristic post-processing is more likely to succeed. We use this result to construct a nonlinear dimensionality reduction (NLDR) algorithm for data sampled from manifolds whose intrinsic coordinate system has linear and cyclic axes, and a novel clustering-by-projections algorithm that requires no post-processing and gives superior performance on “challenge problems”

from the recent literature.

1 Introduction

Spectral methods for multivariate data analysis are notable both for their practical successes and for their rapidly de- veloping theoretical underpinnings. A spectral algorithm typically begins with an “affinity matrix” of pairwise relationships between the samples or the variates, and derives a more useful representation of the data from its eigenvalue decomposition (EVD), often using just one or a few eigenvectors (a truncated eigenbasis). Many classic dimensionality reduction and nonlinear embedding algorithms have this character: Principal components analysis (PCA) [11] uses the variates’ covariance matrix; multidimensional scaling (MDS) [16] uses the samples’ pairwise distance matrix; kernelPCA[1,24] uses a kernel matrix where the kernel functionκ(xi, xj) represents the dot product of samples

in an unknown “feature space¹”; and locally linear embedding (LLE) [23] uses a matrix containing correlations of samples’ barycentric coordinates.

Spectral methods have been even more successful for data clusterings and graph partitionings: Spectral bipartitioning [9,6] cuts a graph in two by thresholding the second eigenvector of the graph’s normalized Laplacian matrix, and nu- merous clustering algorithms use selected eigenvectors of dot-product or kernel matrices to re-represent the data for clustering by simpler heuristics such as thresholding or K- means [25,2,5,7,27,21,29,13,3,18,19,4,20,22].

While the statistical basis and optimality of PCA is well understood, virtually all other spectral methods are motivated by imperfect analogies between data-derived graphs and physical problems (e.g., harmonic analysis²and random walks³), or as approximations to other problems (e.g., vector quantization [2], min-cut [27], or max-flow [6]).

Underlying all this work is the notion that the truncated eigenvector basis somehow makes the problem simpler for the subsequent analysis. Our theoretical goal is to explain how and why this works.

Embeddings and clusterings imply loss of information, but there has been little effort to bound or even quantify what is lost and characterize what is conserved. This is acutely true for the vast majority of algorithms in which the spectral analysis is just a prelude to further information- lossy data analysis. Promising steps in the right direction include work by Alpert & Yao [2] that equates spectral partitioning with vector quantization (thereby implying an objective function), and analyses by Fiedler, Per- ona & Freeman, Shi & Malik, Meila & Shi, and Weiss et al. [9,21,27,29,19,20] that justify using one or a few eigenvectors as a cluster indicator when the data is already clustered or nearly so. In particular, if the affinity matrix already has a block structure, then some of its eigenvectors will be piece-wise constant, such that if items i and j share cluster membership, elements i and j in these eigenvectors

1Feature space is the “linearization space” of Aizerman et al.

[1], in which Euclidean relationships between points are consistent with the kernel’s similarity measure.

2The eigenvectors of a graph’s normalized Laplacian matrix are analogous to the modes of vibration the graph would exhibit if shaken [8].

3The eigenvectors of a stochastic matrix describe the steady- state properties of an infinite random walk on the graph [6,19].

(2)

will have the same value [29,19,20]. For nearly clustered data, the eigenvectors are approximately piecewise.

However, this leaves open many questions, particularly when the data is not nearly clustered: What special proper- ties should the affinity matrix have? Stochasticity? Unit diagonal? Positive definiteness? Unit spectral radius? Which and how many eigenvectors should be used? What information is conserved in a truncated eigenbasis? Obvi- ously, the answers to these questions should inform the post-processing of the eigenvectors.

In this paper we develop a unifying view of spectral methods that answers many of these questions and gives guid- ance for the construction of clustering and nonlinear dimensionality reduction algorithms. For most of our dis- cussion, it will be useful to think of the affinity matrix in terms of a (possibly unknown) kernel: Affinity value A_{i j}=κ(xi, xj) =Φ(xi)^>Φ(xj) is the dot product of two vectors representing the (usually unknown) locations of points i and j in a high-dimensional feature space asso- ciated with the kernel. Spectral analysis gives a new data representation derived from the eigenvalues and eigenvec- tors of the symmetric affinity matrix A. To summarize the main theoretical result:

(1) An eigenvalue-scaled eigenvector representation of the data encodes angles (equivalently, correlations) between points embedded in the surface of a hypersphere.

(2) When the representation is truncated by suppressing the smallest magnitude eigenvalues, the angles (equiv., correlations) between high-affinity points are least distorted, highlighting the manifold structure of the data.

(3) As the representation is further truncated, the angles (equiv., correlations) decrease between points having high affinity and increase between points having low affinity, highlighting the cluster structure of the data.

In short, nonlinear dimensionality reduction and clustering can be obtained from the same process. The theorem is limited to symmetric non-negative definite affinity matrices, but a corollary establishes relevance to non-positive matrices as well, and to asymmetric matrices (e.g., B) via their Grams (B^>B or BB^>).

In the remainder of the paper we leverage this theorem into novel methods for nonlinear dimensionality reduction (NLDR) and clustering. TheNLDRalgorithm maps the data to a mixed vector and toric space, with the linear or cyclic nature of each axis determined from statistical tests. The clustering algorithm works entirely by projections, whose information loss is easily characterized and minimized or bounded at each step. Experiments show that it produces high-quality clusterings of a wide variety of “challenge problems” exhibited in the recent literature. We also use it to solve an unusually difficult visual segmentation problem.

2 The polarization theorem

Let A∈R^D^×Dbe a non-negative definite symmetric matrix having eigenvalue decomposition (EVD) VΛV^>= A with eigenvalues sorted in descending order on the diagonal of Λ. Define representation X .

=Λ^1/2V^> and let truncated representation X_(d)be the top d rows of X—the d princi- pal eigenvectors scaled by the square roots of their associated eigenvalues. A well-known property of such truncated

EVDs is that A_(d) .

= X^>_(d)X_(d)is the best rank-d approxima- tion to A with respect to the Frobenius norm; equivalently, the most energy-preserving projection to rank d.

Let Y_(d) be an angle-preserving projection of the column vectors of X_(d)onto the surface of a d-dimensional hyper- sphere, obtained by scaling each column of X_(d) to unit norm. The angle between two column vectors xi, xj∈ X(d)

(equivalently y_i, yj∈ Y(d)) is θi j .

= ∠(xi, xj) = arccos x^>_i x_j

kxik · kxjk (1) and the correlation between two vectors x_i, xj ∈ X(d)

(equivalently y_i, yj∈ Y(d)) is

corr(x_i, xj) = y^>_i y_j= cosθi j. (2) We may now state the main result:

Theorem (polarization): As positive (resp., non- negative) A is projected to successively lower ranks A_(D₋₁₎, A(D−2),··· , A(d),·· · , A(2), A(1), the sum of squared angle-cosines ∑_{i6= j}(cosθi j)² (equivalently squared cor- relations kY^>_(d)Y_(d)k²_F) is strictly increasing (resp., non-decreasing).

In short, as the dimensionality of the representation is reduced, the distribution of cosines migrates away from 0 toward the two poles ±1, such that angles migrate from θi j=π/2 toθi j∈ {0,π}.

The full proof requires a large number of lemmas and runs to several pages; because of page limits it will be published separately. The following proof sketch gives the flavor of the argument: The identity diag(Λ) = (V◦ V)diag(A) allows one to derive the distribution of the nonzero eigenvalues [γ1,·· · ,γd] of the cosine matrix Y^>_(d)Y_(d) = diag(diag(A_(d)))^−1/2A_(d)diag(diag(A_(d)))^−1/2. One can then show that as d ↓ 1 the variance of the eigenvalues grows. However, the projection onto the hypersphere keeps the mean root-eigenvalue constant at ¹_d∑^di γ^1/2_i = 1. There- fore the sum ∑^d_i γi = trace(Y^>_(d)Y_(d)) =kY^>_(d)Y_(d)k²_F =

∑i6= j(cosθi j)²+ D grows monotonically.

The following two corollaries will be developed into algorithms in the remainder of the paper:

Corollary (embedding): Suppressing the smallest- magnitude eigenvalues of A gives a d < D dimensional em- bedding in which small angles are least distorted.

(3)

This unsurprising corollary is quite similar to the motiva- tion forPCA. In our case, the mass-preserving embedding spreads the data out on the hypersphere surface, preserving small angles most accurately because their cosines com- prise most of the energy in the affinity matrix. This means that local relations between nearby points are well preserved. In the next section we show that this allows one to construct a relatively low-dimensional embedding for affinity data, with the unusual feature that the embedding space may have both linear and cyclic degrees of freedom.

Corollary (clustering): Truncation of the eigenbasis am- plifies any unevenness in the distribution of points on the d-dimensional hypersphere by causing points of high affin- ity to move toward each other and other to move apart.

In short, the distribution approaches a clustering for small d  D. This explains many, if not all, spectral clustering methods: Using a subset of all the eigenvectors emphasizes the data’s cluster structure, improving the output of any heuristic clustering procedure. This does not mean that the lowest-dimensional embedding is the best one for clustering; there is a tradeoff between amplifying cluster structure and losing information. In section4, we show that by using a large subset of eigenvectors one can depend entirely on projections andEVDs to do the clustering, removing the need for heuristic post-processing.

Although the theorem is limited to non-negative symmetric affinity matrices, it also has explanatory value for spectral methods that employ selected eigenvectors of non- positive matrixes: Every real symmetric matrix can be writ- ten C = A− B where positive semi-definite matrices A and B satisfy rank(A) + rank(B) = rank(C) and A, constructed from the positive part of C’s spectrum, is the best (least- squares) gram approximation of C (one offering a real- valued decomposition A = X^>X). The theorem applies to A. For example, Weiss [29] showed that clustering meth- ods related to the min-cut problem (e.g., [21,27,18]) are of this nature: Although posed as generalized eigenvalue problems with non-positive Laplacian matrices, these algorithms ultimately consult a single eigenvector from the positive part the spectrum of the normalized Laplacian matrix. It is worth noting that it follows from basic properties of the normalized Laplacian [6] that either (1) this eigenvector is only approximately piecewise constant, presenting some uncertainty for the final clustering, or (2) eigenvalue multiplicity makes the choice of eigenvector ambigu- ous. The polarization theorem suggests that additional rel- evant information lies in the remaining eigenvectors; below we construct an algorithm that exploits this information and eliminates the abovementioned ambiguities.

3 Dimensionality reduction

Motivated by the first corollary above, we observe that Y_(d) is a low-dimensional nonlinear embedding of the data onto the surface of a d-dimensional hypersphere, with the arc-

length between two points inversely related to their affinity score. Let us choose a dimensionality d that truncates only the lesser eigenvalues of the affinity matrix A. As inPCA, this choice is usually a matter of eyeballing the eigenvalue spectrum, unless one has prior knowledge about the true noise levels in the data and how they affect the kernel.

Since it is difficult to work with spherical embeddings, our goal is to “re-embed” the data in a vector space, where pos- sible. Let p be the point on the hypersphere surface having smallest arc-length to all points in Y and let [u₁,· ·· , ud−1] be an orthogonal basis of the hyperplane tangent to the sur- face at p. (Note that U .

= [u1,· ·· , u_d−1, p] satisfies U^>U = I.) Each u_i specifies a direction around the hypersphere which can be visualized as a great circle parallel to u_iand passing through p. Each u_ican be interpreted as one of the axes of a projection toR^a×T^b—the Cartesian product of an a-dimensional vector space and a b-dimensional toric⁴ space—with a + b = d− 1. If the data wraps fully around the hypersphere along direction u_i, then axis i is cyclic; if the data wraps only partway around the hypersphere along direction u_j, then axis j is linear.

Much as PCA axes are statistically motivated from multi- variate gaussian distribution, the tangent point p and axes U can be estimated by fitting a gaussian distribution to the surface of a hypersphere. The complex Bingham distribution [15] is a multivariate gaussian density on y∈C S^d⁻¹⊂ C^d conditioned on the fact that all vectors are unit-length (y^∗y = 1, where y^∗denotes complex conjugate transpose):

p(y|Σ) = C(Σ)⁻¹exp(y^∗Σy). (3) The complex Bingham is parameterized by hermitian matrix Σ= Udiag([κ1,·· · ,κd])U^∗ whose eigenvalues κ1 <

κ2<·· · <κd−1<κd = 0 are the concentration parameters of the density. A strongly negative κi 0 indicates that the density has little extent along direction u_i. The last eigenvector u_dpoints to the mode of the distribution, thus p = u_d. The normalizing constant C(Σ) is calculated via the matrix confluent hypergeometric function₁F₁[14], which in this case has a compact form discovered by Kent [15]:

C(Σ) = ₁F₁(d 2,1

2, diag([κ1,· ·· ,κd])) (4)

= 2π^d−1^d−1

∑

j=1

e^κ^j

∏_{i6= j}(κj−κi). (5)

Jupp and Mardia [12] showed that the direction vec- tors in U and the concentration parameters [κ1,··· ,κ_d−1] are related to the scatter of Y through its EVD

Udiag([λ1,··· ,λd])U^∗= YY^>, with 0 <λ1<λ2<·· · <λd 4A toric spaceTⁿ^{= (}S¹⁾ⁿhas n cyclic axes but, unlike spheri- cal spaceSⁿ, every point has a unique set of ordinates modulo 2π; there are no poles presenting singularities. E.g.: when walking, leg and arm phase are two cyclic variates inT²; but the orienta- tion of a featureless cone in 3-space is described by two variates inS²(a.k.a. Euler angles).

(4)

satisfying

λj=∂logC(Σ)

∂κj

. (6)

For large sample sizes with concentrated density, κj ≈

−c/λj for some constant c. Numerical solution for the concentration parameters in higher dimensions is feasible but nontrivial. Fortunately, for dimensionality reduction, knowing the EVDof the scatter suffices: Its eigenvectors U = [u1,·· · , ud−1, p] are exactly the maximum likelihood (ML) estimate of the point of tangency (a.k.a. modal direction) and the axes of the tangent space, while its eigenvalues [λ1,· · · ,λd] give a rough indication of which axes are cyclic: A small eigenvalue (λi≈ 0) indicates that the data is approximately linear in direction u_i; a large eigenvalue (λi≈λd) indicates the axis is cyclic.

random points on cylinder side view

spherical embedding from affinity matrix

mode

side view with Bingham axes linear

cyclic

Figure 1: Dimensionality reduction of points distributed on a 3Dcylinder (in 10Dspace) to the 2DspaceR¹×T¹^—one linear axis and one cyclic axis. 500 points are randomly generated on the surface of a 3Dcylinder embedded in 10D

space, and contaminated with isotropic 10Dgaussian noise.

The top two images show a 3Dand a 2Dprojection of the points. The bottom two images show an embedding of the affinity matrix on the surface of a 3Dsphere. The embedding forms a wide belt around the equator. The arrows show the modal direction and two degrees of freedom of a Bingham distribution fitted to the embedding. Statistical tests for uniformity indicate that the data is cyclic around the equator but linear in the other direction.

To test more precisely whether the data is cyclic in direc- tion u_iaround the surface of the hypersphere, consider the projection of the data onto the great circle parallel to ui

and passing through the mode u_d. A uniform distribution on this circle (=S¹⁼T¹) implies that the data is cyclic in direction u_i. The projection is Z .

= [z1,·· · , zN] with z_j∝[ud, ui]^>y_j, kzjk = 1. With this projection, we can apply a result of Mardia [17] which gives a statistical test

500 points from swiss roll + noise

spherical embedding from affinities

mode

top view map to 2D vector space side view

Figure 2: Dimensionality reduction of points distributed on a curled plane in 10Dspace to the 2Dvector spaceR²^{. The} data is analyzed as in figure1, but the spherical embedding does not wrap all the way around the equator. The statistical test indicates that the data is Bingham-distributed in both directions, so the sphere’s surface is azimuthally mapped to a 2Dvector space, recovering the original planar coordinate system. Since the original data manifold passes close by itself, some points, particularly at the ends, have affinity for nonlocal neighbors, resulting in some distortion in the recovered coordinates.

to assess the hypothesis that a distribution on the (perime- ter of the) unit circle is uniform rather than complex Bing- ham: Letγ1,γ2be the eigenvalues of the normalized scatter ZZ^>/n. Then, for large data-sets (n 0),

3n(γ1−γ2)²'χ²₃, (7) whereχ²₃is a chi-squared distribution with three degrees of freedom. Thus we may reject the hypothesis that axis u_iis uniform (cyclic) with confidence Pr(χ²₃> 3n(γ1−γ2)²).

Once identified, the non-cyclic axes are isolated by pro- jecting Y onto the union of the non-cyclic axes and p, then rescaling the resulting vectors to unit norm. After projection onto this reduced hypersphere, the modal point is [0,··· , 0, 1] and the Bingham axes are similarly axis- aligned. An azimuthal equidistant mapping at the modal point then takes the points into a vector space. The process is illustrated in figure 1 for 10Dpoints noisily sampled from a 2D nondevelopable⁵ manifold having genus 1 (a cylinder) and in figure2 for 10Dpoints noisily sampled from a 2D developable manifold having genus 0 (a rectangular plane curled into a “swiss roll”). The affinity matrix for all data-sets in this paper is A_{i j}=κ(xi, xj)∝

5A “developable” d-dimensional manifold embedded inRⁿ can mapped toR^dwithout internal distortions, e.g., a developable surface can be unrolled and flattened without stretching.

(5)

exp−(kxi− xjk²/2σ²), whereσis taken to be the average of the distances between each point and its closest neigh- bor (which we denote σs). For simplicity of analysis, we normalize the affinity matrix by projecting it to the nearest doubly stochastic matrix P .

= diag(d)Adiag(d) using a fast modification of the Sinkhorn procedure [28] to solve for d satisfying P1 = 1 (and P^>1 = 1 since P^>= P). While not crucial to this method, a doubly stochastic matrix has two properties that make it appealing as a model of the data:

P can be interpreted as the transition probabilities of a ran- dom walk on the data; and P’s largest eigenvalueλmax = 1 has corresponding eigenvector u₁∝1, which implies that the stationary distribution of the random walk is uniform (every point is equally probable). To obtain the embeddings in the figures, we discarded the totally uninformative u₁and constructed Y_(d)as in section2from eigenvectors 2-4 of P, then fitted Bingham densities to the results to de- termine the appropriate embedding inR^a×T^b^.

One could also make embeddings inR^a×T^b×S^c^{, though} testing uniformity hypotheses onS^cfor c > 1 requires ex- plicit calculation of the Bingham concentration parameters.

In practice, we find that the MLmodal estimator for the complex Bingham distribution is rather sensitive to noise;

many samples may be necessary to get a good estimate.

Fisher or Watson distributions may be better behaved, but currently they are less tractable analytically and computa- tionally. We turn now to the problem of clustering, where we obtain an easily analyzed, highly competitive algorithm.

4 Clustering

Spectral methods have been extensively studied in graph partitioning and clustering problems. Fiedler [9] first showed that the eigenvector of the Laplacian matrix corresponding to the second eigenvalue gives an embedding of the graph in a real line; cutting this embedding at the origin gives a bipartitioning of the graph. This was extended to k-way partitioning where the feature points are mapped into a k-dimensional space with the new coordinates be- ing the normalized row vector of the matrix formed by the first k eigenvectors of the affinity matrix [25,26]. Simi- larly, in Ng et al. [20], the normalized row vectors of the matrix formed by the first k weighted eigenvectors are used as the input to a k-means clusterer, and a perturbational analysis was used to show that the results should be sta- ble if the data is already “nearly clustered”. In Chan et al. [5], the directional angle between the row vectors of the first k eigenvectors of the Laplacian matrix was used as a new distance measure for partitioning. Alpert & Yao [2]

equated partitioning with the problem of clustering these row-vectors, and found that the more eigenvectors used, the better. Spectral bipartitioning methods were adapted for visual clustering problems by Perona and Freeman [21] and Shi & Malik [27]. Analyses by Weiss [29] and Meila & Shi [18,19] showed that normalizing a nearly block-structured affinity matrix makes its eigenvectors approximately piece-

(a) (b)

(c) (d)

0,0,0

Figure 3: Spectral clustering of data distributed in three rings. (a) Cluster assignments are indicated by different markers. (b) The initial log-affinity matrix, sorted by true clusters (the algorithm is blind to such orderings). (c) The P matrix at convergence after 4 iterations. (d) In the con- verged representation, all the points belonging to any one cluster are co-located in a “corner” of a 3Dsphere.

wise constant, and therefore easy to interpret as cluster assignments. However, such structure is not guaranteed for real problems, and some post-processing is necessary.

Our goal in visiting this already crowded field is to elimi- nate heuristic post-processing steps. Based on the our theoretical result, we constructed one (of many possible) spectral clustering algorithms in which there is no post-EVD

clustering or thresholding; instead, the stochastic eigenvectors form a discrete indicator matrix showing the membership of each point.

Our basic strategy is to cast clustering as two alternating projections: Projection to low-rank, and projection to the set of zero-diagonal doubly stochastic matrices. In both cases it is easy to characterize what is conserved and quantify what is lost. The projection to lower rank A→ A(d)(or P→ A(d)) is exactly the process characterized by the polarization theorem: We polarize the distribution of angles with minimal loss of energy kA − A(d)k²F. The projection to a zero-diagonal doubly stochastic ma- trix A_(d)→ P = diag(d)(A(d)− diag(diag(A(d))))diag(d) suppresses any differences in the stationary probability of points induced by the projection to low-rank. Disre- garding the suppressed diagonal, this projection is simply an angle-preserving rescaling of the embedding vectors, because diag(d)A_(d)diag(d) = diag(d)X^>_(d)X_(d)diag(d) = (X_(d)diag(d))^>(X_(d)diag(d)). Suppressing the diagonal induces negative eigenvalues in the spectrum of P (associ- ated with removing energy placed on the diagonal by the positive eigenvalues); these eigenvalues account for less

(6)

(a) (b) (c)

(d) (e) (f)

Figure 4: A gallery of “challenge problems” adapted from [20] and successfully clustered by our method. Cluster membership is indicated by marker symbol. As with all radial kernel methods, clustering reflects connectivity and scale similarity;

thus the size of the kernel has an effect on the results: The same data-set is differently clustered withσ= 10σsin (c) and σ= 2σs in (d) (settingσ=σs breaks the sides from the corners). A very similar data-set was only bipartitioned after a search overσin [20]. Similar results can be observed for (e) and (f) whereσ=σs/2 for (e) andσ=σsfor (f).

than half the energykPk²_Fin P. We project to lower rank by suppressing the negative eigenvalues and the uninformative unit eigenvalue. This gives an automatic determination of d and a bound on the loss of variance. The alternating pro- jections terminate when the resulting P matrix has two or more stochastic (unit) eigenvalues, implying reducibility.

(A reducible matrix P can be row- and column-permuted into block-diagonal form, e.g., figure5c).

Analysis: It can be shown that (1) If P has unique posi- tive nonstochastic eigenvalues, alternating the projections P→ A(d) and A_(d)→ P will drive the leading eigenval- ues up toward the positive bound λi≤ +1 and all other eigenvalues down toward the negative boundλi≥ −1. For- mally, the vector norm of the eigenvalues increases while their sum remains constant. (2) Once P becomes reducible (has multiple stochastic eigenvaluesλi= +1), the matrix X_(k) whose k columns are P’s stochastic eigenvectors has exactly k unique rows. (3) Let Z_(k)be a matrix formed from these rows. The product Z_(k)Z^>_(k)is diagonal and the prod- uct X_(k)Z^>_(k)(Z_(k)Z^>_(k))⁻¹= X_(k)Z⁻¹_(k)is a binary (0/1) cluster indicator matrix that maps all points in a single cluster to a unique positive axis ofR^k(e.g., figure3d).

It remains to be shown whether the conditions in (1) are sufficient to guarantee absolute convergence. All our experiments, including those of figures3–6, have converged

quite quickly.

When the affinity matrix is produced by a gaussian kernel, this procedure groups points that all have similar affinity values, essentially creating clusters where the inter-point distances are all on the same scale. Clusters may be non- convex and wrap around each other (e.g., figure 4a, fig- ure4bef; figure5ab), but the results are generally in agree- ment with human judgment; some authors have drawn con- nections between gaussian kernel clustering and human perceptual gestalts [21,20].

This procedure automatically produces multi-way parti- tions without prior knowledge of the number of clusters.

However, if the data contains clusters at different scales, the P matrix may become reducible before all the clusters have been revealed, only giving a partial clustering. Often we find that some remaining non-stochastic eigenvalues are close to 1, indicating the clusters can be further subparti- tioned. We partition the P matrix according to its stochas- tic eigenvectors and continue the alternating projections on its submatrices, thereby obtaining a hierarchical clustering.

Figure5treats a well-known problem this way.

4.1 Application to motion segmentation

Spectral clustering has become the preferred method for segmentation problems in computer vision [25,7,27,21, 18,29, 10,4,22]. Motion segmentation (e.g., [7]) takes

(7)

(a) all points, frame 101 (b) segmentation (c) individual clusters, separated to show correctness of segmentation

Figure 6: Nonrigid motion segmentation by spectral clustering. 500 frames of 2Dtracking data from three people are scaled, centered, and superimposed to remove spatial cues that might aid grouping. The tracking matrix is factored P→ ˜M ˜S via

SVD; spectral clustering of the columns of ˜S correctly groups the points on the basis of correlated motion though time.

The grayscale images depict the top ten eigenvectors of the affinity matrix at initialization (far left) and of the P matrix at convergence (far right) after 11 iterations. At initialization, there is a noisy hint of the clustering in eigenvectors 2 and 5;

at convergence, the clustering is clear in piece-wise constant eigenvectors 1–3, which are associated with unit eigenvalues.

Other eigenvalues are near 1, indicating that the faces can be subpartitioned; this results in segmentation of the jaws.

(a) (b)

Figure 5: Hierarchical clustering of a problem treated in [18] and [20]. Our algorithm first gives a bipartitioning (a), then recursively analyzes the two subsets, using submatri- ces of P from the first partitioning. In the second round one subset is immediately bipartitioned to give (b). One point appears mis-classified but its assignment may be consistent with the kernel, since its distances to other points are more consistent with the scale of inter-point distances on the line (◦’s) than of those in the tight cluster (♦’s).

2Dtracking data for a number of points in the scene, and seeks to group those points into independently moving 3D

objects on the basis of correlated motion. The 2Dprojec- tion of N points on the rigid 3Dsurface of object j in image I_f is given by P_{f j}∈R^2×N^{= M}f jS_j, where motion matrix M_{f j} encodes the position of the object relative to the cam- era, and shape matrix S_j∈R^3×N^or ∈R^4×Ngives the 3D

location of the points in object-centered or homogeneous coordinates. With multiple frames and multiple objects,

P = MS∈R^2F^×(N¹⁺^···+N^J⁾ ⁽⁸⁾

=.







M₁₁ · · · M1J

... . .. ... M_F1 ·· · MFJ













S₁ 0 0

0 . .. 0 0 0 S_J





 (9)

Each column of S describes one point; columns represent- ing points from two different objects are orthogonal (Two columns from the same object may be orthogonal as well, but no column can be orthogonal to all columns from the same object). Unfortunately, S cannot be recovered di-

rectly from P. Instead, a “thin”SVDcan be used to factor P→ ˜M ˜S .

= (MG⁻¹)(GS), but unknown matrix G may ar- bitrarily re-order or mix the columns of S, destroying the orthogonal structure. However, if the motions of the objects are approximately independent (they are never per- fectly so), two columns in ˜S belonging to the same ob- ject will be more similar than two columns belonging to different objects, in the sense that their inner product will be more positive because the motions of the corresponding points is highly correlated. Therefore spectrally clustering the columns of ˜S has been found to to be highly successful for segmenting the set of points into objects (e.g., [4]).

Two objections have been leveled at this approach: (1) It has only been applied to sequences in which the points can be segmented easily using simpler criteria such as spatial grouping. This is understandable; no one has tracking data for overlapping objects. (2) It is not clear that the method would extend to nonrigid objects, where the motions of points on one object are more weakly correlated.

To satisfy these objections, we constructed an unrealisti- cally hard segmentation problem by superimposing dense face-tracking data from three “talking-heads” videos. The motion is highly nonrigid. To remove spatial separation cues, the data for each head was centered on the origin in each frame, so that when the data is combined all three faces overlap and there are no spatial or translational cues for segmentation. After centering, the motion of some points on a face is actually anti-correlated with most of the other points of the face (e.g., lower-lip versus upper- lip points). To accommodate the nonrigid motion, we in- creased the rank of theSVDby a factor of 5 (S_i∈R^15×N^), which allows 5 modes of deformation per object but yields a harder, higher-dimensional clustering problem. Figure6 shows that ˜S is perfectly clustered, yielding the correct motion segmentation. The tracking data, the evolution of the dominant eigenvectors, and the extracted clusters are shown in the accompanying video.

(8)

5 Summary

Spectral methods practitioners have long understood that a representation derived from selected eigenvectors of the affinity matrix somehow makes embedding and clustering problems easier for subsequent heuristic algorithms. To date, formal analyses have justified this approach only for problems with very obvious cluster structure and for cer- tain kinds of affinity matrix. The polarization theorem of section 2 provides a unified explanation for virtually all the algorithms and affinity matrices in the cited literature:

There exists an eigenvector representation which matches the angles between data-points in feature space; as the dimensionality of this representation is reduced, angles between similar points shrink while angles between dissimi- lar points grow. This highlights the cluster structure of the data and makes segmentation by heuristic methods signif- icantly more likely to succeed. This theorem invites us to look at the representation as an embedding of the data on the surface of a hypersphere, where the inner product of two vectors gives the cosine of their angle. That insight led us to two algorithms: One finds nonlinear low-dimensional embeddings of data in spaces having a mixture of linear and cyclic axes; the other performs clusterings by repeated projections of the data, eliminating heuristic “post-clustering.”

The clustering algorithm has the appeal that all steps are well characterized in terms of what information about the distribution is preserved or lost, and the amount of information loss can be bounded and/or minimized. It also performs very well in practice on both synthetic “challenge problems” from the literature and a real-world motion segmentation problem that is considerably harder than those contemplated in the computer vision literature.

We are currently exploring better distributions for spherical data, bounds on convergence rates, and bounds on the rate at which angles change as dimensionality is reduced.

In our work, we have benefitted from conversations with Sue Whitesides, Yoav Freund, Josh Tannenbaum, and Paul Viola, to whom we extend thanks.

References

[1] M. Aizerman, E. Braverman, and L. Rozonoer. Theoret- ical foundations of the potential function method in pat- tern recognition learning. Automation and Remote Control, 25:821–837, 1964.

[2] C. Alpert and S. Yao. Spectral partitioning: The more eigen- vectors, the better. In Proc. ACM/IEEE Design Automation Conference, 1995.

[3] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spec- tral analysis of data. In Proceedings of the 33rd Symposium on Theory of Computing, pp. 619–626, 2001.

[4] M. Carcassoni and E. Hancock. A hierarchical framework for spectral correspondence. In Proc. Euro. Conf. Computer Vision, pp. 266–281, 2002.

[5] P. Chan, D. Schlag, and J. Zien. Spectral K-way ratio-cut partitioning and clustering. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 13(9):1088–

1096, 1994.

[6] F. R. Chung. Spectral graph theory, volume 92 of CBMS Re- gional Conference Series in Mathematics. American Math- ematical Society, 1997.

[7] J. Costeira and T. Kanade. A multi-body factorization method for motion analysis. In Proc. Int. Conf. Computer Vision, pp. 1071–1076, 1995.

[8] J. Demmel. Lecture notes on graph partitioning, part 2, April 1999. Notes for UC Berkeley CS267.

[9] M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematics Journal, 23:298–305, 1973.

[10] Y. Gdalyahu, D. Weinshall, and M. Werman. Self organization in vision: Stochastic clustering for image segmentation, perceptual grouping, and image database organization.

IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(10):1053–1074, 2001.

[11] H. Hotelling. Analysis of complex statistical variables in principal components. J. Educational Psychology, 24:417–

441, 498–520, 1933.

[12] P. Jupp and K. Mardia. Maximum likelihood estimators for the matrix von Mises-Fisher and Bingham distributions. An- nals of Statistics, 7:599–606, 1979.

[13] R. Kannan, S. Vempala, and A. Vetta. On clusterings — good, bad and spectral. In 41st Symposium on the Founda- tions of Computer Science, 2000.

[14] J. T. Kent. The Fisher-Bingham distribution on the sphere.

J. Royal Statistical Society, B, 44:71–80, 1982.

[15] J. T. Kent. The complex Bingham distribution and shape analysis. J. Royal Statistical Society, B, 56:285–299, 1994.

[16] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, Beverly Hills„ 1978.

[17] K. Mardia. Directional statistics and shape analysis. Tech- nical Report 24, U. Leeds Department of Statistics, 1995.

[18] M. Meila and J. Shi. Learning segmentation by random walks. In Proc. Adv. Neural Info. Processing Systems, pp. 873–879, 2000.

[19] M. Meila and J. Shi. A random walks view of spectral seg- mentation. In AI and STATISTICS (AISTATS) 2001, 2001.

[20] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering:

Analysis and an algorithm. In Proc. Adv. Neural Info. Pro- cessing Systems, 2002.

[21] P. Perona and W. Freeman. A factorization approach to grouping. In Proc. Euro. Conf. Computer Vision, pp. 655–

670, 1998.

[22] A. Robles-Kelly and E. Hancock. Pairwise clustering with matrix factorisation and the EM algorithm. In Proc. Euro.

Conf. Computer Vision, pp. 63–77, 2002.

[23] S. T. Roweis and L. K. Saul. Nonlinear dimensionality re- duction by locally linear embedding. Science, 290:2323–

2326, December 22 2000.

[24] B. Schölkopf, A. Smola, and K. Müller. Nonlinear compo- nent analysis as a kernel eigenvalue problem. Neural Com- putation, 10(5):1299–1319, 1998.

[25] G. Scott and H. Longuet-Higgins. Feature grouping by relo- calisation of eigenvectors of the proximity matrix. In Proc.

British Machine Vision Conference, pp. 103–108, 1990.

[26] G. Scott and H. Longuet-Higgins. An algorithm for associ- ating the features of two patterns. Proc. Royal Society Lon- don, B244:21–26, 1991.

[27] J. Shi and J. Malik. Motion segmentation and tracking us- ing normalized cuts. In Proc. Int. Conf. Computer Vision, pp. 1154–1160, 1998.

[28] R. Sinkhorn. A relationship between arbitrary positive ma- trices and doubly stochastic matrices. Annals of Mathemat- ical Statistics, 35:876–879, 1964.

[29] Y. Weiss. Segmentation using eigenvectors: A unifying view. In Proc. Int. Conf. Computer Vision, pp. 975–982, 1999.