K-means as a matrix optimization problem

I came across this formulation of k-means while trying to figure out what variants of nonnegative matrix factorization are used in practice:

The basic kernel k-means problem, given datapoints \{x_i\}_{i=1}^n, is to find clusters C_1, \ldots, C_k which minimize the sum of the distances of the points in the clusters from the cluster centers:
 \displaystyle \min_{C_1, \ldots, C_k} \sum_{i=1}^r \sum_{x_i\in C_r} \left\|\phi(x_i)-\frac{1}{|C_r|}\sum_{x_j \in C_r} \phi(x_j)\right\|^2

Here \phi maps the input points into some feature space; the kernel k(x, y) = \langle \phi(x), \phi(y) \rangle measures the similarity of the features of its inputs. Using the kernel, we can rewrite the above as
 \displaystyle \min_{C_1, \ldots, C_k} \sum_{i=1}^r \left( \sum_{x_i \in C_r} k(x_i, x_i)-\frac{1}{|C_r|} \sum_{x_i \in C_r} \sum_{x_j \in C_r} k(x_i, x_j) \right),
and since the first term is a constant, this is equivalent to
 \displaystyle \max_{C_1, \ldots, C_k} \sum_{i=1}^r \frac{1}{|C_r|} \sum_{x_i, x_j \in C_r} k(x_i, x_j).

Now let F_{ij} = 1/|C_r| if i,j \in C_r and 0 otherwise, and let K_{ij} = k(x_i, x_j) and write the above as
 \displaystyle \max_{\text{appropriate } F} \sum_{ij} F_{ij} K_{ij} = \max_{\text{appropriate } F} \text{Tr}(FK).
The question is, what is the appropriate set to maximize over?

The paper shows that F = GG^t where G is a rank k matrix with orthonormal columns and entries all nonnegative and F1 = 1. This is easy to see: by reordering the points x_i in a manner corresponding to the optimal cluster choice, so that each cluster is represented by a contiguous subsequence of the ordering, we find an equation relating the corresponding \hat{F} to F:
\displaystyle \sum_{ij} F_{ij} K_{ij} = \sum_{ij} \hat{F}_{ij}K_{\sigma(i), \sigma(j)} \Leftrightarrow F = P_\sigma \hat{F} P_\sigma^t
where P_\sigma is the permuation matrix that maps the ith standard basis vector to the \sigma(i)th.

By construction \hat{F} is block diagonal with k blocks each of the form \hat{F}_i = \lambda_i 11^t such that \hat{F}_i 1 = 1. Now take G to be the matrix whose ith column is the indicator vector for C_r scaled by 1/\sqrt{|C_r|}. It’s clear that \hat{F} = GG^t satisfies the above characterization, and conjugation with a permutation matrix doesn’t change this, so F satisfies the above characterization.

What I don’t see is how they justify the statement that this is the set to optimize over. Sure, each clustering is represented by some F in this set, but how do we know that the set of clustering matrices isn’t a proper subset of the set? Unless I’m missing something in their argument, it’s incomplete until we know that if F=GG^t where G is a rank k matrix with orthonormal columns and all nonnegative entries and F1 = 1, then F = P\hat{F}P^t for some permutation matrix P and block diagonal \hat{F} that has k blocks, each of which satisfies \hat{F}_i = \lambda_i 11^t and  \hat{F}_i1 = 1. (Well this, or that if you have a matrix in this set which doesn’t represent a clustering, there’s always a matrix in the set which represents a clustering and has a higher value of the objective)

Update It turns out that these sets are equivalent. It follows easily from the fact that G‘s entries are all nonnegative and G has orthogonal columns that there is a permutation matrix P so that (PG)(PG)^t is block diagonal (PG swaps the rows of G, and from the fact G^tG = I, we know there’s only one nonzero entry on each row of G). From F1 = GG^t 1=1, we know that (PG)(PG)^t1 = 1, so the blocks on the diagonal are doubly stochastic (since the matrix is symmetric). Now notice that each block matrix in (PG)(PG)^t is of the form xx^t where  x >0 and xx^t 1 = 1. This implies x is a multiple of 1 — otherwise, since the rows look like x_i ( [x_1, \ldots, x_n] ) , the row corresponding the smallest x_i would have a smaller sum than the others, so xx^t 1 would not be 1.

It would’ve been nice if the paper spelled this out.

Possibly relevant posts:

Tags: , , , ,
No comments yet.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">
CommentLuv Enabled