K-means as a matrix optimization problem
I came across this formulation of k-means while trying to figure out what variants of nonnegative matrix factorization are used in practice:
The basic kernel k-means problem, given datapoints
, is to find clusters
which minimize the sum of the distances of the points in the clusters from the cluster centers:

Here
maps the input points into some feature space; the kernel
measures the similarity of the features of its inputs. Using the kernel, we can rewrite the above as

and since the first term is a constant, this is equivalent to

Now let
if
and 0 otherwise, and let
and write the above as

The question is, what is the appropriate set to maximize over?
The paper shows that
where
is a rank
matrix with orthonormal columns and entries all nonnegative and
. This is easy to see: by reordering the points
in a manner corresponding to the optimal cluster choice, so that each cluster is represented by a contiguous subsequence of the ordering, we find an equation relating the corresponding
to
:

where
is the permuation matrix that maps the
th standard basis vector to the
th.
By construction
is block diagonal with
blocks each of the form
such that
. Now take
to be the matrix whose
th column is the indicator vector for
scaled by
. It’s clear that
satisfies the above characterization, and conjugation with a permutation matrix doesn’t change this, so
satisfies the above characterization.
What I don’t see is how they justify the statement that this is the set to optimize over. Sure, each clustering is represented by some
in this set, but how do we know that the set of clustering matrices isn’t a proper subset of the set? Unless I’m missing something in their argument, it’s incomplete until we know that if
where
is a rank
matrix with orthonormal columns and all nonnegative entries and
, then
for some permutation matrix
and block diagonal
that has
blocks, each of which satisfies
and
. (Well this, or that if you have a matrix in this set which doesn’t represent a clustering, there’s always a matrix in the set which represents a clustering and has a higher value of the objective)
Update It turns out that these sets are equivalent. It follows easily from the fact that
’s entries are all nonnegative and
has orthogonal columns that there is a permutation matrix
so that
is block diagonal (
swaps the rows of
, and from the fact
, we know there’s only one nonzero entry on each row of
). From
, we know that
, so the blocks on the diagonal are doubly stochastic (since the matrix is symmetric). Now notice that each block matrix in
is of the form
where
and
. This implies
is a multiple of
— otherwise, since the rows look like
, the row corresponding the smallest
would have a smaller sum than the others, so
would not be
.
It would’ve been nice if the paper spelled this out.