Nyström method (for low rank approximation)

I’ve been hard at work, mostly verifying that uniform column sampling + the SVD algorithm in Finding Structure with Randomness gives an even faster method with reasonable error bounds. As usual, I am stuck on the analysis.

Anyhoo, Joel suggested I look at the Nyström method for low rank approximation, which is apparently popular among machine learning folk who use kernel-based methods. You can read more about the origin of the method — it was inspired by the Nyström method for solving Fredholm integral equations of the second kind, or at least named after it because it can be ‘derived’ from this method — in Matrix Compression using the Nyström method. The most straightforward, useable bound I’ve seen is given by Talwalkar in Matrix Coherence and the Nyström Method, where he shows that if you construct your Nyström approximation from a reasonable number of columns and rows (depending upon the coherence of the right singular vectors of the matrix) and the matrix is PSD, then with large probability, the Nyström approximation is the original matrix.

So what is the Nyström method? If


\displaystyle
M = \begin{pmatrix} A & B \\ C & D
\end{pmatrix}

where A is k \times k, then the Nyström approximant is


\displaystyle
\tilde{M} = \begin{pmatrix}A \\ C \end{pmatrix} A^{\dagger} \begin{pmatrix} A & B \end{pmatrix}

In practice you’d want to sample rows and columns from through out the matrix, but it turns out that you get the same formula for the approximant (just plug in some permutation matrices and everything works out). This is a magical formula: if A happens to have the same rank as M, then you have exact recovery. This suggests that randomly sampling from the rows and columns of M may give you good results.

In fact, it looks like in the general case (meaning: when you apply the method to the datasets I’ve been working with), you get something on the order of 100-1000 times the optimal approximation error \sigma_{k+1}(M) even if you take A to be about k \log(k) . This is a lot worse than what I can get using the uniform column sampling, and additionally a lot slower, since you have to take the pseudoinverse of A and QR decompositions of the two matrices bordering A^{\dagger} in the formula for \tilde{M}, which may have a lot of rows/columns.

Why did I mention the Nyström method then? Well, for one thing, it’s extremely cool, both intrinsically and in terms of its derivation from the PDE Nyström method. But another is that it made me consider the question of getting low rank approximations to the pseudoinverses of matrices, rather than to the matrices themselves. In the case where the matrix is PSD and you have a good estimate \gamma for the norm, then you can do this by finding a low rank approximation to \gamma I - M, but in the general case, I don’t see any obvious reduction.

I imagine this has applications, so it might be an interesting future research problem.

update: Ok, the results look better if I use a comparatively large (6768×6768) dataset; then the error seems to be around 10\sigma_{k+1} on average. It takes a while to compute the relevant quantities, so I’ve only done three trials. Since the ideal is to work with huge datasets, maybe this Nyström method would be useful— but unlike the column sparsification method, I can’t see the benefits for matrices of which are conveniently sized for experiments?

Possibly relevant posts:

Tags: , , ,
No comments yet.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">
CommentLuv Enabled