Would decoupling be useful for this?

A colleague and I are now looking at a generalization of the problem from the previous post: Let \{X_1, \ldots, X_k \} be a collection of rank one sign matrices. How large does k need to be for \{X_1, \ldots, X_k\} to span a r-dimensional subspace of \mathbb{R}^{n \times n}?

Our idea was to note this is equivalent to asking when

 U = \begin{pmatrix} \text{vec}(X_1) \\ \vdots \\ \text{vec}(X_k) \end{pmatrix}

has rank r, then use the relation  \text{rank}(U) \geq \|U\|_F^2/\|U\|^2 = kn^2/\|U\|^2.

Accordingly, we need to know what k needs to be so it is extremely unlikely that  \|U\|^2 \geq kn^2/r. From two different directions we can get an idea of how large k should be:

  • If we approximated U with a matrix of Rademachers, we know that \|U\| \approx \sqrt{k} + n because it’s a subgaussian matrix in \R^{k \times n^2}. Thus we need k so \sqrt{\frac{k}{r}}n \geq \sqrt{k} + n. That is, k should be larger than r \frac{n^2}{(n - \sqrt{r})^2}.
  • Note that if we take a unit length vector a \in \R^{n^2} and partition it as a = (a_1, \ldots, a_n)^t, and furthermore write X_i = u_i v_i^t where u_i and v_i are Rademacher vectors of length n, then

     \displaystyle \|U a \|^2 = \sum_{i=1} \langle \text{vec}(X_i), a \rangle^2 =  \sum_{i=1}^k \left (\sum_{j=1}^n v_{i,j} \langle u_i, a_j \rangle \right)^2.

    The terms being indexed by i are i.i.d., so consider the behavior of \langle \text{vec}(X), a \rangle^2 = \sum_{j=1}^n v_j \langle u, a_j \rangle^2. Let’s say that by the CLT,  \langle u, a_j \rangle is distributed \mathcal{N}(0,\|a_j\|^2), then this sum is distributed \mathcal{N}(0,1). Thus \|U a\|^2 is a \chi^2(k) r.v., so a subexponential r.v. concentrated around k. From a covering set argument, it looks like the deviations are on the order of n^2, so we’d need kn^2/r \approx k + n^2, or  k to be larger than r \frac{n^2}{n^2 - r}.

The second intuitive argument seems to offer a route to actually finding a concrete relation k must satisfy: we need to show that \langle \text{vec}(X), a \rangle is subgaussian with explicit constants, then use that to show that \|Ua\|^2 is subexponential with explicit constants.

The big hurdle here is that the entries in \text{vec}(X) are not independent, so we can’t just use Hoeffding’s inequality. What we can say is that

 \displaystyle
\langle \text{vec}(X), a \rangle = \sum_{j=1}^n \langle u, a_j \rangle v_j,

where each term in the sum is subgaussian with tails like e^{-t^2/\|a_j\|^2}, but the fact that the terms aren’t independent prevents us from saying that the whole thing is subgaussian with a tail like e^{-t^2}.

So I’m wondering if there’s some kind of decoupling result I can use in this case? I know nothing about decoupling …

Update: \langle \text{vec}(X), a\rangle = \sum_{ij} \epsilon_i \epsilon_j^\prime a_{ij} where the \epsilon_i, \epsilon^\prime_i are Rademacher sequences and I’m considering a as a matrix in the natural manner. You can relate the moments of this to the moments of  \sum_{ij} \epsilon_i \epsilon_j a_{ij} where a_{ii} = 0. I found a nasty moment bound for this in the literature, without explicit constants. See Tail and moment estimates for some types of chaos. Good ol’ Latala. I refuse to believe that this problem requires so much complication, though.

Possibly relevant posts:

Mar 11th, 2010 | Filed under Mathematics
Tags:

The Noncommutative Bernstein inequality: the right tool for the job

I spent several days trying to compute how many random rank one sign matrices (each entry in \{-1, 1\}) you need to ensure that the projection of a fixed matrix onto their span, A_P = P_{\{X_1, \ldots, X_N\}}A satisfies \|A_P\|_F \leq c \|A\|_F with high probability for some  0 < c \leq 1 fixed.

First I noticed that the easiest way to approach this is to ask what N needs to be for \{X_1, \ldots,X_N\} to be full rank; equivalently, take  c = 1. Then I noticed that since \text{rank}(AA^t) = \text{rank}(A) (by the SVD), this is equivalent to asking when  \sum_{i=1}^N \text{vec}(X_i) \otimes \text{vec}(X_i) is full rank.

My first approach, to bound the moments of  \sum_{i=1}^N \text{vec}(X_i) \otimes \text{vec}(X_i) - I using symmetrization and the trace method to reduce to estimation of \mathbb{E}\text{Tr}((\sum_{i=1}^N \varepsilon_i\text{vec}(X_i) \otimes \text{vec}(X_i))^{2k}), then Markov's inequality to get a tail bound, was promising at first. Since each  X_i = u_i v_i^t where u_i and v_i are uniformly distributed random sign vectors, a lot of cancellation occurs, and basically you're left with a counting problem (count which terms in the expectation don't vanish). Unfortunately, this counting problem's a tough nut to crack, at least for me--- attempting to do so is what led me to the nasty recursion I mentioned in a previous post.

Then I came across Recht's paper, "A Simpler Approach to Matrix Completion", in which he proves the following Noncommutative Bernstein inequality:

Let S_1, \ldots, S_N be independent zero-mean random matrices of dimension d_1 \times d_2. Suppose \rho_k = \max\{\|\mathbb{E}(S_kS_k^\star)\|, \|\mathbb{E}(S_k^\star S_k)\| \} and \|S_k\| \leq M almost surely for all k. Then for any \tau > 0,


\displaystyle
\mathbb{P} \left(\left\|\sum_{k=1}^N S_k \right\| > \tau \right) \leq (d_1 + d_2) \exp \left( \frac{-\tau^2/2}{\sum_{k=1}^N \rho_k + M \tau/3} \right)

(apparently related Operator Bernstein and Vector Bernstein inequalities are provided in Gross’s “Recovering low-rank matrices from few coefficients in any basis.”)

With this tool, the problem’s trivial: take S_i = \text{vec}(X_i) \otimes \text{vec}(X_i) - I . Then \rho = \rho_k \leq n^2, M \leq n^2, and by plugging through the calculations, when  N = \text{O}(n^2 \log n),

 \displaystyle
\mathbb{P}\left( \left\| \frac{1}{N} \sum_{i=1}^N \text{vec}(X_i) \otimes \text{vec}(X_i) - I \right\| > \frac{1}{2} \right)

is small, so \{X_i\}_{i=1}^N is full rank with high probability.

I’m satisfied with the sharpness of this estimate for N: the smallest any collection of matrices spanning \R^{n \times n} could be is n^2, and it seems unlikely that with the restriction of them being rank sign matrices, you’d easily find such a collection of random matrices. The \log n factor seems like a reasonable price to pay; in fact, because of the coupon collector phenomena, maybe it’s the same price you’d need to pay for a collection of random matrices selected from the standard euclidean basis matrices. In that light, this is an unexpectedly good result.

Possibly relevant posts:

Mar 10th, 2010 | Filed under Mathematics
Tags:

Gil Scott-Heron: Me and The Devil

Came across this at Raving Black Lunatic. What does it say about me that the first thing the sung portion made me think was that this would be perfect on the soundtrack of one of the Faustian Supernatural episodes?

Possibly relevant posts:

Mar 10th, 2010 | Filed under General
Tags:

Why the eigenvalues of a PSD matrix are continuous w.r.t. its entries

A fellow student asked me that question today, and I came up with the following answer: Clearly the top eigenvalue is continuous, as it’s just the norm. Then by writing f_k(A) = \sum_{i=1}^k \lambda_i(A) , the sum of the top k eigenvalues of A, as the SDP


\begin{aligned}
\max & \text{ Trace}(AB) \\
\text{subject to} & \\
 & \|B\| \leq 1 \\
 & \text{Trace}(B) = k \\
 & B \succeq 0
\end{aligned}

we see that f_k(A) is convex (since a function of the form f(A) = \sup_{B \in \mathcal{B}} \langle A, B \rangle , i.e. the supremum of linear functions, is convex). Since a convex function on an open set is continuous, we see that f_k is continuous on the positive semidefinite cone, so \lambda_k(A) = f_k(A) - f_{k-1}(A) is continuous on the same set also.

Another interesting question: are the eigenvalues of a general square matrix continuous w.r.t. to its entries? I believe that’s the case, because we’re talking about roots of polynomials, after all, but I don’t know a simple argument off the top of my head.

Possibly relevant posts:

Mar 9th, 2010 | Filed under Mathematics
Tags:

A logistic (in the nonmathematical sense) aside

I’m within \epsilon of having a new roommate, so naturally I’m planning what I’ll do with the extra $625/month I’ll be saving. Guffaw. The first two months are spoken for already— it’s past time for new glasses, and I haven’t paid off my Caltech dining account in several months—, then I’ll start saving up to get a car, since it looks like I’m going to be here another 3 years or so. After that, I dream of actually adding to my savings account for the first time in over a year.

Possibly relevant posts:

Mar 7th, 2010 | Filed under General
Tags:

Solve this nasty recursion

I’ve “reduced” a problem to solving this recursion:


\displaystyle
\begin{cases}
\Omega_{k,m} = \sum_{\omega = 0}^k {2k \choose 2\omega} \Omega_{k-\omega, m-1} & k \geq 1 \text{ and } m \geq 1 \\
\Omega_{0,m} = 1 & m \geq 1 \\
\Omega_{k,0} = 0 & k \geq 1
\end{cases}

This looks nasty, and certainly beyond my experience. Any ideas on how to proceed? I guess I could try Z-transforms, but … just yuck.

Possibly relevant posts:

Mar 4th, 2010 | Filed under Mathematics
Tags:

CVXOPT + Pythonika = convex optimization from Mathematica

One of the reasons why I use Matlab more than Mathematica is Matlab supports convenient convex optimization via the open source CVX package, while Mathematica doesn’t seem to have any support even via (open source) third party packages for convex optimization.

Now it seems I’ve found a relatively easy way to rectify this, by interfacing Mathematica with the CVXOPT Python package using Pythonika. At least under Ubuntu, this was a hasslefree process: apt-get python-cvxopt, then download and compile Pythonika. The only modifications that I needed to make to the Pythonika makefile were to add a ‘-lrt’ linker flag and point to the correct python version and the correct Mathematica Linker library. Then I was able to use CVXOPT to solve a linear program from Mathematica!

CVXOPT doesn’t seem to provide the same type of natural mathematical language interface as CVX does, and I’m not sure how flexible/convenient the interface between Python and Mathematica provided by Pyhtonika is, but this is an interesting development.

Possibly relevant posts:

Feb 27th, 2010 | Filed under Mathematics, Programming
Tags:

A way not to get sign decompositions

One idea for getting sign decompositions of a matrix, inspired by Grothendieck’s inequality (the dual formulation) is to get a factorization A = UV^t which realizes  \gamma_2(A) and then round the entries of U,V to signs somehow, take D to be an appropriate scaling matrix, and use UDV^t as your approximate sign decomposition of A.

You can use this as the primitive step in a greedy algorithm, and get pretty good results (if! you use least squares to calculate D)– I implemented this method, simply replacing the entries of U,V with their signs–, but it’s not a particularly smart way of getting sign decompositions. For one thing, you could probably equally as well use random sign matrices U,V, because if you use least squares projection, you’re guaranteed that \|A - UDV^t\|_F < \|A\|_F with reasonable probability at each step (take this with a grain of salt: I think this is easy to show, but I've not checked that it's true), which is all that you can hope to say even if you use the U,V from the \gamma_2 decomposition. That is to say, there's nothing inherently special or useful about the \gamma_2 decomposition.

To drive home the point, first notice that 1 = \|I\|_{1 \rightarrow \infty} \leq \gamma_2(I) , so \gamma_2(I) = 1. Therefore we can find a wide range of factorizations achieving \gamma_2(I): take U = V to be any orthonormal matrix. So there's definitely not anything special about the factors, or the signs you'd round them to, in the general case.

Second, and more importantly, it seems difficult to prove anything about the rate of convergence of these types of greedy algorithms. And (this is just a gripe, because even a factorization corresponding to \|A\|_{\infty \rightarrow 1}^\star can be far from sparse), although the factorizations I've seen returned with my particular greedy scheme have reasonably many terms, there's no kind of guarantee of sparseness.

Possibly relevant posts:

Feb 26th, 2010 | Filed under Mathematics
Tags:

Dual of an inequality and equality constrained SDP

Update: The dual SDP for \gamma_2(A) can be more simply written as

 \max 2 \langle \lambda, A \rangle subject to
 \text{diag}(s) \geq 0 ,
 s^t 1 \leq 1
 \text{diag}(s) - \begin{pmatrix} 0 & \lambda \\ \lambda^t & 0 \end{pmatrix} \succeq 0.

So it’s surprising that the \lambda which solves this seems to be in the \gamma_2^\star 1/2-ball. It’s easy to see that this ball is included in the feasible points, but no reason to think that it is the entire set of feasible points.

I spend mucho time deriving and checking this duality, then implementing it in CVX, so I record it here so hopefully I don’t have to redo this (I realize I could’ve looked this up, but the point is to derive it so I have a better chance of remembering it myself):

The dual of the SDP  \min \langle C, X \rangle s.t. X \succeq 0 ,  \mathcal{A}(X) \preceq 0 .  \mathcal{B}(X) = M is the SDP \max \langle \lambda, M \rangle s.t. \mathcal{A}^\star(s) - \mathcal{B}^\star(\lambda) + C - Z = 0 .  s \succeq 0 ,  Z \succeq 0.

In particular, this implies the dual of the SDP I mentioned before gives \gamma_2(A) is the SDP

 \max \left\langle \lambda, \begin{pmatrix} 0 & A &  \\ A^t & 0 &  \\  &  & 0 \end{pmatrix} \right\rangle
s.t. \begin{pmatrix} \text{diag}(s) & \\ & -s^t 1 \end{pmatrix} - \begin{pmatrix} 0 & \lambda_2 & \\ \lambda_3 & 0 & \lambda_5 \\ & \lambda_6 &  \end{pmatrix} + \begin{pmatrix} 0 & 0 & \\ 0 & 0 & \\ & & 1 \end{pmatrix} - Z = 0 ,  s \succeq 0 ,  Z \succeq 0

where I decomposed \lambda as follows

 \lambda = \begin{pmatrix} \lambda_1 & \lambda_2 & \\ \lambda_3 & \lambda_4 & \lambda_5 \\ & \lambda_6 & 0 \end{pmatrix}.

If A \in \R^{n \times m}, then \lambda \in \R^{(n+m+1) \times (n+m+1)} and its components break down as follows: \lambda_1 \in \R^{n \times n} , \lambda_2 \in \R^{n\times m}, \lambda_3 \in \R^{m \times n}, \lambda_4 \in \R^{m \times m} , \lambda_5 \in \R^{(n+m) \times 1}, and \lambda_6 \in \R^{1 \times (n+m)}.

As you can see, this is an unwieldy SDP to write out, but it has its benefits as a method for calculating \gamma_2(A). Specifically, I believe that 2\lambda_2 is a matrix in the unit \gamma_2^\star ball which achieves \gamma_2(A):

 \displaystyle \gamma_2(A) = \max_{\gamma_2^\star(B) \leq 1} \langle A, B \rangle = \langle A, 2\lambda_2 \rangle.

Numerical experiments indicate this is so, but I’d like to establish it. Here’s some CVX code to test this out (mostly this is just here– as is this entire post and all posts relating to my research– for archival purposes):

% solve the dual SDP for a random matrix
A = randn(20,20);
[n,m] = size(A);
cvx_begin
    variable s(n+m);
    variable lambda(n+m+1, n+m+1);
    variable Z(n+m+1, n+m+1) symmetric;
    maximize( trace( lambda * [zeros(n) A zeros(n,1); A.' zeros(m,m+1); zeros(1,n+m+1)] ));
    subject to
        s >= 0;
        Z == semidefinite(n+m+1);
        [diag(s) zeros(n+m,1); zeros(1,n+m) -ones(n+m,1).'*s] - ...
            [zeros(n) lambda(1:n, n+1:n+m+1); ...
             lambda(n+1:n+m, 1:n) zeros(m) lambda(n+1:n+m, n+m+1); ...
             lambda(n+m+1, 1:n+m) 0] + diag([zeros(1,n+m) 1]) - Z == 0; 
cvx_end
B = lambda(1:n, n+1:n+m);
% the solution to the dual SDP, the primal SDP, and  (lambda_2, A) (inner product) should
% all be the same
[cvx_optval gamma2norm(A) trace(2*B.'*A)] 
gamma2dualnorm(2*B) % check 2lambda_2 is in the gamma2dual norm ball

Possibly relevant posts:

Feb 26th, 2010 | Filed under Mathematics
Tags: