Raw versus central matrix moments

It turns out that the question I’m looking at now is strongly connected the question of what difference there is between the raw and central matrix moments of random self-adjoint matrices.

Remember the raw moments of a centered scalar random variable are r_p = (\mathbb{E} X^p)^{1/p} while the central moments are c_p = (\mathbb{E}(X - \mathbb{E}X)^p)^{1/p}. The same definitions hold in the matrix case, except X is a random matrix and we use matrix powers instead (and ignore the root-taking!).

In the scalar case, I’ve been assured that the central moments and raw moments of the same r.v. may grow at different rates (i.e. there’re no constants so  c r_p \leq c_p \leq C r_p ).
The general idea is if \mathbb{E} X = \mu \neq 0 , then X^p will vary about \mu^p instead of 0, so the raw moment may be much larger. Looking at wikipedia’s list of the first several raw and central moments of a \mathcal{N}(\mu, \sigma^2) variable, it looks like a \mathcal{N}(\sigma, \sigma^2) random variable may provide an example of this phenomenon in action.

The question is, what happens in the matrix case? It’s not clear to me how to compare the two moments in this case: are they even guaranteed to be comparable in the Loewner ordering?

Possibly relevant posts:

Tags: , , , ,

The gist of the Laplace transform method for bounding eigenvalues of random matrices

This post is a brief overview of the technique introduced in Joel’s User Friendly technical report (the User Friendly preprint is more intricate, but also more powerful, because it generalizes the results in the technical report to the case of martingales). I realize I’ve mentioned this paper in passing in several posts. I’m pretty obsessed with it at this point— not only is it the major tool in my current research project, but it’s a really powerful and simple idea.

Recall the Laplace transform method (aka the Chernoff bounding method) for bounding sums of scalar random variables: use Markov’s inequality and the fact that the moment generating function of the sum splits into the product of the moment generating functions of the individual summands:


\displaystyle
\mathbb{P}( \sum\nolimits_i X_i \geq t) \leq \inf_{\theta > 0} e^{-\theta t} \mathbb{E}\exp\left(\sum\nolimits \theta X_i \right) = \inf_{\theta > 0} e^{-\theta t} \prod_i \mathbb{E} \exp(\theta X_i)

Assuming that the calculations are tractable, from here you simply optimize over \theta to get the best such tail bound.

Attempting to generalize this to the case where the random variables are independent random self-adjoint matrices, and considering events of the type \lambda_{\text{max}}\left(\sum\nolimits_i X_i \right) > t, you immediately run into the problem that the ersatz moment generating function \mathbb{E} \exp(\lambda_{\text{max}}( \cdot) ) does not turn sums into products.


\displaystyle
\mathbb{P}\left( \lambda_{\text{max}} \left( \sum\nolimits_i X_i \right) > t \right) \leq \inf_{\theta >0} e^{-\theta t} \mathbb{E} \exp\left(\lambda_{\text{max}} \left( \sum\nolimits_i \theta X_i \right) \right) \leq \, ?

Where do we go from here?

Well, it turns out that the right thing to do is apply the spectral mapping theorem to see that \exp(\lambda_{\text{max}}(X)) = \lambda_{\text{max}}(\exp(X)) and then dominate the maximum eigenvalue with the trace; this gives


\displaystyle
\mathbb{P}\left( \lambda_{\text{max}} \left( \sum\nolimits_i X_i \right) > t \right) \leq \inf_{\theta >0} e^{-\theta t} \mathbb{E} \text{tr}\,\exp\left( \sum\nolimits_i \theta X_i \right)

At this point, the clever realization of the paper is that we can get good results by invoking a deep technical result due to Lieb (see Appendix A of Rusaki’s review of inequalities for quantum entropy for a flawed but conceptually enlightening proof) : if A is a fixed self-adjoint matrix then the function B \mapsto \text{tr}\,\exp(A + \log B) is concave on the positive definite cone, so if X is a random positive-definite matrix, then


\displaymath
\mathbb{E} \text{tr}\,\exp(A + \log X) \leq \text{tr}\,\exp(A + \log \mathbb{E}X)

Of course this fact remains true if we use conditional expectation.

Now assume we have bounds on the cumulant generating functions of the summands, of the form \log \mathbb{E} \exp(\theta X_i) \preceq V_i(\theta), where V_i(\theta) is a psd matrix for each \theta. This is equivalent to having bounds on the moment generating functions of the summands, \mathbb{E} \exp(\theta X_i) \preceq \exp(V_i(\theta)) .

Using conditional expectation and the technical result above, Joel shows that we have \mathbb{E} \text{tr}\,\exp\left(\sum\nolimits_i \theta_i X_i \right) \leq \tr \exp\left( \sum_i V_i(\theta) \right): let \mathbb{E}_k denote expectation conditional on X_1, \ldots, X_k. Then


\displaystyle
\begin{aligned*}
\mathbb{E} \text{tr}\,\exp\left( \sum\nolimits_i \theta X_i \right) & = \mathbb{E} \mathbb{E}_1 \cdots \mathbb{E}_{n-1} \text{tr} \, \exp( \theta X_1 + \ldots + \theta X_n) \\
& = \mathbb{E} \mathbb{E}_1 \cdots \mathbb{E}_{n-1} \text{tr} \, \exp( \theta X_1 + \ldots + \theta X_{n-1} + \log \exp(\theta X_n) ) \\
& \leq \mathbb{E} \mathbb{E}_1 \cdots \mathbb{E}_{n-2} \text{tr}\,\exp(\theta X_1 + \ldots + \theta X_{n-1} + \log \mathbb{E}_{n-1} \exp(\theta X_n) ) \\
& = \mathbb{E} \mathbb{E}_1 \cdots \mathbb{E}_{n-2} \text{tr}\,\exp(\theta X_1 + \ldots + \theta X_{n-1} + \log \mathbb{E} \exp(\theta X_n) ) \\
&  \leq \mathbb{E} \mathbb{E}_1 \cdots \mathbb{E}_{n-2} \text{tr}\,\exp(\theta X_1 + \ldots + \theta X_{n-1} + V_{n}(\theta)  ) \\
& \cdots \leq \text{tr}\,\exp\left(\sum\nolimits_i V_i(\theta) \right).
\end{aligned*}

If we overestimate the trace with the maximum eigenvalue and apply the spectral mapping theorem once more, we get the generic bound


\displaystyle
\mathbb{P}\left( \lambda_{\text{max}} \left( \sum\nolimits_i X_i \right) > t \right) \leq d \inf_{\theta >0} e^{-\theta t} \exp\left( \sum\nolimits_i \lambda_{\text{max}}(V_i(\theta)) \right),

where d is the size of the summands.

To apply this generic bound, you have to find V_i(\theta)– a simple procedure for the cases considered in the technical report–, which depend upon assumptions you make about the summands: e.g. if they’re bounded or how their moments grow. Then you can optimize over \theta to find the best such bounds.

See the technical report for the generalization of several standard scalar probability inequalities, like upper and lower Chernoff bounds for sums of bounded random matrices, and the Bernstein bound for sums of bounded random matrices with finite “variance”.

Possibly relevant posts:

No tags for this post.
Aug 27th, 2010 | Filed under Mathematics
Tags:

Musical Interlude: Duets

Aug 24th, 2010 | Filed under General
Tags:

inbetweeness

I had a thought very much like this today at lunch, as I looked around and noticed how young all the students seem. Well, at least the thought in the first paragraph… except I’m in that nonexistent transitional period right now: somewhere on the curve interpolating Undergraduate with Professor. Slowly, ever so slowly, moving. Why does this make me think of homotopy? Talk about crappy metaphors.

My late friend Stan Ulam used to remark that his life was sharply divided into two halves. In the first half, he was always the youngest person in the group; in the second half, he was always the oldest. There was no transitional period.

I now realize how right he was. The etiquette of old age does not seem to have been written up, and we have to learn it the hard way. It depends on a basic realization, which takes time to adjust to. You must realize that after reaching a certain age you are no longer viewed as a person. You become an institution, and you are treated the way institutions are treated. You are expected to behave like a piece of period furniture, an architectural landmark, or an incunabulum.

Possibly relevant posts:

No tags for this post.
Aug 24th, 2010 | Filed under General
Tags:

Positivity of a certain integral

Let a,l,v be strictly positive integers. I’d like to know that


\displaystyle
\int_0^1 \frac{\log (x) x^{a (l v+v-1)-v-1} \left((a l+a+l) x^{(l+1) v}-x (a l+a-1)\right)}{x-1} \, dx > 0

Any ideas? It looks like the integrand is pretty flat near 0, then it dips below zero, then shoots up above zero again strongly enough that the integral is always positive. The integral as a function of v also seems to be decreasing.

Equivalently, I’d like to know that


\displaystyle
\sum_{i=1}^{v l + v - 1} \frac{1}{a(vl + v -1) - v + i}

is increasing as a function of v.

Possibly relevant posts:

Tags: , ,
Aug 13th, 2010 | Filed under Mathematics

Comparing products of gaussian moments with one gaussian moment

I found this result while trying to prove that the moments of a sum of scaled chi-squared random variables grow like p:

Let \gamma be a standard Gaussian random variable. Then for any positive integers k_i,


\displaystyle
\mathbb{E}\gamma^{k_1} \mathbb{E}\gamma^{k_2} \cdots \mathbb{E}\gamma^{k_n} \leq \mathbb{E} \gamma^{k_1 + \cdots + k_n}.

(I may have missed some way in which this is trivially true anyhow, but I don’t think so)
(Update: in the comments, Sohail gave a shorter method using the AM-GM inequality that gives a tighter bound which is particularly useful if you know what \|k\|_2 and \|k\|_1 are)

The proof is pretty straightforward, but first I’ll show why this implies that the moments of a sum of scaled chi-squared random variables grow like p. Let \lambda_i be positive, and \gamma_i be independent standard Gaussian random variables, then


\displaystyle
\mathbb{E}\left(\sum\nolimits_i \lambda_i \gamma_i^2 \right)^p = \sum_{k_1 + \cdots + k_n = p} \left({p \atop k_1, \ldots, k_n}\right) \lambda_1^{k_1} \cdots \lambda_n^{k_n} \mathbb{E}\gamma_1^{2k_1}\cdots \mathbb{E}\gamma_n^{2k_n}

\displaystyle
\leq \mathbb{E}\gamma^{2p} \sum_{k_1 + \cdots k_n = p} \left({p \atop k_1, \ldots, k_n}\right) \lambda_1^{k_1}\cdots\lambda_n^{k_n} = \left(\sum\nolimits_i \lambda_i\right)^p \mathbb{E} \gamma^{2p}.

The fact about the moment growth follows, because \mathbb{E} \gamma^{2p} grows like p^p.

Now, the proof. First note that we can assume that all the k_i = 2p_i are even, because if any are odd, the left-hand side is zero, and the right hand side is always nonnegative. Let  p = \sum_i p_i. Using the fact that \mathbb{E}\gamma^{2p_i} = \tfrac{2^{p_i}}{\sqrt{\pi}} \Gamma(p + \tfrac{1}{2}),


\displaystyle
\mathbb{E}\gamma^{2p_1} \cdots \mathbb{E}\gamma^{2p_n} = \frac{2^p}{\pi^{n/2}} \Gamma(p_1 + \tfrac{1}{2}) \cdots \Gamma(p_n + \tfrac{1}{2}) \leq \frac{2^p}{\pi^{n/2}} \max_{x_1 + \ldots + x_n = p} \Gamma(x_1 + \tfrac{1}{2}) \cdots \Gamma(x_n + \tfrac{1}{2})

The quantity being maximized is always positive, so we can consider instead the problem of maximizing its log,


\displaystyle
\max_{x_1 + \ldots + x_n = p} \sum\nolimits_i \log\Gamma(x_i + \tfrac{1}{2}).

Since \Gamma is log convex, some extreme point of the set being maximized over must be a maximizer. The extreme points of this set consist of (x_i) where all entries but one are zeros. Thus we see that


\displaystyle
\mathbb{E}\gamma^{2p_1} \cdots \mathbb{E}\gamma^{2p_n} \leq \frac{2^p}{\pi^{n/2}} \Gamma(p + \tfrac{1}{2})\Gamma(\tfrac{1}{2})^{n-1} = \frac{2^p}{\sqrt{\pi}} \Gamma(p + \tfrac{1}{2}) = \mathbb{E}\gamma^{2p}.

Possibly relevant posts:

Tags: , , , , ,

A Matrix Cauchy-Schwarz Inequality via Schur Complements

Here’s an interesting proof of the CS-ineq that uses Schur Complements. Let f,g be square-integrable functions. Note that \begin{pmatrix} f \\ g \end{pmatrix} \begin{pmatrix}f & g \end{pmatrix} \succeq 0 , so the expectation of this matrix is positive. This implies the Schur complement \mathbb{E}f^2 - \mathbb{E}fg (\mathbb{E}g^2)^{-1} \mathbb{E}fg is positive. This is the Cauchy-Schwarz inequality.

I mention this because exactly the same trick gives you a matrix-valued Cauchy-Schwarz inequality:


\displaystyle
\mathbb{E}(A^\star A) \succeq \mathbb{E}(A^\star B) (\mathbb{E}(B^\star B))^{-1} \mathbb{E}(B^\star A).

Possibly relevant posts:

Tags: , ,
Aug 12th, 2010 | Filed under Mathematics

Nasty matrix moment!

Ok, I’ll put this out there. I’m try to calculate the matrix moments of a Wishart matrix with one degree of freedom: I have \eta \sim \mathcal{N}_n(0, C) and W = \eta \eta^t \sim W_{n}(C, 1). What is \mathbb{E} W^p in terms of C? I was thinking of applying Isserlis’ theorem and the relation \mathbb{E}((xx^t)^p)_{ij} = \mathbb{E} x_i x_j (x^tx)^{p-1}, but good God I hate dealing with the combinatorics that would involve, and I don’t think I’d get a tractable expression out of it. It seems like this should be known. Darn it, Wishart matrices are supposed to be friendly!

Possibly relevant posts:

Tags: , , , ,
Aug 10th, 2010 | Filed under Mathematics

Convergence of eigenvalues of sample covariance matrices

I’m thinking about how to measure the convergence of a sample covariance matrix to the covariance matrix. To be specific, let \eta_j \sim \mathcal{N}(0, C), then we have S_m \equiv \frac{1}{m} \sum_j \eta_j \eta_j^t \rightarrow C by the law of large numbers.

Both the sample covariance matrix and the covariance matrices are positive semidefinite, so we can bound the \ell_2 error between the spectra of the two matrices using the spectral norm of the difference matrix (Hoffman-Wielandt):


\displaystyle
\min_{\sigma \in S_n} \left(\sum_{i=1}^n |\lambda_i(S_m) - \lambda_{\sigma(i)}(C)|^2 \right)^{1/2} \leq \| S_m - C\|_2.

This implies that in particular, |\lambda_1(S_m) - \lambda_1(C)| \leq \lambda_1(S_m - C).

It’d be nice if we could assert something like |\lambda_k(S_m) - \lambda_k(C)| \leq \lambda_k(S_m - C), but it’s easy to see that this type of relationship doesn’t hold for general positive semidefinite matrices: e.g. take


\displaystyle
A = \begin{pmatrix} 1 & 0 \\ 0 & 0.5 \end{pmatrix}

and


\displaystyle
B = \begin{pmatrix} 0.5 & 0 \\ 0 & 1 \end{pmatrix},

then 0 = \lambda_2(A) - \lambda_2(B) \not\leq \lambda_2(A - B) = -0.5.

What I’m pondering is exactly how useful the quantity \lambda_k(S_m - C) is. It doesn’t have that nice metric property, but it still seems like it measures something. Maybe if we use the relationship between S_m and C, this quantity means something more than in the general case?

Update: It turns out that you can use something very similar to \lambda_k(A -B) to control the size of \lambda(A) - \lambda(B). I can’t share it here, as it would give the entire game away, unfortunately. It’s not clear how good the bounds you get using this method are, yet.

Possibly relevant posts:

Tags: , , ,