somewhere near the beginning.

One direction in Khintchine’s inequality for Rademacher sums

Filed under: Mathematics — Alex @ 12:11 am 9/26/2008

The Khintchine inequality for Rademacher sums says that given a real vector a = (a_1, \ldots, a_n), the p-th moment of the random sum  \sum_{i=1}^n \epsilon_i a_i is equivalent to the length of a:

 \displaystyle A_p \|a\|_2 \leq \left( \mathbb{E} \left| \sum_{i=1}^n \epsilon_i a_i \right|^p \right)^{1/p} \leq B_p \|a\|_2,

where A_p, B_p are constants which depend on p. Here the expectation is taken w.r.t to the \epsilon_i, which are i.i.d Bernoulli \pm 1 r.v.s — i.e. Rademacher r.v.s

Today I’ll derive the upper bound; when/if I figure out the lower bound, I’ll post on it later. There are two approaches to the upper bound that I know of. The simplest cheats, in that it appeals to Khintchine’s inequality for Gaussian sums, but at the same time, it illustrates a connection between gaussian r.v.s and Rademacher r.v.s that is often useful. Khintchine’s inequality for Gaussian sums is

 \displaystyle A_p^\prime \|a\|_2 \leq \left( \mathbb{E} \left| \sum_{i=1}^n \nu_i a_i \right|^p \right)^{1/p} \leq B_p^\prime \|a\|_2,

where the \nu_i are \mathcal{N}(0,1).

Note that since Gaussian r.v.s are symmetric, \nu_i and \epsilon_i |\nu_i| have the same distribution; therefore we can replace each \nu_i with \epsilon_i |\nu_i| in the inequality, and apply Jensen’s inequality to bring the conditional expectation with respect to the |\nu_i| inside the absolute values. That is,

 \displaystyle \mathbb{E} \left| \sum_{i=1}^n \nu_i a_i \right|^p = \mathbb{E}_\epsilon \mathbb{E}_{|\nu|} \left| \sum_{i=1}^n \epsilon_i |\nu_i| a_i \right|^p \geq \sqrt{\pi/2} \mathbb{E} \left| \sum_{i=1}^n \epsilon_i a_i \right|^p ,

where we use the fact that \mathbb{E} |\nu_i| = \sqrt{\pi/2} . This implies that the upper Khintchine bound holds for Rademacher variables with B_p at most \sqrt{2/\pi} B_p^\prime. The Rademacher-Gaussian connection is a powerful tool: e.g. Latala uses it to extend a bound on the norm of a random matrix with independent mean-zero gaussian entries to one that applies to virtually *any* random matrix with independent mean-zero entries.

I suspect, but haven’t given it any thought, that Khintchine’s inequality probably holds for any sequence of i.i.d. subgaussian variables.

The second approach, the one from scratch, is to note that you have a Chernoff bound which is a differential form of the upper Khintchine bound (I think this form is also sometimes called a Khintchine inequality):

P\left(\left|\sum_{i=1}^n \epsilon_i a_i \right| > C\right) \leq 2 \exp(-C^2/(2 \|a\|_2^2))

The derivation of this inequality is standard: note P\left(\left|\sum_{i=1}^n \epsilon_i a_i \right| > C\right) = 2 P\left(\sum_{i=1}^n \epsilon_i a_i > C\right) then use Markov’s inequality to bound this by  2 \mathbb{E} \exp\left( \lambda \left(\sum_{i=1}^n \epsilon_i a_i\right)\right)/ \exp(\lambda C) where the \lambda is to be determined. By independence, the numerator factors:

 2 \displaystyle \prod_{i=1}^n \mathbb{E} \exp(\lambda \epsilon_i a_i) = 2 \prod_{i=1}^n \cosh(\lambda a_i) \leq 2 \prod_{i=1}^n (1+(\lambda a_i)^2/2)

which we dominate by \exp(\lambda^2 \|a\|_2^2). Plugging this back in and finding the \lambda which minimizes the bound gives us the desired Chernoff bound.

From this Chernoff bound, we have

P\left(\left|\sum_{i=1}^n \epsilon_i a_i \right|^p > C\right) \leq 2 \exp(-C^{2/p}/(2 \|a\|_2^2)),

so using the integral expression for expectation, \mathbb{E}|X|^p = \int_0^\infty P(|X|^p > C) \,dC, and doing some simplifications of the integral, we wind up with

 \left( \mathbb{E} \left| \sum_{i=1}^n \epsilon_i a_i \right|^p \right)^{1/p} \leq 2^{1/2} \left( \Gamma(1+p/2) \right)^{1/p} \|a\|_2.

Possibly relevant posts:

Prediction (a really cool game)

Filed under: Mathematics — Alex @ 6:16 pm 9/12/2008

Prediction, Learning, and Games caught my eye a while ago– besides the title suggesting that it’d be a mathematical smorgasborg purpose-made to satisfy my particular appetite, I flipped through it and was intrigued by the seeming uniqueness of the presentation (not that I’m anything near an expert on this, but other books on prediction and learning I’ve seen tend to work from a probabilistic perspective from the start, which this book doesn’t; they also don’t draw connections with game theory the way this book claims to). Today I started reading the introduction; I came across the following delicious problem, and had to stop and puzzle it out.

On to the aforementioned delicious problem:

Consider the problem of predicting an unknown sequence y_1, y_2, \ldots of bits y_t \in \{0,1\}. At each time t the forecaster first makes his guess \hat{p_t} \in \{0,1\} for y_t. Then the true bit is revealed and the forecaster finds out whether his prediction was correct. To compute \hat{p_t} the forecaster listens to the advice of N experts. This advice takes the form of a binary vector (f_{1,t}, \ldots, f_{N,t}), where f_{i,t} \in \{0,1\} is the prediction that expert i makes for the next bit. Assume we are told in advance that, on this particular sequence of outcomes, there is an expert i that makes no mistakes. That is, we know that f_{i,t} = y_t for some i and for all t, but we do not know for which i this holds. Our goal is to bound the number of time steps t in which \hat{p_t} \neq y_t, that is, to bound the number of mistakes made by the forecaster.

Can you find a good upper bound? It is clear that after N-1 mistakes, you know who the prescient expert is, so it is less than this. In fact, it is also much less than \sqrt{N}– as one of the people I shared this problem with said, that really only leaves one choice :)

Possibly relevant posts:

\ell_{\log n} approximates \ell_\infty

Filed under: Mathematics — Alex @ 1:45 pm

It’s common knowledge that in \R^n the \ell_p norms approach the \ell_\infty norm as p goes to infinity. As was apparent to everyone but me, this approximation gets really good at about p = \log n; here’s a quick justification.

We’re wondering for what value p is \|a\|_{p}/\|a\|_\infty = \left( \sum_i \left(\frac{|a_i|}{\|a\|_\infty}\right)^p \right)^{1/p} approximately one; I took ‘approximate’ to mean bounded above and below by constants. Each summand is lesser than or equal to one, so the most the quotient can be is n^{1/p}, and the least it can be is one. So really, we just want to find a small p so n^{1/p} is a bounded by a constant; the easiest choice is p = \log n , in which case we have  1 \leq \|a\|_{\log n} / \|a\|_\infty \leq e.

Possibly relevant posts:

Sweet success!

Filed under: Mathematics — Alex @ 11:23 pm 9/10/2008

I’ve been thinking about this counting problem off-and-on throughout the day: what is \sum_{k=1}^n (n-k)2^{k-1}?

Give it a shot, then see my argument after the cut
(more…)

Possibly relevant posts:

Matrix Structure

Filed under: Mathematics — Alex @ 1:26 am 9/9/2008

Given a square n \times n matrix A, we call the directed graph G(A) = \operatorname{struct}(A) on \{1, \ldots, n\} which contains edge (i,j) iff A_{ij} \neq 0 the structure of A. This notion is useful in designing algorithms for sparse matrix computation: given only the structure of the input, you’d like to determine the worst-case structure of the output, so you know which entries you need to compute. Or in graph theoretical terms, given \operatorname{struct}(A) and a function f on matrices, you’d like to compute
 S_f = \bigcup \{ \operatorname{struct}(f(B)) \,:\, G(B) = \operatorname{struct}(A) \}.

For some functions f and structures G(A), S_f is the complete graph, which means f completely destroys the sparsity of that class of inputs. Individual matrices with that structure may give outputs with sparse structures, but there is no overarching sparsity pattern to the output.

We call the transitive closure G^\star(A) of G(A), minus the self-loops, the transitive closure of A. That is, (i,j) \in G^\star(A) iff there is a path from i to j in G(A) and i \neq j.

Lots of interesting questions abound, but here’s my first. Take some simple random matrix A, then calculate the probability that an edge is in its transitive closure. For example, let A be a matrix where the entries are i.i.d. distributed evenly over -1,1,0. Just from simulations, I’m getting that the transitive closure is with high probability the complete directed graph.

Incidentally, I came across this notion of a matrix’s structure in a paper entitled “Predicting structure in sparse matrix computations” when I was trying to determine whether there are algorithms which exploit sparsity in general non-structured sparse matrices– it’s clear that if you have a particularly nice sparsity pattern (diagonal, in one extreme), specific algorithms can be tailored to take advantage of that structure, but it’s not at all clear to me that just knowing the specific but arbitrary sparsity pattern of a fixed matrix affords you any algorithmic benefits.

Possibly relevant posts:

Probabilistic Tools

Filed under: Mathematics — Alex @ 10:36 pm 8/25/2008

Today one of the first years studying for his qualification exams asked me a question that he came across on a previous year’s probability qual: If M people are removed from N couples, what is the expected number of remaining couples?

Give it a shot, then see under the cut for my solution

(more…)

Possibly relevant posts:

Complexification of the Rademacher comparison theorem?

Filed under: Mathematics — Alex @ 12:00 pm 8/18/2008

Recall the Rademacher comparison theorem:

Let F: \R^+ \rightarrow \R^+ be an increasing, convex function, T \subset \R^n be a compact set, and \varphi_i be real contractions such that \varphi_i(0) = 0, then

\displaystyle \mathbb{E} F\left(\frac{1}{2} \sup_{t \in T} \left| \sum_{i=1}^n \epsilon_i \varphi(t_i) \right| \right) \leq \mathbb{E} F\left( \sup_{t \in T} \left| \sum_{i=1}^n \epsilon_i t_i \right| \right),

where \epsilon_i are independent Rademacher (Bernoulli \pm 1) variables.

Intuitively, replacing F with the identity, this says that if we shrink the coordinates and then take random Bernoulli averages, the expected value of the maximum deviation is going to be less than that for the original coordinates. Therefore, it makes sense to expect that an appropriately modified version of this theorem holds for the case where T \subset \C^n and the \varphi_i are complex contractions.

In one version of a proof (see “Comparison Theorems, Random Geometry and Some Limit Theorems for Empirical Processes” by Ledoux and Talagrand), an intermediate step is to show that

 \displaystyle \mathbb{E} F\left( \sup_{t \in T} \left| \sum_{i=1}^n \epsilon_i |t_i| \right| \right) \leq 2 \mathbb{E} F \left( \sup_{t \in T} \left| \sum_{i=1}^n \epsilon_i t_i \right| \right)

and their main tool for doing this is a bijection \theta_t : \{\pm 1\}^n \rightarrow \{\pm 1\}^n which satisfies, for a given t \in T

\displaystyle \left| \sum_{i=1}^n \epsilon_i |t_i| \right| \leq \left| \sum_{i=1}^n \theta(\epsilon)_i t_i \right|.

They prove such a \theta_t exists without explicitly specifying it, using the marriage theorem.

If I could somehow generalize this to the complex case, i.e. find a bijection \theta_t : \{\pm 1\}^n \rightarrow \{\pm 1\}^n which satisfies the above, I’d have at least this intermediate inequality. (As it turns out, this inequality is all I need for my applications, since we only use \varphi_i = | \cdot | .) But the question is, is it even reasonable to expect this?

Possibly relevant posts:

A variational problem

Filed under: Mathematics — Alex @ 12:20 pm 8/7/2008

Something to ponder: Fix X, is \min_D \|D^{-1} X\|_{1 \rightarrow 2} \geq \|X\|_F, where D is a positive diagonal matrix satisfying \text{tr} D^2 = 1?

In the case where  D = \text{diag}(1/\sqrt{n}, \ldots, 1/\sqrt{n}) , the inequality is true, which is why it might be reasonable to expect it in the general case.

Possibly relevant posts:

Expected norm of matrices with randomly signed entries

Filed under: Mathematics — Alex @ 11:58 am 8/6/2008

Woohoo! I think I’ve finished constructing examples that prove the optimality of two of our error bounds– that the various log and sqrt(log) factors are needed, etc.– tomorrow I’ll see what my advisor thinks of them.

Here’s an interesting question I came across during the process (I haven’t yet attempted to solve it, because it turns out I can avoid needing to know the answer for the general case by choosing a particular matrix for my example): Given a matrix X, multiply all its entries by independent Rademacher (+/-1 Bernoulli) r.v.s — what does the expected norm of this random matrix look like? The answer depends on the norm (e.g. for the Frobenius norm, it’s trivially the norm of X); I’m interested in the spectral norm.

Just for aesthetic reasons, I’d like to say that it’s (maybe only, close to) the norm of X.

Possibly relevant posts:

« Newer PostsOlder Posts »