Schur Complement and Quadratic Forms

This came up in a math reading group, so I figured I’d write a note. This is relatively computational, but I will try to keep it accessible to someone who has had a first course in linear algebra.

The Schur Complement comes up a lot when talking about a block matrix

M = \left[\begin{array}{cc} A & B \\ C & D \end{array}\right]

where A is an m\times m block and D is n\times n , (thus M is n+m\times n+m. Then the Schur complement, if D is invertible, is A - BD^{-1}C.

The reason for how often it appears, is most readily appears explained by the factorization
M = \left[\begin{array}{cc} I_m & BD^{-1} \\ 0 & I_n \end{array}\right] \left[\begin{array}{cc} A - BD^{-1}C & 0 \\ C & D \end{array}\right] = \left[\begin{array}{cc} I_m & BD^{-1} \\ 0 & I_n \end{array}\right] \left[\begin{array}{cc} A - BD^{-1}C & 0 \\ 0 & D \end{array}\right] \left[\begin{array}{cc} I_m & 0 \\ D^{-1}C & I_n \end{array}\right]

which simply comes from first (upside-down) row reducing and then (upside down) column reducing.

This factorization has a few immediate results. One of the most common is that \det M = \det (A-BD^{-1}C) \det D. Compare with the formula for 2\times 2 matrices, by setting m=n=1. So M is invertible if and only if D and A-BD^{-1}C are both invertible.

This also has the benefit of giving a factorization of

M^{-1} = \left[\begin{array}{cc} I_m & 0 \\ -D^{-1}C & I_n \end{array}\right] \left[\begin{array}{cc} (A - BD^{-1}C)^{-1} & 0 \\ 0 & D^{-1} \end{array}\right] \left[\begin{array}{cc} I_m & -BD^{-1} \\ 0 & I_n \end{array}\right]
So the Schur complement comes up pretty much whenever you want to invert a block matrix.

Another less intuitive place where this comes up is in quadratic forms. For this case, assume that M is symmetric, and hence A and D are symmetric and C=B^T. If we want to write a quadratic form Q(\mathbf x) = \mathbf x^T M^{-1} \mathbf x, in terms of a block vector

\mathbf x = \left[\begin{array}{c} \mathbf x^{(1)} \\ \mathbf x^{(2)} \end{array}\right]

for a fixed x^{(2)}. Then we have the following formula

Q(\mathbf x^{(1)}~|~\mathbf x^{(2)}) := Q(\mathbf x) - \mathbf x^{(2)}D^{-1} \mathbf x^{(2)} = (\mathbf x^{(1)} - D^{-1}B^T\mathbf x^{(2)})^T (A-BD^{-1}B^T)^{-1}(\mathbf x^{(1)} - D^{-1}B^T\mathbf x^{(2)})

To prove this formula for symmetric M, inverting the decomposition above
M^{-1} = \left[\begin{array}{cc} I_m & -BD^{-1} \\ 0 & I_n \end{array}\right]^{T} \left[\begin{array}{cc} (A - BD^{-1}B^{T})^{-1} & 0 \\ 0 & D^{-1} \end{array}\right] \left[\begin{array}{cc} I_m & -BD^{-1} \\ 0 & I_n \end{array}\right]
\left[\begin{array}{cc} I_m & -BD^{-1} \\ 0 & I_n \end{array}\right] \left[\begin{array}{c} \mathbf x^{(1)} \\ \mathbf x^{(2)} \end{array}\right] = \left[\begin{array}{c} \mathbf x^{(1)} - BD^{-1}\mathbf x^{(2)}\\ \mathbf x^{(2)} \end{array}\right]
\mathbf x^T M^{-1} \mathbf x = \left[\begin{array}{c} \mathbf x^{(1)} - BD^{-1}\mathbf x^{(2)}\\ \mathbf x^{(2)} \end{array}\right]^T \left[\begin{array}{cc} (A - BD^{-1}B^{T})^{-1} & 0 \\ 0 & D^{-1} \end{array}\right] \left[\begin{array}{c} \mathbf x^{(1)} - BD^{-1}\mathbf x^{(2)}\\ \mathbf x^{(2)} \end{array}\right]

And so
\mathbf x^T M^{-1} \mathbf x = (\mathbf x^{(1)} - D^{-1}B^T\mathbf x^{(2)})^T (A-BD^{-1}B^T)^{-1}(\mathbf x^{(1)} - D^{-1}B^T\mathbf x^{(2)})+\mathbf x^{(2)}D^{-1} \mathbf x


Some Markov Chain Explainers

My class is starting to cover Markov Chains, and I thought that it would be good to share some videos which have some gentle introductions  with examples and questions that go along with it.

I am a fan of PBS’ Infinite Series,  a YouTube channel which features a lot of explanations of interesting math topics that you probably don’t see in school. Even if you have a higher STEM degree.

The first video that I want to share talks about random walks.

The second talks about more general Markov Chains, and has a nice trick for calculating the stationary distribution.

Reasonable Sets and Banach Tarski

In the past few lectures I have found myself saying “reasonable sets” a few times. (“The probability that X is in any ‘reasonable set’ A is \iint_A f(x,y) \ dxdy“) I thought I would take some time to explain what I mean by this.

It turns out that if we want to define a uniform random variable X (or any continuous random variable), there are unreasonable sets — sets which is it impossible to even define the probability that X is in that set. Perhaps the most mind-blowing way to see this can’t be the case is the Banach–Tarski Paradox. This paradox illustrates a way to rearrange a sphere into two spheres of the same size… thus if we could define the probability that our uniform random variable X was in those sets, there would be two disjoint sets where this probability was 1. This breaks the axioms of probability. So for the axioms of probability to hold we need to restrict ourselves to measurable sets (what mathematicians call reasonable sets).

Banach–Tarski is not the easiest example of an unreasoable set, but it is the most striking. Also, VSauce has a beautiful and fairly comprehensive video explaining the paradox…

Bayes’ Formula and Predicting the Future

Last week, a student came to me after class and asked if I could explain the idea behind Bayes Formula. While I would like to say I gave an eloquent explanation, in truth, my account was a bit, um, ramble-y. So I thought that I would write down what I was trying to say in hopes of attaining my ideal.

My ramblings placed the meaning of Bayes’ formula into two buckets. The first being reversing a conditional probability…i.e. converting from the the probability of B given A to the probability of A given B. That is a good explanation for the following formula.

P(B|A) = \frac{P(A|B)P(B)}{P(A)} and P(B|A) = \frac{P(A|B)P(B)}{P(A|B)P(B)+ P(A|B^c) P(B^c)}

On the left we are given that event A has happened, and on the right we are conditioning on the (\sigma-algebra generated by) B.

Thinking of it as the formula for switching conditionals highlights this formula’s role in many misconceptions — while, when written in the language of math, it is obvious P(A|B) \neq P(B|A), our language can easily conflate the two. For example, saying a test for a disease is 99% accurate is not the same as saying that 99% of those who test positive actually have the disease. Translating into math, the former may correspond to either P(test~positive |are~positive) = .99 where as the latter would be $P(are~positive|test~positive) = .99$. This mistake pops up everywhere once you start looking for it.

The other main use of the formula is predicting the future based on the past.  Bayes said the formula was necessary for a “…sure foundation for all our reasonings based on past facts…”. Here, Bayes’ formula tells you how to accurately incorporate a new observation into previous knowledge.  This strays a little from mathematics in that we need to pick Bayesian priors, which, to varying extents, need to be guessed at. The example that I most often use for this is how one should factor in a new piece of evidence at a trial…. although be careful, that could get you in trouble.

A good source for examples on this is in Nate Silvers’ book, which is aimed at an audience which is interested in math, but does not need more than some basic statistics knowledge.

What do you think? Does this reflect how you have used Bayes’ theorem, or how you have seen it used? Would it be helpful if I worked through some specific examples?