Monday, July 31, 2017

Notes on a Cosmology - Part 8, Bayesian Inference

Inference is short for "inductive inference" in this context and is roughly synonymous with causal reasoning. "A caused B" is an inference about the causal relationship between events A and B. Ideally, we would like to be able to give some reason to believe this inference. Classical probability theory faces serious obstacles in aiding inference.

The first major problem is the problem of mistaking correlation for causation. Suppose every time you see your neighbor pull into his driveway, his lawn sprinklers come on. "He must have some kind of controller in his car that switches on the sprinklers," you surmise. On closer examination, you find out that he has the sprinklers on a timer and he has configured the timer to coincide to times that he is scheduled to be home so that, if the sprinklers malfunction, he will be present to intervene. There are many different kinds of errors of reasoning related to false generalization from correlations.

The problem of induction

The second major problem facing classical probability in performing inference is the problem of induction. There are a variety of ways to illustrate the problem of induction. Essentially, the problem of induction has to do with generalizing from particular facts of knowledge. Specifically, the problem of induction arises when trying to generalize generalization itself.

For example, suppose that every spring throughout your long lifetime, the first daffodil blooms appear before March 31st. One year, when March 31st comes around, you find that the daffodils have not bloomed. You had no guarantee that they would bloom but you had formed a belief, based on considerable empirical evidence, that daffodils bloom no later than March 31st.

This is not such a big deal because, after all, the world does not turn on the blooming of daffodils. But the world does turn on its axis. And the reason you believe that the world will continue to turn on its axis is that it has always turned on its axis. Yet you have no more evidence that the world will continue turning on its axis than you had that daffodils always bloom before March 31st.

Probability: frequency or degree of belief?

The first place you might be tempted to turn for help is probability theory. Surely, we can prove that the probability of daffodils blooming before March 31st is greater than otherwise, given that we know they have always bloomed before March 31st until now. There might be individual exceptions but they are improbable. Thus, we will have rescued the earth's rotation and the daily rising of the Sun. Sure, we can conceive of the possibility of the Earth stopping on its axis, but the fact that it has never happened makes the probability of such a world-shattering event very low.

Unfortunately, this line of reasoning does not work because it is a circular argument. First, you have to assume something about the probability of the future behavior of daffodils or planets before you can claim to know the long-run probabilities based on the observational evidence. In short, the standard definition of probability only concerns the frequency of recurrent processes (or, repeatable events). While we believe that daffodil blooms and planet rotation are recurrent processes, we cannot use a probability argument to increase our confidence in this belief in any way.

There are two, dominant views of probability. The first view is called the frequentist view of probability and is the "naive" or default view. The probability of an event is the number of times it has been observed, divided by the total number of observations. The second view is called the Bayesian or subjective view of probability. In this view, probability is not just a frequency, it is a degree-of-belief, or confidence about some event.

Conditional probabilities

If you're comfortable with conditional probabilities, feel free to skip this section. But for anyone who wants a refresher, I've provided an example to explain the concept. In Table 1, we have tabulated the number of guests attending a masquerade ball, with the requirement that each party-goer's mask must be either red or blue. We've also tabulated party-goers by their sex. The four main grids of the table capture this information. On the margins of the table are the sums of each row and column, respectively. These marginal counts will be useful in calculating the conditional probabilities. The sum of all guests at the ball is 144, in the lower-right corner.

Table 1 - Party-goers by sex and mask-color

With this data, we can answer several different kinds of questions.
What is the probability that any given party-goer is male and wearing a red mask?  
P(M,R) = 29 / 144 = 0.2 = 20% (by abuse of notation)
What is the probability that any given party-goer is female and wearing a blue mask? 
P(F,B) = 17 / 144 = 0.118 = 11.8% 
What is the probability that any given party-goer is male? 
P(M) = 44 / 144 = 0.306 = 30.6%
Note: The notation P(A,B) is read, "the probability of A and B together".

So far, the questions we have asked have only used 144 in the denominator, because that is the total number of guests attending the ball. But what if we ask a different kind of question:
What is the probability that a party-goer is wearing a red mask given that she is female? 
P(R|F) = 83 / 100 = 83%
Note: The notation P(A|B) is read, "the probability of A given that B".

The reason for this is straightforward - we already know that the party-goer is female, so there are only 100 possible people to consider, rather than all 144 guests. Of those 100, 83 are wearing red masks and 17 are wearing blue masks. Thus, the probability that a woman is wearing a red mask is 83%. But when we ask the same question about men, we get a different result:
P(R|M) = 29 / 44 = 65.9%
This shows that we have different amounts of information available to us, depending on what we already know, that is, based on our prior knowledge. If we are interested in whether someone is wearing a red mask, knowing whether they are male or female changes that probability. Thus, the prior knowledge of one property can convey information about another property.

First attempt at deriving Bayes' theorem

The following formula shows how the standard probability, conditional probability and joint probability are related:
P(A,B) = P(A) • P(B|A)    (Eq. 1)
In the above table:
P(M,R) = P(M) • P(R|M) = 0.306 • 0.659 = 0.20
This agrees with our prior calculations. But note that the joint probability does not depend on the order of the operands: P(R,M) = P(M,R). Thus,
P(A,B) = P(B,A)    (Eq. 2)
Rewriting Eq. 1 in terms of B instead of A gives:
P(B,A) = P(B) • P(A|B)    (Eq. 3)
Substituting Eq.1 and Eq.3 into Eq. 2 gives:
P(B) • P(A|B) = P(A) • P(B|A)
Dividing by P(B) gives Bayes' theorem:
P(A|B) = P(A) • P(B|A) / P(B)    (Eq. 4)
Note that we have only relied on basic algebra and the intuitive relations between probabilities illustrated in the last section on conditional probabilities in order to derive Bayes' theorem. The theorem itself is indisputable but its meaning and applications are still not free of disagreement among various approaches to probability within mathematics.

Induction and Bayes' theorem

The standard view of Bayesian probability is that probability reflects a degree-of-belief or confidence that something has happened, is the case, or will occur. Betting theory utilizes Bayesian analysis. Polling, weather modeling and other systems for predicting "chaotic" systems also rely on Bayesian analysis.

Let's look at another derivation of Bayes' theorem. I recommend the PDF at this link, it contains a clear and direct 2-page derivation of Bayes' theorem across a space of mutually exclusive possibilities (partition of event space). This form of Bayes' theorem is closer to the way we need to think about what is happening in scientific hypothesization. We need to look at Bayes' theorem as a way of organizing the formal structure of a scientific experiment - hidden in Bayes' theorem is an organized way to relate scientific hypotheses and scientific evidence. This is called Bayesian inference.

Suppose we have a hypothesis H and we want to test this hypothesis. We perform an experiment, generating evidence E. We can organize this information using Bayes' theorem by expressing Eq. 4 in terms of H and E:
P(H|E) = P(H) • P(E|H) / P(E)    (Eq. 5)
In itself, Eq. 5 is not very enlightening since it is not clear what any one of these terms means. What is the "probability of the evidence", for example? We can plug numbers into a table like the one above, but we will quickly find that there is one term - P(H) - for which it appears that no reasonable number can be chosen! The marginal probability table above does not help us understand Bayes' theorem when applied to scientific reasoning, either. Instead, let's abandon this approach and express Bayes' theorem in terms of a partition across the hypothesis space:
H0 is the alternative hypothesis. We select this hypothesis when E is consistent with rejecting the null hypothesis, H1. We select H1 when E does not support rejecting it. In this equation, i can be 0 or 1, depending on which hypothesis we want to calculate for. There are more sophisticated frameworks for hypothesis testing but this framework is sufficiently general that any more sophisticated form of hypothesis testing can be reduced to this form. Read the Wikipedia articles on Bayesian inference and Hypothesis testing for more information on the categories of hypothesis selection, hypothesis rejection and errors.

The meanings of the terms in Bayes' theorem are as follows. P(Hi|E) is the "degree of belief"[1] we have in hypothesis Hi, given observational evidence E. P(E|Hi) is the probability that we will observe evidence E given hypothesis Hi. The sum in the denominator expresses the weighted partition of the hypothesis "space". The entire fraction expresses the likelihood of a particular hypothesis chosen from the space of all possible hypotheses. In this case, there are just two possibilities but, arithmetically, this is the same as applying Bayes' theorem to any partitioning of a probability space, that is, when there are multiple hypotheses. The only terms we have not explained are P(Hi), i=0,1.

Prior probability

The term P(Hi) is read "the prior probability of hypothesis Hi." What is a prior probability distribution? Intuitively, the prior probability is a mathematical placeholder for "background knowledge". Suppose you measure the probability that it will rain on any given day in your region in September and find that it is 21%. When predicting the probability of rain on any given day in September of the next year, the most you can say on the basis of the data you have collected is that there is a 21% chance, each day, that it will rain. But suppose you know that it has rained the day before. Is the chance of rain today still 21%? Of course not - we know this because we know that rain tends to occur in "streaks", that is, in multi-day groups separated by multi-day groups of non-rainy days. This kind of knowledge about the structural behavior of the problem under examination is expressed in the prior probability. The probability of a hypothesis is a reflection of background knowledge about the problem. "The probability that it rains today, knowing that it rained yesterday, is the same as it would otherwise have been" is a possible hypothesis but it is improbable given the background knowledge we have about the nature of weather.

The trouble with the intuitive notion of prior probability, however, is that it is so squishy. The prior probability distribution is numerical and numbers should, ideally, measure something. This problem is very important because the choice of a prior can have as much, or more, influence on the outcome of Bayesian inference as the experimental evidence. The overarching goal of the scientific method is to reason about the world-as-it-is, not the world-as-we-suppose-it-is. The danger of using methods of inference that rely on a prior is that we can smuggle our suppositions into a scientific theory under the guise of real, experimental evidence.

Fortunately, there is a rigorous method for addressing this danger, putting prior probability on an objective footing. We will begin investigating this topic in the next post.

Next: Part 9, Algorithmic Inference

---

1. Note that the stochastic theory of hypothesis testing does not interpret the probability as a "degree of belief" or "confidence" in the hypothesis. There are a variety of interpretations in the stochastic theory - I will not venture to comment on any of them since I do not find them compelling. As we have formulated it in this post, the Bayesian interpretation is also unjustified. We will rectify this in the next post.

No comments:

Post a Comment

Wave-Particle Duality Because Why?

We know from experimental observation that particles and waves are fundamentally interchangeable and that the most basic building-blocks of ...