Conceptualizing Bayes's Theorem
Introduction
People who know me well enough at one time or another have probably heard me say: "I love math. Why? Because I believe math is truth! Well ... unless it's statistics ... ". Of course I'm being facetious here -- I say that jokingly. In fact, quite the opposite, statistics is really about getting at truth through our inherent biases, misattributions, and misconceptions.
Back when I was in grad school I spent a lot of time studying -- and later teaching -- statistics (mainly as applied to psychological research). But recently, as I've undergone a rekindled interest in consciousness and Artificial Intelligence I've been revisiting statistical approaches and methodologies. And over the course of doing so I got to thinking about Bayes' Theorem.
Bayes' Theorem (and, more generally, Bayesian statistics) is a statistical approach that delves into the nature of belief as much explicating mathematical principles for probabilistic inference. Consequently, over and above adding more tools to the belt for statistical analysis, conceptualizing Bayes' Theorem has profound implications relating to our general understanding of probability.
Bayes' Theorem
Bayes theorem is all about likelihood. Mathematically, it can be expressed in terms of probabilities. Given that it's all about belief I like to express Bayes' theorem terms of events and hypotheses.
$$ P ( Hyp | Event ) = \frac {P ( Event | Hyp ) \times P ( Hyp ) } { P ( Event ) } $$
Where:
-
P(Hyp | Event) represents the probability of a hypothesis being true given that an event has occurred. This is the posterior probability.
-
P(Event | Hyp) represents the probability of the event occurring given that the hypothesis is true (which may be referred to as the likelihood).
-
P(Hyp) represents the prior probability of the hypothesis irrespective of the event.
-
P(Event) represents the probability of the event irrespective of the hypothesis. It may be considered as the marginal likelihood of the event.
Examples
Forecasting the Weather
With anything math I always like examples. Being a New Englander and an avid outdoorsperson (also given that my father was a meteorologist) I often worry about the weather. Since, as I write this, it's January and quite cloudy outside my window, let's consider, in Bayesian terms, whether I should be concerned about snow.
-
Let's assume the probability that it will snow in my location on the given day in January ( $P(Hyp)$ ) to be 25% (based on historical data).
-
The probability that it will snow is the prior probability. But I've also observed an event that should impact the prior: It's cloudy. The probability of an overcast day occurring on a January day in New England is $P(E) = 50\%$
-
But in Bayesian terms that's not the whole story. Whether I should be concerned with snow given that it's cloudy is also impacted by the likelihood it will be overcast if it's snowing. Given snow, it may be cloudy 99% of the time but on rare occaisions I've seen snow when it's not completely overcast. So let $P( Event | Hyp ) = 99\%$
Now let's do the math. Calculate the probability of snow in Boston on January 9th (the hypothesis) given that it's overcast in the morning (the event)...
So, Bayesian analysis enables us to add information to the determination of a probability, often greatly enhancing the estimate. In the weather example, adding observational information to the equation raised the probability of snow from the baseline (prior) probability significantly.
Medical Diagnostics
Just for fun let's work through a slightly more complex example. Imagine a scenario where you want to determine the probability of a patient presenting with a particular rare condition. The condition can be detected by a test which has 99% accuracy. In other words, if the condition is present the probability that the test is positive is 99% (there is a 1% chance of a false negative). If the condition is not present the test will be positive 1% of the time (a false positive).
-
Let's say the condition occurs in 1% of the population ($P(Cond) = 0.01$) .
-
We know the test is 99% accurate.
- If the condition is present the test is positive 99% of the time ( $P(Pos|Cond) = 0.99$ ).
- If the condition is not present the test is negative 99% of the time ( $P( Pos | \neg Cond) = 0.01$ ).
Let's say we have a case where a patient tests positive for the condition. The question is; what is the probability that the patient actually has the condition. In Bayesian terms we ask; "What is the probability that the patient has the condition given the evidence of a positive test result?"
At this point we have all the information we need to apply Bayes' Theorem, but not quite in the form given above. That is, we don't have a number for $P( Pos )$ (the probability of getting a positive test result irrespective of the condition). But we can determine that probability and expand the theorem to address our question.
We can obtain $P(Pos)$ by collapsing conditions across the entire population:
$$ P( Pos ) = P( Pos | Cond ) \times P( Cond ) + P( Pos | \neg Cond ) \times P( \neg Cond ) $$
Given that we can expand our original formulation of Bayes' Theorem. If the hypothesis (Hyp) put to the test is that the patient has the condition, and the event (Event) is testing positive then:
$$ P ( Hyp | Event ) = \frac {P ( Event | Hyp ) \times P ( Hyp ) } { P( Event | Hyp ) \times P( Hyp ) + P( Event | \neg Hyp ) \times P( \neg Hyp ) } $$
Now the calculation boils down to simple arithmetic:
So, with this example, we see that it's not enough to just consider the test accuracy in gauging the probability of a correct diagnosis. Over and above that we need to consider the frequency of the condition with respect to the population writ large.
This example demonstrates a crucial point about Bayesian probability; the prior probability (the prevalence of the condition in the population) greatly influences the posterior probability (the probability of having the condition given a positive test result). The failure to bring the prior probability to bear on the assessment (referred to as base-rate neglect) is an example of a well-known fallacy in statistical reasoning -- a form of cognitive bias that can lead to errors in judgment that depend on estimating probabilities associated with uncertain events.
Summary
In summary, Bayes' theorem defines probability in terms of evidence for a hypothesis. Key concepts include:
- Prior Probability: Our initial belief in the hypothesis before observing any evidence.
- Likelihood: How likely observed evidence would be if the hypothesis is true, and
- the Posterior Probability: Our updated belief in the hypothesis after considering the evidence.
In other words the Bayesian interpretation of probability is one where probability expresses a degree of belief in an event. A level of certainty.
Discussion: Understanding Statistics
All this bares thinking about as much as for deepening our understanding of statistics as for computational applications. Bayesian analysis actually predates in large part the formalization of what some statisticians might term "classical" hypothesis testing. Bayes theorem was originally formulated by Thomas Bayes in the 18th century. Back then many philosophers were concerned with the nature of belief (and those concerns are as relevant today as ever)! Subsequent statistical approaches shifted toward methods around "proving" the "truth" of hypotheses through sampling from larger populations. Back when I was in grad school, Bayesian analysis was reemerging as an analytic method often posed in contrast to formal hypothesis testing based on sampling distributions.
I feel in large part the confusion people often feel around statistics stems from the tendency to conflate belief with fact. To fully understand the concept of probability it's critical to understand that nothing is fully certain until after the fact.
In other words, to me statistics is all about belief. At the heart of statistical analysis lies the notion that nothing is ever 100% certain. Frequentists have historically been concerned with drawing conclusions about populations based on evidence present in samples. But conclusions based on aggregate data can't be applied to individuals. Going back to the diagnosis example it's tempting to say something like, "Oh, based on your symptoms you have a 95% chance of having the disease". The fact is you either have the disease or you don't. That's a constant. Probabilistic assessments rely on variability over populations. And you can't confuse statements based on aggregate sample statistics with individual assessments. "95% of the people in this group have the disease" does not mean the same thing as saying "You have a 95% chance of having the disease". The distinction is subtle but it's not just arguing semantics. There are very real consequences of statistical fallacies!
So in conclusion I'd have to say that going back and revisiting the Bayesian approach has been worth the effort. I feel I've gained some new insights into the nature of statistical reasoning and understanding probability. Bayesian inference is a key methodical approach to many machine learning applications. But key to them all is embracing uncertainty and understanding levels of certainty in any sort of classification problem!