Probability Theory in One Page

Basic Concepts

This is a summary of some principles of probability theory. The context is the types of reasoning that artificial (and human) agents perform. It’s not a mathematical treatise, and it’s not deep.

Probability is intrinsically related to randomness, which in turn is easier to understand intuitively than it is to pin down precisely. In fact, if you do some research, you won’t have to go further than Wikipedia to find that there are alternative models of randomness. For our purposes, it refers to anything that can’t be known with certainty. If information theory is what lets us measure information accurately, probability theory is what lets us deal with information that can’t be measured accurately.

You can think of probability theory as a way of dividing a bit of information into smaller units. If a bit’s value of “1” means “yes” and the value “0” means “no,” then “maybe” could be thought of as a probability with a numerical value somewhere between 0 and 1.

If you flip a normal coin (called a “fair” coin in probability parlance), you normally have no way to predict whether it will come up heads or tails. Both outcomes are equally likely. There is one bit of uncertainty; the probability of a head, written p(h), is 0.5 and the probability of a tail (p(t)) is 0.5. The sum of the probabilities of all the possible outcomes adds up to 1.0, the number of bits of uncertainty we had about the outcome before the flip.

But what if we build a coin that’s weighted so it comes up heads 3/4 of the time? That is, instead of p(h) = 0.5 = p(t), we have p(h) = 0.75 and p(t) = 0.25. There is still one bit of information in the coin flip (we don’t know in advance which of the two outcomes will occur), but the unequal probabilities allow us to adjust our behavior (by placing bets, for example) more intelligently than if the coin was fair.

Multiple Events

What happens if you flip a coin more than once? If it is fair and you flip it two times, there are four equally-likely possible outcomes: tt (tails on the first flip followed by tails on the second flip), th (tails followed by heads), ht, and hh. This pattern should look familiar: if you substitute “0” for “t” and “1” for “h” you get the four binary combinations of two bits, in numerical order: 00, 01, 10, and 11.)

Since exactly one of the four outcomes has to happen, the sum of the probabilities for the four possibilities has to be 1.0. To relate this to information theory, this is like saying there is one bit of uncertainty about which of the four outcomes will happen before each pair of coin flips. And since each combination is equally likely, the probability of each outcome is 1/4 = 0.25.

This is a case of an “AND” relationship: the probability of heads on the first flip AND tails on the second flip (ht) is 0.25. When there is an AND relationship, you multiply the component probabilities to get the probability of the sequence: multiply p(h), which is 0.5 by p(t), which is also 0.5, to get the resultant probability of heads on the first flip AND tails on the second flip, 0.25.

If you ask an “OR” question, you add probabilities instead of multiplying, but you have to be careful not to add the same value multiple times. If the question is, “with a fair coin, what is the probability of getting heads on the first flip OR the second flip, you add the the probabilities for all the combinations that start with h (ht and hh; 0.25 + 0.25 = 0.5) and all the combinations that end with h (th and hh; 0.25 + 0.25 = 0.5), but that counted the hh combination twice, so you have to subtract one of them out of the answer (0.25 + 0.25) + (0.25 + 0.25) - 0.25 = 0.75. It’s usually easier to convert OR questions to the negation of an AND question, and to subtract the result from 1.0. The probability of getting a head on either the first or the second flip is 1.0 minus the probability of not getting heads on the first flip and not getting heads on the second flip. 1.0 - (0.5 × 0.5) = 1.0 - 0.25 = 0.75.

Sometimes you have to play with the words to turn probability questions into AND or OR questions. For example, “what is the probability of two heads in a row,” has to be turned into, “what is the probability of heads on the first flip and heads on the second flip. But once you do that, you don’t need any more rules than the above.

To keep the link to information theory intact when talking about multiple events, remember that it’s the whole sequence of events that is being reduced to a single bit. That is, the information theory version of the “probability of at least one head” question has to be turned into a single question that is answered “yes” or “no:” “Is the outcome th, ht, or hh?”

Conditional Probability

Conditional probability refers to the situation when multiple events occur in sequence and the events are linked in some way such that knowledge of prior events can change probabilities of subsequent events. This is distinguished from the coin flip case, where the first flip tells you nothing about what will happen on the second flip. If you know that it’s always cold when it snows, then observing that it is snowing lets you say with certainty that it is cold because the probability of cold given snow is 1.0. Note that this can work both ways, but (usually) not symmetrically: the propbability of it being cold when it is snowing is different from the probability that it is snowing when it is cold.

As an aside, let’s quickly point out that conditional probability is related to correlation, but not necessarily to causation. If you go for a walk and see that a lot of tall people are wearing yellow socks, it doesn’t say that yellow socks cause tallness, nor that tallness causes yellow socks. But they are correlated (at least during that walk), and the conditional probability of seeing yellow socks is higher given that someone is tall is higher than if the person is not tall, and vice-versa.

The notation for conditional probability uses the ‘|’ symbol: p(a | b) is “the probability of a given b. In the coin-flipping case, p(h | t) is the probability that the second flip is heads given that the first flip came up tails. For a fair coin, the value would be 0.5. For the weighted coin, the value would be 0.75. That is, in both cases, the outcome of one flip doesn’t affect the probability of the next outcome, and conditional probabilities dont’t tell you anything that regular probabilities. But p(snow | cold) gives you more information than p(snow) alone.

With conditional probabilities, you can start to answer other kinds of questions than you could with regular probabilities. For example, if you flip a coin 100 times and it comes up heads 100 times, you are pretty certain it’s not a fair coin. The probability it is a fair coin is not zero, it’s 0.5100 (0.000000000000000000000000000000078886090522). But if you had to bet on the outcome of the next flip, you’d be stupid not to pick heads.

Bayes’ Theorem

Conditional probabilities play an important role in how we (or artificial intelligence agents) make decisions based on observations of the real world. A small bit of algebra leads to a key equation called Bayes’ Theorem. First, a basic principle: p(a · b) = p(a | b) · p(b). The probability of both a and b happening is equal to the probability of a given b times the probability of b. A way to see this is with a Venn Diagram:

Here, the rectangle represents the universe of all possible outcomes: its area represents the total probability that some event occurs, which is 1.0 by definition. The area of the red circle, which we’ll call a, represents the probability that event a happens by chance; the area of the green circle, which we’ll call b, represents the probability that event b happens by chance; the area of the overlapping region shows the joint probability, the chance that a random event will be both a and b. Depending on the particular universe of events being depicted, the areas of the a, b, and their overlapping region could each be anything from 0 to 1.0.

The rectangle is 300 pixels wide and 150 pixels tall. The red and green circles both have radii (r of 60 pixels. The area of a circle is π × r².

The same reasoning applies to multiple events that are somehow dependent on one another. It’s a bit contrived, but to keep the magic dart throwing device example going: pretend that when a dart lands inside the green circle, everything else disappears so that the green circle becomes the “universe” for a second dart throw. The probability that the second dart will land in the blue region is p(a | b), so the overall probability that the second dart lands in the blue region, p(a · b), is the probability of landing in the blue region given that the universe has been reduced to the green region, p(a | b) · p(b).

The “small bit of algebra” mentioned above is this: p(a · b) is the same whether you think of it as a portion of a or as a portion of b:

p(a · b) = p(b | a) · p(a)
and
p(a · b) = p(a | b) · p(b)
That is,
p(b | a) · p(a) = p(a | b) · p(b)

You do the algebra: what are the equations for these?

Causal and Diagnostic Reasoning

We mentioned that correlation is not the same as causality above. But what if two events are causally related? What if a has some probability of causing b? Bayes’ Theorem says we can reason either causally: (given the cause, a, what is the probability that the effect, b, will occur? And we can reason diagnostically: if we observe a known effect, b, what is the probability that it was caused by a? The term “diagnostic” is used because of the analogy with how doctors diagnose a patient’s disease state (the cause) based on the symptoms presented, the results of lab tests, etc.

Unfortunately, we’re at the bottom of this web page and don’t have room to see how Bayes’ Theorem can actually be used by intelligent agents to make decisions …