Discrete Probability

Notes based on Sections 7.1 and 7.2 of Ertel.

We will begin with some basic coverage of concepts and terminology from discrete probability.

A discrete event space, Ω, is an enumerable set of mutually exclusive elementary events, Ω = { w₁, w₁, ... }. We will often focus on finite event spaces.

More generally, we can define compound events that are sets of elementary events. For example, when rolling a standard die, we can consider the event A to correspond to rolling an even value, thus (die = 2) or (die = 4) or (die = 6). Compound events can be described using notation from set theory, or propositional logic. If we let A and B denote events, then we can define the following

Set Notation Logic Notation Description

A ∩ B A ∧ B intersection

A ∪ B A ∨ B union

Ω true Certain event

∅ false impossible event

Set Notation	Logic Notation	Description
A ∩ B	A ∧ B	intersection
A ∪ B	A ∨ B	union
Ω	true	Certain event
∅	false	impossible event

We let notation P(A) represent the probability that event A occurs. We then have the following facts:

P(Ω) = 1
In effect, if events A₁, A₂, ... A_n are the elementary events of the space, then
Σ_i P(A_i) = 1
P(∅) = 0
P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
Note that we must avoid overcounting outcomes in which both A and B occur.
If A and B are pairwise exclusive events (meaning that they cannot both occur together), then we have the simpler formula
P(A ∨ B) = P(A) + P(B)

Joint Probability Distribution
We will often consider two or more random variables and their outcomes. It is common to use notation

P(A,B)

to mean P(A ∧ B), that is, the probability of both A and B occurring.

For Boolean valued events, the joint distribution space can be described using four possible outcomes: P(A,B), P(A,¬B), P(¬A,B), and P(¬A,¬B). A common notation is to denote the entire distribution space as P(A,B), with the bold P vs. the usual P. Similar approach can be used for multivalued discrete random variables. For example, we could consider the 12 outcomes that arise if you combine one coin flip with one roll of a die.

Example: Consider a population of only those who come to a doctor with acute stomach pain. Let event App represent that a patient has acute appendicitis, and event Leuko represent that a patient's leukocyte value is greater than some threshold. We can describe the joint probability distribution as

P(App,Leuko) App ¬App

Leuko 0.23 0.31

¬Leuko 0.05 0.41

Marginalization
If given the entire joint distribution, it is easy to eliminate a variable by taking the sum of probabilities over all possible values of that variable.

For example, we can determine P(Leuko) = P(Leuko,App) + P(Leuko,¬App) = 0.23 + 0.05 = 0.28

P(App,Leuko)	App	¬App	Total
Leuko	0.23	0.31	0.54
¬Leuko	0.05	0.41	0.46
Total	0.28	0.72	1

Conditional Probability
P(A) represents what is known as the a priori probability of event A; that is, the probability of A occurring, without any additional information. However, it is common to consider a conditional probability P(A | B), which is the probability that A occurs, given knowledge that event B occurs.

That is, we restrict the event space to only those elementary events for which B is true, and then we consider the probability of A. (If B occurs with probability 0, then the conditional probability is undefined).

P(A | B) = P(A ∧ B) / P(B).

Example: The speed of 100 vehicles on a particular road is measured, as well as whether the driver is a student. Outcomes were:

Event Frequency Relative Frequency

Vehicle Observed 100 1

Driver is student (S) 30 0.3

Car is speeding (G) 10 0.1

Speeding Student (S ∧ G) 5 0.05

What is P(G | S)? Tthat is, what is the probability of speeding, given that it is a student driver?

P(G | S) = P(S ∧ G) / P(S) = 0.05 / 0.3 ≈ 0.17

Another example: revisit appendicitis and leukocytes. The joint distribution that the above table describes is actually
P(App,Leuko | StomachPain)
as those statistics were gathered only over that segment of the population. But for our discussion, we will just consider StomachPain as a presumption of our event space.

What is P(Leuko | App)?

P(Leuko | App) = P(Leuko ∧ App) / P(App) = 0.23 / 0.28 ≈ 0.82

Thus, 82% of appendicitis cases demonstrate elevated leukocytes.

Doctor would be more interested in diagnostics. What is P(App | Leuko)?

P(App | Leuko) = P(Leuko ∧ App) / P(Leuko) = 0.23 / 0.54 ≈ 0.43

So even for a patient exhibiting stomach pain and with elevated leukocytes, they are still less likely to be suffering an appendicitis.

Independence
Note well that a conditional probability does not describe an causal effect between two events; it is purely a statistical measurement. With that said, we say that two events A and B are independent (again, in a statistical sense), if P(A | B) = P(A)

Note that P(A | B) = P(A) implies P(B | A) = P(B). This can be seen by noting that
P(A) = P(A | B) = P(A ∧ B) / P(B), and therefore,
P(A ∧ B) = P(A) * P(B).
We can now compute
P(B | A) = P(A ∧ B) / P(A) = P(A) * P(B) / P(A) = P(B).

Chain Rule
P(A,B,C,D) = P(A | B,C,D) * P(B | C,D) * P(C | D) * P(D)
The chain can be computed in any order. Of course, if all the events are mutually independent, then
P(A,B,C,D) = P(A) * P(B) * P(C) * P(D)

Bayes Theorem
For any A and B,

P(A | B) = P(B | A) * P(A) / P(B)

This turns out to be a very useful fact for computing conditional probabilities in practice, when P(B | A) is well understood, but P(A | B) is not.

Example: consider again the diagnosis question, P(App | Leuko).

In the real world, we expect that there is no causation by which high leukocyte level causes an appendicitis. It is instead likely that an appendicitis may often cause a high leukocyte level. But there are also many other things that could cause a high leukocyte level beyond appendicitis. Furthermore, those factors could vary greatly in local communities and vary over time. So the value of P(App | Leuko) is not likely to be fixed over time and over communities.

On the other hand, there may be more universal understanding of P(Leuko | App), which is, how likely is someone with acute appendicitis to exhibit high leukocyte level. Perhaps there is also a reasonably predictable level for P(App).

Locally, there may be a more general population that can be used to produce current statistics for P(Leuko) in a community. Then a doctor can combine that result to compute

P(App | Leuko) = P(Leuko | App) * P(App) / P(Leuko)

Bayes theorem is also used to compute more reliable probability estimate for events that may be relatively rare (with less reliable empirical evidence), based only on using more well evidenced values.

Probabilistic Reasoning

Question: If you know that P(A) = 0.3 and you know that P(B | A) = 0.6, what would you predict for P(B)?

Answer: There is not enough information to determine P(B).

More generally, assume that
P(A) = α
P(B | A) = β

Consider the full joint distribution
P(A,B) = p₁
P(A,¬B) = p₂
P(¬A,B) = p₃
P(¬A,¬B) = p₄

What do we know?

p₁ + p₂ = α as P(A) = P(A,B) + P(A,¬B)

p₁ = αβ as P(A,B) = P(B | A) * P(A)

p₁ + p₂ + p₃ + p₄ = 1 as P(Ω) = 1

There are 3 equations and 4 unknowns. We can rearrange to have
p₁ = αβ
p₂ = α - αβ = α(1 - β)
p₃ + p₄ = (1 - α)
But what we were interested in is estimating P(B) = p₂ + p₄ = ?

Strategy: Maximize Entropy

When a system of equations involving a joint probability distribution is under-constrained, yet there is desire to estimate that distribution, one strategy is based on picking a consistent solution that maximizes an information theoretic measure known as entropy.

Informally, the entropy of a probability distribution is a measure of the amount of randomness in the system. For a discrete probability distribution p, entropy H(p) is defined as

H(p) = Σ_i -p_i ln p_i.

Since each 0 ≤ p_i ≤ 1, the term ln p_i ≤ 0, and thus -p_i ln p_i ≥ 0. Therefore the entropy is non-negative.

The idea is that if some probabilities are known but other's aren't, a practical technique for estimating the unknowns is to select them so as to maximize the entropy of the system (informally, hoping to insert as little artificial information as possible into the system).

In general, it is a non-trivial mathematical problem to determine the solution that maximize entropy (requiring differential equations and other techniques that we will not consider in this class). But there are systems that perform such computations.

Returning to the above example we know the correct value for for p₁=αβ and p₂=&alpha(1-&beta), but not for p₃ and p₄. It turns out that the maximum entropy prediction is achieved (due to symmetry), with p₃ = p₄ = (1-α)/2. Returning to our original goal of estimating P(B), we have that P(B) = α(β - 0.5) + 0.5 = 0.3(0.1) + 0.5 = 0.53

Michael Goldwasser

Last modified: Monday, 28 October 2013

Event	Frequency	Relative Frequency
Vehicle Observed	100	1
Driver is student (S)	30	0.3
Car is speeding (G)	10	0.1
Speeding Student (S ∧ G)	5	0.05

p₁ + p₂ = α		as P(A) = P(A,B) + P(A,¬B)
p₁ = αβ		as P(A,B) = P(B \| A) * P(A)
p₁ + p₂ + p₃ + p₄ = 1		as P(Ω) = 1