Consider a joint probability distribution over discrete multivalued
random variables. If we know the actual probability for
each form
If we knew that the random variables were mutually independent, then we need only store the n probabilities of the form P(A). All other quantities, such as P(A,B,C) could be easily computed. Of course, many interesting distributions are not based on mutually independent variables.
We wish to again consider how to model a joint distribution for which we have some partial information, and some presumption of conditional dependencies, while still being able to answer queries about various conditional probabilities. We look at a very prominent approach known as a Bayesian Network, which combines some basic domain knowledge about variable independence to produce a more sparse representation of a distribution.
Classic Example: Bob has a house alarm, and has neighbors John and Mary who will call him if they think they hear the alarm. An alarm might be set off by a burglar, or by a (rare) Earthquake. We introduce the following Boolean variables as notation for the model:
Let's further assume that the following probabilities are known based on experience:
What if we are interested in answering queries such as the following?
But what if we add the additional assumptions based on knowledge of the domain:
Conditional Independence
We originally defined pairwise independence of A and B if if
For example, we are assuming that earthquakes and burglaries are
independent, thus
P(B | E) = P(B) = 0.001
However, the condition about John and Mary not knowing about the other's actions does not mean that those variables are independent.
P(M) ≠ P(M | J)(Why is that?)
To express our assumption about John and Mary,
we need a new definition of
conditional independence, as follows.
Variables
A and B are conditionally independent, given C, if
which is equivalent to each of the following statements:
We can now use this definition to describe what we want to say about
John and Mary, which is
P(J | M,A) = P(J | M)
Bayesian Network
We will use a graph to compactly represent our knowledge about the
domain, with a vertex for each variable, and directed edges to
describe one variable that directly impacts another.
(Figure 7.11 in text is much prettier than I'm willing to draw)
B E
\ /
\ /
| |
v v
A
/ \
/ \
| |
v v
J M
With each edge, we specify the
The beauty of a Bayesian Network is that it gives a compact representation for the entire joint probability distribution (assuming the assumptions about independence and conditional probabilities are valid). In particular, we can compute any specific value of the joint probability distribution, as:
P(x1, x2, ... xn) = Πi P(xi | parents(Xi))This equation is effectively a result of the chain rule, and a theorem about conditional independence (given below).
Example:
= 0.90 * 0.70 * 0.001 * 0.999 * 0.998 = 0.000628
Once we have the ability to compute atomic probabilities in the joint distribution, we can marginalize and condition to compute any probability of interest. Often this can be done far more directly than by working bottom-up from the atomic probabilities in the joint distribution.
Example: Compute P(J | B).
By definition this is P(J,B) / P(B).
We are given P(B) as unconditional probability
P(J,B) = P(J,B,A) + P(J,B,¬A).
We can further apply the chain rule on term P(J,B,A) as
P(J,B,A) = P(J | B,A) * P(A | B) * P(B).We note that P(J | B,A) = P(J | A) due to conditional independence of J and B given A. Similarly, we have
P(J | B) = ( P(J | A) * P(A | B) * P(B) + P(J | ¬A) * P(¬A | B) * P(B) ) / P(B)
= P(J | A) * P(A | B) P(J | ¬A) * P(¬A | B)
We still need P(A | B). We can marginalize with earthquake to have
P(A | B) = ( P(A,B,E) + P(A,B,¬E) ) / P(B)We can similarly get P(¬A | B) = 0.06, and the original conclusion that
= ( P(A | B,E)*P(B)*P(E) + P(A | B,¬E)*P(B)*P(¬E) ) / P(B)
= P(A | B,E)*P(E) + P(A | B,¬E)*P(¬E)
= 0.95 * 0.002 + 0.94 * 0.998 = 0.94.
P(J | B) = 0.9 * 0.94 + 0.05 * 0.06 = 0.849
Theorem: A variable for a node x is conditionally independent of all its non-descendent nodes, given its parents. That is
P(xi,xj | parents(Xi)) = P(xi | parents(Xi)) * P(xj | parents(Xi))
Markov blanket: We define the Markov blanket of a node to be a node's parents, children, and children's parents.
Theorem: A variable for a node x is conditionally independent of all its non-descendent nodes, given its We define the Markov blanket of a node to be a node's parents, children, and children's parents.