Introduction to Probability
Frequentism vs. Bayesian
Frequentists' view point of probability
The probability that some specific outcome of a process will be obtained can be interpreted to mean the relative frequency with which that outcome would be obtained if the process were repeated a large number of times under similar conditions.
Bayesian
Each person has beliefs about random procedure which presumed to be different. It is inferential point of view
The difference is that in frequentism, samples are coming from one unknown true distribution; in Bayesian, all unknowns must have probability structure.
Sample Spaces, Experiments, and Events
An experiment is any process, real or hypothetical, in which the possible outcomes can be identified ahead of time. The collection of all possible outcomes of an experiment is called the sample space of the experiment which is often denoted by S.
An event is a well-defined subset of sample space.
For example, an experiment can be "tossing a fair coin 3 time". Then, the sample space \(S = \{TTT, TTH, THT, THH, HTT, HTH, HHT, HHH\}\) is the collection of all outcomes, and \(E = \{THH, HTH, HHT\} \subseteq S\) is an event that two heads are tossed.
Probability
Probability is a function \(P: \mathbb P(S)\rightarrow \mathbb R\) such that
- Non-negativity \(\forall E\subseteq S. P(E)\geq 0\)
- \(P(S) = 1\)
- Countable additivity \(\forall E_1, E_2 \subseteq S. E_1\cap E_2 = \emptyset\Rightarrow P(\bigcup^\infty E_i) = \sum^\infty P(E_i)\).
Theorem 1 \(P(\emptyset) = 0\).
proof. Let \(E_1=\emptyset=E_2 = E_3 = ...\), so that they are disjoint.
so that \(P(\emptyset) \leq 0\), using non-negativity, \(P(\emptyset) = 0\)
Theorem 2 (Finite additivity) For any disjoint events \(E_1,...,E_n\), \(P(\bigcup^n E_i) = \sum^n P(E_i)\)
proof. Define \(E_i = \emptyset\) for \(i>n\). Then by countable additivity.
Theorem 3 \(\forall A\subseteq S. P(A^c) = 1-P(A)\)
proof. By finite additivity, \(P(S) = P(A^c\cup A) = P(A^c)+P(A) = 1\).
Theorem 4 \(\forall A, B \subseteq S. A\subseteq B \Rightarrow P(A)\leq P(B)\).
proof. By non-negativity and finite additivity. \(P(B)=P(A+(B-A))=P(A)+P(B-A)\geq P(A)\).
Theorem 5 \(\forall A\subseteq S. 0\leq P(A)\leq 1\).
proof. \(A\subseteq S\Rightarrow P(A)\leq P(S) = 1\).
Theorem 6 \(P(A-B)=P(A)-P(A\cap B)\).
proof. \(P(A) = P((A-B)\cup (A\cap B)) = P(A-B)+P(A\cap B)\).
Theorem 7 \(P(A\cup B) = P(A) + P(B) - P(A\cap B)\). proof. \(P(A\cup B) = P(A\cup B-A) = P(A) + P(B-A) = P(A)+P(B)-P(A\cap B)\)
Theorem 8 (sub-additivity) For \(E_1,...,E_n\). \(P(\bigcup^n E_i) \leq \sum^n P(E_i)\).
proof. Define \(A_1 = E_1, A_i = E_i - \cup^{i-1}_j E_j\) so that \(\cup^n A_i = \cup^n E_i\) and \(A_i\) are mutually disjoint. and \(A_i\subset E_i\). Therefore,
Continuity of Probability
For a sequence of events \(A_1, A_2,...\)
Continuity from below If \(A_1,A_2\) is increasing to \(A\) (\(A_1\subset A_2... , \bigcup A_i = A\)) then \(\lim_{n\rightarrow\infty}P(A_n) = P(A)\).
proof. \(A_j\subset A_{j+1}\) so that \(A_j\cap A_{j+1} = A_j\). Therefore, define \(C_{j+1} = A_{j+1}\cap A_j^c\) and \(C_1 = A_1\) so that \(C_j\) is the part of \(A_{j+1}\) that is not in \(A_j\) and \(C_j\) are disjoint. Also, \(C_j\) is increasing to \(A\) by its construction. Therefore, we can prove by countable additivity.
Continuity from above If \(A_1,A_2\) is decreasing to \(A\) (\(A_1\supset A_2... , \bigcap A_i = A\)) then \(\lim_{n\rightarrow\infty}P(A_n) = P(A)\).
proof. Note that \(A_1 - A_n\) is increasing to \(A_1-A\). so that
note that \(P(A_1-A_n) = P(A_1)-P(A_n)\) since \(A_n\subset A_1\), similarly \(P(A_1-A) = P(A_1)-P(A)\)
Combinatorics
Permutation When there are \(n\) elements, the number of events pulling \(k\) elements out of \(n\) elements is called a permutation of \(n\) elements taken \(k\) at a time and denoted by \(nPk\).
Combination The number of combinations of n elements taken k at a time is denoted by \(nCk\) or \(n\choose k\)
Binomial coefficients
Note that the coefficient is determined by the number of combinations choosing \(k\) \(x\)-terms among \(n\) \((x + y)\) terms.
This result can be expanded to infinity as Newton's generalized binomial theorem.
Multinomial Coefficients
where the coefficient
Inclusion Exclusion Formula
proof Using induction on the number of events \(n\).
Conditional Probability
The conditional probability of an event \(A\) given \(B\) is defined by \(P(A\mid B) = \frac{P(A\cap B)}{P(B)}\), if \(P(B) > 0\).
Independent Events
\(A\) and \(B\) are independent if \(P(A\cap B) = P(A)P(B)\).
A collection of events \(\{A_i\}\) are (mutually) independent if \(P(\bigcap A_i) = \prod P(A_i)\) for \(A_i\neq \emptyset\),
are pairwise independent if \(\forall A_i\neq A_j\), \(A_i, A_j\) are independent.
Theorem \(A\perp B\) IFF \(A\perp B^c\).
proof.
\(\Rightarrow\)
\(\Leftarrow\)
\(A\) and \(B\) are conditional independent given \(C\) if
Bayes Theorem
Theorem (Law of total probability) Let \(B_1,...,B_k\) be a partition of the sample space \(S\). For any event \(A\)
proof. Note that
are \(A\cap B_i\) are disjoint since \(B_i\)'s are disjoint, so that finite additivity and conditional probability gives
Theorem (Bayes Theorem) \(P(B\mid A) = \frac{P(A\mid B)P(B)}{P(A\mid B)P(B) + P(A\mid B^c)P(B^c)}\).
proof. Direct from law of total probability.
Therefore, Bayes Theorem says that
Posterior probability can be expressed with respect to prior probabilities