STAT 230 - Probability

STAT 230 - Probability

Lecture - 1

Lecturer: Diana Skrzydlo (Skryzlo)
Office Hours: MF 3:30-5 in M3 3106 & W 11:30-12:30 in M3 3144

What is probability?
The likelihood of an event occuring

The 3 Definitions of Probability

Classical - Probability is equal to $\frac{\text{# of ways event occurs}}{\text{total # of outcomes}}$ (relies on outcomes being equally likely)
Relative Frequency - Probability = proportion of time ( $\infty$ long) that the event occurs
Subjective - Probability = how certain the person is that it will occur

How does prob relate to CS?

optimizing runtime
randomized algos
confidence bounds on answers
recommendation engine
targeted ads
machine learning

Definitions to set up a probability model

Lecture 2 - Sept 12

Definitions for probability models: experiment, trial, outcome, sample space
events
probability
example: binary search trees

Review Question:

What is essential to the classical definition of probability?
Outcomes are equally likely

Definitions

Experiment

A process that can be repeated multiple times, with multiple results

Trial

One iteration of an experiment

Outcome

The result on one trial of an experiment

Sample Space

A set of all possible outcomes for a single trial of an experiment. Notes: S is a set. i.e. an unordered, unique element list. S must contain ALL possible outcomes. Two types: discrete (finite) -> can list/count, continuous -> uncountably many elements

Example
Experiment: Roll a 6-sided die repeatedly
Trial: One roll
Outcomes: 1, 2, … , 5, 6
$S_1$ = {1, 2, 3, 4, 5, 6} (most versatile)
$S_2$ = {even, odd} (also works)

Event

A subset of a sample space (including the empty set $\phi$ ) and the entire sample space $S$

Example
A = “a 1 is rolled” -> A = {1}
B = “the number is odd” -> B = {1, 3, 5}
Note: events consisting of one sample point are simple events, otherwise they are called compound events On any trial, if the outcome is in the event A, we say A has occurred.
E.g. if a 1 is rolled -> A occurs, B occurs
3 is rolled -> B occurs, A does not
6 is rolled -> neither occur

A Probability

is a function that maps events in a sample space to $Real$ numbers such that

for any A
- $0 =$ “impossible”
- $1 =$ “guaranteed”
If are the elements of S, then
- where Simple event $A_i = \{a_i\}$
- Example: Roll a 6-sided die twice. S = {(1,1), (1,2), … , (1,6), … , (6, 1), … , (6,6)}
- Example: BST. A Binary Search Tree is degenerate if each node has at most one child
- Find the prob. that a tree with 3 elements is degenerate:
- We drew out all probs. 4/6 chance

Lecture 3

Recall

P(A) is the probability that A occurs. A is a subset of S. If A is compound, its prob is the sum of the probs of the simple events that make up A.

Counting Techniques

Two basic counting rules:

Addition rule
- if you can do 1 task in p ways and task 2 in q ways. Then the number of ways to do task 1 XOR task 2 is just p + q.
Multiplication rule
- If we can do task 1 in p ways and for each of those ways, we can do task 2 in q ways, then the total number of ways we can do task 1 AND 2 is pq.

Sampling

“with replacement” - is possible to obtain the same result more than once. Then it’ll have $m^k$ choices. M spaces, k options.

Permutations

A permutation is an ordering of k objects selected from n objects without replacement. Let n = number of objects. K spaces. The order matters! $n^{(k)}$ = “n to k factors” = $n(n-1)...(n-k+1)$
Also $\frac{n!}{(n-k)!}$

Tip, if computing a certain permutation is hard, try 1 - opposite of what you want to find. Eg. Find P(of at least one repeat) turns into 1 - P(no repeats) which is easier.

Lecture - 4

Combinations

A combination is a selection of k objects from n objects without replacement. The order does NOT matter. E.g. drawing a hand of cards. Denoted by N choose K (I forget how to $\LaTeX$ this) Pascal triangle - n choose k is the kth entry in the nth row.

Some shit about pascal’s triangle that is super obvious

Pascal’s Identity

(n choose k) = (n-1 choose k-1) + (n-1 choose k)
We won’t prove this mathematically since it’s messy to prove

oh lool, we proved this in MATH 239.

Lecture - 5

Example:
A box of 10 computer chips containing 3 defective ones in the box. Four of the chips are tested. What is the probability that 1 is found defective?
Solution:
W/o rep & don’t care about order.

$Prob = \frac{\text{# of ways to get 1 def chips}}{\text{# of ways to test 4}} = \frac{\binom{3}{1}\binom{7}{3}}{\binom{10}{4}}$

The whole time, we don’t care about order. (i.e. we don’t need to worry if the defective one is first, last or etc)

Note: “1 if found” means exactly 1. Not at least one.

Arrangements, where some objects are alike (indistinguishable)

Can be arranged $\frac{n!}{n_1!n_2!...n_k!}$ where $n_1 + n_2 + ... + n_k = n$ and $n_k$ is number of type k.

Lecture - 6

Recall - event -> subset of S for any event A, $0 \leq P(A) \leq 1$
Rules for probabilities
If an event A is contained in another event B (A subset B) Then, $P(A) \leq P(B)$

$P(B) = \Sigma_{a_i \epsilon B} P(a_i)$
$= \Sigma_{a_i \epsilon A} P(a_i) + \Sigma_{a_i \epsilon B but not A}P(a_i)$
$= P(A) + (\geq 0)$
thus, $P(A) \leq P(B)$

A_bar , A^c, A^’, -> NOT A. Aunion B = or. A intersect B = and.

De Morgan’s Laws

Bar{aunionB} = a_bar and b_bar
bar{aintersectB} = a_bar or b_bar

P(a union B) = P(A) + P(B) - P(AB)

Lecture - 7

Multiplication Rule

applies when events are independent of each other (don’t affect one another)

Lecture - 8

Recall: Independent: P(AB) = P(A)P(B)
Conditional prob of A given B: P(A|B) = $\frac{P(AB)}{P(B)}$

Last time we had
A = small = 3
C = total = 8
P(A|C) = 1/5
P(A) = 1/6

C occuring makes A more likely.

P(C|A) = 1/6, P(C) = 5/36

A occurring makes C more likely. Dependence is a two-way relationship

Alternative defn of independence. P(A|B) = P(A)
knowing B occurred has no effect on the probability of A occurring. P(B|A) = P(B)

Example:

Diana has 2 kids. Each independently has 0.5 chance of having red hair. At least one has red hair. What is the probability the other does? Let A = at least one red

	Kid 1	Kid 2
`red`	0.25	0.25
`not red`	0.25	0.25

P(A) = 0.75

P(both|A) = P(both and A) / P(A) = 0.25/0.75 = 1/3

Product Rule (4.5)

If we rearrange the definition of conditional prob, we get:
P(AB) = P(B)P(A|B)
P(AB) = P(A)P(B|A)

Intuitively, think of “checking” whether A occurred. If it did, “check” whether B occurred (knowing A did). (or you check B first, then A) Works for both dependent and independent events.

Extends nicely to multiple events P(ABC) = P(B)P(C|B)P(A|BC)
The order doesn’t matter as long as each probability is conditional on what we know.

Law of Total Probability

Let $A_1 , ... , A_k$ be a collection of events that are mutually exclusive $A_i\cap A_j = \phi$ cover the sample space $A_1 \cup ... \cup A_k = S$

(break S up into k disjoint pieces) This is a partition of S
e.g. A and A_bar is a partition of size 2.
For any event B in S, we can express it as B = $\cup_{i=1}^{k} (BA_i)$
So P(B) = $\Sigma_{i=1}^{k}P(A_i)P(B|A_i)$ using product rule

Example:
Three people A,B,C are writing code together. A produces twice as much as B or C

1% of A’s code has errors
2% of B’s
5% of C’s
What is the probability that a random line of code has errors?

Let E = line has error
A = line written by A
B = line written by B
C = line written by C

We want P(E)

Using the partition {A, B, C}, P(A) = 0.5 , P(B) = P(C) = 0.25
We have P(E|A) = 0.01
P(E|B) = 0.02
P(E|C) = 0.05

So $P(E) = P(A)P(E|A) + P(B)P(E|B) + P(C)P(E|C) = 0.5*0.01 + 0.25*0.02 + 0.25*0.05 = 0.0225$

Lecture - 9

We found P(E) = 0.0225 using law of total prob. We would like to know, if we find an error, whose fault is it?
We had to reverse the direction of the conditioning: we have P(E|author). We want P(author|E).

Bayes’ Rule

$P(A|B) = \frac{P(AB)}{P(B)} = \frac{P(A)P(B|A)}{P(B)}$

also written as
$P(A|B) = \frac{P(A)P(B|A)}{\Sigma_{i=1}^k P(A_i)P(B|A_i)}$

For this example:
$P(A|E) = \frac{0.5*0.01}{0.0225} = 0.222...$

Example

Your spam filter has 90% detection rate but 1% false positive rate. Suppose 25% of messages are spam

Let S = “message is spam” P(S) = 0.25 P(S’) = 0.75
I = “Identified as spam”
P(I/S) = 0.9 -> P(I’|S) = 0.0

photo here<>

Machine Learning

Supervised learning -> give some training data
Unsupervised learning -> no training data

Types of Problems
Classification: spam vs non-spam email; cancerous vs benign tumor; fraudulent vs legit banking transactions
Regression: predict the price of house/stock; length of recovery time
Clustering: grouping products together; making recommendations

Bayesian Classifier
P(Category 1 | evidence) = $\frac{P(cat1)P(ev|cat1)}{P(evidence)}$

“Monty Hall” problem

3 doors. 1 door is revealed to be a dud. Do you switch doors?

Yes

Important Sums

Geometric - $\frac{a}{1-r}$
Binomial - $(a+b)^n = \Sigma$

Lecture - 10

Bayes’ Rule:

P(A|B) = $\frac{P(A)P(B|A)}{P(B)}$
Product rule + law of total prob. A and A’ is a partition

Ch 5 Random Variables

Def. A random variable (rv) is a function that maps points in a sample space to Real numbers

The values that the rv can take on are called the RANGE of the rv. By convention, we call rvs X,Y,Z and the values, x,y,z
We are interested in finding the prob. that a rv X takes on one of it’s particular values. ie P(X =x) (outcome is a sample point that maps to x)

There are two types of rvs, discrete -> finite or countably infinite range vs continuous -> incountably infinite range

More than one rv can be defined on the same S

eg roll 3 fair 6-sided dice.
X = sum on 3 dice {3, … , 18}
Y = avg value {1, … , 6}
Z = the 3 digit number created by the dice
W = the product of the 3 {1, 2, … , 216}
U = number of 3s {1,2,3}
etc

Def. The probability factor (pf) of a discrete rv X is f(x) = P(X = x) and is only defined for x $\epsilon$ range of X. f(x) if x does not belong in range is undefined or 0.

Properties of f(x)

$0 \leq f(x) \leq 1$ for all x $\epsilon$ range of f(x) . Why? it’s a probability

$\Sigma_{\text{all x}\ \epsilon \ \text{range}} f(x) = 1$ The events “X = x” are mutually exclusive for each x

Def. The cumulative distribution function (cdf) of the rv X is F(x) = P(X $\leq$ x) and is defined for ALL Real x.

Lecture - 11

Recall: Random variable (X) maps sample points to Real numbers
Probability function f(x) = P(X=x) $\forall$ x $\epsilon$ range
Cumulative distribution function F(x) = P( $X \leq x$ ) $\forall$ x $\epsilon R$

Properties of F(x)

$0 \leq F(x) \leq 1$ Why? it’s a probability
F(x) is non-decreasing with respect to x . Why? Can’t lose probability when increasing x
1. Proof: the event “ $X \leq a$ ” is contained in “ $X \leq b$ ” where $a \leq b$ so $F(a) \leq F(b)$
$lim_{x \rightarrow - \infty} F(x) = 0$ where 0 is impossible
$lim_{x \rightarrow \infty} F(x) = 1$ where 1 is guarenteed

Example:

x	0	1	2	3
f(x)	125/216	75/216	15/216	1/216
F(x)	125/216	200/216	215/216	1

Let’s graph it to show what happens at other Real values of x.

needs to approach one
needs to approach zero
right-step function (holes are filled on the left. empty on the right)
On this image, needs to have an arrow from 0 going to the left

Relationship between F and f
f(x) = the size of the jump in F(x) at the point x = F(x) - F(x-1) <- which is the next smallest value in range

$F(x) = f(0) + f(1) + ... + f(\lfloor x \rfloor) = \Sigma_{i=0}^{\lfloor x \rfloor} f(i)$
We can use F(x) to find $P(a < X \leq b)$ = P()

Discrete Uniform (5.2)

Let X be a rv on the range {a, a+1, …, b} where $a \leq b$ & a,b $\epsilon Z$ where each value is equally likely. Then we say X has a discrete uniform distribution on [a , b] in other words, “X~DU[a,b]” . X has a DU distribution of [a,b]

e.g.

fair die ~DU[1,6]
position in a deck of one particular card ~ DU [1, 52]

Find the pf f(x) = P(X = x) = c must be constant since all equal
We need $\Sigma_{x = a}^{b}f(x) = 1$
$\Sigma_{x=a}^{b}c = 1$
$(b-a+1)c = 1$
$C = \frac{1}{b-a+1} = f(x)$

Find the cdf
$\begin{equation} \begin{split} F(x) & = \Sigma_{i = a}^{\lfloor x \rfloor}f(i) \\ & = \Sigma_{a}^{\lfloor x\rfloor} \frac{1}{b-a+1} \\ F(x) & = \left\{ \begin{array}{ll} \frac{\lfloor x \rfloor -a + 1}{b-a+1} & a\leq x \leq b \\ 0 & x < a\\ 1 & x > b \\ \end{array} \right. \end{split} \end{equation}$

Example

You are looking through a linked list of 100 items for one particular item. Let X = # of comparisons to find it. Claim: X ~ DU [1, 100]

$\begin{equation} \begin{split} f(1) &= P(X=1) = \frac{1}{100} \\ f(2) & = P(X =2) \\ & = P(X \neq 1)P(X=2|X\neq 1) \\ & = \frac{99}{100}*\frac{1}{99} = \frac{1}{100} \\ similarly, all f(x) &= \frac{1}{100} \end{split} \end{equation}$

Lecture 12

Hypergeometric rv (5.3)

Set up: we have N objects -> r “successes”, N-r “Failures”
We choose n without replacement

Let X = “# of S’s chosen”
eg. if x=# of winning numbers on a lotto 6/49 ticket
X ~ Hyp(N,r,n)
N=49, r = 6, n =6
Find the pf f(x) = P(X=x) = P(we get x S’s and n-x F’s)

$f(x) = \frac{{r \choose x}{N-r \choose n-x}}{N \choose n}$
range of x?
lower bound: $x \leq 0$ , $x \leq n - (N-r)$ possible to run out of F’s so remaining will all be S’s

upper bound x \leq n (all 5’s), x \leq r (could run out of S’s)
so the range of X is messy and depends on the relationship btwn N, r, and n.

Also, there is no closed form expression for F(x). You have to add up the f(x) values which is computationally intensive when N is large.
Example:
You have 10 cards - 7 treasure, 3 non-treasure.
Draw 5 without replacement. Let X = “# of treasure cards”. X ~ Hyp(N=10, r=7, n=5)
f(x) = P(X = x) = $\frac{{7 \choose x}{3 \choose 5-x}}{10 \choose 5}$ for x = 2,3,4,5
All the values of f(x) will add up to 1
(proven using Hypergeometric series result)

Binomial rv (5.4)

Set up: Bernoulli trials - independent trials, 2 outcomes on each (S or F), P(S) = P is constant for all trials

We do n trials.
Let X = # of S in n trials
(You can imagine it as a Hypergeometric selecting with replacement instead of without.)
We write X ~ Bin(n, p)
Eg. Flip a fair coin 10 times, x = # heads, X ~ Bin (n = 10, p = 0.5)
Find the pf f(x) = P(X=x) = P(we get x S’s and n-x F’s) = p^x(1-p)^(n-x)j

Tut 1

The letters of PROBABILITY are arranged at random in a row. Find the probability that:
1. Y is in the last position
  - Total # is 9979200
  - # to have Y last = 907200
  - so Prob = 1/11 (also can just consider that 1/11 chance last letter is y)
2. The two B’s are consecutive
  - treat “BB as one letter
  - P = (10!/2!)/(11!/(2!2!)) = 0.182
3. The two B’s are consecutive and Y is in the last position
  - (9!/2!)/(11!/(2!2!)) = 0.018 = P(AB)
  - Note: $P(AB) \neq P(A)P(B)$
4. Y is not in the last position and the two B’s are not consecutive
  - P(A_bar B_bar) = P((AUB)_bar) = 1-P(AUB) = 1-(P(A)+P(B)-P(AB)) = 0.745
5. Y is not in the last position or the two B’s are not consecutive
  - P((AB)_bar) = P(A_bar U B_bar) = 1-P(AB) = 0.982
You are trying to guess your friend’s Quest password. You know it must be 8 characters chosen from digits 0-9, lower case letters a-z, and uppser case letters A-Z and is not allowed to be all letters or all numbers
1. how many possible valid passwords are there?
  - 62 possible characters
  - 62^8 - 52^8 - 10^8
2. You happen to know that your friend is a bit lazy with respect to password security and will only use the letters a,s,d and f (both upper and lower) and the number 1. How many possible valid passwords could your friend have?
  - now 1 + 4 + 4 = 9 characters
  - So 9^8 - 8^8 - 1 = 26269504
3. What is the probability your friend’s password has no repeated characters?
  - no repeats - selecting w/o replacement so order still matters
  - So 9^8 - 8^8 - 0 = 322560
  - P = 322560/26269504 = 0.012
4. What is the probability your friend’s password contains at least one “a” or “A”?
  - 1 - P(no “a” or “A”s)
  - # w/o = 7^8 -6^8 -1^8
  - P = 1 - (7^8 -6^8 -1^8)/( 9^8 - 8^8 - 1) = 0.844
Consider the machine learning problem of classifying incoming messages as spam. We define: A_1 = message fails rdns check (i.e. the “from” domain does match), A_2 = message is sent to over 100 people, A_3 = message contains a link with the url not matching the alt text. We will assume that the A_i’s are independent events, given that a message is spam, and that they are independent events, given that a message is regular. This is known as the “Naive Bayes Classifier” and is the simplest of the machine learning classification algorithms. We estimate P(A_1|Spam) = 0.3 ,P(A_2|Spam) = 0.2,P(A_3|Spam) = 0.1, P(A_1|Not Spam) = 0.005, P(A_2|Not Spam) = 0.04, P(A_3|Not Spam) = 0.05 and P(Spam) = 0.25
1. Suppose a message has all of features 1,2,and 3 present. Det P(Spam|A_1,A_2,A_3)
  - = (P(Spam)P(A_1A_2A_3|Spam))/(P(Spam)P(A_1A_2A_3|Spam) +P(Spam_bar)P(A_1A_2A_3|Spam)) = (P(S)P(A_1|S)P(A_2|S)P(A_3|S))/(P(S)P(A_1|S)P(A_2|S)P(A_3|S) + P(S_bar)P(A_1|S)P(A_2|S)P(A_3|S)) = 0.995
2. Suppose a message has features 1 and 2 present, but feature 3 is not present. Determine P(Spam | A_1A_2(A_3)bar).
  *(P(S)P(A_1|S)P(A_2|S)P(A_3_bar|S))/(P(S)P(A_1|S)P(A_2|S)P(A_3_bar|S) + P(S_bar)P(A_1|S)P(A_2|S)P(A_3_bar|S)) = 0.9895
3. If you declared as spam any message with one or more of features 1,2, or 3 present, what fraction of spam emails would you detect?
  - P(A_1UA_2UA_3|Spam) = 1 - P(none of features) = 1-P(A_1_barA_2_barA_3_bar|spam) = 1- P(A_1_bar|Spam) … P(A_3_bar|Spam) = 0.496
4. Given that a message is declared as spam (according to the rule in (c)), what is the probability that it actually is spam?
  - similar to e. ans = 0.8508
Given that a message is declared as spam, (according to the rule in (c)), what is the probability that feature 1 is present?
- P(A_1|A_1UA_2UA_3) = P(A)/(P(A_1UA_2UA_3)) = P(S)P(A_1|S)+P(S_bar)P(A_2|S) )/(P(S)*0.496) = 0.641
Let X represent the number of days in Feb. with temp below -24C. The probability function (pf) of X, f(x)= P(X=x) insert photo
1. 0.0625
2. look at one note
3. F(3.5) - F(0.5) = 0.375
4. f(2)/0.375 = 0.8333…

Lecture 13

Recall:
X~Hyp(N, r, n)
$f(x) = \frac{{r \choose x}{N-r \choose n-x}}{{N \choose n}}$
X = # S’s in n objects w/o rep
X~Bin(n,p)
$f(x) = {n \choose x}p^x (1-p)^{n-x}$

Example:
Want to send a 4-bit message. Each bit is independently flipped (0->1 or 1->0)
Probability = 0.1
P(message received correctly?)
Let X = # of bits flipped
X~Bin(4, 0.1)
P(X=0) = ${4 \choose 0}0.1^00.9^4 = 0.656$

Now add 3 “parity bits” to the message which allows the receiver to detect and fix up to 1 error.
Let Y = # bits flipped ~Bin(7, 0.1)
P(Y=0) + P(Y=1) are both ok.
$= {7 \choose 0}0.1^00.9^7 + {7 \choose 1}0.1^10.9^6= 0.85$

Bin approx to Hyp.

(ie. n << N) then it doesn’t make a big difference to the probabilities if you sample with or without replacement. If we did it with replacement the number of S’s we get
X~Bin(n,p= $\frac{r}{N}$ )
So when N is large and n is small, we can use a Bin(n, $\frac{r}{N}$ ) to approximate a Hyp(N,r,m). (when $\frac{n}{N}$ is less than 0.05)

Lecture 14

recall: Negative Binomial (5.5)
Bernoulli trials (indep, S or F P(S)=p) X = # of F’s before the Kth S is obtained
pf and examples of NB
Geometric rv (5.6)
How to tell when to use distributions

Bin	NB
-know # trials(n)	-know # successes(k)
-? successes modelled by X	-? trails modelled by k+x

We write $X\sim NB(k,p)$
range $\{0,1,2, ...\}$

$\begin{equation} \begin{split} pf \ \ \ f(x) & = P(X=x) \\ & = P(x \ \ \text{F's befpre the kth S}) \\ & = {x+k-1 \choose k-1} p^k(1-p)^x \\ &{x+k-1 \choose k-1} \ \text{is # orderings} \\ &p^k \ \text{p of k S's} \\ &(1-p)^x \ \text{p of x F's} \end{split} \end{equation}$

We can show that $\Sigma^{\infty}_{x=0}f(x) = 1$
But there is no closed form expression for $F(x)=\Sigma^{x}_{y=0}f(y)$

Example:
A startup is looking for 5 investors. Each investor will independently say yes with probability 0.2. Founders will ask investors one at a time until they get 5 yes. Let X=total # of investors asked. Find f(x) and f(10).
$X \not\sim NB$
Let Y = # who say no (Y+5=X)
$Y \sim NB(5, 0.2)$
So
$\begin{equation} \begin{split} f(x) & = P(X=x) \\ &= P(Y= x-5) \\ & ={x-5+5-1 \choose 5-1} {0.2}^5 0.8^{x-5} \\ & = {x-1 \choose 4}0.2^50.8^{x-5} \ \text{for x=5,6,7,...} \\ f(10) &= {9 \choose 4}0.2^50.8^5 =0.013 \end{split} \end{equation}$

Geometric rv (5.6)

(nothing to do with hypergeometric)
Special case of NB with k=1.
X= # of F’s before obtaining the first S in Bernoulli trails
X~Geo(p)
range: {0, 1, 2, …}

$\begin{equation} \begin{split} pf \ \ \ f(x) & = P(X=x) \\ & = P(x \ \ \text{F's, then 1 S}) \\ f(x) & = (1-p)^xp \end{split} \end{equation}$
no orderings to worry about since it’s just FFFF….FS (x F’s)
Easy to show $\Sigma_{x=0}^\infty f(x) =1$ using infinite geometric series formula
$F(x) = P(X \leq x) = 1 - P(X \leq x+1) \\= 1 - (p(1-p)^{x+1} + p(1-p)^{x+2} + p(1-p)^{x+3} + ...)$

	Discrete Uniform	Hypergeometric	Bin	NB	Geometric	Poisson
pf f(x)	$\frac{1}{b-a+1}$	$\frac{{r \choose x} {N-r \choose n-x}}{{N \choose n}}$	${n \choose x}p^x(1-p)^{n-x}$	${x + k -1 \choose k-1}p^k(1-p)^x$	$(1-p)^xp$	$\frac{e^{-\lambda t} (\lambda t) ^x}{x!}$
range	{a, a+1, .. ,b}	weird!	0,…,n	0,1,…	0,1 …	0,1,2,…
cdf F(x)	$\frac{x-a+1}{b-a+1}$	no closed form	no	no	$1-(1-p)^{x+1}$	$F(x) = e^{-\lambda t} (1 + \frac{\lambda t}{1!} +\frac{(\lambda t)^2}{2!} + ... + \frac{(\lambda t)^x}{x!})$
when to use?	$- \text{fixed # values} \\ - \text{equally likely}$	$- \text{w/o replacement} \\ - \text{know # S in a subset}$	$- \text{Bernoulli trials - indep, S or F, P(S)} \\$ $-\text{know # trials} \\$ - counting # S	$- \text{Bernoulli trials - indep, S or F, P(S)} \\$ $- \text{ know # S's $\\$} \\ -\text{"before" "until" "waiting for"}$	$- \text{Bernoulli trials - indep, S or F, P(S)} \\$ $- \text{same but k=1}$	Bin with large n, small p (approx)

f(x)
When to use? = Bin with large n, small p (approx)
range 0,1,2,…

Lecture - 15

Recall: uniform, hypergeometric, binomial, NB, geometric
Poisson dist from Bin (5.7)
Poisson process (5.8)

Poisson rv (5.7)

We say X has a Poisson distribution with parameter $\mu$ ( $X\sim Poi(\mu))$ if $f(x) = \frac{e^{-\mu}\mu^x}{x!}$ for $x=0,1,2,...$
Easy to show $\Sigma_{x=0}^{\infty}f(x)=1$ since $e^\mu=\Sigma_{x=0}^{\infty} \frac{{\mu}^x}{x!}$
The Poisson is a limiting case of the Binomial when n-> $\infty$ and p->0 such that the product np remains constant
$f(x) = {n \choose x}p^x (1-p)^{n-x}$
Let np = $\mu$
then,
$\begin{equation} \begin{split} p &=\frac{\mu}{n} \\ &= \frac{n(n-1)...(n-x+1)}{x!}\frac{\mu}{n}^x(1-\frac{\mu}{n})^{n-x} \\ &=\frac{\mu}{x!}^x\frac{n}{n}\frac{n-1}{n}...\frac{n-x+1}{n}(1-\frac{\mu}{n})^n(1-\frac{\mu}{n})^{-x} \\ \text{now let n} \rightarrow \infty \\ &=\frac{\mu^x}{x!}*1*1...1*e^{-\mu}*1 \\ &=\frac{e^{-\mu}\mu^x}{x!} \text{ which is } Poi(\mu=np) \end{split} \end{equation}$
So if we have a Bin(n,p) with large n and small p, we can approx if with a Poisson rv with parameter $\mu = np$ . Guideline: $n\geq40$ and $p\leq 0.05$ works well!

Example:
Roll up the Rim - “1 in 9 cups win!” You buy 100 cups (treat as independent). Find prob you get 10 or fewer winning cups.

X= # winning
$X \sim Bin(100,\frac{1}{9})$
So:
$\begin{equation} \begin{split} P(X\leq 10) & = f(0) + f(1) + ... + f(10) \\ & = {100 \choose 0} \frac{1}{9}^0\frac{8}{9}^100 + {100 \choose 1} \frac{1}{9}^1\frac{8}{9}^99 + ... + \text{tedious calc} \\ &=0.439 \end{split} \end{equation}$

Try Poisson approx $\mu = 100*\frac{1}{9} = 11.111$
$Y \sim Poi(11.111)$
$P(Y\leq 10) = \frac{e^{-11.1}}{0!}11.1^0+\frac{e^{-11.1}}{1!}11.1^1 + ... + \frac{e^{-11.1}}{11!}11.1^{10} \\= e^{-11.1}(1 + 11.1 + \frac{11.1}{2!}^2 + ... + \frac{11.1}{10!}^10) = 0.447$

Not a great approx since $p = \frac{1}{9}$ was a bit too high to be “close to 0”

Clicker question
Suppose you type at exactly 90 words pm and on each word have a 1% chance of making an error. After 1 minute, what is the probability you have made NO errors?
0.405 -> from bin. 0.407 -> from poisson

You can also use Poisson when $p\approx 1$ by instead modelling the number of F’s instead of S’s.

Poisson Process (5.8)

Consider “events” occurring randomly throughout time/space according to 3 conditions:

Independence
1. (events have no impact on each other)
2. # of events in non-overlapping time intervals are indep.
Individuality
1. (events occur one at a time)
2. Cannot have two or more events at the exact same time
Homogeneity/Uniformity
1. (events occur at a constant rate $\lambda$ )
2. Prob of an event occurring in a short time interval (t, t+ $\Delta$ t) is proportional to $\lambda \Delta t$
3. Can’t have periods of higher activity

E.g. emails into an inbox
Cars through an intersection
Births in a large population

Lecture - 16

Imagine we observe a Poisson process (with rate $\lambda$ ) for t units of time.
Let X = # of events that occur.
X is a discrete rv with no maximum.
It turns out that:
X ~ Poi( $\mu$ = Xt)
ie. $f(x) = \frac{e^{-\lambda t} (\lambda t) ^x}{x!}$ for x = 0, 1, 2

Proof: See course notes
Add one more column to your chart!

| Poisson
f(x) | $\frac{e^{-\lambda t} (\lambda t) ^x}{x!}$ | $F(x) = e^{-\lambda t} (1 + \frac{\lambda t}{1!} +\frac{(\lambda t)^2}{2!} + ... + \frac{(\lambda t)^x}{x!})$
When to use? = Bin with large n, small p (approx)
range 0,1,2,…

Conditions of a Poisson process, counting # of events in a fixed time period

When NOT to use?
We can specify a maximum
If it makes sense to ask how many time the event did not occur

Example:

requests to a web server follow the conditions of a Poisson process with rate 100 per minute.
Find prob of 1 request in 1 sec. 90 requests in 1 min.

Let X = # requests in 1 sec
X ~ Poi( $\mu = 100 * \frac{1}{60}$ or $\frac{100}{60} = \frac{5}{3}$ )
$P(X=1)=\frac{e^{-\frac{5}{3}}(\frac{5}{3})}{1!} = 0.314$
Let Y = # requrests in 1 min
Y ~ Poi( $\mu = 100*1$ )
P(Y=90) = $\frac{e^{-100}100^{90}}{90!} = 0.025$

Combining Models (5.9)

Many problems may combine more than one distribution together. Your task is to identify the distribution needed, depending on the perobability requested.

Example: Server requests 100/min
A 1-second period is “quiet” if it contains no requests.

a) Find the Prob of a “quiet” second.
X=# requests in 1 sec. X~Poi( $\frac{5}{3}$ )
P(X = 0) = $\frac{e^{-\frac{5}{3}}(\frac{5}{3})^0}{0!} = 0.189$

b) Prob of 10 “quiet” seconds in a minute (60 non-overlapping sec)
Let Y = # “quiet” in 60 sec.
Y~Bin(60,0.189)
P(Y=10) = ${60 \choose 10}0.189^{10}0.811^{50} = 0.124$

c) Prob of having to wait 30 non-overlapping sec to get 2 “quiet”
Let Z = # non-quiet sec before 2 “quiet”
Z ~ NB(2, 0.189)
P(Z = 28) = ${28 + 2 -1 \choose 28}0.189^2 0.811^{28} = 0.003$

d) Given (c), the prob there is 1 “quiet” sec in the first 15 sec.
P(1 Q in 15 sec | wait 30 for 2 Q)
$= \frac{P(\text{1 Q in 15 sec AND wait 30 for 2Q})}{P(\text{wait 30 for 2 Q})}$
$= \frac{P(\text{1 Q in 15 sec})P(\text{wait 15 more to get one more Q)}}{P(\text{wait 30 for 2 Q})}$
Use binomial for 1st prob. Geometric for 2nd.
$= \frac{({15 \choose 1}0.189^10.811^{14})(0.811^{14}0.189)}{{29 \choose 28}0.189^{2}0.811^{28}}$

Lecture 17

Ch 7. Expected Value + Variance Summarizing Data

Let X = # of kids in a family

To summarize the data, we can use:
1. A frequency distribution

x	frequency
1	10
2	14
3	10
4	1
5	2

2. a frequency histogram (different graph here because I was lazy and didn’t want to draw one in ms paint)

3. A single number representing the average or sample mean
$\begin{equation} \begin{split} \bar x &= \frac{\text{total # kids}}{\text{# families}}\\ & = \frac{1*10 + 2*14 + 3*10 + 4*1 + 5*2}{37}\\ & = \frac{82}{37} \\ & = 2.216 \end{split} \end{equation}$
4. median - the middle value: 2 in this case
5. mode - most common/frequent value (in this case, 2)

Expectation of a r.v. (7.2)
We had the sample mean of the kids be $\bar x$
$\Sigma_{x=1}^{5} (\text{relative frequency of answering x})$

We can replace the observed relative frequency with a theoretical probability of the r.v. equally x -> theoretical mean

2011 census:

x	1	2	3	4	5
f(x)	0.43	0.4	0.12	0.04	0.01

so the theoretical mean of X is:
$\Sigma_{x=1}^{5}xf(x) = 1* 0.43 + 2*0.4 + 3*0.12 + 4*0.04 + 5*0.01 = 1.8$
and median is 2 since F(2) > 0.5 and F(1) < 0.5
and mode is 1

Def: The expected value or expectation or mean of a discrete r.v X is:
$E[X] = \mu = \Sigma_{\text{all x } \epsilon \text{ range of X}}xf(x)$

Lecture 18

Recall - Expected value (aka mean) of X is $E[X]=\mu=\Sigma_{\text{all }\epsilon \text{ range of X}}xf(x)$

Can think of it as a weighted average of the values X can take with weights = probabilities or balance point of the histogram of f(x). Often we may be interested in the average value of the function of x. Eg. x = usage on phone. g(X) = cost of that usage.

Def: the expected value of g(X) for a discrete r.v. X is $E[g(X)] = \Sigma_{\text{all }\epsilon \text{ range of X}}g(x)f(x)$
weighted average of the g(x) values that can occur. (e.g. g(X) = 1000 + 250x)

What if $g(X) = \frac{2000}{X}$ ?
Here, $E[g(X)] \neq g(E[x])$ because g(x) is non-linear function of X. Expectation is a linear operator.

SO ONLY USE IF g(X) IS LINEAR
Also:
$E[aX + b] = aE[X] + b$

Applications of Expectation (5.3)

If we have the distribution of X ans we let Y = g(X), we can find E[Y] by either: $E[g(X)] = \Sigma_{\text{all x}}g(x)f(x)$ or by finding the range and pf of Y and using $\Sigma_{\text{all y}}yf(y)$

Example
Suppose the time to finish a part on a coding question is 10 minutes if you make no errors. Syntax errors take 2 mins to fix. Logic errors take 10 min. Assume O(syntax error) = 0.1 , P(logic error) = 0.2 independently.

Find the average(expected) time to finish this question.
Let X = 0 (no errors), 1(syntax), 2(logic), 3(both),

x	0	1	2	3
f(y) -> f(x)	0.72 =(0.9*0.8)	(0.1)*(0.8) = .08	(0.9)*(0.2)=0.18	0.02
y -> g(x)	10	12	20	22

So $E[g(X)] = \Sigma_{x=0}^{3}g(x)f(x) \\ = 10*0.72 + 12*0.08 + 20*0.18 + 22*0.02 = 12.2$

Example:
A web server has a cache. 20% chance that the request is found in the cache (cache hit) -> 10 ms. If it’s not found (cache miss) then it takes 50(send msg) + 70(lookup) + 50(return answer) ms. Find the expected time with and without the cache.
Without: time is always 190ms So E[T] = 170.
With: Let X = 0 if found, 1 if not found

x	0	1
f(x)	0.2	0.8
time=g(x)	10	180

Lecture 19

Recall:
$\begin{equation} \begin{split} E[g(X)] &= \Sigma_{\text{all x}}g(x)f(x)\\ E[aX + b] &= aE[X]+b \text{(linear operator)} \end{split} \end{equation}$

Means (and variances) of named distributions (7.4)

Binomial
- Let X~ Bin(n,i)
  Find E[X]:
  $E[X] = \Sigma_{allx}xf(x) \\ = \Sigma_{x=0}^{n}x{n \choose x}p^x(1-p)^{n-x} \\ = \Sigma_{x=1}^{n}x{n \choose x}p^x(1-p)^{n-x}$
  since the x=0 term is 0
  $=\Sigma_{x=1}^{n}x\frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}$
  Note: If $x \leq 1$ , then $x! = x(x-1)!$
  $= \Sigma_{x=1}^{n}\frac{n!}{(x-1)!(n-x)!}p^x(1-p)^{n-x}$
  $= \Sigma_{x=1}^{n}\frac{n(n-1)!}{(x-1)!((n-1)-(x-1))!}pp^{x-1}(1-p)^{(x-1)-(x-1)}$
  $= np\Sigma_{x=1}^{n}{n-1 \choose x-1}p^{x-1}(1-p)^{(n-1)-(x-1)}$
  Let $y = x-1$
  $np\Sigma_{y=0}^{n-1}{n-1 \choose y}p^y(1-p)^{n-1-y}$
  The summation is just f(y) for Y~Bin(n-1,y) and that = 1
  $\therefore E[X] = np$ for X~Bin(n,p)
  This makes logical sense because the number of successes is proportional to both the # of trials and the prob of success. So the average # S is the # trials x prob of S.
Poisson
- Let X~Poi( $\mu$ ) (where $\mu$ comes from $\lambda t$ or np or given)
  Find E[X]
  $E[X] = \Sigma_{x=0}^{\infty}x\frac{e^{-\mu}\mu^x}{x!}$
  again the x=0 term is 0 and for $x \leq 1$ , x! =x(x-1)!
  $= \Sigma_{x=1}^{\infty}\frac{e^{-\mu}\mu^x}{(x-1)!}$
  $= \mu\Sigma_{x=1}^{\infty}\frac{e^{-\mu}\mu^{x-1}}{(x-1)!}$
  Let y = x-1
  $= \mu \Sigma_{y=0}^{\infty}\frac{e^{-\mu}\mu^y}{y!}$
  that sum is just f(y) for Y~Poi( $\mu$ ) so $\Sigma_{y=1}^{\infty}f(y)=1$
  $\therefore E[X] = \mu$ for X~Poi( $\mu$ )
  i.e. $\mu$ is the parameter and the mean.
  This makes sense if $\mu = \lambda t$ since avg # events in t units of time should be the rate ( $\frac{\text{avg # events}}{\text{time unit}}$ ) multiplied by the length of the time interval we observe.
Similarly:
1. X~NB(k,p) , $E[X] = \frac{k(1-p)}{p}$
2. X~Hyp(N,r,n) , $E[X] = \frac{nr}{N}$
3. X~DU[a,b] , $E[X] = \frac{a+b}{2}$
  these results can be proven from first principles ( $\Sigma_{\text{all x}}xf(x)$ ) but there are other easier ways to show them

The mean of X, E[X] tells us where the distribution is centered, on average. But the practice we also care about how widely spread out the distribution is around that mean.
E.g. determining the number of servers for an online system. need to know spread.
How to measure?
$E[X-\mu] = 0$ (not helpful!)
$E[|X-\mu|]$ - average absolute distance from mean

need cases to evaluate
point of non-differentiabililty at $\mu$

So we use:
$E[(X-\mu)^2]$ - average squared distance from mean

Lecture 20

Def: Variance of a r.v. X is $Var(X)= \sigma^2 = E[(X-\mu)^2]$
Weighted average squared, distance from mean. If X is discrete, $Var(X) = \Sigma_{\text{all x}}(x-\mu)^2f(x)$

Calc form of variance:
$\begin{equation} \begin{split} Var(X) &= E[(X-\mu)^2] \\ &= E[X^2 - 2\mu X + \mu^2] \\ &= E[X^2] - 2\mu E[X] + \mu^2 \text{ by linearity} \\ &= E[X^2] - 2\mu^2 + \mu^2 \text{ since }E[X] = \mu \\ &= E[X^2] - \mu^2 \\ &= E[X^2] - E[X]^2 \end{split} \end{equation}$
Where $E[X^2] = \Sigma_{\text{all x}}xf(x)$

Example:
X has pf

x	10	12	20	22
f(x)	0.72	0.08	0.18	0.62

We found $E[X] = 12.2$ min
Now $E[X^2] = 10^2 *0.72 + 12^2*0.08 + 20^2*0.18 + 22^2*0.02 = 165.2$
$\therefore Var(X) = 165.2 - 12.2^2 = 16.36 \text{ min}^2$

Note that variance is measured in units squared rather than the original units of X. Taking the square root of the variance makes more sense.

Defn: the standard deviation of X is $SD(X) = \sigma = \sqrt{Var(X)}$

In our example, $SD(X) = \sqrt{16.36} = 4.04$ min
Remember, Var(X) will always be non-negative because it’s a weighted average of $(x-\mu)^2$ values $\geq 0$
So SD(X) will also be $\geq 0$ and a Real number

Mean + Var of a linear f’n of X

$Y=aX+b$
$E[Y] = aE[X] + b$ by linearity of expectations
$\begin{equation} \begin{split} Var(Y) &= E[(Y - E[Y])^2] \\ & = E[(aX + b - (aE[X] + b))^2] \\ &= E[a^2(X-E[X])^2] \\ &= a^2Var(X) \end{split} \end{equation}$

Note: b does not affect the variance since shifting the distribution doesn’t affect the spread. We increase all the distances by a factor of $a$ so the squared distances increase by $a^2$

Finally,
$SD(Y) = \sqrt{Var{Y}} \\ = \sqrt{a^2Var(X)} \\ =|a|SD(x)$

Example
Suppose X has pf

x	0	1	2	3	4
f(x)	0.1	0.1	0.1	0.5	0.2
y	1	3	5	7	9

Let $Y = 2X + 1$
$E[X] = 2.6$
$E[X^2] = 8.2$
$E[Y] = 6.2$
$E[Y^2] = 44.2$

$Var(X) = E[X^2] - E[X] = 1.44$
$Var(Y) = E[Y^2] - E[Y] = 5.76$

But! We can calculate the Var(Y) in a different way!
Verify $E[Y] = 2E[X]+1$ and $Var(Y) = 2^2Var(X)$

Variances of named distributions

Poissson - X~Poi() ,
- $Var(X) = E[X^2] - E[X]$
  $= E[X(X-1) + X] - E[X]^2$
  $= E[X(X-1)] +E[X] - E[X]^2$
- Why?
- $E[X(X-1)] = \Sigma_{x=0}^{\infty}x(x-1)\frac{e^{-\mu}\mu^x}{x(x-1)(x-2)!} \\=\mu^2\Sigma_{y=0}^{\infty}\frac{e^{-\mu}\mu^y}{y!} = ty = x-2$
- So $Var(X) = \mu^2 + \mu - \mu^2 = \mu$
  In the Poisson, $\mu$ is the mean and the variance!
  ^ the Poisson variance is a good exam question
Binomial - ,
- $\therefore Var(X) = n(n-1)p^2 + np - (np)^2 = np(1-p)$
  
  Note: Var is smaller than mean when $p \approx 0$ or $p \approx 1$ the variance is small since there’s less uncertainty

Lecture 22

Recall: continuous rv. has uncountable range for any x, P(X=x) = 0
Properties of F(x)
Probability density function f(x)
expectation + variance, percentiles
transformations

$P(a < X \leq b) = F(b) - F(a)$ $P(a \leq X \leq b)$ is the same $P(X=a) =0$ so are $P(a \leq X < b)$ and $P(a < X < b)$ cdf $F(x) = P(X \leq x) = P(X < x)$ Claim: F(x) is continuous for a cts rv X Proof $lim_{\epsilon \rightarrow 0} (F(x) - F(x-\epsilon))$ $= lim_{\epsilon \rightarrow 0} P(x-\epsilon < X\leq x)$ $= P(X = x)$ $= 0$ $lim_{\epsilon \rightarrow 0^+}F(x-\epsilon) = F(x)$ ie F(x) is left cts similarly F(x) is right cts So F(x) is everywhere continuous (not necessarily everywhere differentiaible) As before, its non-decreasing $lim_{x \rightarrow -\infty} F(x) = 0$ $lim_{x \rightarrow \infty} F(x) = 0$

We also want to know how X behaves locally Def: The probability density function (pdf) f(x) for a continuous r. v. X is f(x) = F’(x) (where the derivative exist) It represents a relative likelihood of X being “near” the value x Properties of f(x)

F(x) is the area under the curve f(x) below
- ie $F(x) = \int_{-\infty}^x f(u)du$
- ie area under f(x) between a and 1.
  $= F(b) - F(a)$
- ie total area under entire f(x) curve is 1.
f
- because F(x) is non-decreasing its derivative is non-negative.

Note: f(x) is NOT a probability!

Imagine $P(x < X \leq x + \epsilon)$ $= \int_{x}^{x+\epsilon} f(u)du$ $\approx$ area of rectangle $= \epsilon f(x)$

So f(x) can be thought of as a multiplier for an interval with $\epsilon$ that tells us the approx prob that X is within an interval of $\epsilon$ near the value x. $\epsilon f(x) \approx P(x < X \leq x+\epsilon)$ for a small interval $\epsilon$ Expectation Def: The mean of a cts rv X is $\int_{-\infty}^{\infty}xf(x)dx$ $E[X] = \int_{\text{all x } \epsilon \text{ range}}xf(x)dx$ $Var(x) = E[X^2] - E[X]^2$ $where E[X^2] = \int_{\text{all x } \epsilon \text{ range}}x^2f(x)dx$ Def: The pth percentile of X is the value $x_p$ such that: $P(X \leq /< x_p) = p$ eg. $x_{0.5}$ is the median Find $x_{0.5}$ for the previous example $P(X\leq x_{0.5}) = 0.5$ $F(x_{0.5}) = 0.5 \\ \therefore (x_{0.5})^2 = 0.5 \\ \therefore x_{0.5} = 0.707$

Stat 230 Tutorial 2 Problems - Section 102

Friends add you to Facebook according to a Poisson process with rate $\lambda$ per day
a. On any given day, the probability that nobody adds you is 0.1353. Find $\lambda$
b. Given that 5 friends added you in 3 days, what is the probability that 2 of them were on the first day?
c. A bad day is when 1 or fewer friends add you. Show that the probability of a bad day is 0.41. Calculate. Use the rounded value in the rest.
d. What is the probability of having 2 bad days in a week?
e. What is the prob of having to wait at least 5 days (total) to have one bad day?

a) Let X = # who add in 1 day
X~ Poi( $\lambda$ )
We know P(X=0) = 0.1353
But $P(X = 0) = \frac{e^{-\lambda}\lambda^0}{0!} = e^{-\lambda}$
Equating and solving, $\lambda = -ln(0.1353) = 2$

b) $\frac{\text{P(2 in 1 day AND 5 in 3 days)}}{P(5 in 3 days)} \\= \frac{\text{P(2 in 1 day)P(3 in next 2 days)}}{\text{P(5 in 3 days)}}$
$= \frac{\frac{e^{-2}2^2}{2!}*\frac{e^{-4}4^3}{3!}}{\frac{e^{-6}6^5}{5!}}$
$= {5 \choose 2}{\frac{2}{6}}^2\frac{4}{6}^3 = 0.329$ looks like binomial. We could have at the beginning, used binomial and have gotten the same result.

c) $P(bad) = P(X \leq 1) = P(X=0) + P(X=1)$
$= \frac{e^{-2}2^0}{0!}+\frac{e^{-2}2^1}{1!} = 0.406$

d) Y = # bad days out of 7
Y~Bin(7,0.41)
$P(Y=2) = {7 \choose 2}0.41^20.59^5$

e) Let Z = # good before first bad
Z ~ Geo(0.41)
$P(Z \geq 4) = 1 - 0.41 - 0.41*0.59 - 0.41*0.59^2 - 0.41*0.59^3 = 0.121$

Suppose X ~ Geometric(p)
a) Find the probability that X is an odd number
b) Find the probability that X is divisible by 3. What about divisible by k?
c) Find the probability function of the random variable R, where R is the remainder when X is divided by 4. What about the remainder when divided by m?
d) show that the mean of X is (1-p)/p and provide a logical explination for the relationship between the mean and the size of p.

a) $P(X=1) + P(X=3) + P(X=5) + ... = p(1-p) + p(1-p)^3 + p(1-p)^5 + ...$
$= \frac{p(1-p)}{1-(1-p)^2}$

b) $P(X=0) + P(X=3)+P(X=6)+ ... = p + p(1-p)^3 + p(1-p)^6$
$=\frac{p}{1-(1-p)^3}$
general $\frac{p}{1-(1-p)^k}$

c) range of R: {0, 1, 2, 3}
$P(R=0) = \frac{p}{1-(1-p)^4}$ from before
$P(R=1) = P(X=1) + P(X=5) + P(X=9) + ... = \frac{p(1-p)}{1- (1-p)^4}$
Similarly, $P(R=2) = \frac{p(1-p)^2}{1-(1-p)^4}$ and $P(R=3) = \frac{p(1-p)^3}{1-(1-p)^4}$

So $f(r) = P(R=r) = \frac{p(1-p)^r}{1-(1-p)^4}$

d) we know $\frac{a}{1-r} = \Sigma_{n=0}^{\infty}ar^n$
take d/dr of bs
$\frac{a}{(1-r)^2} = a\Sigma_{n=1}^{\infty}nr^{n-1} (*)$
Now $E[X] = \Sigma_{allx} xf(x) = \Sigma_{x=0}^{\infty}xp(1-p)^x$
$= p(1-p) \Sigma_{x=1}^{\infty}x(1-p)^{x-1} (*)= \frac{p(1-p)}{p^2} = \frac{1-p}{p}$

Why? $p = \frac{1}{6} , \ E[X] = \frac{\frac{5}{6}}{\frac{1}{6}} = 5$
$p = \frac{1}{100} , \ E[X] = \frac{\frac{99}{100}}{\frac{1}{100}} = 99$
Waiting time is inversely proportional to the prob of success.

According to the clicker data collected in our class, the probability function X = the number of courses you are taking is:

x	3	4	5	6
f(x)	0.02	0.13	0.75	0.1

suppose the hours of schoolwork you have each week are 20sqrtx and your stress level is 2X^2
a) Find the expected schoolwork hours and the expected stress level of a random student.
b) Find the variance of these quantities

a) Let Y = $20\sqrt x, \ z=2X^2$

x	f(x)	y	z	y^2	z^2
3	0.02	34.64	18	1200	324
4	0.13	40	32	1600	1024
5	0.75		50	2000	…
6	0.1		72	2400	…

$E[Y] = \Sigma 20\sqrt xf(x) = 44.33$
$E[Z] = \Sigma 2x^2 f(x) = 49.22$
$E[Y^2] = 1972$ -> Var(Y) = 6.60
$E[Z^2] = 2533$ -> Var(Z) = 110.31

The download speed of a decent connection, X (measured in units of 10MBps) has prob density function (pdf)

f(x) = ksqrtx for 0 < x < 1
f(x) = k/2 (3-x) for 1 < x < 3
f(x) = 0 otherwise
see learn

a) we need $\int_{-\infty}^{\infty}f(x)dx = 1$
$= 0 + \int_{0}^{1}k\sqrt x dx + \int_1^3 \frac{k}{2} (3-x) dx + 0$
$= k \frac{2}{3} x^{\frac{3}{2}} |_0^1 +\frac{k}{2}(3x - \frac{x^2}{2} )| _1^3$
$= \frac{2}{3}k + \frac{k}{2}(\frac{9}{2}-\frac{3}{2})$
$= \frac{5}{3}k$

So $K = \frac{3}{5} = 0.6$

Lecture 23

recall: cdf, pdf, mean, percentiles of cts X
Uniform distribution (8.2)
transformations / change of variable (8.1)
SWAG: computer -generated random numbers

Example:
A cont rv X has pdf f(x) = $c^2$ for 0 < x < 2 and 0 otherwise
Find c:

f(x) = F’(x)
$\int f(x)fx = 1$
all x
$F(x) = \int_{-\infty}^\infty f(u)du$
$E[X] = \int_{\text{all x}}xf(x)dx$

$f(x) = cx^2 , 0 < x< 2$
$\int_0^2cx^2dx = \frac{cx^3}{3}|^2_0 = c\frac{8}{3} = 1$
so $c = \frac{3}{8}$

$F(x) = \int_0^x \frac{3}{8}u^2 du$
$= \frac{3}{8}\frac{u^3}{3}|^x_0 = \frac{x^3}{8}$
So F(x) = 0 if x < 0
$F(x) = x^3/8 \ \ if\ \ 0 \leq x \leq 2$
F(x) = 1 x > 2

Find P(X>1)
$P(X> 1) = 1 - P(X \geq 1) = 1 - \frac{1}{8}$

Find E[X]
E[X] = $\int_0^2x\frac{3}{8}x^2dx = \frac{3}{8}\frac{x^4}{4}|_0^2=\frac{3}{2}$

8.2 Continuous Uniform rv

Def: If a cts rv X takes values on (a,b), a < b $\epsilon$ R st all subintervals of fixed length are equally likely. We say X~U(a,b). The endpoints a and b can be included or excluded, it doesn’t matter.

Find f(x) = c (a constant)

but we need $\int_a^b f(x)dx = 1$
$\therefore cx|_b^a = 1$
$\therefore c(b-a) = 1$
$\therefore c = \frac{1}{b-a}$

$F(x) = \int_a^xf(u)du$
$= \int_a^x\frac{1}{b-a}du$
$= \frac{u}{b-a}|_a^x = \frac{x-a}{b-a}$
Sp F(x) = 0 for x < a
$F(x) = \frac{x-a}{b-a} for a \leq x \leq b$
F(x) = 1 for x > b

So F(x) is not diff at a,b.

$E[X] = \int_a^b x\frac{1}{b-a}dx$
$=\frac{x^2}{2(b-a)}|^b_a = \frac{b^2-a^2}{2(b-a)} = \frac{a+b}{2}$
(average of end points)

Similarly Var(X) = $\frac{(b-a)^2}{12}$ (proof as exercise. Maybe on final :^))

Makes sense that it’s proportional to the square of the range.

Special case:
a=0, b=U~U(0,1)
$f(u) = 1 \ \ if \ \ 0 \leq u \leq1$
$f(u) = 0$ otherwise

Change of Variable (8.1)

If we have the distribution X but we want $Y=h(X)$ we can find the cdf/pdf of Y by changing the variable in 3 steps.

write the cdf of Y, $P(Y \leq y)$ in terms of an expression using the cdf of X
Use the cdf of X to evaluate it and if we desired, differentiate to get the pdf
Find the range of Y (will depend on h and the range of X)

Example:
Let X=# a spinner lands on between 0 and 4.
X~U(0,4)
Let $Y = \frac{1}{x}$ and find $f_y(g)$

Step 1: $F_y(y) = P(Y \leq y)$
$= P (\frac{1}{x}\leq y)$
$= P(x > \frac{1}{y})$
$= 1 - F_x(\frac{1}{y})$

Step 2: We know $F_x(x) =0$ for $x<0$
$F_x(x)= \frac{x}{4}$ for $0 \leq x \leq 4$
$F_x(x) = 1$ for $x > 4$

So $F_y(y) = 1 - \frac{\frac{1}{y}}{4}$
$= 1 - \frac{1}{4y}$
and $f_y(y) = F'_y(y) = -\frac{1}{4}(-\frac{1}{-y^2}) = \frac{1}{4y^2}$

Step 3: range of Y: $y \epsilon (\frac{1}{4} , \infty)$

So $f_y(y) = \frac{1}{4y^2}$ for $y > \frac{1}{4}$
$f_y(y) = 0$ otherwise

Note: if h(X) is invertible (1-1) we can differentiate after step 1 using the chain rule.

$f_y(y) = \frac{d}{dy}F_y(y)$
$= \frac{d}{dy}(1 - F_x(\frac{1}{y}))$
$= - F'_x(\frac{1}{y})\frac{d}{dy}(\frac{1}{y})$
$-f_x(\frac{1}{y})(-\frac{1}{y^2}) = \frac{1}{4}*\frac{1}{y^2} = \frac{1}{4y^2}$

Lecture 24

8.3 Exponential rv

def: The exponential distribution is a cts rv defined for all positive Real numbers. It represents the waiting time between events in a Poisson process.

Find cdf of X=time until next event in P.p w rate $\lambda$ .
F(x) = P(X $\leq$ x) = 1 - P(X>x) = 1 - P(next event occurs more than x time from now)
= 1 - P(0 events in (0,x))
but # events in (0,x) in a P.p is Poi( $\lambda$ x)
= 1 - $\frac{e^{-\lambda x}(\lambda x)^0}{0!} = 1 - e^{-\lambda x} = F(x)$

Thus f(x) = $\lambda e^{-\lambda x}$ for x > 0

Alternate parameterization
$f(x) = \frac{1}{\theta}e^{\frac{-x}{\theta}}$ for x > 0.
$F(x) = 1 - e^{frac{-x}{\theta}}$ ( $\theta = \frac{1}{\lambda}$ )
We say X~exp $(\theta)$

E[X] = $\int_0^\infty x\frac{1}{\theta}e^{-\frac{x}{\theta}}dx$
use integration by parts OR we can use a trick called a Gamma function
Def: $\Gamma (\alpha) = \int_0^\infty x^{\alpha - 1} e^{-x}dx$
Properties:

If $\alpha > 1$ , $\Gamma (\alpha) =(\alpha - 1)* \Gamma(\alpha -1)$
If $\alpha = 1$ , $\Gamma(\alpha) = 1$
If $\alpha$ is an integer $\geq 1$ ,
$\Gamma(\alpha) = (\alpha - 1)!$
$\Gamma(\frac{1}{2}) = \sqrt \pi$

So $E[X] = \int_0^\infty x \frac{1}{\theta}e^{-\frac{x}{\theta}}dx$

substitution:
Let $y = \frac{x}{\theta}$
$x = y \theta$
$dx = \theta dy$

So: $E[X] = \int_0^\infty x \frac{1}{\theta}e^{-\frac{x}{\theta}}dx$
$= \int_0^\infty y \theta \frac{1}{\theta}e^{-y} \theta dy$
$=\theta \int_0^\infty y^{-1}e^{-y}dy$
$=\theta \Gamma(2) = \theta(2 - 1)! = \theta = E[X]$

Remember $\lambda$ - rate of events in poisson process.
$\theta$ - average waiting time between events
So it makes sense that they are inversely related
$\lambda = \frac{1}{\theta}$ , $\theta = \frac{1}{\lambda}$

$Var(X) = E[X^2] - E[x]^2$
$E[X^2] = \int_0^\infty x^2 \frac{1}{\theta}e^{-\frac{x}{\theta}}dx$
$Let y = \frac{x}{\theta}$
$x = \theta y$
$dx = \theta dy$

$=\int_0^\infty \theta^2y^2\frac{1}{\theta}e^{-y}\theta dy$
$= \theta^2 \int_0^{\infty}y^{3-1} e^{-y}dy$
$=\theta^2\Gamma(3) = \theta^2(3-1)! = 2 \theta^2$
So Var(X) = $2 \theta ^2 - (\theta^2)=\theta^2$

i.e. $SD(X) = \theta$
For Poisson, nean = variance = parameter $\mu$ .
For exponential mean = sd = parameter $\theta$

Example:
Requests arrive to a server with an exp. distribution of wait times. 50% of the time, the wait is at best 15 sec.
a) Find avg time between requests
Let X = time between requests~exp $(\theta)$
We know P(X > 15) = 0.5
So 1 - F(15) = 0.5
So $1 - (1 - e^{-\frac{15}{\theta}})= 0.5$
So $\theta = \frac{-15}{ln(0.5)} = 21.64$

b) If a request has just come in, find the prob. another one comes in within 5 sec.
P(X < 5) = F(5) = $1 - e^{-\frac{5}{21.64}} = 0.206$
Could also do with Poisson rv.

The memoryless property
Example:
Suppose buses arrive according to a Poisson process with rate 5/hr.
Find P(wait > 15 mins)
t = min
$\lambda = \frac{5}{60}= \frac{1}{12}$
$\theta = 12$

X = waiting time ~exp(12)
P(X > 15) = 1 - (1 - $e^{-\frac{15}{12}}) = e^{-\frac{15}{12}} = 0.2865$

C) given you’ve already waiting 6 min, find prob you wait > 15 more min.
i.e. P(X > 21 | X > 6)
$=\frac{\text{P(X > 31 and X > 6)}}{\text{P(X>6)}}$
$=\frac{\text{P(X>21)}}{\text{P(X>6)}}$
$= \frac{1 - (1 - e^{-\frac{21}{12}})}{1 - (1 - e^{-\frac{6}{12}})}$
$= e^{-\frac{15}{12}} = 0.2865$
Waiting already done affect change future wait time

Lecture 25

Recall: X~exp( $\theta$ ), is waiting time between events in a Poisson proccess with rate $\lambda = \frac{1}{\theta}$ (cts rv)
$E[X] = \theta$ Var(X) = $\theta$

Memoryless property
If X~exp( $\theta$ )
P(X > t + s | X > t) = P(X > s)

Any waiting time already occurred is irrelevent in determining the remaining waiting time. This makes exp a bad choice for modelling human lifetimes.

Despite this we often use it in the short term for lifetimes of electronic or mechanical components.

8.5 Normal Distribution

Many real life phenomena seem to follow a Normal distribution.
- heights/weights
- test scores on learge standard tests
- logs of stock returns
- measurement errors

Def: We say X has a Normal Distribution X~N( $\mu, \sigma ^2$ ) if its pdf is:
$f(x) = \frac{1}{\sqrt{2 \pi \sigma}}e^{\frac{-1}{2}\frac{x-\mu}{\sigma}^2}$
for all Real x

What does it look life?
-symmetric around $\mu$
-both lefts and right “tails” go to 0 quickly
- highest when x = $\mu$
- area under curve = 1

We can show $\int_{-\infty}^{\infty}f(x)fx = 1$ using a change of variable and the gamma function trick and $E[X]= \int_{-\infty}^{\infty}xf(x)dx = \mu$ also the median and mode are $\mu$
And $Var(X) = \sigma^2$ (proof again in uses gamma f’n)

the cdf $F(x) = \int_{-\infty}^{\infty}f(x)du = \int_{-\infty}^{\infty}\frac{1}{\sqrt{2 \pi \sigma}}e^{\frac{-1}{2}\frac{u-\mu}{\sigma}^2}du$

We cannot evaluate the integral! numerical methods are needed.
Note: the Normal distribution is also called the Gausian X~N( $\mu$ , $\sigma^2$ ) = X~G( $\mu$ , $\sigma$ ) <- where $\sigma$ is SD

Standard Normal rv (special case)
Let $\mu = 0$ , $\sigma^2 = 1$
Z~N(0, 1)
$f(z) = \frac{1}{\sqrt{2 \pi}}e^{\frac{-1}{2}z^2} = \Phi (z)$ standard Normal pdf for $-\infty < z < \infty$

The standard Normal cdf
$F(z) = \Phi(z) = \int_{-\infty}^{\infty}\Phi(u)du$
still cannot be integrated analytically but tables of values of $\Phi(z)$ are readily available.

Using N(0,1) tables
row -> ones column and first decimal place
column -> second decimal place
e.g. P(z < 3.14) = 0.99916
to find F(z) where z < 0 use the fact that the Normal dist is symmetric.

Example:
Suppose a voltage of +2 or -2 is sent down a wire to represent 1 or 0. The connection is noisy and adds a N(0,1) amount to the voltage sent. The receiver interprets any voltage > +0.5 as a 1 or 0 otherwise.
Find P(error| 1 sent)
Let R = voltage received = 2 + Z where Z~N(0,1) is the noise
P(R < 0.5) = P(2 + Z < 0.5)) = P(Z < -1.5)
=1 - P( Z < +1.5)
= 1 - 0.93319
= 0.06681
P(error | 0 sent) R = -2 + Z
P(R > 0.5) = P(Z > 2.5) = 1 - P(Z < 2.5) = 1 - 0.99379 = 0.00621

Finding percentiles of N(0,1) $z_p$ is the value such that F( $z_p$ ) = p

F( $z_p$ ) = p connot be solved analytically so we use the F(z) table. Say we want $z_{0.95}$

Lecture 26

Recall: Normal Dist, probabilities + percentiles of N(0,1)
Transforming a Normal variable
Probabilities + percentiles of N( $\mu$ , $\sigma^2$ )
Also Recall: X~N( $\mu$ , $\sigma^2$ )
$f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2} (\frac{x- \mu}{\sigma})^2}$
Standard Normal Z~N(0,1)
F(z) =P(Z $\leq$ z) in tables (use symmetry for negative values)

Also finding percentiles of N(0,1)
-can look for p in the main table and find the row + column closest
P(Z $\leq z_p$ )=p
-or can look up in the reverse Normal table (at the bottom)

Example:
Find c such that
$P(|z| < c) = 0.8$
Since Normal Dist is symmetric, we look for c -> covers 90% (because the area outside the graph will be 20% / 2)
So c is the 90th percentile of N(0,1)
looking it up, c = 1.2816
In practice we want to find $N(\mu, \sigma^2)$ information, not just N(0,1)

Suppose X~N( $\mu$ , $\sigma^2$ )
Let $Z = \frac{X-\mu}{\sigma}$ (a linear function of X)
Claim Z~N(0,1)
Proof $F_z(z) = P(Z \leq z)$
$\begin{equation} \begin{split} F_z(z) &= P(Z \leq z)\\ &=P(\frac{X-\mu}{\sigma} \leq z)\\ &=P(X-\mu \leq \sigma z)\\ &= P(X \leq \sigma z + \mu)\\ &= F_x(\sigma z + \mu)\\ \text{differentiating wrt z} \\ f_z(z) &= \frac{d}{dz}F_z(z)\\ &=\frac{d}{dz}F_x(\sigma z + \mu) \\ &= {F'}_x(\sigma z + \mu) * \frac{d}{dx} ( \sigma z + \mu) \\ &= f_x(\sigma z + \mu)* \sigma \\ &= \sigma \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2} (\frac{\sigma z + \mu -mu}{\sigma})^2} \\ \text{using fx()} \\ &= \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2}z^2} \end{split} \end{equation}$
Which is the pdf of a N(0,1) rv
Examples
Heights of adult men are N(68, $2.4^2$ ). Find prob of a person being at least 77 in.
Let X = height of a random person from this population
P(X < 77) = P( $\frac{X-69}{2.4}$ > $\frac{77 - 69}{2.4}$ )
$= P(Z > 3.33) = 1- P(Z \leq 3.33)=0.00043$

Example 2
Scores on the SAT reading test follow a N(504, $111^2$ ). What score is required to be in the top 10%?
Find s st
$P(X > s) = 0.1$
$P(X \leq s) = 0.9$
$P(\frac{X-504}{111} \leq \frac{s - 504}{111}) = 0.9$
$P(Z < \frac{s- 504}{111}) = 0.9$
So set
$\frac{s-504}{111} = 1.2816$
$s = 1.2816*111 + 504$
$= 646.2 \approx 650$

In general, if X~N( $\mu$ , $\sigma^2$ )
the $p^{th}$ percentile of X is $x_p = \sigma z_p + \mu$
$p^{th}$ percentile of Z~N(0,1).

Lecture 27

Ch 9 Discrete Multivariate RVs

We have models (both discrete and continuous) for a single random quantity
But we are often interested in two or more rvs at the same time.

Eg. returns on two stocks X and Y
heights and weights
in a board game, # of a suit and # of a rank
treatment vs recovery
runtime vs algorithm used
In this course we’ll focus on the discrete case, but the continuous case is similar

Def:
For two discrete rvs X and Y, the joint pf is f(x,y)=P(X=x , Y=y)
i.e. the prob that X takes the value x AND Y takes the value y at the same time. Defined where x $\epsilon$ range of X and y $\epsilon$ Y. In general, the joint pf of $X_1$ , … , $X_n$ is f( $x_1$ ,… $x_n$ ) = P( $X_1 = x$ , … , $X_n=n$ )

Similarly, to the single var case, f(x,y) can be displayed as a table or as a function of x and y.
Example:
Experiment: toss a fair coin 3 times
Let X = # heads
Y = 1 if heads , 0 otherwise
find f(x,y)

f(x,y)	0	1	2	3
0	$\frac 1 8$	$\frac 2 8$	$\frac 1 8$	$0$
1	$0$	$\frac 1 8$	$\frac 2 8$	$\frac 1 8$

Note:

f(x,y) $\geq$ 0 $\forall x,y$
$\Sigma_x \Sigma_y f(x,y) =1$
why? 1 -> it’s a probability, 2-> X and Y must take one of their values, but no double counting or missing

what if we only want info about X?
e.g.
P(X=0) = P(X=0, Y=0) + P(X=0, Y=1) = f(0,0) + f(0,1) = $\frac 1 8$
etc

We get the pf of X by summing over the values of Y.
Similarly we could find the pf of Y by summing over the values of X.

Def: The marginal pf of X is $f_x(x) = P(X=x) = \Sigma_{ally}f(x,y)$ and the marginal pf of Y is $f_y(y) = P(Y=y) = \Sigma_{allx}f(x,y)$

Note: $f_x(x)$ and $f_y(y)$ will automatically satisfy the conditions for a pf
In general for X1 ,… , Xn , then the marginal pf of X1 is:
$f_{x_1}(x_1) = \Sigma_{allx_2} ... \Sigma_{allx_n}f(x_1, x_2 , ..., x_n)$
Recall: two events A and B are independent iff P(AB) = P(A)P(B). We extend this idea to rvs.
Def: Two discrete rvs X and Y are independent iff $f(x,y) = f_x(x)f_y(y)$ for all possible pairs (x,y)
Note: to show dependence, it suffices to find one pair (x,y) where this doesn’t hold to show Independence, check all pairs.

In general, X1,…Xn are independent of each other iff f(x1, …, xn) = $f_{x_1}(x_1) ...f_{x_n} (x_n)$ for all n-tuples

In our example, f(0,1) = 0. but fx(0)fy(1) = 1/16 != 0
therefore X and Y are not dependent

A quick way to check for dependence is to find 0’s in the table
In other words, if the range is not a rectangle, formally a Cartesian product, the we know that the variables are dependent.

Lecture 28

Remember conditional probability
$P(A|B) =\frac{P(AB)}{P(B)}$
We can extend to rvs

Def: The conditional pf of X given Y=y is:
$f(x|y)=P(X=x|P(Y=y)) = \frac{f(x,y)}{f_y(y)}$
Similarly $f(y|x) = \frac{f(x,y)}{f_x(x)}$

Functions of multiple rvs

We may often be interested in the sum of X and Y or any function of any number of rvs.
eg. T = X + Y, U = 2(Y-X)
We can find the pf of the new rv based on the joint pf as follows:

find the range of T by calculating the value of t for each pair (x,y)
P(T=t) = $\Sigma_{all (x,y) where T=t}\Sigma f(x,y)$