STAT 230 - Probability

Lecture - 1

Lecturer: Diana Skrzydlo (Skryzlo)
Office Hours: MF 3:30-5 in M3 3106 & W 11:30-12:30 in M3 3144

What is probability?
The likelihood of an event occuring

The 3 Definitions of Probability

  1. Classical - Probability is equal to (relies on outcomes being equally likely)
  2. Relative Frequency - Probability = proportion of time ( long) that the event occurs
  3. Subjective - Probability = how certain the person is that it will occur

How does prob relate to CS?

Definitions to set up a probability model

Lecture 2 - Sept 12

Review Question:

What is essential to the classical definition of probability?
Outcomes are equally likely

Definitions

Experiment
A process that can be repeated multiple times, with multiple results
Trial
One iteration of an experiment
Outcome
The result on one trial of an experiment
Sample Space
A set of all possible outcomes for a single trial of an experiment. Notes: S is a set. i.e. an unordered, unique element list. S must contain ALL possible outcomes. Two types: discrete (finite) -> can list/count, continuous -> uncountably many elements
Example
Experiment: Roll a 6-sided die repeatedly
Trial: One roll
Outcomes: 1, 2, … , 5, 6
= {1, 2, 3, 4, 5, 6} (most versatile)
= {even, odd} (also works)
Event
A subset of a sample space (including the empty set ) and the entire sample space
Example
A = “a 1 is rolled” -> A = {1}
B = “the number is odd” -> B = {1, 3, 5}
Note: events consisting of one sample point are simple events, otherwise they are called compound events On any trial, if the outcome is in the event A, we say A has occurred.
E.g. if a 1 is rolled -> A occurs, B occurs
3 is rolled -> B occurs, A does not
6 is rolled -> neither occur
A Probability

is a function that maps events in a sample space to numbers such that

  1. for any A
    • “impossible”
    • “guaranteed”
  2. If are the elements of S, then
    • where Simple event
    • Example: Roll a 6-sided die twice. S = {(1,1), (1,2), … , (1,6), … , (6, 1), … , (6,6)}
    • Example: BST. A Binary Search Tree is degenerate if each node has at most one child
    • Find the prob. that a tree with 3 elements is degenerate:
    • We drew out all probs. 4/6 chance

Lecture 3

Recall

P(A) is the probability that A occurs. A is a subset of S. If A is compound, its prob is the sum of the probs of the simple events that make up A.

Counting Techniques

Two basic counting rules:

  1. Addition rule
  2. Multiplication rule

Sampling

“with replacement” - is possible to obtain the same result more than once. Then it’ll have choices. M spaces, k options.

Permutations

A permutation is an ordering of k objects selected from n objects without replacement. Let n = number of objects. K spaces. The order matters! = “n to k factors” =
Also

Tip, if computing a certain permutation is hard, try 1 - opposite of what you want to find. Eg. Find P(of at least one repeat) turns into 1 - P(no repeats) which is easier.

Lecture - 4

Combinations

A combination is a selection of k objects from n objects without replacement. The order does NOT matter. E.g. drawing a hand of cards. Denoted by N choose K (I forget how to this) Pascal triangle - n choose k is the kth entry in the nth row.

Some shit about pascal’s triangle that is super obvious

Pascal’s Identity

(n choose k) = (n-1 choose k-1) + (n-1 choose k)
We won’t prove this mathematically since it’s messy to prove

oh lool, we proved this in MATH 239.

Lecture - 5

Example:
A box of 10 computer chips containing 3 defective ones in the box. Four of the chips are tested. What is the probability that 1 is found defective?
Solution:
W/o rep & don’t care about order.

The whole time, we don’t care about order. (i.e. we don’t need to worry if the defective one is first, last or etc)

Note: “1 if found” means exactly 1. Not at least one.

Arrangements, where some objects are alike (indistinguishable)

Can be arranged where and is number of type k.

Lecture - 6

Recall - event -> subset of S for any event A,
Rules for probabilities
If an event A is contained in another event B (A subset B) Then,




thus,

A_bar , A^c, A^’, -> NOT A. Aunion B = or. A intersect B = and.

De Morgan’s Laws

Bar{aunionB} = a_bar and b_bar
bar{aintersectB} = a_bar or b_bar

P(a union B) = P(A) + P(B) - P(AB)

Lecture - 7

Multiplication Rule

applies when events are independent of each other (don’t affect one another)

Lecture - 8

Recall: Independent: P(AB) = P(A)P(B)
Conditional prob of A given B: P(A|B) =

Last time we had
A = small = 3
C = total = 8
P(A|C) = 1/5
P(A) = 1/6

C occuring makes A more likely.

P(C|A) = 1/6, P(C) = 5/36

A occurring makes C more likely. Dependence is a two-way relationship

Alternative defn of independence. P(A|B) = P(A)
knowing B occurred has no effect on the probability of A occurring. P(B|A) = P(B)

Example:

Diana has 2 kids. Each independently has 0.5 chance of having red hair. At least one has red hair. What is the probability the other does? Let A = at least one red

Kid 1 Kid 2
red 0.25 0.25
not red 0.25 0.25

P(A) = 0.75

P(both|A) = P(both and A) / P(A) = 0.25/0.75 = 1/3

Product Rule (4.5)

If we rearrange the definition of conditional prob, we get:
P(AB) = P(B)P(A|B)
P(AB) = P(A)P(B|A)

Intuitively, think of “checking” whether A occurred. If it did, “check” whether B occurred (knowing A did). (or you check B first, then A) Works for both dependent and independent events.

Extends nicely to multiple events P(ABC) = P(B)P(C|B)P(A|BC)
The order doesn’t matter as long as each probability is conditional on what we know.

Law of Total Probability

Let be a collection of events that are mutually exclusive cover the sample space

(break S up into k disjoint pieces) This is a partition of S
e.g. A and A_bar is a partition of size 2.
For any event B in S, we can express it as B =
So P(B) = using product rule

Example:
Three people A,B,C are writing code together. A produces twice as much as B or C

1% of A’s code has errors
2% of B’s
5% of C’s
What is the probability that a random line of code has errors?

Let E = line has error
A = line written by A
B = line written by B
C = line written by C

We want P(E)

Using the partition {A, B, C}, P(A) = 0.5 , P(B) = P(C) = 0.25
We have P(E|A) = 0.01
P(E|B) = 0.02
P(E|C) = 0.05

So

Lecture - 9

We found P(E) = 0.0225 using law of total prob. We would like to know, if we find an error, whose fault is it?
We had to reverse the direction of the conditioning: we have P(E|author). We want P(author|E).

Bayes’ Rule

also written as

For this example:

Example

Your spam filter has 90% detection rate but 1% false positive rate. Suppose 25% of messages are spam

Let S = “message is spam” P(S) = 0.25 P(S’) = 0.75
I = “Identified as spam”
P(I/S) = 0.9 -> P(I’|S) = 0.0

photo here<>

Machine Learning

Supervised learning -> give some training data
Unsupervised learning -> no training data

Types of Problems
Classification
spam vs non-spam email
cancerous vs benign tumor
fraudulent vs legit banking transactions
Regression
predict the price of house/stock
length of recovery time
Clustering
grouping products together
making recommendations

Bayesian Classifier
P(Category 1 | evidence) =

“Monty Hall” problem

3 doors. 1 door is revealed to be a dud. Do you switch doors?

Yes

Important Sums

  1. Geometric -
  2. Binomial -

Lecture - 10

Bayes’ Rule:

P(A|B) =
Product rule + law of total prob. A and A’ is a partition

Ch 5 Random Variables

Def. A random variable (rv) is a function that maps points in a sample space to Real numbers

The values that the rv can take on are called the RANGE of the rv. By convention, we call rvs X,Y,Z and the values, x,y,z
We are interested in finding the prob. that a rv X takes on one of it’s particular values. ie P(X =x) (outcome is a sample point that maps to x)

There are two types of rvs, discrete -> finite or countably infinite range vs continuous -> incountably infinite range

More than one rv can be defined on the same S

eg roll 3 fair 6-sided dice.
X = sum on 3 dice {3, … , 18}
Y = avg value {1, … , 6}
Z = the 3 digit number created by the dice
W = the product of the 3 {1, 2, … , 216}
U = number of 3s {1,2,3}
etc


Def. The probability factor (pf) of a discrete rv X is f(x) = P(X = x) and is only defined for x range of X. f(x) if x does not belong in range is undefined or 0.

Properties of f(x)
for all x range of f(x) . Why? it’s a probability
The events “X = x” are mutually exclusive for each x

Def. The cumulative distribution function (cdf) of the rv X is F(x) = P(X x) and is defined for ALL Real x.

Lecture - 11

Recall: Random variable (X) maps sample points to Real numbers
Probability function f(x) = P(X=x) x range
Cumulative distribution function F(x) = P() x

Properties of F(x)

  1. Why? it’s a probability
  2. F(x) is non-decreasing with respect to x . Why? Can’t lose probability when increasing x
    1. Proof: the event “” is contained in “” where so
  3. where 0 is impossible
  4. where 1 is guarenteed

Example:

x 0 1 2 3
f(x) 125/216 75/216 15/216 1/216
F(x) 125/216 200/216 215/216 1

Let’s graph it to show what happens at other Real values of x.

Relationship between F and f
f(x) = the size of the jump in F(x) at the point x = F(x) - F(x-1) <- which is the next smallest value in range


We can use F(x) to find = P()

Discrete Uniform (5.2)

Let X be a rv on the range {a, a+1, …, b} where & a,b where each value is equally likely. Then we say X has a discrete uniform distribution on [a , b] in other words, “X~DU[a,b]” . X has a DU distribution of [a,b]

e.g.

Find the pf f(x) = P(X = x) = c must be constant since all equal
We need


Find the cdf

Example

You are looking through a linked list of 100 items for one particular item. Let X = # of comparisons to find it. Claim: X ~ DU [1, 100]

Lecture 12

Hypergeometric rv (5.3)

Set up: we have N objects -> r “successes”, N-r “Failures”
We choose n without replacement

Let X = “# of S’s chosen”
eg. if x=# of winning numbers on a lotto 6/49 ticket
X ~ Hyp(N,r,n)
N=49, r = 6, n =6
Find the pf f(x) = P(X=x) = P(we get x S’s and n-x F’s)


range of x?
lower bound: , possible to run out of F’s so remaining will all be S’s

upper bound x \leq n (all 5’s), x \leq r (could run out of S’s)
so the range of X is messy and depends on the relationship btwn N, r, and n.

Also, there is no closed form expression for F(x). You have to add up the f(x) values which is computationally intensive when N is large.
Example:
You have 10 cards - 7 treasure, 3 non-treasure.
Draw 5 without replacement. Let X = “# of treasure cards”. X ~ Hyp(N=10, r=7, n=5)
f(x) = P(X = x) = for x = 2,3,4,5
All the values of f(x) will add up to 1
(proven using Hypergeometric series result)

Binomial rv (5.4)

Set up: Bernoulli trials - independent trials, 2 outcomes on each (S or F), P(S) = P is constant for all trials

We do n trials.
Let X = # of S in n trials
(You can imagine it as a Hypergeometric selecting with replacement instead of without.)
We write X ~ Bin(n, p)
Eg. Flip a fair coin 10 times, x = # heads, X ~ Bin (n = 10, p = 0.5)
Find the pf f(x) = P(X=x) = P(we get x S’s and n-x F’s) = p^x(1-p)^(n-x)j

Tut 1

  1. The letters of PROBABILITY are arranged at random in a row. Find the probability that:
    1. Y is in the last position
      • Total # is 9979200
      • # to have Y last = 907200
      • so Prob = 1/11 (also can just consider that 1/11 chance last letter is y)
    2. The two B’s are consecutive
      • treat “BB as one letter
      • P = (10!/2!)/(11!/(2!2!)) = 0.182
    3. The two B’s are consecutive and Y is in the last position
      • (9!/2!)/(11!/(2!2!)) = 0.018 = P(AB)
      • Note:
    4. Y is not in the last position and the two B’s are not consecutive
      • P(A_bar B_bar) = P((AUB)_bar) = 1-P(AUB) = 1-(P(A)+P(B)-P(AB)) = 0.745
    5. Y is not in the last position or the two B’s are not consecutive
      • P((AB)_bar) = P(A_bar U B_bar) = 1-P(AB) = 0.982
  2. You are trying to guess your friend’s Quest password. You know it must be 8 characters chosen from digits 0-9, lower case letters a-z, and uppser case letters A-Z and is not allowed to be all letters or all numbers
    1. how many possible valid passwords are there?
      • 62 possible characters
      • 62^8 - 52^8 - 10^8
    2. You happen to know that your friend is a bit lazy with respect to password security and will only use the letters a,s,d and f (both upper and lower) and the number 1. How many possible valid passwords could your friend have?
      • now 1 + 4 + 4 = 9 characters
      • So 9^8 - 8^8 - 1 = 26269504
    3. What is the probability your friend’s password has no repeated characters?
      • no repeats - selecting w/o replacement so order still matters
      • So 9^8 - 8^8 - 0 = 322560
      • P = 322560/26269504 = 0.012
    4. What is the probability your friend’s password contains at least one “a” or “A”?
      • 1 - P(no “a” or “A”s)
      • # w/o = 7^8 -6^8 -1^8
      • P = 1 - (7^8 -6^8 -1^8)/( 9^8 - 8^8 - 1) = 0.844
  3. Consider the machine learning problem of classifying incoming messages as spam. We define: A_1 = message fails rdns check (i.e. the “from” domain does match), A_2 = message is sent to over 100 people, A_3 = message contains a link with the url not matching the alt text. We will assume that the A_i’s are independent events, given that a message is spam, and that they are independent events, given that a message is regular. This is known as the “Naive Bayes Classifier” and is the simplest of the machine learning classification algorithms. We estimate P(A_1|Spam) = 0.3 ,P(A_2|Spam) = 0.2,P(A_3|Spam) = 0.1, P(A_1|Not Spam) = 0.005, P(A_2|Not Spam) = 0.04, P(A_3|Not Spam) = 0.05 and P(Spam) = 0.25
    1. Suppose a message has all of features 1,2,and 3 present. Det P(Spam|A_1,A_2,A_3)
      • = (P(Spam)P(A_1A_2A_3|Spam))/(P(Spam)P(A_1A_2A_3|Spam) +P(Spam_bar)P(A_1A_2A_3|Spam)) = (P(S)P(A_1|S)P(A_2|S)P(A_3|S))/(P(S)P(A_1|S)P(A_2|S)P(A_3|S) + P(S_bar)P(A_1|S)P(A_2|S)P(A_3|S)) = 0.995
    2. Suppose a message has features 1 and 2 present, but feature 3 is not present. Determine P(Spam | A_1A_2(A_3)bar).
      *(P(S)P(A_1|S)P(A_2|S)P(A_3_bar|S))/(P(S)P(A_1|S)P(A_2|S)P(A_3_bar|S) + P(S_bar)P(A_1|S)P(A_2|S)P(A_3_bar|S)) = 0.9895
    3. If you declared as spam any message with one or more of features 1,2, or 3 present, what fraction of spam emails would you detect?
      • P(A_1UA_2UA_3|Spam) = 1 - P(none of features) = 1-P(A_1_barA_2_barA_3_bar|spam) = 1- P(A_1_bar|Spam) … P(A_3_bar|Spam) = 0.496
    4. Given that a message is declared as spam (according to the rule in (c)), what is the probability that it actually is spam?
      • similar to e. ans = 0.8508
  4. Given that a message is declared as spam, (according to the rule in (c)), what is the probability that feature 1 is present?
  5. Let X represent the number of days in Feb. with temp below -24C. The probability function (pf) of X, f(x)= P(X=x) insert photo
    1. 0.0625
    2. look at one note
    3. F(3.5) - F(0.5) = 0.375
    4. f(2)/0.375 = 0.8333…

Lecture 13

Recall:
X~Hyp(N, r, n)

X = # S’s in n objects w/o rep
X~Bin(n,p)

Example:
Want to send a 4-bit message. Each bit is independently flipped (0->1 or 1->0)
Probability = 0.1
P(message received correctly?)
Let X = # of bits flipped
X~Bin(4, 0.1)
P(X=0) =

Now add 3 “parity bits” to the message which allows the receiver to detect and fix up to 1 error.
Let Y = # bits flipped ~Bin(7, 0.1)
P(Y=0) + P(Y=1) are both ok.

Bin approx to Hyp.

(ie. n << N) then it doesn’t make a big difference to the probabilities if you sample with or without replacement. If we did it with replacement the number of S’s we get
X~Bin(n,p=)
So when N is large and n is small, we can use a Bin(n, ) to approximate a Hyp(N,r,m). (when is less than 0.05)

Lecture 14

recall: Negative Binomial (5.5)
Bernoulli trials (indep, S or F P(S)=p) X = # of F’s before the Kth S is obtained
pf and examples of NB
Geometric rv (5.6)
How to tell when to use distributions

Bin NB
-know # trials(n) -know # successes(k)
-? successes modelled by X -? trails modelled by k+x

We write
range

We can show that
But there is no closed form expression for

Example:
A startup is looking for 5 investors. Each investor will independently say yes with probability 0.2. Founders will ask investors one at a time until they get 5 yes. Let X=total # of investors asked. Find f(x) and f(10).

Let Y = # who say no (Y+5=X)

So

Geometric rv (5.6)

(nothing to do with hypergeometric)
Special case of NB with k=1.
X= # of F’s before obtaining the first S in Bernoulli trails
X~Geo(p)
range: {0, 1, 2, …}


no orderings to worry about since it’s just FFFF….FS (x F’s)
Easy to show using infinite geometric series formula

Discrete Uniform Hypergeometric Bin NB Geometric Poisson
pf f(x)
range {a, a+1, .. ,b} weird! 0,…,n 0,1,… 0,1 … 0,1,2,…
cdf F(x) no closed form no no
when to use? - counting # S Bin with large n, small p (approx)

f(x)
When to use? = Bin with large n, small p (approx)
range 0,1,2,…

Lecture - 15

Recall: uniform, hypergeometric, binomial, NB, geometric
Poisson dist from Bin (5.7)
Poisson process (5.8)

Poisson rv (5.7)

We say X has a Poisson distribution with parameter ( if for
Easy to show since
The Poisson is a limiting case of the Binomial when n-> and p->0 such that the product np remains constant

Let np =
then,

So if we have a Bin(n,p) with large n and small p, we can approx if with a Poisson rv with parameter . Guideline: and works well!

Example:
Roll up the Rim - “1 in 9 cups win!” You buy 100 cups (treat as independent). Find prob you get 10 or fewer winning cups.

X= # winning

So:

Try Poisson approx

Not a great approx since was a bit too high to be “close to 0”

Clicker question
Suppose you type at exactly 90 words pm and on each word have a 1% chance of making an error. After 1 minute, what is the probability you have made NO errors?
0.405 -> from bin. 0.407 -> from poisson

You can also use Poisson when by instead modelling the number of F’s instead of S’s.

Poisson Process (5.8)

Consider “events” occurring randomly throughout time/space according to 3 conditions:

  1. Independence
    1. (events have no impact on each other)
    2. # of events in non-overlapping time intervals are indep.
  2. Individuality
    1. (events occur one at a time)
    2. Cannot have two or more events at the exact same time
  3. Homogeneity/Uniformity
    1. (events occur at a constant rate )
    2. Prob of an event occurring in a short time interval (t, t+ t) is proportional to
    3. Can’t have periods of higher activity

E.g. emails into an inbox
Cars through an intersection
Births in a large population

Lecture - 16

Imagine we observe a Poisson process (with rate ) for t units of time.
Let X = # of events that occur.
X is a discrete rv with no maximum.
It turns out that:
X ~ Poi( = Xt)
ie. for x = 0, 1, 2

Proof: See course notes
Add one more column to your chart!

| Poisson
f(x) | |
When to use? = Bin with large n, small p (approx)
range 0,1,2,…

When NOT to use?
We can specify a maximum
If it makes sense to ask how many time the event did not occur

Example:

requests to a web server follow the conditions of a Poisson process with rate 100 per minute.
Find prob of 1 request in 1 sec. 90 requests in 1 min.

Let X = # requests in 1 sec
X ~ Poi( or )

Let Y = # requrests in 1 min
Y ~ Poi()
P(Y=90) =

Combining Models (5.9)

Many problems may combine more than one distribution together. Your task is to identify the distribution needed, depending on the perobability requested.

Example: Server requests 100/min
A 1-second period is “quiet” if it contains no requests.

a) Find the Prob of a “quiet” second.
X=# requests in 1 sec. X~Poi()
P(X = 0) =

b) Prob of 10 “quiet” seconds in a minute (60 non-overlapping sec)
Let Y = # “quiet” in 60 sec.
Y~Bin(60,0.189)
P(Y=10) =

c) Prob of having to wait 30 non-overlapping sec to get 2 “quiet”
Let Z = # non-quiet sec before 2 “quiet”
Z ~ NB(2, 0.189)
P(Z = 28) =

d) Given (c), the prob there is 1 “quiet” sec in the first 15 sec.
P(1 Q in 15 sec | wait 30 for 2 Q)


Use binomial for 1st prob. Geometric for 2nd.

Lecture 17

Ch 7. Expected Value + Variance Summarizing Data

Let X = # of kids in a family

To summarize the data, we can use:
1. A frequency distribution

x frequency
1 10
2 14
3 10
4 1
5 2

2. a frequency histogram (different graph here because I was lazy and didn’t want to draw one in ms paint)

3. A single number representing the average or sample mean

4. median - the middle value: 2 in this case
5. mode - most common/frequent value (in this case, 2)

Expectation of a r.v. (7.2)
We had the sample mean of the kids be

We can replace the observed relative frequency with a theoretical probability of the r.v. equally x -> theoretical mean

2011 census:

x 1 2 3 4 5
f(x) 0.43 0.4 0.12 0.04 0.01

so the theoretical mean of X is:

and median is 2 since F(2) > 0.5 and F(1) < 0.5
and mode is 1

Def: The expected value or expectation or mean of a discrete r.v X is:

Lecture 18

Recall - Expected value (aka mean) of X is

Can think of it as a weighted average of the values X can take with weights = probabilities or balance point of the histogram of f(x). Often we may be interested in the average value of the function of x. Eg. x = usage on phone. g(X) = cost of that usage.

Def: the expected value of g(X) for a discrete r.v. X is
weighted average of the g(x) values that can occur. (e.g. g(X) = 1000 + 250x)

What if ?
Here, because g(x) is non-linear function of X. Expectation is a linear operator.

SO ONLY USE IF g(X) IS LINEAR
Also:

Applications of Expectation (5.3)

If we have the distribution of X ans we let Y = g(X), we can find E[Y] by either: or by finding the range and pf of Y and using

Example
Suppose the time to finish a part on a coding question is 10 minutes if you make no errors. Syntax errors take 2 mins to fix. Logic errors take 10 min. Assume O(syntax error) = 0.1 , P(logic error) = 0.2 independently.

Find the average(expected) time to finish this question.
Let X = 0 (no errors), 1(syntax), 2(logic), 3(both),

x 0 1 2 3
f(y) -> f(x) 0.72 =(0.9*0.8) (0.1)*(0.8) = .08 (0.9)*(0.2)=0.18 0.02
y -> g(x) 10 12 20 22

So

Example:
A web server has a cache. 20% chance that the request is found in the cache (cache hit) -> 10 ms. If it’s not found (cache miss) then it takes 50(send msg) + 70(lookup) + 50(return answer) ms. Find the expected time with and without the cache.
Without: time is always 190ms So E[T] = 170.
With: Let X = 0 if found, 1 if not found

x 0 1
f(x) 0.2 0.8
time=g(x) 10 180

Lecture 19

Recall:

Means (and variances) of named distributions (7.4)

  1. Binomial
  2. Poisson
  3. Similarly:
    1. X~NB(k,p) ,
    2. X~Hyp(N,r,n) ,
    3. X~DU[a,b] ,
      these results can be proven from first principles () but there are other easier ways to show them

The mean of X, E[X] tells us where the distribution is centered, on average. But the practice we also care about how widely spread out the distribution is around that mean.
E.g. determining the number of servers for an online system. need to know spread.
How to measure?
(not helpful!)
- average absolute distance from mean

So we use:
- average squared distance from mean

Lecture 20

Def: Variance of a r.v. X is
Weighted average squared, distance from mean. If X is discrete,

Calc form of variance:

Where

Example:
X has pf

x 10 12 20 22
f(x) 0.72 0.08 0.18 0.62

We found min
Now

Note that variance is measured in units squared rather than the original units of X. Taking the square root of the variance makes more sense.

Defn: the standard deviation of X is

In our example, min
Remember, Var(X) will always be non-negative because it’s a weighted average of values
So SD(X) will also be and a Real number

Mean + Var of a linear f’n of X


by linearity of expectations

Note: b does not affect the variance since shifting the distribution doesn’t affect the spread. We increase all the distances by a factor of so the squared distances increase by

Finally,

Example
Suppose X has pf

x 0 1 2 3 4
f(x) 0.1 0.1 0.1 0.5 0.2
y 1 3 5 7 9

Let




But! We can calculate the Var(Y) in a different way!
Verify and

Variances of named distributions

  1. Poissson - X~Poi() ,
  2. Binomial - ,

Lecture 22

Recall: continuous rv. has uncountable range for any x, P(X=x) = 0
Properties of F(x)
Probability density function f(x)
expectation + variance, percentiles
transformations

is the same so are and cdf Claim: F(x) is continuous for a cts rv X Proof ie F(x) is left cts similarly F(x) is right cts So F(x) is everywhere continuous (not necessarily everywhere differentiaible) As before, its non-decreasing We also want to know how X behaves locally Def: The probability density function (pdf) f(x) for a continuous r. v. X is f(x) = F’(x) (where the derivative exist) It represents a relative likelihood of X being “near” the value x Properties of f(x)

Note: f(x) is NOT a probability!

Imagine area of rectangle So f(x) can be thought of as a multiplier for an interval with that tells us the approx prob that X is within an interval of near the value x. for a small interval Expectation Def: The mean of a cts rv X is Def: The pth percentile of X is the value such that: eg. is the median Find for the previous example

Stat 230 Tutorial 2 Problems - Section 102

  1. Friends add you to Facebook according to a Poisson process with rate per day
    a. On any given day, the probability that nobody adds you is 0.1353. Find
    b. Given that 5 friends added you in 3 days, what is the probability that 2 of them were on the first day?
    c. A bad day is when 1 or fewer friends add you. Show that the probability of a bad day is 0.41. Calculate. Use the rounded value in the rest.
    d. What is the probability of having 2 bad days in a week?
    e. What is the prob of having to wait at least 5 days (total) to have one bad day?

a) Let X = # who add in 1 day
X~ Poi()
We know P(X=0) = 0.1353
But
Equating and solving,

b)

looks like binomial. We could have at the beginning, used binomial and have gotten the same result.

c)

d) Y = # bad days out of 7
Y~Bin(7,0.41)

e) Let Z = # good before first bad
Z ~ Geo(0.41)

  1. Suppose X ~ Geometric(p)
    a) Find the probability that X is an odd number
    b) Find the probability that X is divisible by 3. What about divisible by k?
    c) Find the probability function of the random variable R, where R is the remainder when X is divided by 4. What about the remainder when divided by m?
    d) show that the mean of X is (1-p)/p and provide a logical explination for the relationship between the mean and the size of p.

a)

b)

general

c) range of R: {0, 1, 2, 3}
from before

Similarly, and

So

d) we know
take d/dr of bs

Now

Why?

Waiting time is inversely proportional to the prob of success.

  1. According to the clicker data collected in our class, the probability function X = the number of courses you are taking is:
x 3 4 5 6
f(x) 0.02 0.13 0.75 0.1

suppose the hours of schoolwork you have each week are 20sqrtx and your stress level is 2X^2
a) Find the expected schoolwork hours and the expected stress level of a random student.
b) Find the variance of these quantities

a) Let Y =

x f(x) y z y^2 z^2
3 0.02 34.64 18 1200 324
4 0.13 40 32 1600 1024
5 0.75 50 2000
6 0.1 72 2400



-> Var(Y) = 6.60
-> Var(Z) = 110.31

  1. The download speed of a decent connection, X (measured in units of 10MBps) has prob density function (pdf)

f(x) = ksqrtx for 0 < x < 1
f(x) = k/2 (3-x) for 1 < x < 3
f(x) = 0 otherwise
see learn

a) we need



So

Lecture 23

recall: cdf, pdf, mean, percentiles of cts X
Uniform distribution (8.2)
transformations / change of variable (8.1)
SWAG: computer -generated random numbers

Example:
A cont rv X has pdf f(x) = for 0 < x < 2 and 0 otherwise
Find c:

f(x) = F’(x)

all x



so



So F(x) = 0 if x < 0

F(x) = 1 x > 2

Find P(X>1)

Find E[X]
E[X] =

8.2 Continuous Uniform rv

Def: If a cts rv X takes values on (a,b), a < b R st all subintervals of fixed length are equally likely. We say X~U(a,b). The endpoints a and b can be included or excluded, it doesn’t matter.

Find f(x) = c (a constant)

but we need





Sp F(x) = 0 for x < a

F(x) = 1 for x > b

So F(x) is not diff at a,b.



(average of end points)

Similarly Var(X) = (proof as exercise. Maybe on final :^))

Makes sense that it’s proportional to the square of the range.

Special case:
a=0, b=U~U(0,1)

otherwise

Change of Variable (8.1)

If we have the distribution X but we want we can find the cdf/pdf of Y by changing the variable in 3 steps.

  1. write the cdf of Y, in terms of an expression using the cdf of X
  2. Use the cdf of X to evaluate it and if we desired, differentiate to get the pdf
  3. Find the range of Y (will depend on h and the range of X)

Example:
Let X=# a spinner lands on between 0 and 4.
X~U(0,4)
Let and find

Step 1:


Step 2: We know for
for
for

So

and

Step 3: range of Y:

So for
otherwise

Note: if h(X) is invertible (1-1) we can differentiate after step 1 using the chain rule.




Lecture 24

8.3 Exponential rv

def: The exponential distribution is a cts rv defined for all positive Real numbers. It represents the waiting time between events in a Poisson process.

Find cdf of X=time until next event in P.p w rate .
F(x) = P(Xx) = 1 - P(X>x) = 1 - P(next event occurs more than x time from now)
= 1 - P(0 events in (0,x))
but # events in (0,x) in a P.p is Poi(x)
= 1 -

Thus f(x) = for x > 0

Alternate parameterization
for x > 0.
()
We say X~exp

E[X] =
use integration by parts OR we can use a trick called a Gamma function
Def:
Properties:

  1. If ,
    If ,

  2. If is an integer ,

So

substitution:
Let

So:


Remember - rate of events in poisson process.
- average waiting time between events
So it makes sense that they are inversely related
,








So Var(X) =

i.e.
For Poisson, nean = variance = parameter .
For exponential mean = sd = parameter

Example:
Requests arrive to a server with an exp. distribution of wait times. 50% of the time, the wait is at best 15 sec.
a) Find avg time between requests
Let X = time between requests~exp
We know P(X > 15) = 0.5
So 1 - F(15) = 0.5
So
So

b) If a request has just come in, find the prob. another one comes in within 5 sec.
P(X < 5) = F(5) =
Could also do with Poisson rv.

The memoryless property
Example:
Suppose buses arrive according to a Poisson process with rate 5/hr.
Find P(wait > 15 mins)
t = min

X = waiting time ~exp(12)
P(X > 15) = 1 - (1 -

C) given you’ve already waiting 6 min, find prob you wait > 15 more min.
i.e. P(X > 21 | X > 6)




Waiting already done affect change future wait time

Lecture 25

Recall: X~exp(), is waiting time between events in a Poisson proccess with rate (cts rv)
Var(X) =

Memoryless property
If X~exp()
P(X > t + s | X > t) = P(X > s)

Any waiting time already occurred is irrelevent in determining the remaining waiting time. This makes exp a bad choice for modelling human lifetimes.

Despite this we often use it in the short term for lifetimes of electronic or mechanical components.

8.5 Normal Distribution

Many real life phenomena seem to follow a Normal distribution.
- heights/weights
- test scores on learge standard tests
- logs of stock returns
- measurement errors

Def: We say X has a Normal Distribution X~N() if its pdf is:

for all Real x

What does it look life?
-symmetric around
-both lefts and right “tails” go to 0 quickly
- highest when x =
- area under curve = 1

We can show using a change of variable and the gamma function trick and also the median and mode are
And (proof again in uses gamma f’n)

the cdf

We cannot evaluate the integral! numerical methods are needed.
Note: the Normal distribution is also called the Gausian X~N(,) = X~G(,) <- where is SD

Standard Normal rv (special case)
Let ,
Z~N(0, 1)
standard Normal pdf for

The standard Normal cdf

still cannot be integrated analytically but tables of values of are readily available.

Using N(0,1) tables
row -> ones column and first decimal place
column -> second decimal place
e.g. P(z < 3.14) = 0.99916
to find F(z) where z < 0 use the fact that the Normal dist is symmetric.

Example:
Suppose a voltage of +2 or -2 is sent down a wire to represent 1 or 0. The connection is noisy and adds a N(0,1) amount to the voltage sent. The receiver interprets any voltage > +0.5 as a 1 or 0 otherwise.
Find P(error| 1 sent)
Let R = voltage received = 2 + Z where Z~N(0,1) is the noise
P(R < 0.5) = P(2 + Z < 0.5)) = P(Z < -1.5)
=1 - P( Z < +1.5)
= 1 - 0.93319
= 0.06681
P(error | 0 sent) R = -2 + Z
P(R > 0.5) = P(Z > 2.5) = 1 - P(Z < 2.5) = 1 - 0.99379 = 0.00621

Finding percentiles of N(0,1) is the value such that F() = p

F() = p connot be solved analytically so we use the F(z) table. Say we want

Lecture 26

Recall: Normal Dist, probabilities + percentiles of N(0,1)
Transforming a Normal variable
Probabilities + percentiles of N(,)
Also Recall: X~N(,)

Standard Normal Z~N(0,1)
F(z) =P(Z z) in tables (use symmetry for negative values)

Also finding percentiles of N(0,1)
-can look for p in the main table and find the row + column closest
P(Z )=p
-or can look up in the reverse Normal table (at the bottom)

Example:
Find c such that

Since Normal Dist is symmetric, we look for c -> covers 90% (because the area outside the graph will be 20% / 2)
So c is the 90th percentile of N(0,1)
looking it up, c = 1.2816
In practice we want to find information, not just N(0,1)

Suppose X~N(, )
Let (a linear function of X)
Claim Z~N(0,1)
Proof

Which is the pdf of a N(0,1) rv
Examples
Heights of adult men are N(68, ). Find prob of a person being at least 77 in.
Let X = height of a random person from this population
P(X < 77) = P( > )

Example 2
Scores on the SAT reading test follow a N(504, ). What score is required to be in the top 10%?
Find s st




So set


In general, if X~N(,)
the percentile of X is
percentile of Z~N(0,1).

Lecture 27

Ch 9 Discrete Multivariate RVs

We have models (both discrete and continuous) for a single random quantity
But we are often interested in two or more rvs at the same time.

Eg. returns on two stocks X and Y
heights and weights
in a board game, # of a suit and # of a rank
treatment vs recovery
runtime vs algorithm used
In this course we’ll focus on the discrete case, but the continuous case is similar

Def:
For two discrete rvs X and Y, the joint pf is f(x,y)=P(X=x , Y=y)
i.e. the prob that X takes the value x AND Y takes the value y at the same time. Defined where x range of X and y Y. In general, the joint pf of , … , is f(,…) = P(, … , )

Similarly, to the single var case, f(x,y) can be displayed as a table or as a function of x and y.
Example:
Experiment: toss a fair coin 3 times
Let X = # heads
Y = 1 if heads , 0 otherwise
find f(x,y)

f(x,y) 0 1 2 3
0
1

Note:

  1. f(x,y) 0

  2. why? 1 -> it’s a probability, 2-> X and Y must take one of their values, but no double counting or missing

what if we only want info about X?
e.g.
P(X=0) = P(X=0, Y=0) + P(X=0, Y=1) = f(0,0) + f(0,1) =
etc

We get the pf of X by summing over the values of Y.
Similarly we could find the pf of Y by summing over the values of X.

Def: The marginal pf of X is and the marginal pf of Y is

Note: and will automatically satisfy the conditions for a pf
In general for X1 ,… , Xn , then the marginal pf of X1 is:

Recall: two events A and B are independent iff P(AB) = P(A)P(B). We extend this idea to rvs.
Def: Two discrete rvs X and Y are independent iff for all possible pairs (x,y)
Note: to show dependence, it suffices to find one pair (x,y) where this doesn’t hold to show Independence, check all pairs.

In general, X1,…Xn are independent of each other iff f(x1, …, xn) = for all n-tuples

In our example, f(0,1) = 0. but fx(0)fy(1) = 1/16 != 0
therefore X and Y are not dependent

A quick way to check for dependence is to find 0’s in the table
In other words, if the range is not a rectangle, formally a Cartesian product, the we know that the variables are dependent.

Lecture 28

Remember conditional probability

We can extend to rvs

Def: The conditional pf of X given Y=y is:

Similarly

Functions of multiple rvs

We may often be interested in the sum of X and Y or any function of any number of rvs.
eg. T = X + Y, U = 2(Y-X)
We can find the pf of the new rv based on the joint pf as follows:

  1. find the range of T by calculating the value of t for each pair (x,y)
  2. P(T=t) =