Estimation with Large Sample of Data: Part I Ba Chu E-mail: ba Web:

Transcription

Estimation with Large Sample of Data: Part I Ba Chu E-mail: ba Web:
Estimation with Large Sample of Data: Part I
Ba Chu
E-mail: ba [email protected]
Web: http://www.carleton.ca/∼bchu
(Note that this is a lecture note. Please refer to the textbooks suggested in the course outline for
details. Examples will be given and explained in the class.)
1
Objectives
At the end of Lecture 04, I emphasized that, because of sampling variations and small sample size,
the sample estimates of population quantities are rarely precise. The present lecture is an early step
in exploring this phenomenon in greater depth. To ease students into this rather complicated topic,
I will first introduce a number of important theorems that govern the sampling distribution of the
average, X: 1) the weak law of large numbers (WLLN) and 2) the central limit theorem (CLT).
The main assumption that I have imposed throughout this lecture is that data in our sample are
IID.
2
(Weak) Law of Large Numbers (WLLN)
We know from classical probability that if a coin is tossed one time, we cannot predict the outcome,
but the probability of getting a head is 1/2 and the probability of getting a tail is 1/2 if everything
is fair. But what happens if we toss the coin 100 times? Will we get 50 heads? Common sense tells
us that most of the time, we will not get exactly 50 heads, but we should get close to 50 heads.
What will happen if we toss a coin 1000 times? Will we get exactly 500 heads? Probably not.
1
However, as the number of tosses increases, the ratio of the number of heads to the total number
of tosses will get closer to 1/2. This phenomenon is known as the law of large numbers. This law
holds for any type of gambling game such as rolling dice, playing roulette, etc.
Suppose that a random variable, X, follows a distribution, P , and that a scientist wants to
estimate the population mean, µ = E[X]. To do so, the scientist draws n independent copies of X,
say X1 , . . . , Xn , from P , then computes
n
Xn =
1X
Xi .
n i=1
The random variable X n is called the sample mean. The question is which value of n to make the
sample mean is sufficiently close the true mean, µ. The answer is simple: large samples are more
reliable than small or moderate samples.
Suppose that X1 , . . . , Xn have a finite mean E[Xi ] = µ and a finite variance V ar(Xi ) = σ 2 , we
shall study the behaviour of X n , as n increases.
We can notice at the outset that, although the expectation of X and X n are similar, i.e., E[X n ] =
P
2
µ, the variances are different, i.e., V ar(X n ) = n12 ni=1 V ar(Xi ) = σn . Hence, the sample mean has
less variability than any of the individual random variables that are being averaged. Averaging
decreases variation, i.e., as n −→ ∞, V ar(X n ) −→ 0. This suggests that, if the population mean,
µ, is unknown, we can draw inferences about it by observing the behaviour of the sample mean,
X n . To do this, we need to study the WLLN and CLT. We shall begin with a definition:
Definition 1. A sequence of random variables X1 , . . . , Xn converges in prob. to a constant, c,
p
written as Xn =⇒ c, if and only if, for every > 0, limn−→∞ P (Xn ∈ (c − , c + )) = 1.
[ Figure 1 is about here.]
The above concept allows us to state an important result.
Theorem 1 (WLLN). Let X1 , . . . , Xn be any sequence of IID random variables having finite mean
µ and finite variance σ 2 . Then
p
X n =⇒ µ.
2
Corollary 2 (Law of Averages). Let A be any event and consider a sequence of IID experiments
in which we observe whether or not A occurs. Let p = P (A) and define IID random variables by


 1, A occurs,
Xi =

0, Ac occurs.
Then Xi ∼ Bernoulli(p), X n is the observed frequency with which A occurs in n trials, and µ =
E[Xi ] = p is the theoretical prob. of A. The WLLN states that the former tends to the latter as the
number of trials increases.
Remark 2.1. The WLLN formalizes our common experience that “things tend to average out in the
long run.” For instance, we might be surprised if we tossed a fair coin n = 10 times and observed
X 10 = 0.9; however, if we knew that the coin was indeed fair (p = 0.5), then we would remain
confident that, as n increased, X n would eventually tend to 0.5.
3
The Central Limit Theorem (CLT)
The WLLN states a precise fact that the distribution of values of the sample mean collapes to
the population mean as the sample size increases. However, there are several obvious questions
unanswered:
1. How rapidly does the sample mean tend toward the population mean?.
2. How does the shape of the sample mean’s distribution changes as the sample mean tends
toward the population mean?.
To answer the above questions, we need to convert the random variables to standardized random
variables. This can be done in the following table:
[Table 1 is about here.]
Notice that standardizing a random variable does not change the shape of its distribution. Let
Zn =
X n√
−µ
σ/ n
denote this standardized random variable which we shall focus our attention on.
3
p
We begin by observing that V ar(X n − µ) = ( √σn )2 . The WLLN states that X n − µ =⇒ 0, so
√
the factor 1/ n measures how rapidly the sample mean tends toward the population mean.
Now, we can answer the second question mentioned above by studying the behaviour of Zn as
n becomes large. The following theorem is one of the most remarkable and useful results in all of
mathematics. It is fundamental to the study of statistics.
Theorem 3 (CLT). Let X1 , . . . , Xn be any sequence of IID random variables having finite mean µ
and finite variance σ 2 . Let Fn denote the cdf of Zn , and let Φ denote the cdf of the standard normal
distribution. Then, for any fixed value z ∈ R, we have
Fn (z) = P (Zn ≤ z) −→ Φ(z),
as n −→ ∞.
The CLT states that the behaviour of the average of a large number of IID random variables
will resemble the behaviour of a standard normal random variable. This is true regardless of
the distribution of the random variables that are being averaged. Thus, the CLT allows us to
approximate a variety of probability distributions with the standard normal distribution. This
approximation is sufficiently precise if n > 30.
Now, I present some simulation studies to assess the accuracy of the CLT. I assumed that
X1 , . . . , Xn are IID χ2 (2) with mean µ = 2 and variance σ 2 = 4. If I draw n values from this
distribution 25,000 times, compute the mean and the standard deviation of these 25,000 draws, and
plot a histogram of the results, I will have the following graphs:
[Figure 2 is about here.]
In the case X1 , . . . , Xn are distributed as counts from Bernoulli trials with the success prob.,
E[Xi ] = p, and V ar(Xi ) = p(1 − p), it follows that E[X n ] = np, V ar(X n ) = np(1 − p), and √X n −np
np(1−p)
is approximately distributed as N (0, 1) as n −→ ∞. This result was first established by De Moivre
in 1730s.
Example 1. My friend, John, is attempting to have formal dates. He is very confident that his
4
success probability is 50%. If he replicates his dating experience 36 times, then what is the probability
that his sample success probability will fall within 0.1 of the true success prob.?
Answer: Let Xi denote the dating result obtained from the replication i, for i = 1, . . . , 36. This
is a Bernoulli random variable. His expectation is E[X] = 0.5, and his variance is V ar(X) = 0.52 .
Let Z ∼ N (0, 1). Then, applying the CLT,
P (µ − 0.1 < X 36 < µ + 0.1) = P (−
X 36 − µ
0.1
0.1
<
<
)
0.6/6
0.5/6
0.6/6
= P (−1.2 < Z < 1.2)
= Φ(1.2) − Φ(−1.2) = 0.7698.
Example 2. Suppose that John will try to replicated his dating experience an additional of 10 times.
What is the prob. that his sample success prob. with respect to 36 dates will fall within 0.1 of that
with respect to 46 dates?
2
Answer: Note that, from Theorem 3, we have X n ≈ N (µ, σn ). Thus, it is straight-forward to
obtain X 36 − X 46 ≈ N (0, 0.25/36 + 0.25/46 = 0.1112622 ). Standardizing, it follows that
P (−0.1 < X 36 − X 46 < 0.1) = P (−
0.1
0.1
X 36 − X 46
<
<
)
0.111262
0.111262
0.111262
= P (−0.89878 < Z < 0.89878)
= Φ(0.89878) − Φ(−0.89878).
I conclude this section with a warning. Statisticians usually apply the CLT in order to approximate the distribution of a sum or an average of random variables, Xi , that are observed in an
experiment. These random variables need not to be normally distributed themselves – indeed, the
glamour of the CLT is that it does not assume the normality of Xi .
5
4
Exercises
1. Suppose that I toss a fair coin 100 times and observe 60 Heads. Now, I decide to toss the
same coin another 100 times. Does the WLLN or the Law of Averages imply that I should
expect to observe another 40 Heads?
2. Suppose that an dice has the following probabilities of producing the 4 possible uppermost
faces: P (1) = P (6) = 0.1, P (3) = P (4) = 0.4. This dice is to be thrown 100 times. Let Xi
denote the value of the uppermost face that results from throw i.
(a) Compute the expected value and the variance of Xi .
(b) Compute the prob. that the average value of the 100 throws will exceed 3.6.
3. It has been found that 2% of the tools produced by a certain machine are defective. What is
the probability that in a shipment of 400 such tools (a) 4% or more and (b) 2% or less will
be defective?
4. A financial theory posits that daily fluctuations in stock prices are independent random variables. Suppose that the daily price fluctuations of a certain blue-chip stock1 are IID random
variables X1 , X2 , . . . , with E[Xi ] = 0.01 and V ar(Xi ) = 0.01. (Thus, if today’s price of this
stock is $50, then tomorrow’s price is $50+X1 , etc.) Suppose that the daily price fluctuations of a certain internet stock are IID random variables Y1 , Y2 , . . . , with E[Yj ] = 0 and
V ar(Yj ) = 0.25.
Now suppose that both stocks are currently selling for $50/share and you wish to invest $50
in one of these two stocks for a period of 400 days. Assume that the costs of purchasing and
selling a share of either stock are zero.
(a) Approximate the prob. that you will make a profit on your investment if you purchase
a share of the blue-chip stock.
1
Please check the link: http://en.wikipedia.org/wiki/Blue_chip_(stock_market) if you do not know about
blue-chip stocks.
6
(b) Approximate the prob. that you will make a profit on your investment if you purchase
a share of the internet stock.
(c) Approximate the prob. that you will make a profit of at least $20 if you purchase a share
of the blue-chip stock.
(d) Approximate the prob. that you will make a profit of at least $20 if you purchase a share
of the internet stock.
(e) Assuming that the internet stock fluctuations and the blue-chip stock fluctuations are
independent, approximate the prob. that, after 400 days, the price of the internet stock
will exceed the price of the blue-chip stock.
7
random variable
X
Pn i
1 Xi
Xn
Table 1: Standardizing random variables
expected value standard deviation standardized random variable
X −µ
µ
σ
Pn1 Xiσi−nµ
√
√
nµ
nσ
nσ
√
X n√
−µ
µ
σ/ n
σ/ n
8
Figure 1: An example of convergence in prob.
Figure 2: An illustration of the CLT. Note that the shapes approach the bell shape.
9