Chapter 1 - De Anza College

Transcription

Chapter 1 - De Anza College
MATH 10: Elementary Statistics and Probability
Chapter 1: Sampling and Data
Tony Pourmohamad
Department of Mathematics
De Anza College
Spring 2015
Introduction
Population and Sample
Data Types
What is Statistics?
Statistics
The collection of methods for planning experiments, obtaining data
and then organizing, summarizing, analyzing, interpreting, presenting
and drawing conclusions based on data.
• Statistics is the study of data!
• Statistics helps us answering questions that arise in several fields:
. Ecology: How many animals of a given species live in a particular
area?
. Health: Is a new drug effective against a disease? Is drug A more
effective than drug B?
. Environmental Sciences: Weather forecasts and prediction of
extreme environmental patterns.
2 / 28
Introduction
Population and Sample
Data Types
Steps in Learning From Data
We will learn about:
• Designing the data collection process (experimental design)
• Preparing and analyzing the data (descriptive statistics, models,
hypothesis testing, prediction)
• Reporting conclusions
We will study:
• Probability: Theoretical aspects of randomness! What is the
probability that I select an ace of spades out of a deck of cards?
• Inference: Learning from data, taking randomness into account
using probability ! Given the heights of the people in this room,
what is the average height of ALL statistics college students?
3 / 28
Introduction
Population and Sample
Data Types
Example: Nitrogen Fertilizers
• Study of the effects of nitrogen fertilizer in wheat production: 15
fields and 5 nitrogen fertilizers. Three fields were randomly assigned
to each of the nitrogen fertilizers. The same variety of wheat was
planted in all fields and they were cultivated in the same manner.
The number of pounds of wheat per acre was recorded.
• Goal: determine the optimal level of nitrogen to apply to any
wheat field.
• After determining the amount of nitrogen that yielded the largest
production of wheat, the farmer concluded that similar results
would hold for fields with the same characteristics as the ones in
the study.
• Is this conclusion justified?
4 / 28
Introduction
Population and Sample
Data Types
Population and Sample
• In statistics, we generally want to study a population.
Population
You can think of a population as a collection of persons, things, or
objects under study.
. How do we study the population?
• We study a population by selecting a sample.
Sample
A sample is a portion (or subset) of members from the population.
Normally collected using a random method.
5 / 28
Introduction
Population and Sample
Data Types
Parameters and Statistics
Parameter
A parameter is a number that is a descriptive measurement of the
population.
• Typically parameters are unknown. Why is that?
Statistic
A statistic is a number that is learned/observed/computed from a
sample.
6 / 28
Introduction
Population and Sample
Data Types
Example: Parameters and Statistics
• Examples of a parameter:
. Proportion: In a recent quarter, 39% of all De Anza College
students were over age 25.
. Average: In a recent quarter, the average age of all De Anza
College students was 27.1 years.
. Median: In a recent quarter, half of all De Anza College students
were 22 years old or younger
• Examples of a statistic:
. Proportion: 41% of a random sample of 200 students were over
age 25.
. Average: The average age of a random sample of 200 De Anza
College students was 28.3 years.
. Median: In a sample of 200 De Anza College, 50% of the students
were 22 years old or younger
7 / 28
Introduction
Population and Sample
Data Types
Questions to Ask Yourself
• Does a sample statistic always have the same numerical value as
the population parameter? Why or why not?
• Are sample statistics equal for all samples taken from the same
population? Why or why not?
• Does a sample statistic always have a different numerical value as
the population parameter?
8 / 28
Introduction
Population and Sample
Data Types
Variables and Data
Variables
A variable is the description in words of the characteristic of interest
• A variable will be a sentence, not a number.
Data
The data are the information collected about the variable for
individuals in the population or sample.
• The variable is a sentence explaining the question you are asking
in order to obtain information.
• The data are the information obtained as answers to the question.
9 / 28
Introduction
Population and Sample
Data Types
Example #1
Suppose we are studying the commutes of De Anza college students to
school from home.
• What is the population?
• What is one possible example of a sample?
• Some examples of variables and data:
. Variable: "the distance that a student commutes to De Anza
College"
. Data: 2.5 miles, 8.4 miles, 0.25 miles, 52 miles, . .
or
. Variable: "how a De Anza student commutes to school"
. Data: car, car, bus, bikewalk , car, bus, bus, car , bike, car
10 / 28
Introduction
Population and Sample
Data Types
Data Types -- Quantitative Data
• Quantitative Data: Consist of numbers representing counts or
measurements.
• There are two types of quantitative data:
. Discrete data: the number of possible values is either a finite
number or a countable number. E.g., number of animals of a given
species.
. Continuous data: are the result from infinitely many possible values
not restricted to certain specified values (such as integers). E.g.,
height.
11 / 28
Introduction
Population and Sample
Data Types
Data Types -- Qualitative Data
• Qualitative Data: Also called categorical or attribute data. These
can be separated into different categories that are distinguished
by some non-numeric characteristics.
• There are two types of qualitative data:
. Nominal data: Data that consist of names, labels or categories
only. These data cannot be arranged in an ordering scheme. E.g.,
blood type.
. Ordinal data: Data that can be arranged in some order but
differences betweeen data values either cannot be determined or
are meaningless. E.g., shirt size.
12 / 28
Introduction
Population and Sample
Data Types
Examples
Consider causes of death in the US in 1992. Below is a list all causes and
the number of lives that each one claimed. We ordered the causes
and assigned consecutive integers.
Rank Cause
Total
1
Heart Diseases
717,706
2
Malignant neoplasms
520,578
3
Cerebrovascular Diseases 143,769
4
Pulmonary Diseases
91,938
5
Accidents
86,777
6
Pneumonia influenza
75,719
7
Diabetes
50,067
8
HIV
33,566
9
Suicide
30,484
10
Homicide
25,488
13 / 28
Introduction
Population and Sample
Data Types
Examples
• Discrete data examples:
. The number of new cases of breast cancer reported yearly from
1995 to 2002.
. The number of cows in a field
. The number of students in this room
• Continuous data examples:
. Temperature
. Age
. Weight
14 / 28
Introduction
Population and Sample
Data Types
Examples
• Nominal data examples:
. Colors: blue, red, green, yellow, etc.
. Cars: Toyota, Honda, Subaru, Ferrari, etc.
. Feelings: Sad, happy, mad, etc.
• Ordinal data examples:
. Letter Grades: A, B, C, D, F
. Weight: Small, medium, large
. Speed: Slow, average, fast
15 / 28
Introduction
Population and Sample
Data Types
Data Gathering
There are many ways to gather data, such as:
• Census: Collection of data from every member of the population.
• Sampling: Collecting data from a sub-collection of members from
part of the population. Normally collected using a random
method.
• Observational Study: Collect data with NO CONTROL over possible
affecting factors.
• Designed Experiment: Data are collected by means of an
experiment where the most important factors are subject to
control.
• Examples?
16 / 28
Introduction
Population and Sample
Data Types
Designed Experiments vs. Observational Studies
In a New York Times article about hormone therapy for women a
reporter wrote that "researchers say observational studies painted a
falsely rose picture of hormone replacement because women who opt
for the treatments are healthier and have better habits to begin with
that women who do not."
17 / 28
Introduction
Population and Sample
Data Types
Designed Experiments
• Treatments: Different values or components of the explanatory
variable applied in an experiment.
• Response: The dependent variable in an experiment; the value
measured for change at the end of an experiment.
• Control Group: A group that receives an inactive treatment but is
otherwise managed exactly the same as the other groups.
• Placebo: An inactive treatment that has no effect on the
explanatory variable.
• Blinding: Not telling participants what treatment a subject is
receiving.
• Double Blind: The act of of blinding both the subjects of an
experiment and the researchers who work with the subjects.
18 / 28
Introduction
Population and Sample
Data Types
Big Problem of Observational Studies
Confounding
Confounding occurs when the effects of variables are mixed such that
the individual effects are indeterminable. When the effects of multiple
factors on a response can not be separated, it becomes difficult or
impossible to draw valid conclusions about the effect of each factor.
• Example: If only the people in a particular age group are given a
particular drug, the drug may look effective/ineffective.
• Designed experiments can be constructed to avoid confounding
variables.
19 / 28
Introduction
Population and Sample
Data Types
Other Problems in Data Gathering
• Problems with Samples:
. A sample should be representative of the population.
. A sample that is not representative of the population is called
"biased".
. Non-response or refusal of subject to participate OR Self-Selected
Samples.
. Sample Size Issues.
• Collecting data or asking questions in a way that influences the
response.
• Causality: A relationship between two variables does not
necessarily imply that one causes the other. They may both be
affected by some other variable.
• Self-Funded or Self-Interest Studies
• Misleading Use of Data: improperly displayed graphs, incomplete
data, lack of context.
20 / 28
Introduction
Population and Sample
Data Types
Example #2
• Study I: Employees of a company are randomly divided into two
groups. Group A gets classroom training from an instructor who is
available to help and answer questions; Group B gets training via
online software with an online discussion board available to get
help and answers to questions.
• Study II: Researchers are studying whether retirement age affects
the rate of memory problems in senior citizens. A survey of retired
senior citizens showed that those who had retired earlier tended to
have a higher incidence of memory problems after retirement
than those who had retired at an older age.
1
For each of the above, what type of study is it?
2
What problem can you see in Study II?
21 / 28
Introduction
Population and Sample
Data Types
Example #2 Continued
• Study III: 300 randomly selected individuals are asked if they had
been on a diet in the last 8 weeks and how much their weight has
changed over the last 8 weeks. Weight change for dieters and
non-dieters are compared.
• Study IV: 100 individuals are put on a low fat diet, 100 on a low
carb diet and 100 eat their normal diet. Their weight change over
an 8 week period is recorded.
1
Which weight loss study (III or IV) do you think would give the best
information about the effect of diet on weight loss? Why?
22 / 28
Introduction
Population and Sample
Data Types
Example #3
A large city is proposing a parcel tax to support education. Each
property owner would be assessed a tax of $100 per property per year.
The parcel tax will be voted on by voters in the next election. It will pass
if 2/3 of the voters vote in favor of the tax.
• I. A group of parents and teachers supporting the parcel tax
randomly select and call residents in the city. They identify
themselves as members of the Parent Teachers Association for the
school system and ask the person who answers the telephone call
if they support the parcel tax.
• II. A TV news station in the city conducts a "Facebook" survey.
Viewers are asked whether they favor or oppose the tax and are
given instructions to visit the TV stations Facebook page to respond
about their opinion. The poll is publicized and responses are
solicited by announcements on the TV station’s evening news.
23 / 28
Introduction
Population and Sample
Data Types
Example #3 Continued
• III. A professional polling organization conducts a survey by
randomly calling selected residents in the city. If the resident is a
registered voter, he or she is asked his/her their opinion about the
proposed parcel tax. They are asked whether they favor the tax,
oppose the tax, or have no opinion. These three choices are
presented to the individual in random order, so that not all
respondents hear the choices in the same order.
1
Which survey do you think would produce the most accurate
prediction of the election results?
2
For each of the other two surveys, what problems do you think
there might be with the information obtained?
24 / 28
Introduction
Population and Sample
Data Types
Randomization and Sampling
• Simple random sample of size n: All samples of n members from
the population have the same chance of being chosen.
• Systematic Sampling: We randomly select a point every k th
element of the population.
• Convenience Sampling: Collect results that are convenient to get.
• Stratified Sampling: The population is subdivided into at least two
different groups that share the same characteristics (e.g., age
bracket) and then a sample is drawn from each subgroup.
• Cluster Sampling: Divide the population into sections (or clusters),
then randomly select some of those clusters, and then choose all
the members from those clusters.
25 / 28
Introduction
Population and Sample
Data Types
Example #4
Determine the type of sampling method used:
1
To form a recreational soccer team, a soccer coach randomly
selects 6 players from a group of boys ages 8 to 10, 7 players from
a group of boys ages 11 to 12, and 3 players from a group of boys
age 13 to 14.
2
For a survey of human resource (HR) personnel at high tech
companies, a pollster interviews all HR personnel in each of 5
randomly selected high tech companies.
3
In a survey of engineering salaries, a researcher selects engineers
to interview by randomly selecting 50 women engineers and
randomly selecting 50 men engineers.
4
A medical researcher for a hospital interviews every third cancer
patient from a list of cancer patients at that local hospital.
26 / 28
Introduction
Population and Sample
Data Types
Example #4 Continued
Determine the type of sampling method used:
1
A high school counselor uses a computer to generate 50 random
numbers and then selects students whose names correspond to
the numbers.
2
A student interviews classmates in his algebra class to determine
how many pairs of jeans a student owns, on average.
3
In a study to learn what types of after school child care are used in
their district, a school district administrator randomly selects 6
classes at each school and surveys all parents with children in the
selected classes.
27 / 28
Introduction
Population and Sample
Data Types
Remember...
Data may be useless if not collected in an appropriate way.
28 / 28