A DATA

Transcription

A DATA
10
DATA ANALYSIS – CORE MATERIAL
DATA
A
When information for a statistical investigation is collected and recorded, the information is
referred to as data.
WHAT IS A STATISTICAL INVESTIGATION?
There are four processes involved in a statistical investigation:
Collection of data (information)
Data for a statistical investigation can be collected from records, from surveys (either faceto-face, telephone, or postal), by direct observation or by measuring or counting. Unless the
correct data is collected, valid conclusions cannot be made.
Organisation and display of data
Data can be organised into tables and displayed on a graph. This allows us to identify features
of the data more easily.
Calculation of descriptive statistics
Some statistics used to describe a set of data are the centre and the spread of the data. These
give us a picture of the sample or population under investigation.
Interpretation of statistics
This process involves explaining the meaning of the table, graph or descriptive statistics in
terms of the variable, or theory, being investigated.
COLLECTION OF DATA
The variable is the subject that we are investigating.
The entire group of objects from which information is required is called the population.
Gathering statistical information properly is vitally important. If gathered incorrectly then any
resulting analysis of the data would almost certainly lead to incorrect conclusions about the
population.
The gathering of statistical data may take the form of:
² a census, where information is collected from the whole population, or
² a survey, where information is collected from a much smaller group of the
population, called a sample.
For example:
² The Australian Bureau of Statistics conducts a census of the whole population of
Australia every five years.
² In opinion polls before an election, a survey is conducted to see which way a
sample of the population will vote.
² The students in a school are to vote for a new school captain. If 20 students from
the school are asked how they will vote, then the population is all the students
who attend the school, and the 20 students is a sample.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
11
When taking a sample it is hoped that the information gathered is representative of the entire
population.
For accurate information when sampling, it is essential that:
² the number of individuals in the sample is large enough
² the individuals involved in the survey are randomly chosen from the
population. This means that every member of the population has an equal
chance of being chosen.
If the individuals are not randomly chosen or the sample is too small, the data collected may
be biased towards a particular outcome.
For example:
If the purpose of a survey is to investigate how the population of Melbourne will vote at the
next election, then surveying the residents of only one suburb would not provide information
that represents all of Melbourne.
TYPES OF DATA
Data are individual observations of a variable. A variable is a quantity that can have a value
recorded for it or to which we can assign an attribute or quality.
Two types of variable that we commonly deal with are categorical variables and numerical
variables.
CATEGORICAL VARIABLES
A quality or category is recorded for this type of variable. The information collected is
called categorical data.
Examples of categorical variables and their possible categories include:
Colour of eyes:
blue, brown, hazel, green and violet
Continent of birth: Europe, Asia, North America, South America, Africa, Australia and
Antarctica
Gender:
male or female
Type of car:
General Motors, Toyota, Ford, Mazda, BMW, Subaru, etc.
NUMERICAL VARIABLES
A number is recorded for this type of variable. The information collected is called numerical
data.
There are two types of numerical variables:
Discrete numerical variables
A discrete variable can only take distinct values and these values are often obtained by
counting.
Examples of discrete numerical variables and their possible values include:
The number of children in a family:
0, 1, 2, 3, ...
The score on a test, out of 30 marks: 0, 1, 2 ..., 29, 30.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
12
DATA ANALYSIS – CORE MATERIAL
Continuous numerical variables
A continuous numerical variable can theoretically take any value on a part of the number line.
Its value often has to be measured.
Examples of continuous numerical variables and their possible values include:
The height of Year 9 students: any value from about 120 cm to 200 cm
The speed of cars on a stretch any value from 0 km/h to the fastest speed that a car
of highway:
can travel, but most likely in the range 30 km/h to 120 km/h
The weight of newborn babies: any value from 0 kg to 10 kg but most likely in the range
0:5 kg to 5 kg
The time taken to run 100 m: any value from 9 seconds to 30 seconds.
EXERCISE 1A
1 40 students, from a school with 820 students, are randomly selected to complete a survey
on their school uniform. In this situation:
a what is the population size
b what is the size of the sample?
2 A television station is conducting a viewer telephone-into-the-station poll on the question ‘Should Australia become a republic?’
a What is the population being surveyed in this situation?
b How is the data biased if it is used to represent the views of all Australians?
3 A polling agency is employed to survey the voting intention of residents of a particular
electorate in the next election. From the data collected they are to predict the election
result in that electorate.
Explain why each of the following situations would produce a biased sample.
a A random selection of people in the local large shopping complex is surveyed
between 1 pm and 3 pm on a weekday.
b All the members of the local golf club are surveyed.
c A random sample of people on the local train station between 7 am and 9 am are
surveyed.
d A doorknock is undertaken, surveying every voter in a particular street.
4 Classify the following data as categorical, discrete numerical or continuous numerical:
a the quantity of soil in a particular size of potplant
b the number of pages in a daily newspaper
c the number of cousins a person has
d the speed of cars on a particular stretch of highway
e the state of Australia where a person was born
f the maximum daily temperature in Melbourne
g the manufacturer of a car
h the preferred football code
i the position taken by a player on a football field
j the time it takes 12-year-olds to run one kilometre
k the length of feet
l the number of goals shot by a netballer
m the amount spent weekly, by an individual, at the supermarket.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
13
5 A sample of public trees in a municipality was surveyed for the following data:
a the diameter of the tree (in centimetres) measured 1 metre above the ground
b the type of tree
c the location of the tree (nature strip, park, reserve, roundabout)
d the height of the tree, in metres
e the time (in months) since the last inspection
f the number of inspections since planting
g the condition of the tree (very good, good, fair, unsatisfactory).
Classify the data collected as categorical, discrete numerical or continuous numerical.
ORGANISING AND DISPLAYING DATA
B
CATEGORICAL DATA
Tally and frequency tables are used to organise categorical data and there are several types
of graphs that can be used to effectively display the data.
For example:
A centrally-located school is investigating how their students get to school. This is of interest
to them because of local traffic problems. A sample of 50 students was asked which of the
following five categories they used most.
The results were:
BBCWTn
TnTnTmCC WCCBC CWBBTn TmCBWTn
WWTnTnC TmTnCCTm BBBBW CCBWC
TnBCBB
(Tn ´ train, Tm ´ tram, B ´ bus, W ´ walk, C ´ private car)
The variable ‘mode of transport to school’ is a categorical variable.
We can organise the data using a tally and frequency table.
One stroke for each data value is recorded in the tally column.
represents a
tally of five.
©
©
jjjj
Mode of transport
Train
Tram
Bus
Walk
Private car
Total
©
©
jjjj
jjjj
©
©
jjjj
©
©
jjjj
©
©
jjjj
Tally
jjjj
© jjjj
©
jjjj
jjj
© ©
©
©
jjjj
jjjj
Frequency
9
4
14
8
15
50
From the frequency table we can see:
² The most favoured ‘mode of transport’ in the sample was ‘Private car’.
² 9 + 4 + 14 = 27 of the 50 students came by public transport (train, tram, or bus).
² Only 8 of the 50 students (16%) walked to school.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
14
DATA ANALYSIS – CORE MATERIAL
GRAPHS TO DISPLAY CATEGORICAL DATA
1 A barchart (or column graph) is usually drawn with the categories along the horizontal
axis and the frequency on the vertical axis.
Each bar (or column) is drawn with height equal to the frequency of its category.
The ‘bars’ are equally spaced (not joined together) and are of the same width.
Below is a barchart for the example.
Note: A barchart can also be drawn
with horizontal bars.
Mode of transport to school
Mode of transport to school
16
frequency
14
12
10
8
6
4
2
0
train tram
bus walk private
car
private car
walk
bus
tram
train
0
2
4
6
8 10 12 14 16
2 A segmented barchart is a single ‘bar’ divided into segments so that the length of each
segment is proportional to the frequency.
A percentaged segmented barchart can also be produced.
The percentage for each category is calculated using
frequency of category
£ 100% .
total
For example, for the traffic data shown previously:
The category with the highest frequency of 15 was Private car.
So,
15
50
£
100
1
= 30% of the students came by private car.
27 students came by public transport.
So the percentage who came by public transport was
27
50
£
100
1
= 54%.
Following is a segmented barchart and a percentaged segmented barchart for the above
example.
The segments can be labelled, or shaded including a legend.
50
frequency
100%
40
private
car
30
walk
60%
20
bus
40%
10
tram
train
20%
0
80%
0%
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
% frequency
black
private car
walk
bus
tram
train
VIC MCR_12
Chapter 1
UNIVARIATE DATA
15
EXERCISE 1B.1
a Which subject was the most
favoured?
b How many students chose Art
as their favourite subject?
c What percentage of the students nominated Mathematics
as their favourite subject?
d What percentage of the students chose either Music or
Art as their favourite subject?
subject
1 55 randomly selected year eight students were asked to nominate their favourite subject
studied at school.
The results of the survey are disEnglish
played in the barchart alongside.
Mathematics
Science
Language
History
Geography
Music
Art
0
2
4
6
8
10
frequency
2 A randomly selected sample of adults was asked to
News Service Frequency
nominate the evening television news service that
ABC
40
they watched. The results alongside were obtained:
Channel 7
45
a Construct a barchart for this data.
Channel 9
64
b Use the table and graph to answer the followChannel 10
25
ing questions about the data.
SBS
23
i How many adults were surveyed?
None
3
ii Which news service is the most popular?
iii What percentage of those surveyed watched the most popular news service?
iv What percentage of those surveyed watched the news service on Channel 7?
3 Construct a percentaged segmented barchart
for the following categorical data, shading
the categories and including a legend.
Expenditure
item
Weekly household
expenditure ($)
Food
Clothing
Rent
Travel
Utilities
Entertainment
60
30
120
15
30
45
DISCRETE NUMERICAL DATA
A discrete numerical variable can take only distinct values.
The data is often obtained by counting.
For example, a farmer has a crop of peas and wishes to investigate the number of peas in
the pods. He takes a random sample of 50 pods and counts the number of peas in each pod,
obtaining the following data:
6654987776567888752477678
8786642913359887767768455
The variable in this situation is the discrete numerical variable ‘the number of peas in a pod’.
The data could only take the discrete numerical values 0, 1, 2, 3, 4, ....
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
16
DATA ANALYSIS – CORE MATERIAL
TABLES AND GRAPHS
No. peas in pod
1
2
3
4
5
6
7
8
9
Total
To organise his data the farmer could use
the tally and frequency table shown.
A barchart could be used to display the
results.
frequency
14
12
10
8
6
4
2
0
0
1
2
3
4
5 6 7 8 9
number of peas in pod
Tally
j
jj
jj
jjjj
© j
©
jjjj
© jjjj
©
jjjj
© ©
© jjj
©
jjjj
jjjj
© ©
©
©
jjjj
jjjj
jjj
Frequency
1
2
2
4
6
9
13
10
3
50
Alternatively, the farmer could use a dot plot which is a convenient method of tallying the
data and at the same time displaying the frequencies.
To draw a dot plot:
1 Draw a horizontal axis and mark it with the values that the variable can take. For this
example, the variable took values from 1 to 9, so we mark the axis from 0 to 10.
2 Label the axis with a description, in this case: number of peas in pod.
3 Systematically go through the data, placing a dot or cross above the appropriate position
on the axis.
The dot plot for this example is:
0
1
2
3
4
5
6
7
9
8
10
number of peas in pod
Notice that the dots are evenly spaced so the final plot looks similar to the barchart.
From both the barchart and the dot plot it can be seen that:
² Seven was the most frequently occurring number of peas in a pod.
100
² 35
50 £ 1 = 70% of the pods yielded six or more peas.
10% of the pods had fewer than 4 peas in them.
²
DESCRIBING THE DISTRIBUTION OF A SET OF DATA
The distribution of a set of data is the pattern or shape of its graph.
stretched to the left
For the example above, the graph has the
general shape shown alongside:
This distribution of the data is said to be negatively skewed because it is stretched to the
left (the negative direction).
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
A positively skewed distribution of data
would have a shape:
UNIVARIATE DATA
17
A symmetrical distribution of data is neither positively nor negatively skewed, but is
symmetrical about a central value.
stretched to the right
A set of data whose graph has two peaks is
said to be bimodal.
Note that the horizontal is a number line
with numbers in ascending order from left to
right.
Outliers are data values that are either much
larger or much smaller than the general body
of data. Outliers appear separated from the 12 frequency
body of data on a frequency graph.
10
For the example, if the farmer found one pod
in his sample contained 13 peas then the data
value 13 would be considered an outlier. It is
much larger than the other data in the sample.
On the column graph it appears separated.
8
6
4
2
0
outlier
0 1 2 3 4 5 6 7 8 9 10 11 12 13
number of peas in pod
EXERCISE 1B.2
2
a Construct a barchart for the discrete
numerical data alongside.
b Comment on the distribution of the
data (positively or negatively skewed
or symmetric).
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
Size of households
frequency
1 A randomly selected sample of households
has been asked, “How many people live
in your household?” A column graph has
been constructed for the results.
a How many households were surveyed?
b How many households had only one
or two occupants?
c What percentage of the households
had five or more occupants?
d Describe the distribution of the data.
8
6
4
2
0
1
2
3 4 5 6 7 8 9 10
number of people in the household
Number of toothpicks
33
34
35
36
37
38
39
Frequency
1
5
7
13
12
8
2
VIC MCR_12
18
DATA ANALYSIS – CORE MATERIAL
3 A bowler has recorded the number of wickets he has taken in each of the last 30 innings
he has played:
113200422431010215137222431103
a Construct a dot plot for the raw data.
b Comment on the distribution of the data, noting any outliers.
4 For an investigation into the number of phonecalls made by teenagers, a sample of
50 fifteen-year-olds were asked the question, “How many phonecalls did you make
yesterday?” The following dot plot was constructed for the data.
The number of phone calls made in a day by a sample of 50 fifteen year olds
0
a
b
c
d
e
f
g
1
2
3
5
4
6
7
8
9
10
11
number of phone calls
What is the variable in this investigation?
Explain why the data is discrete numerical data.
What percentage of the fifteen-year-olds did not make any phonecalls?
What percentage of the fifteen-year-olds made 5 or more phonecalls?
Copy and complete: “The most frequent number of phonecalls made was .........”.
Describe the distribution of the data.
How would you describe the data value ‘11’?
5 The number of matches in a box is stated as 50, but the actual number of matches has
been found to vary. To investigate this, the number of matches in a box is counted for
a sample of 60 boxes:
51 50 50 51 52 49 50 48 51 50 47 50 52 48 50 49 51 50 50 52
52 51 50 50 52 50 53 48 50 51 50 50 49 48 51 49 52 50 49 50
50 52 50 51 49 52 52 50 49 50 49 51 50 50 51 50 53 48 49 49
a
b
c
d
e
What is the variable in this investigation?
Is the data continuous or discrete numerical data?
Construct a dot plot for this data.
Describe the distribution of the data.
What percentage of the boxes contained exactly 50 matches?
CONTINUOUS NUMERICAL DATA
The height of 14-year-old children is being investigated. The variable ‘height of 14-year-old
children’ is a continuous numerical variable because the values recorded for the variable
could, theoretically, be any value on the number line. They are most likely to fall between
120 and 190 centimetres.
The heights of thirty children are measured in centimetres. The measurements are rounded to
one decimal place, and the values recorded below:
163:0 154:2 152:8 160:5 148:3 149:2 154:7 172:7 171:3 162:5
165:0 160:2 166:2 175:3 143:4 174:6 180:9 162:4 167:3 158:4
159:4 164:5 163:7 183:8 150:8 163:4 181:9 158:3 165:0 156:8
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
19
Note that these rounded values are actually discrete. However, when we tally them, we use
continuous class intervals as follows:
The smallest height is 143:4 cm and the largest is 183:8 cm so we will use class intervals 140
up to 150 (this does not include 150), 150 up to 160, 160 up to 170, 170 up to 180, 180 up
to 190. Note that we choose class intervals of the same width.
These class intervals are written as 140 - , 150 - , 160 - , etc. in the frequency table. The
final class interval is written as 180 - < 190 which means 180 cm up to a height that is less
than 190 cm.
Height (cm)
Tally
Frequency
A tally-frequency table for this example is:
140 jjj
3
© jjj
150 ©
jjjj
8
© ©
© jj
160 ©
jjjj
jjjj
12
170 jjjj
4
180 - < 190 jjj
3
Total
30
A histogram is used to display continuous numerical data. This is similar to a barchart but
because of the continuous nature of the variable, the ‘bars’ are joined together. The frequency
is represented by the height of the ‘bars’.
Heights of a sample of fourteen-year-old children
12
frequency
8
A histogram for this example is
shown opposite:
4
0
140
150
160
170
180 190
height (cm)
Note: The two oblique lines that cross the horizontal axis indicate that the numbers on
this axis are not starting at zero. This can also be shown using
.
A relative frequency table and histogram can also be drawn:
Height (cm)
140 150 160 170 180 - < 190
Total
Frequency
3
8
12
4
3
30
3
30
Relative %
£ 100 = 10%
26:7%
40%
13:3%
10%
100%
From the tables and graphs we can see:
40
relative frequency %
30
20
10
0
140
150
160
170
180 190
height (cm)
²
More children had a height in the class interval 160 up to 170 cm than any other
class interval. This class interval is called the modal class.
12
30 £ 100 = 40% of the children had a height in this class.
²
3
£ 100 = 10%) had a height less than 150 cm.
Three of the children ( 30
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
20
DATA ANALYSIS – CORE MATERIAL
²
²
Three of the children (10%) were 180 cm or more tall.
The distribution of heights was approximately symmetrical.
EXERCISE 1B.3
1 Construct a histogram for the following
continuous numerical data.
Time to complete
100 m swim (secs)
Number of
swimmers
50 55 60 65 70 75 - < 80
3
6
16
11
2
2
2 The speed of vehicles travelling along a
200
number of
section of highway has been recorded and
vehicles
displayed using the histogram alongside.
150
a How many vehicles were included in
this survey?
100
b What percentage of the vehicles
were travelling at speeds equal to or
50
greater than 100 km/h?
c What percentage of the vehicles were
0
travelling at a speed from 100 up to
50
70
90
110
130
speed (km/h)
110 km/h?
d What percentage of the vehicles were travelling at a speed less than 80 km/h?
e If the owners of the vehicles travelling at 110 km/h or more were fined $165 each,
what amount would be collected in fines?
3 The daily maximum temperature (o C) to the nearest degree, in Melbourne, for each day
in January 2001, is recorded below:
34 38 31 38 23 24 25 26 29 35 41 23 32 36 22 21
24 26 35 36 25 32 27 30 34 30 27 25 26 23 25
a Using class intervals of 5 degrees construct a tally and frequency table for the data.
b Construct a histogram to display the data.
c Describe the distribution of Melbourne’s daily maximum temperatures in January
2001.
4 The height of each member of a basketball squad has
been measured and the results are displayed using the
frequency table alongside.
a Calculate the relative frequencies and construct a
relative frequency histogram for the data.
b Comment on the distribution of the heights.
c Find the percentage of members of the squad
whose height is
i greater than 180 cm ii less than 170 cm
iii between 175 and 190 cm.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
Height (cm)
165 170 175 180 185 190 195 200 - < 205
Frequency
1
3
5
12
7
5
2
1
VIC MCR_12
Chapter 1
UNIVARIATE DATA
21
STEM-AND-LEAF PLOTS (STEMPLOTS)
C
Constructing a stem-and-leaf plot, commonly called a stemplot, is often a convenient method
to organise and display a set of numerical data. A stemplot groups the data and shows the
relative frequencies but has the added advantage of retaining the actual data values.
CONSTRUCTING A STEMPLOT
Data values such as 25 36 38 49 23 46 47 15 28 38 34 are all two digit numbers, so
the first digit will be the ‘stem’ and the last digit the ‘leaf’ for each of the numbers. The
stems will be 1, 2, 3, 4 to allow for numbers from 10 to 49.
The stemplot for the data is shown alongside.
Stem Leaf
Notice that:
1 5
² 1 j 5 represents 15
2 358
3 4688
² 2 j 3 5 8 represents 23, 25 and 28
4 679
2 j 3 means 23
² the data in the leaves is evenly spaced with
no commas
² the leaves are placed in increasing order, so this stemplot is ordered
² the scale (sometimes called the key) tells us the place value of each leaf.
If the scale was 2 j 3 means 2:3, then 4 j 6 7 9 would represent 4:6, 4:7 and 4:9.
For data values such as 195 199 207 183 201 ...... the first two digits are the stem and
the last digit is the leaf.
Example 1
The score, out of 50, on a test was recorded for 36 students.
a Organise the data using a stemplot.
25 36 38 49 23 46 47 15 28 38 34 9
30 24 27 27 42 16 28 31 24 46 25 31
b Comment on the distribution of the
37 35 32 39 43 40 50 47 29 36 35 33
data.
a
Recording the data from the list
gives an unordered stemplot:
Ordering the data from smallest
to largest for each stem gives an
ordered stemplot:
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
Stem
0
1
2
3
4
5
Leaf
9
56
538
688
967
0
Stem
0
1
2
3
4
5
Leaf
9
56
34455 77889
01123455667889
02366779
0
4778459
40117529653
26307
2 j 4 means 24 marks
VIC MCR_12
22
DATA ANALYSIS – CORE MATERIAL
b
Leaf
9
56
34455 77889
01123455667889
02366779
0
The shape of the distribution can
be seen when the stemplot is
rotated:
The data is slightly negatively
skewed.
Stem
0
1
2
3
4
5
We also observe these important
features:
² The minimum (smallest) test
score is 9.
² The maximum (largest) test
score is 50.
² The modal class is 30 - 39.
SPLIT STEMS
Consider the following example:
The residue that results when a cigarette is
smoked collects in the filter. This residue has
been weighed for twenty cigarettes, giving the
following data, in milligrams.
1:62 1:55 1:59 1:56 1:56 1:55 1:63
1:59 1:56 1:69 1:61 1:57 1:56 1:55
1:62 1:61 1:52 1:58 1:63 1:58
Scanning the data reveals that there will be only two ‘stems’, i.e., 15 and 16. In cases like
this we will need to split the stems.
If we use the stem 15 to represent data with
Stem Leaf
values 1:50 to 1:54 and 15¤ to represent data
15 2
with values 1:55 to 1:59 etc., we can construct
15¤ 5 5 5 6 6 6 6 7 8 8 9 9
a stemplot with four stems:
16 1 1 2 2 3 3
15 j 2 means 1:52
16¤ 9
If we split the stems five ways, where 150
represents data with values 1:50 and 1:51, 152
represents data with values 1:52 and 1:53 etc.,
the stemplot becomes:
The stemplot with the stems split five ways
clearly gives a better view of the distribution
of the data. The value 1:69 appears as an
outlier in this graph.
The stemplot with the stems split two ways
was not sensitive enough to show this.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
Stem
150
152
154
156
158
160
162
164
166
168
Leaf
2
5
6
8
1
2
5
6
8
1
2
5
667
99
33
9
VIC MCR_12
Chapter 1
UNIVARIATE DATA
23
EXERCISE 1C
1 A school has conducted a survey of 60 of their students to investigate the time it takes
for students to travel to school. The following data gives the travel time to the nearest
minute.
12 15 16 8 10 17 25 34 42 18 24 18 45 33 38 45 40 3 20 12
10 10 27 16 37 45 15 16 26 32 35 8 14 18 15 27 19 32 6 12
14 20 10 16 14 28 31 21 25 8 32 46 14 15 20 18 8 10 25 22
a
b
c
d
Is travel time a discrete or continuous variable?
Construct a stemplot for the data using stems 0, 1, 2, ....
Describe the distribution of the data.
Copy and complete: “Most students spent between ...... and ...... minutes travelling
to school.”
2 The weight of 900 g loaves of bread varies
slightly from loaf to loaf. A manufacturer of
bread is concerned that he may be producing
too many underweight loaves of bread in his
900 gram range. He weighs a sample of sixty
900 g loaves and records their weight to the
nearest gram. Construct a stemplot for the
following data and comment on the distribution of the data.
901
907
898
893
904
904
895
894
913
892
904
903
924
888
908
900
921
905
913
906
893
907
924
910
894
901
927
928
895
915
885
901
878
901
898
896
885
909
903
886
896
917
903
897
910
889
913
899
901
891
916
908
903
894
931
904
907
894
882
889
3 A taxi driver has recorded the fares, to the
nearest dollar, of 60 passengers that he has
collected from Melbourne airport:
25 32 35 16 39 18 19 25 16 41 40 43 16
13 9 48 42 20 20 22 23 33 35 24 23 14
34 37 36 36 44 51 22 48 55 13 16 20 26
30 12 30 33 35 41 17 22 54 24 20 21 35
42 43 54 28 38 37 46 25
a Construct a stemplot with stems 0, 1, 2, 3, ...... Comment on the distribution of
the data.
b Construct a stemplot with two-way split stems. Comment on the feature of the
distribution that is revealed by this split-stem stemplot.
4 The time spent (minutes) by 20 people in a queue at a bank, waiting to be attended by
a teller, has been recorded:
3:4 2:1 3:8 2:2 4:5 1:4 0 0 1:6 4:8 1:5 1:9 0 3:6 5:2 2:7 3:0 0:8 3:8 5:2
Construct a stemplot for this data (include a legend). Comment on the distribution of
the data.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
24
DATA ANALYSIS – CORE MATERIAL
SAMPLE SUMMARY STATISTICS: MEASURES OF CENTRE
D
MEASURES OF CENTRE
A picture of a data set can be obtained if we have an indication of the centre of the data and
the spread of the data.
Three statistics that provide a measure of the centre of a set of data are:
² the mean
² the median
² the mode.
THE MEAN
The mean x is the statistical name for ‘average’. The mean is calculated by adding all the
data values x then dividing this sum by the number of data n.
P
x
sum of the data values
denoted x =
mean =
number of data values
n
Note: The Greek letter sigma, §, means ‘the sum of’.
²
²
The mean involves all the data values.
If you are told that the mean mark for a test is 65% then there will be some marks
higher than 65% and some marks lower than 65%.
²
The mean does not have to be one of the data values.
For example:
The mean number of children per family is 1:8 in Melbourne.
It is obvious that a family cannot have 1:8 children but this statistic tells us that most
families have either 1 or 2 children, with more families having 2 children.
Example 2
Find the mean of the following data:
5573823465764
There are 13 data values in this set, so n = 13.
Mean =
5+5+7+3+8+2+3+4+6+5+7+6+4
65
=
=5
13
13
Example 3
Megan has had three Maths tests and her mean (average) mark is 78.
a What is the total of Megan’s marks for the three tests?
b She scores 82 marks for her next test. What is the mean mark for the four tests?
c How many marks did she need to score for the fourth test so that her overall
mean mark would increase to 80?
a
The total number of marks for the three tests is 78 £ 3 = 234.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
b
c
UNIVARIATE DATA
25
234 + 82
= 79.
4
To get an average mark of 80 in four tests, Megan needed to score a total of
4 £ 80 = 320 marks.
Hence she needed to score 320 ¡ 234 = 86 marks on the fourth test to bring
her overall mean mark to 80.
The average of her marks for the four tests is
THE MEDIAN
The median is the middle value of an ordered set of data.
An ordered set of data is the data listed from smallest to largest value (or largest to smallest).
The median splits the data set into two halves: half of the data have values less than or equal
to the median and half have values greater than or equal to the median.
For example, if the median mark for a test is 65%, then half the marks scored are greater
than or equal to 65% and half the marks scored are lower than or equal to 65%.
To find the median:
1
2
Order the data by rearranging the values from smallest to largest.
Locate the middle of the data values.
² If there is an odd number of data then the median will be one of the data values.
n+1
th value in a data set of n values.
The median is the
2
² If there is an even number of data then the median is the average of the two
middle values and may not be equal to any of the data values.
Example 4
Find the median for the following data sets:
a 5573823465764
b 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10
a
The data set is ordered (arranged from smallest to largest).
2334455566778
13 + 1
= 7th value (circled).
2
The median is the
The median is 5.
b
There are 16 data values so the median is the average of the 8th and 9th values
(circled).
3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10
6+7
= 6:5
2
The median is
(Note: This is not one of the data values.)
THE MODE
The mode is the most frequently occurring value in the data set.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
26
DATA ANALYSIS – CORE MATERIAL
This statistic can usually be found easily from a frequency table, barchart or dot plot.
If there are two modes in a data set then the data can be described as bimodal.
If there are more than two modes then it is said that “the mode is not distinct” and the mode
is not useful as a descriptive statistic.
For continuous data, the class interval with the highest frequency is the modal class.
Example 5
Find the mode for the following data:
5573823465764
The mode is the most frequently occurring value.
There are three 5s and the most we have of any other number is two.
So, the mode is 5.
EXERCISE 1D.1
1 Find the i mean ii median iii mode for each of the following data sets:
a 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9
b 10, 12, 12, 15, 15, 16, 16, 17, 18, 18, 18, 18, 19, 20, 21
c 22:4, 24:6, 21:8, 26:4, 24:9, 25:0, 23:5, 26:1, 25:3, 29:5, 23:5
d 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140,
125, 124, 119, 128, 141.
2 Consider the following two data sets:
Data set A: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 10
Data set B: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 15
a Find the mean for both Data set A and Data set B.
b Find the median of both Data set A and Data set B.
c Explain why the mean of Data set A is less than the mean of Data set B.
d Explain why the median of Data set A is the same as the median of Data set B.
3 A cricketer has scored an average of 25:4 runs in his last 10 innings. He scores 58 and
16 runs in his next two innings. What is his new batting average?
4 On the first five days of his holiday David drove an average of 256 kilometres per day
and on the next three days he drove an average of 172 kilometres per day.
a What is the total distance that David drove in the first five days?
b What is the total distance that David drove in the next three days?
c What is the mean distance travelled per day over the eight days?
5 A basketball team scored 43, 55, 41 and 37 goals in their first four matches.
a What is the mean number of goals scored for the first four matches?
b What score will the team need to shoot in the next match so that they maintain the
same mean score?
c The team shoots only 25 goals in the fifth match. What is the mean number of
goals scored for the five matches?
d The team shoots 41 goals in their sixth and final match. Will this increase or
decrease their previous mean score? What is the mean score for all six matches?
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
27
COMPARING MEASURES OF CENTRE
Consider the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 used in Examples 4a and 5. For this data
set the mean, median and mode all had the same value, 5, and this fact indicates that the
distribution of data in this set is symmetrical.
A dot plot of the data confirms
this:
2
When the distribution of data is
not symmetrical the measures of
centre can have different values.
3
4
5
6
7
8
data values
mean, median and mode
CALCULATING MEASURES OF SPREAD FROM A FREQUENCY TABLE
When the same data appear several times
we often summarise the data in table form.
Consider the data of the given table.
We can find the measures of the centre
directly from the table.
The mode
The mode is 7. There are 15 of data value
7 which is more than any other data value.
Data value
Frequency
Data value
£ frequency
3
4
5
6
7
8
9
Total
1
1
3
7
15
8
5
40
3£1= 3
4£1= 4
5 £ 3 = 15
6 £ 7 = 42
7 £ 15 = 105
8 £ 8 = 64
9 £ 5 = 45
278
The mean
There are 40 data in this set, made up of one 3, one 4, three 5s, seven 6s and so on.
The data in an ordered list would look like
3 4 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ::::::
To add these numbers we could say
3 £ 1 + 4 £ 1 + 5 £ 3 + 6 £ 7 + 7 £ 15 + ::::::
so it is not necessary to write out all the data values.
Adding a ‘Data value £ frequency’ column to the table helps to add all the scores. For
example, there are 15 data of value 7 and these add to 7 £ 15 = 105.
278
Since the total of the 40 data values is 278, the mean =
= 6:95.
40
The median
Since there are 40 data in
this set, if the data is written out in order from smallest to largest then the median
will be the average of the two
middle values, i.e., the 20th
and 21st values.
The median can be found by
counting down the frequency
table.
3
4
5
6
7
8
9
Total
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
Data value
black
Frequency
1
1
3
7
15
8
5
40
1
2
5
12
27
one number is 3
two numbers are 4 or less
five numbers are 5 or less
12 numbers are 6 or less
27 numbers are 7 or less
VIC MCR_12
28
DATA ANALYSIS – CORE MATERIAL
In the table, the blue numbers show us accumulated values. We can see that the 20th and
7+7
21st data values (in order) must both be 7s; the median =
= 7.
2
Example 6
Find the mean, median and mode for the data
given in the following frequency table.
Adding a Data value £ Frequency
column, we get:
Data value
2
4
5
6
7
8
9
Total
188
26
' 7:23
the mean =
There are 26 data
in this set, so the
median will be the
average of the 13th
and 14th values.
The 13th and 14th
values are both 8
so their average is
8.
The median is 8.
Data value
2
4
5
6
7
8
9
Total
Data value
2
4
5
6
7
8
9
Total
Freq
1
1
2
3
4
9
6
26
Freq
1
1
2
3
4
9
6
26
Frequency
1
1
2
3
4
9
6
26
Data value £ Freq
2£1=2
4£1=4
5 £ 2 = 10
6 £ 3 = 18
7 £ 4 = 28
8 £ 9 = 72
9 £ 6 = 54
188
1st value
2nd value
3rd and 4th values
5th, 6th and 7th values
8th, 9th, 10th and 11th values
12th, 13th, 14th, 15th to 20th values
21st to 26th values
8 is the data value with the highest frequency of 9, so the mode is 8.
Which measure of centre is the most suitable to use?
In Example 6, the mean (7:23) is less than
the median (8) and mode (8).
A dot plot shows the distribution of the data:
2
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
3
outliers
black
4
5
6
7
8
9
mean median and mode
VIC MCR_12
Chapter 1
UNIVARIATE DATA
29
The data is negatively skewed and the data values 2 and 4 are much smaller than most of the
data values.
The mean depends on the actual values of the data so it has been ‘dragged’ towards these
outliers.
If the data value ‘2’ was replaced by a ‘7’ then the overall total would increase by 5 and
hence the mean would increase.
The median is not influenced by extreme values because it depends on the position of data
rather than their value. If the data value ‘2’ was replaced by a ‘7’ then the median would not
change; the middle values would remain the same.
In cases where there are outliers in one direction so the distribution is skewed, the most
suitable measure of centre to use is the median or the mode. In this case the mode has the
same value as the median and would be a suitable measure of centre for the data.
However, because the mode does not take all the data values into account, in some situations
it is not representative of a data set.
For example, the data set 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 9 has a
mode of 2 and this is not representative of the data set.
A more suitable measure of centre for this data set would be the median 4 or the mean 4:5:
MEASURES OF CENTRE FROM A STEMPLOT
Example 7
Find the mean, median and mode
from the ordered stemplot shown.
Stem
1
2
3
4
5
Leaf
6788
233446789
1258
median, the 11th value
046
1
The mean is found by dividing the sum of all the data values by the number of
data. We must make sure that the ‘stem’ is included with the ‘leaf’.
Mean =
16 + 17 + 18 + 18 + 22 + 23 + 23 + :::::: + 51
= 29:14
21
The median is the middle value, the 11th value in this ordered data set.
Counting the leaves from the beginning gives a median of 27.
The mode is the most frequently occurring value; there are two 18s, two 23s and
two 24s in this set of data. We can say that the mode is not distinct in this case and
is not useful as a measure of centre.
The mean of 29:14 is larger than the median of 27, indicating that the distribution is positively skewed. This can be seen from the stemplot.
Note:
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
30
DATA ANALYSIS – CORE MATERIAL
USING A CALCULATOR TO FIND THE MEAN AND MEDIAN
Consider the data 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:
The data is entered into the calculator under the …
menu.
Choose 1:Edit Í.
Use List 1 ( L), and after checking that the cursor is in the
first position of List 1 we can type the first data value. This
value will appear at the bottom of the screen as L(1)=2.
Press Í and ‘2’ appears in the list.
Continue in a similar way through the list of data, pressing
Í after each data entry to move the cursor to the next
position.
To find the descriptive statistics for the data:
… ~ CALC will get you into the menus for finding
descriptive statistics.
We are dealing with only one variable so we choose
1:1-Var Stats Í.
1-Var Stats appears on the home screen. We need to
tell the calculator which list our data is entered in, so type
y À Í.
All the available descriptive statistics for this variable appear
on the screen:
The first statistic, x, is the mean.
The mean of the data is 4:867 (to 3 decimal places).
P
The second statistic,
x = 73, means that the sum of all
the data values is 73.
The next three statistics we will consider in Section 1E.
‘n=15’ indicates that there are 15 data values in the set.
The arrow ÿ beside n=15 means that there are other entries
for this screen. Scroll down using †.
Med=5 means the median is 5.
The other statistics on this part of the screen give the statistics of the five-number summary
which is also covered in Section 1E.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
31
EXERCISE 1D.2
1 Find the mean, median and mode for each of the following data sets given as frequency
tables:
b
a Data value Frequency
Number of rooms Frequency
1
2
2
1
2
5
3
4
3
8
4
12
4
6
5
15
5
4
6
2
7
4
8
2
2 The test scores, out of 30 marks, for a class of twenty-two students are:
15, 16, 18, 23, 22, 28, 29, 25, 25, 24, 27, 18, 11, 20, 23, 26, 26, 30, 25, 18, 15, 17
a Find the i mean ii median iii mode for the data.
b Explain why the mean is not the most suitable measure of centre for this set of data.
c Explain why the mode is not the most suitable measure of centre for this set of data.
3
a Find the i mean ii median iii mode
for the data displayed in the following stem-andleaf plot:
b Which measure of centre would be the best representative for this set of data?
Stem
5
6
7
8
9
Leaf
356
0124679
3368
47
1
4 The following data is the daily rainfall (to the nearest millimetre) for the month of
October 2000 in Melbourne:
3, 1, 0, 0, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 7, 1, 1, 0, 3, 8, 0, 0, 0, 32, 38, 3, 0, 3, 1, 0, 0
a Find the i mean ii median iii mode for this data.
b Explain why the median is not the most suitable measure of centre for this data.
c Explain why the mode is not the most suitable measure of centre for this data.
5 The frequency table alongside records the
number of phonecalls made in a day by 50
fifteen-year-olds.
a Find the:
i mean ii
for this data.
median iii
mode
b Construct a barchart for the data and
show the position of the measures of
centre (mean, median and mode) on
the horizontal axis.
Number of phonecalls
0
1
2
3
4
5
6
7
8
9
10
11
Frequency
5
8
13
8
6
3
3
2
1
0
0
1
c Describe the distribution of the data.
d Why is the mean larger than the median
for this data?
e Which measure of centre would be the most suitable for this data set?
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
32
DATA ANALYSIS – CORE MATERIAL
6 Which one of the following will always be true for the mean, median and mode of a set
of discrete numerical data, assuming a distinct mode exists?
A The mean always equals one of the data values in the set.
B The median always equals one of the data values in the set.
C The mode always equals one of the data values in the set.
D The median is distorted by extreme values.
E In a positively skewed set of data, the median will be greater than the mean.
SAMPLE SUMMARY STATISTICS: MEASURES OF SPREAD
E
MEASURES OF SPREAD
Three commonly used statistics that indicate the spread of a set of data are:
² the range
² the interquartile range
² the standard deviation.
THE RANGE AND INTERQUARTILE RANGE
The range is the difference between the maximum (largest) data value and the minimum
(smallest) data value.
Range = maximum data value ¡ minimum data value
Example 8
Find the range for the data set: 5 5 7 3 8 2 3 4 6 5 7 6 4.
Scanning the data we can see that the minimum is 2 and the maximum is 8.
Hence the range is 8 ¡ 2 = 6.
Now the median divides an ordered data set into two halves. These halves are divided in half
again by the quartiles. The median is denoted Q2 .
The middle value of the lower half is called the lower quartile, denoted Q1 . One quarter
(25%) of the data have values less than or equal to the lower quartile. Three quarters (75%)
of the data have values greater than or equal to the lower quartile.
The middle value of the upper half is called the upper quartile, denoted Q3 . One quarter
(25%) of the data have values greater than or equal to the upper quartile. Three quarters
(75%) of the data have values less than or equal to the upper quartile.
The interquartile range (IQR) is the spread of the middle half (50%) of the data.
Interquartile range (IQR) = upper quartile ¡ lower quartile
= Q3 ¡ Q1
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
33
Example 9
For the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 find the:
a median b lower quartile c upper quartile d interquartile range
The ordered data set is 2 3 3 4 4 5 5 5 6 6 7 7 8
a There are 13 data values so the median is the 7th value (circled).
There is an odd number of data and the median is one of the values so it
divides the data into two halves of six values each.
Note: For an odd number of data the median data value is not included in
the lower or upper half for the calculation of the quartiles.
b The middle value of the lower half is the average of the 3rd and 4th values.
6 values
6 values
z
}|
{ z
}|
{
2 3 3 4 4 5 5 5 6 6 7 7 8
3:5
median
3+4
= 3:5
2
Similarly, the middle value of the upper half is the average of the 10th and
Lower quartile =
c
11th values:
2 3 3 4 4 5 5 5 6 6 7 7 8
6:5
6+7
= 6:5
2
Interquartile range = upper quartile ¡ lower quartile
= 6:5 ¡ 3:5
=3
Upper quartile =
d
So, the middle half of the data has a spread of 3.
A summary for the set of data in Example 9 is:
= 8¡2 = 6
Range
2 3 3 4 4 5 5 5 6 6 7 7 8
3:5
5
Lower quartile
Median
Interquartile range
Upper quartile
= 3
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
6:5
The data has a spread of 6 (range = 6),
centred around the value 5 (median = 5).
The middle half of the data has a spread of 3
(interquartile range = 3).
black
VIC MCR_12
34
DATA ANALYSIS – CORE MATERIAL
Example 10
Find the range and the interquartile range and describe the distribution of the data:
8, 4, 3, 9, 6, 5, 5, 10, 3, 6, 7, 9, 11, 14, 9, 8, 7, 12
The ordered data set (there are 18 data values) is:
3, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 10, 11, 12, 14
The range = 14 ¡ 3 = 11
The median will be the average of the 9th and 10th values:
7+8
= 7:5
Median =
2
The median divides the data set into two sets of 9 values:
9 values
9 values
z
}|
{ . z
}|
{
3, 3, 4, 5, 5, 6, 6, 7, 7 ... 8, 8, 9, 9, 9, 10, 11, 12, 14
..
Lower quartile
Median 7:5
Upper quartile
The lower quartile is the middle value of the lower half and the upper quartile is
the middle value of the upper half.
The interquartile range = 9 ¡ 5 = 4
The data is centred at 7:5 (median) and has a spread of 11 (range).
The middle half of the data has a spread of 4 (interquartile range).
USING THE CALCULATOR TO FIND THE RANGE AND INTERQUARTILE RANGE
Key the data into a list. The data does not have to be
ordered.
Enter … ~ CALC and choose 1:1-Var Stats Í.
Press y À Í to select the list L.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
35
The screens below show all the statistics for the data. Use † to scroll down and reveal the
lower part of the screen.
The range is
maxX ¡ minX
= 14 ¡ 3 = 11
The IQR = Q3 ¡ Q1
= 9¡5
=4
MEASURES OF SPREAD FROM A STEMPLOT
The median, range, and interquartile range can be found easily from an ordered stemplot.
Example 11
The number of cars travelling along a particular road
were counted for 21 days and the data was recorded
in this ordered stemplot.
Find the median, range and interquartile range for this
data.
The data is ordered so we can read from the
smallest value to the largest value.
Combining the ‘stem’ with the ‘leaf’, we get:
16, 17, 18, 18, 22, 24, 27, ......, 40, 44, 46, 51.
The minimum is 16 and the maximum is 51,
so the range = 51 ¡ 16 = 35.
Stem
1
2
3
4
5
Stem
1
2
3
4
5
Leaf
6788
24789
02334568
046
1
Leaf
6788
24 789
023345 68
046
1
The median is the middle value (the 11th data value in a list of 21) and counting
from the beginning, the median = 32 (circled).
The median divides the data into two groups of 10 data values.
The average of the middle values of these groups gives the lower and upper
quartiles.
22 + 24
36 + 38
Lower quartile =
= 23
Upper quartile =
= 37
2
2
Interquartile range = Upper quartile ¡ Lower quartile
= 37 ¡ 23
= 14
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
36
DATA ANALYSIS – CORE MATERIAL
EXERCISE 1E.1
1 For each
i
ii
iii
iv
a
b
c
d
of the following data sets, find:
the median (make sure the data is ordered)
the upper and lower quartiles
the range
the interquartile range.
2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9
10, 12, 15, 12, 24, 18, 19, 18, 18, 15, 16, 20, 21, 17, 18, 16, 22, 14
21:8, 22:4, 23:5, 23:5, 24:6, 24:9, 25, 25:3, 26:1, 26:4, 29:5
127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140,
125, 124, 119, 128, 141.
2 For the data given in the following ordered stem-andleaf plot, find the:
a median
b upper quartile
c lower quartile
d range
e interquartile range
Stem
0
1
2
3
4
Leaf
347
034
003
137
2
9
678
56999
8
3 The time spent (in minutes) by 20 people in a queue at a bank has been recorded:
3:4, 2:1, 3:8, 2:2, 4:5, 1:4, 0, 0, 1:6, 4:8, 1:5, 1:9, 0, 3:6, 5:2, 2:7, 3:0, 0:8, 3:8, 5:2
a Find the median waiting time and the upper and lower quartiles.
b Find the range and interquartile range of the waiting times.
c Copy and complete the following statements:
i “50% of the waiting times were greater than ...... minutes.”
ii “75% of the waiting times were less than or equal to ...... minutes.”
iii “The minimum waiting time was ...... minutes and the maximum waiting time
was ...... minutes. The waiting times were spread over ...... minutes.”
4 The following data gives the number of novels counted
in 30 households.
a Find the median number of novels per household
and the upper and lower quartiles of the data.
b Copy and complete the following statements:
i “Half of the households have more than ......
novels.”
ii “75% of the households have at least ......
novels.”
Stem
2
3
4
5
6
7
Leaf
025
013
224
001
25
2
5
5
7
2
899
6689
789
6
c Find the
i range
ii interquartile range
for the number of novels per household.
d Describe the distribution of the data using the
statistics found.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
5 The height (to the nearest centimetre) of 20 ten year olds
is recorded in the following stemplot.
a Find the i median height
ii upper and lower quartiles of the data.
UNIVARIATE DATA
37
Leaf
9
134489
22446899
12588
Stem
10
11
12
13
b Copy and complete the following statements:
i “Half of the children are less than or ...... cm tall.”
ii “75% of the children are less than ...... cm tall.”
iii “The middle 50% of the children have heights spread over ...... cm.”
THE VARIANCE AND STANDARD DEVIATION
Now the range and IQR both only use two values in their calculation. It is sometimes better
to use a measure of spread that includes all of the data values in its calculation. One such
statistic is the variance, which measures the average of the squared deviations of each data
value from the mean. The deviation of a data value x from the mean x is given by x ¡ x.
For a sample, i.e., when we have surveyed a portion of the population:
P
(x ¡ x)2
2
where n is the sample size
² the variance is s =
n¡1
s
² the standard deviation s is the square root of the variance, s =
P
(x ¡ x)2
.
n¡1
The variance and standard deviation for a whole population have slightly different
formulae. However, we do not use these in this course.
Note:
Example 12
Use the formula to find the variance and the standard deviation of the sample data:
3, 4, 4, 8, 7, 6, 10
3 + 4 + 4 + 8 + 7 + 6 + 10
42
=
=6
7
7
The mean, x, of the data is
Using a table for the calculations:
P
variance s
2
=
(x ¡ x)2
n¡1
38
6
= 6:3333::::
=
standard deviation s =
=
p
variance
q
x
3
4
4
8
7
6
10
x¡x
¡3
¡2
¡2
2
1
0
4
Total
2
(x ¡ x)
9
4
4
4
1
0
16
38
P
(x ¡ x)2
38
6
= 2:5166 (4 d.p.)
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
38
DATA ANALYSIS – CORE MATERIAL
Using a table to calculate the standard deviation is an interesting exercise, but you will
normally use your calculator to find this statistic.
USING THE CALCULATOR TO FIND THE STANDARD DEVIATION
Press … and choose
1:EDIT
Key data into list L.
Press … ~ to choose
CALC, then choose
1:1-Var Stats.
Press y À Í to
choose list L.
Sample standard
deviation.
The variance is not given on the screen, but it can be found by squaring the standard
deviation.
Note:
STANDARD DEVIATION FOR GROUPED DATA
Example 13
The frequency table alongside shows data
collected from a random sample of 50
households in a particular suburb, investigating the number of people in the household.
Use the calculator to find the standard deviation of the number of people in a household
for this sample.
Press … and choose 1: Edit
Key the variable values into L and
the frequency values into L‚.
Frequency
1
2
3
4
5
6
5
8
13
14
7
3
Press … ~ to choose
CALC, then choose
1:1-Var Stats.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
Number of people
in the household
black
VIC MCR_12
Chapter 1
Enter L, L‚ by pressing
y À ¢ y Á Í.
UNIVARIATE DATA
39
The sample standard deviation is
1:3536 ::::
Note: If you do not include L‚ then you will still get a screen of statistics, but they
will be for Lonly.
GIVING MEANING TO THE STANDARD DEVIATION
Many data sets have frequency distributions that are ‘bell-shaped’ and symmetrical about the
mean.
For example, the histogram alongside
frequency
exhibits this typical ‘bell-shape’. The 25
data represents the heights of a group
20
of adult women and has a mean of 165
15
and a standard deviation of 8.
The data is centred about the mean and 10
spreads from 140 to 190. However,
5
most of the data have values between
0
155 and 170 and not many have values
140 145 150 155 160 165 170 175 180 185 190
more than 180 or less than 150.
height (cm)
The Normal distribution is an important bell-shaped distribution.
For the Normal distribution it can be shown that:
²
²
²
68% of the data will have values within one standard deviation of the mean.
95% of the data will have values within two standard deviations of the mean.
99:7% of the data will have values within three standard deviations of the mean.
Graphically this can be summarised:
68% of data
95% of data
x¡-¡s
x¡+¡s
mean x
x¡-¡2s
mean x
x¡+¡2s
99.7% of data
x¡-¡3s
mean x
x¡+¡3s
If we model the bell-shaped data above using the Normal distribution:
² 68% of the heights will have values between 165 ¡ 8 = 157 and
165 + 8 = 173, i.e., between 157 and 173 cm.
68% of the data values will be in the interval [x ¡ s, x + s].
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
40
DATA ANALYSIS – CORE MATERIAL
² 95% of the heights will have values between 165 ¡ 2 £ 8 = 149 and
165 + 2 £ 8 = 181, i.e., between 149 and 181 cm.
95% of the data values will be in the interval [x ¡ 2s, x + 2s].
² 99:7% of the heights will have values between 165 ¡ 3 £ 8 = 141 and
165 + 3 £ 8 = 189, i.e., between 141 and 189 cm.
99:7% of the data values will be in the interval [x ¡ 3s, x + 3s].
Example 14
A set of data has a Normal distribution with a mean x = 30 and a standard
deviation of s = 7. What percentage of the data is:
a greater than 30
b between 23 and 37
c more than 37
d between 16 and 44
e more than 44
f between 37 and 44?
a
The distribution of data is symmetrical about the
mean, so 50% of the data have a value greater
than 30.
30
b
Now x ¡ s = 30 ¡ 7 = 23
and x + s = 30 + 7 = 37
68% of the data are between 23 and 37.
23 30 37
c
Since 68% of scores are between 23 and 37,
32% are outside this interval. The distribution
of scores is symmetrical, so 16% are greater
than 37.
d
16%
68%
16%
23 30 37
Now x + 2s = 30 + 14 = 44
and x ¡ 2s = 30 ¡ 14 = 16
95% of the data fall between these two values.
16
e
Since 95% of the data are between 16 and 44,
5% are outside this interval. The distribution
is symmetric so 2:5% of the data are greater
than 44.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
2.5%
30
44
95%
16
30
2.5%
44
VIC MCR_12
Chapter 1
f
From c, we know that 16% of the data are
greater than 37, and from e, we know 2:5%
of the data are greater than 44.
16% ¡ 2:5% = 13:5% of the data lie
between 37 and 44.
UNIVARIATE DATA
41
16%¡-¡2.5%
=¡13.5%
2.5%
68%
2.5%
23 30 37 44
Example 15
The contents of a sample of two hundred ‘800 gram packets’ of muesli were weighed
and the weights were found to have a bell-shaped distribution with a mean of 800
grams and a standard deviation of 8 grams. How many of the packets in the sample
would be expected to have a weight of more than 792 grams?
We model the bell-shaped distribution using the Normal distribution.
Now 792 = 800 ¡ 8 So, 792 g is one standard deviation less than the mean.
Since 68% of the weights are within
one standard deviation of the mean,
32% are outside this range.
68%¡+¡16%
=¡84%
Since the distribution is symmetric,
32
2 = 16% of the weights are lower
68%
than 792 g.
16%
16%
84% of the weights are above 792 g.
792
800
weight in grams
84
£ 200 = 168.
84% of 200 = 100
So 168 of the 200 packets in the sample would be expected to have a weight greater
than 792 grams.
EXERCISE 1E.2
1
a Use the formula to find the standard deviation of the following set of data:
334456678899
b Check your answer to a using your calculator.
2 Use your calculator to find the standard deviation and variance of the following data:
25:6, 32:8, 24:7, 36:0, 32:1, 30:9, 34:4, 27:5
3 Find the standard deviation of the data given in the frequency table below.
Number of cars owned
by the business
0
1
2
3
4
5
3
4
6
9
12
10
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
Frequency
black
Number of cars owned
by the business
6
7
8
9
10
11
Frequency
10
8
5
2
1
0
VIC MCR_12
42
DATA ANALYSIS – CORE MATERIAL
4 The following data are the heights, to the nearest centimetre, of the thirty footballers that
belong to an AFL club.
192 185 189 183 189 191 190 192 198 187 191 194 198 181 189
191 190 187 189 194 198 191 187 196 181 193 187 196 192 178
a Find the i mean, x ii standard deviation, s of the height of the footballers
in this club.
b
i Calculate the interval [x ¡ s, x + s].
ii What percentage of the heights would be expected to fall in this interval?
iii What percentage of the actual heights fall in this interval?
c What percentage of the actual heights fall in the interval [x ¡ 2s, x + 2s]?
What percentage would you expect to fall in this interval?
5 The distribution of weights of 600 g loaves of bread is bell-shaped with a mean weight
of 605 g and a standard deviation of 8 g. What percentage of the loaves can be expected
to have a weight between 597 g and 613 g? (Use the Normal distribution as a model.)
6 [1997 FM CAT 2 Q4]
The distribution of the weight of ice-cream served in a single scoop of Danish Delight is
known to be bell-shaped with a mean of 104 grams and a standard deviation of 2 grams.
The percentage of single scoops of Danish Delight containing less than 100 grams will
be closest to:
A
B
0%
C
2:5%
5%
D
16%
E
95%
7 The diameters of washers produced by a machine have a bell-shaped distribution with
a mean diameter of 10 mm and a standard deviation of 0:3 mm. Using the Normal
distribution as a model, find the percentage of the washers that would have a diameter:
a between 9:7 mm and 10:3 mm
b greater than 10 mm
c greater than 10:6 mm
d between 9:4 and 9:7 mm
e greater than 9:7 mm?
8 The distribution of exam scores for 780 students who sat an exam is Normal with a mean
of 55 and a standard deviation of 15.
a Find the number of students who would be expected to obtain a score:
i greater than 70
ii less than 55
iii less than 25
iv between 70 and 85
b If the pass mark for the exam was 40, then how many students are expected to pass
the exam?
9 The distribution of times taken to swim 50 metres by a group of 16 year-olds is bellshaped with a mean of 38 seconds and a standard deviation of 3 seconds. The slowest
16% of the students would be expected to have a swim-time:
A
B
C
D
E
greater than 32 seconds but less than 35 seconds
less than 32 seconds
greater than 35 seconds but less than 38 seconds
greater than 35 seconds
greater than 41 seconds.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
43
STANDARD SCORES (z-SCORES)
The relative significance of a particular data value can be considered in terms of the number
of standard deviations that it differs from the mean.
This is called the standard score or z-score of the data value, and the process of finding the
standard score is called standardisation. Non-standardised data are often referred to as raw
scores.
CALCULATING STANDARD SCORES
Standard score (z-score) =
raw score ¡ mean
standard deviation
Example 16
The mean percentage on a mathematics exam is 60 and the standard deviation is 13.
a Find the standard scores for students who, on the exam, scored:
i 82%
ii 45%
iii 73%
b Find the raw score of a student whose standardised score was 0:61.
a
Using the formula for standard score:
i
ii
standard score
iii
standard score
45 ¡ 60
13
= ¡1:15 (2 dec. pl.)
82 ¡ 60
13
= 1:69 (2 dec. pl.)
73 ¡ 60
13
=1
=
=
standard score
=
raw score ¡ mean
standard deviation
raw score ¡ 60
0:61 =
13
0:61 £ 13 = raw score ¡ 60
7:93 + 60 = raw score
raw score = 67:93
b
z-score =
So, the student’s raw score would have been 68%.
Consider the following example:
95% of data
The bell-shaped distribution alongside has
mean 35 and standard deviation 10.
68%
15
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
25
35
40
45
x
VIC MCR_12
44
DATA ANALYSIS – CORE MATERIAL
The distribution of the standardised data is
shown alongside.
95% of data
Notice that:
² the shape of the distribution is unchanged
² the values on the x-axis have been scaled
68%
so that:
I the 68% of the data within one standard
-2 -1 0
1
deviation of the mean have z-scores
between ¡1 and 1
I the 95% of the data within two standard
deviations of the mean have z-scores between ¡2 and 2
2
x
² a standard score of 0 represents a raw score of the same value as the mean
² a positive standard score represents a raw score that is greater than the mean
² a negative standard score represents a raw score that is less than the mean.
These facts are always true when we standardise a bell-shaped distribution.
Example 17
Find the percentage of scores that come from a Normal distribution that will
have a z-score:
a greater than 0
b between ¡2 and 2
c between 1 and 2
d less than ¡2
e more than 3.
a
A z-score of 0 corresponds to a raw score
of the mean and 50% of the data will have
a value greater than the mean.
-3 -2 -1
b
0
1
2 3
50%
z
2
3
z
2
3
z
95% of the data will have a z-score between
¡2 and 2.
-3 -2 -1
0 1
95%
95 ¡ 68
= 13:5% of raw scores will have
2
a z-score between 1 and 2.
c
-3 -2 -1
0
95-68
2
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
1
= 13.5%
VIC MCR_12
Chapter 1
d
If 95% of the raw scores have a z-score
between ¡2 and 2 then 2:5% ( 12 of 5%)
will have a z-score less than ¡2.
2.5%
-3 -2 -1
e
45
UNIVARIATE DATA
2.5%
0 1
95%
2
3
z
If 99:7% of the raw scores have a z-score between ¡3 and 3 then
0:3
100 ¡ 99:7
=
= 0:15% will have a z-score score more than 3.
2
2
COMPARING RAW SCORES FROM DIFFERENT DATA SETS
Since standard scores:
² keep the relative value of raw scores within a data set
² scale the x-axis of distributions in terms of their standard deviations,
standard scores are useful for comparing scores from different data sets.
Example 18
Archie scored 62% on his Mathematics exam. This exam had a mean of 57 and a
standard deviation of 5. In his English exam Archie scored 75% and this exam had
a mean of 70 and a standard deviation of 6.
In which subject was his relative performance better?
5
62 ¡ 57
= =1
5
5
5
75 ¡ 70
= = 0:83
standard score =
6
6
In Maths:
standard score =
In English:
Since Archie’s standard score for Maths was greater than his standard score for
English, his Maths result was further to the right in the distribution of the scores of
the class.
Archie’s relative performance was better in Maths.
EXERCISE 1E.3
1 Find the standard scores for the following raw scores that come from a set of data that
has a mean of 6:4 and a standard deviation of 2.
a
b
10
c
5:2
12
d
6:5
2 A raw score from a data set has a z-score of ¡0:85. If the data set has a mean of 50
and a standard deviation of 5:6, find the value of the raw score.
3 A raw score of 72 has a z-score of 1:25. If the standard deviation from the data set is
8, find the mean of the data.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
46
DATA ANALYSIS – CORE MATERIAL
4 A raw score of 20 has a z-score of ¡1:6. If the mean of the data set is 28, find the
standard deviation.
5 Peter has had four Mathematics tests for the year and his results and the class averages
and standard deviations are given in the table below.
Peter’s mark
58
72
68
78
Test
1
2
3
4
Class average
60
65
60
72
Standard deviation
7
12
10
9
a Calculate Peter’s standard score for each test.
b In which test did Peter perform best?
6 The semester English exam results for four
students are given in the table alongside.
If the mean was 60 for both exams and
the standard deviation was 15 for the
Semester 1 exam and 8 for the Semester 2
exam:
Student
David
Rodney
Gavan
Daniel
Semester 1
70
54
92
75
Semester 2
65
58
75
70
a Which of the students improved their performance from Semester 1 to Semester 2?
b Which student improved the most?
c Which student’s performance was the most consistent for the year?
7 For a set of data that has a bell-shaped distribution, find the percentage of raw scores
that have a z-score:
a less than 0
b between ¡1 and 1
c greater than 2
d between ¡1 and 0
e between ¡1 and ¡2
f between 0 and 3
THE BOXPLOT (BOX-AND-WHISKER PLOT)
F
A boxplot is a visual display of some of the descriptive statistics of a set of data, namely its
minimum and maximum values, the median and the upper and lower quartiles. These five
statistics form what is called the five-number summary of the data set.
CONSTRUCTING A BOXPLOT
A boxplot (box-and-whisker plot) is constructed above a number line (labelled and scaled)
which is drawn so that it covers all the data values in the data set.
The boxplot is drawn with a rectangular ‘box’ representing the middle half of the data. The
‘box’ goes from the lower quartile to the upper quartile.
The ‘whiskers’ extend from the ‘box’ to the maximum value and to the minimum value.
A vertical line marks the position of the median in the ‘box’.
For example, for the data set 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9:
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
7 values
UNIVARIATE DATA
47
7 values
z
}|
{ z
}|
{
The ordered data set is 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7, 7, 8, 9 (15 data).
Q1
}|
These 5 statistics form the
five-number summary.
{
minimum is 1.
maximum is 9.
median is the 8th value, 5.
lower quartile is the 4th value, 3.
upper quartile is the 12th value, 7.
Q3
z
The
The
The
The
The
median
Q2
whisker
1
whisker
2
minimum
3
4
lower quartile
5
6
median
7
upper quartile
8
9
value
maximum
Using the graphics calculator to find descriptive statistics and construct a boxplot
Press … and choose 1:Edit.
Enter the data from the example above into L: 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9
Statistical graphs are drawn using STAT PLOT,
which is located above the o key.
Press y o to use it.
Press Í to use Plot 1.
Turn the plot On by pressing Í then use the arrow
keys to choose the boxplot icon Ö and press Í.
Press q ® to draw the boxplot.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
48
DATA ANALYSIS – CORE MATERIAL
r can be used to locate the statistics of the
five-number summary. The arrow keys move backwards and forwards between them.
In this screen,
the cursor is on
the median.
INTERPRETING A BOXPLOT
A set of data with a symmetric distribution will have a symmetric boxplot.
For example:
y
8
6
4
2
0
10 11 12 13 14 15 16 17 18 19 20 x
10 11 12 13 14 15 16 17 18 19 20 x
The whiskers of the boxplot are the same length and the median line is in the centre of the
box.
A set of data which is positively skewed will have a positively skewed boxplot.
For example:
10
8
6
4
2
0
y
1
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
x
x
The right whisker will be longer than the left whisker and the median line is to the left of the
box.
A set of data which is negatively skewed will have a boxplot that appears stretched to the
left.
For example:
1
1 2
3 4
5
6
7
8
2
3
4
5
6
7
8
9 x
9 x
The left whisker is longer than the right and the median line is to the right of the box.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
49
Example 19
A boxplot has been drawn to show the distribution of marks (out of 100) in a test
for a particular class:
0
a
b
c
d
e
f
g
h
i
a
b
c
d
e
f
g
h
i
10
20
30
40
50
60
70
80
90
100
score on test
What was the highest mark scored for this test?
What was the median test score for the class?
What is the range of marks scored for this test?
What percentage of students scored 60 or more for the test?
What was the lowest mark scored?
What is the interquartile range for this test?
The top 25% of students scored a mark between ...... and ......
If you scored 70 on this test, would you be in the top 50% of students in the
class?
Comment on the symmetry of the distribution of marks.
The highest score corresponds to the end of the upper whisker, so the
highest mark scored was 98.
The median corresponds to the vertical line inside the box, which is at 73.
The range = maximum score ¡ minimum score = 98 ¡ 30
= 68
The score of 60 corresponds to the lower quartile.
25% of the students have a score less than or equal to the lower quartile so 75%
scored 60 or more.
The lowest score corresponds to the end of the lower whisker, so the lowest
score was 30.
The interquartile range = upper quartile ¡ lower quartile = 82 ¡ 60
= 22
The top 25% of scores correspond to the upper whisker.
So, the top 25% of students scored a mark between 82 and 98.
The top 50% of students had a mark greater than or equal to the median of 73.
You would not be in the top 50% of students if you scored 70 for the test.
stretched to the left
0
10
20
30
40
50
60
70
80
90
100
score on test
The distribution of test scores is stretched to the left, and is therefore negatively
skewed. The lower whisker is longer than the upper whisker and the median is
not in the centre of the box but further towards the upper end.
The distribution is therefore not symmetrical.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
50
DATA ANALYSIS – CORE MATERIAL
TESTING FOR OUTLIERS
Outliers are extraordinary data that are either much larger or much smaller than the main
body of the data.
There are several tests that identify outliers. One commonly used test involves the following
calculation of ‘boundaries’:
The upper boundary = upper quartile + 1:5 £ IQR.
Any data larger than this number is an outlier.
The lower boundary = lower quartile ¡ 1:5 £ IQR.
Any data smaller than this value is an outlier.
When outliers exist, the ‘whiskers’ of a boxplot extend to the last value that is not an outlier.
Each outlier is marked with an asterisk; it is possible to have more than one outlier at either end.
Example 20
Draw a boxplot for the following data, identifying any outliers.
1, 3, 7, 8, 8, 5, 9, 9, 12, 14, 7, 1, 4, 8, 16, 8, 7, 9, 10, 13, 7, 6, 8, 11, 17, 7
The ordered data is:
13 values
13 values
z
}|
{ z
}|
{
1, 1, 3, 4, 5, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 16, 17
lower quartile = 7
median = 8
upper quartile = 10
The five-number summary is:
minimum value is 1
lower quartile is 7
median is 8
upper quartile is 10
maximum value is 17
Using the calculator:
IQR = 10 ¡ 7 = 3
The upper boundary = upper quartile + 1:5 £ IQR = 10 + 1:5 £ 3 = 14:5
The lower boundary = lower quartile ¡ 1:5 £ IQR = 7 ¡ 1:5 £ 3 = 2:5
Values outside the interval [2:5, 14:5] are outliers. Hence the two outliers at the
upper end are the data values 16 and 17, and the two at the lower end are both the
data value 1.
We now have all the information to draw the boxplot:
Two outliers of the same
value are shown like this.
0
1
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
2
The whisker is drawn to the last
value that is not an outlier.
black
variable
VIC MCR_12
Chapter 1
UNIVARIATE DATA
51
Using the calculator to draw the boxplot in Example 20 above, we begin by entering the
data in L.
Use STAT PLOT by pressing y o.
Press Í to use Plot 1.
Turn the plot On then use the arrow keys to choose the
‘boxplot with outliers’ icon Õ
Then press Í.
Press q ® to draw the boxplot.
Note that only one of the outliers at 1 appears on the
screen.
Press r and use the arrow keys to move the cursor
through the summary statistics. Note that both values at 1
are included.
You may wonder why we would need both the boxplot and the stemplot or
histogram. Each complements the other and shows slightly different things.
Boxplots provide an excellent display of the summary statistics, while stemplots and
histograms illustrate the shape of the distribution more accurately.
Note:
Consider the following example:
3
4
5
6
7
8
9
10
11
2
2
3
0
0
4
0
2
2
5
3
8
5
0
8
1
3
9
height (cm)
13
8
3
8
5579999999
11
9
13344455678
3456
leaf unit: 0:1 cm
9
7
5
3
These graphs display the same distribution.
The boxplot displays the summary statistics, while the stemplot
reveals the bimodal nature of the distribution. Hence both
graphics are of value.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
52
DATA ANALYSIS – CORE MATERIAL
EXERCISE 1F
1 The following boxplot summarises the heights of the players in an AFL team.
165
Use
a
b
c
d
e
170
175
180
185
190
195
200
205
210
the boxplot to find:
the median height of the team
the range of heights of the team (ignoring the outlier)
the height that 75% of the team are taller than
the height of the player that is an outlier
the interquartile range of the heights.
215
height (cm)
2 Find the five-number summary (minimum, lower quartile, median, upper quartile, maximum) for each of the following data sets, and construct a boxplot for the data.
a Essendon’s game scores for the year 2000 (not including the finals):
156, 130, 124, 137, 123, 144, 140, 127, 106, 132, 145, 169, 119, 89, 108,
89, 167, 159, 165, 109, 81, 97
b
Number of toothpicks
33
34
35
36
37
38
39
c
Frequency
1
5
7
13
12
8
2
The daily maximum temperature
(o C) in Melbourne for the month
of March 2001:
Leaf
Stem
1
1¤
2
2¤
3
3¤
7
0
5
0
5
8
0
5
0
8
0
6
1
8
2
7
2
9
2
8
2
9
2233344
8
3
2 j 4 represents 24o C
3 A set of data has a lower quartile of 31:5, median of 37, and upper quartile of 43:5.
a Calculate the interquartile range for this data set.
b Calculate the boundaries that identify outliers.
c Which of the data 22, 13:2, 60, 65 would be outliers?
4 The boxplot below shows the distribution of weights of a sample of Jack Russell terriers:
4
5
6
7
8
9
Which one of the following would not be true for this data?
A The interquartile range is more than 1:5 kg.
B The heaviest 25% of the dogs all weighed more than 8 kg.
C The median weight was 7 kg.
D At least 75% of the weights were more than 6 kg.
E The lightest 25% weighed less than or equal to 6:2 kg.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
10
weight (kg)
VIC MCR_12
Chapter 1
53
UNIVARIATE DATA
5 The boxplot below shows the distribution of taxi fares for 50 trips taken from Melbourne
Airport.
15
a Find:
i
20
25
30
ii
the median fare
35
40
iii
the range of fares
45
fare ($)
the IQR of fares.
b Write a sentence describing the distribution of the data, mentioning each of the
statistics from a.
c Complete the following:
i Approximately ...... % of fares were greater than $32.
ii The minimum fare was $ ......
iii 75% of the fares were greater than $ ......
6 Match the histograms A, B, C , D and E to the boxplots I, II , III , IV and V .
A
C
1
2
3
4
5
6
7
2
8 x
D
frequency
8
6
4
2
0
E
B
frequency
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 x
4
3
2
1
0
4
6
8
10
12
x
4
5
6
7
8 x
frequency
1
2
3
Leaf
12246
223445555689
07999
0234666778
25668
29
5
leaf unit : 0:1
Stem
7
6
5
4
3
2
1
I
II
1
2
3
4
5
6
7
8
1
x
III
2
3
4
5
6
7
8
x
IV
1
2
3
4
5
6
7
8
x
1 2 3 4 5 6 7 8 9 10 11 12 13 x
V
1 2 3 4 5 6 7 8 9 10 11 12 13 x
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
54
DATA ANALYSIS – CORE MATERIAL
RANDOM SAMPLES
G
When we conduct a statistical survey, it is important that our data reflects the whole population.
If data is to be collected from a sample then the sample must accurately represent the population. Otherwise, reliable conclusions about the population cannot be made. Samples must
be chosen so that the results will not show bias towards a particular outcome.
The sample size is also an important feature to be considered if conclusions about the population are to be made from the sample.
For example:
Measuring a group of three fifteen-year-olds would not give a very reliable estimate of the
height of fifteen-year-olds all over the world. We therefore need to choose a random sample
that is large enough to represent the population. Note that conclusions based on a sample
will never be as accurate as conclusions made from the whole population, but if we choose
our sample carefully, they will be a good representation.
CHOOSING A RANDOM SAMPLE
In a simple random sample, every member of the population has an equal chance of being
chosen, and each member is chosen independently of any other member.
Random samples can be chosen using coins, dice, numbered tokens, random number tables,
or random number generators on computers or calculators.
For example:
Suppose you wish to choose Tattslotto numbers. The population of numbers is the integers 1
to 45 inclusive and you are going to choose a ‘sample’ of six different numbers. How could
you choose these numbers randomly?
Three possible methods:
1 Number forty five pieces of paper, place them in a container and select six pieces of
paper without looking.
2 Use a random number table (Table 1).
39634
14595
30734
64628
42831
80583
00209
05409
95836
15358
62349
35050
71571
89126
95113
70361
90404
20830
22530
70469
74088
40469
83722
91254
43511
41047
99457
01911
91785
87149
65564
27478
79712
24090
42082
26792
72570
60767
80210
89509
16379
44526
25775
25752
15140
78466
42194
55248
34361
72176
19713
67331
65178
03091
34733
03395
49043
79253
52228
18103
39153
93365
07763
39411
68076
17635
24330
12317
33869
55169
69459
54526
82928
73146
18292
09697
14939
84120
94332
79954
17986
22356
31131
06089
69486
82447
09865
77772
83868
72002
24537
93208
30196
15630
80468
31405
45906
50103
61672
20582
The digits in the table are generated by computer in groups of five for easy reading. You
can start anywhere in the table and move across or down.
To choose numbers between 1 and 45, you need to look at two digits at a time. If the
digits are 04 then the chosen number is ‘4’. If the digits give a number greater than 45
then you ignore it. If you get a repeat of a number then you will also ignore it.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
55
Starting in the top left hand corner and going across, (crossing out the inappropriate
numbers until you have six numbers) the numbers are:
39, 63, 46, 23, 49, 74, 08, 86, 55, 64, 16, 37, 91, 97, 13
Your chosen numbers would be 39, 23, 8, 16, 37 and 13.
3 Use the random number generator on the
calculator.
This can be found in the menu as follows:
Press then | to select PRB.
Choose 5:randInt(.
This will bring randInt( to the screen.
We need to type in the range of random
integers that we are considering, i.e., 1 to 45.
Press À ¢ ¶ · ¤.
Pressing Í repeatedly will give random digits
between 1 and 45.
In this case the first six numbers were all different
numbers, so these are the randomly chosen Tattslotto
numbers. If numbers were repeated, we would generate
more until we had six different ones.
You could also type in the sample size of 6 as shown
alongside. However, if this gave repeats in the sample,
you would need to repeat the procedure.
Example 21
The table below gives
the monthly sales figures, in thousands of
dollars, for a shop over
a six year period.
January
February
March
April
May
June
July
August
September
October
November
December
2000 2001 2002 2003 2004 2005
43:1 48:7 45:7 44:0 48:6 46:3
38:2 35:3 36:4 38:3 37:7 40:2
38:6 36:0 36:2 34:8 35:3 33:3
40:2 40:9 42:4 42:5 43:8 35:7
43:2 44:2 47:0 48:7 50:3 52:4
27:8 32:3 33:5 34:1 32:2 35:8
26:4 27:2 23:5 27:2 27:7 28:1
23:8 24:9 24:8 27:6 26:1 28:2
27:4 30:8 32:7 33:6 34:9 35:1
40:4 39:3 38:7 41:3 42:4 44:9
68:3 67:4 67:3 69:8 70:4 72:6
81:2 83:9 84:6 85:5 88:3 87:2
a Choose a year at
random.
b Choose a month at
random.
c Choose three consecutive years.
d Choose a period of
three consecutive
years (36 months) starting with any month.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
56
DATA ANALYSIS – CORE MATERIAL
a
There are six years from which to choose. We
could use a die to randomly choose one of these
years; the year 2000 would be represented by 1,
2001 by 2, ......, 2005 by 6.
Alternatively, we could use the random generator
on a calculator: The randomly chosen year is
2004.
b
There are twelve months from which we need
to choose one month.
We use the calculator, with 1 representing
January, 2 representing February, etc.
The randomly chosen month is November.
c
To choose three consecutive years, we need to establish the number of sets of
three consecutive years that are possible:
1 2000 - 2002
2 2001 - 2003
3 2002 - 2004
4 2003 - 2005
There are four possibilities, from which we
have to choose one. Using the calculator, the
randomly chosen period is 3 2002 to 2004.
d
To choose a period of three consecutive years starting with any month, we need
to establish the number of sets that are possible:
1 Jan 2000 - Dec 2002
2 Feb 2000 - Jan 2003
3 Mar 2000 - Feb 2003
..
.
37 Jan 2003 - Dec 2005
There are thirty seven possibilities, from which we have to choose one.
Using the calculator, the randomly chosen period is 11 November 2000 to
October 2003.
TO CHOOSE A SIMPLE RANDOM SAMPLE:
1 State the sample size needed.
2 State the number of possibilities from which you can choose, and number them if necessary.
3 State the random number generator that you are using.
4 Explain what you will do if repeated random numbers are not applicable.
5 State the random number(s) chosen and the data that is now in your sample.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 1
UNIVARIATE DATA
57
EXERCISE 1G
1 Use the random number table from page 54, starting at the top left corner and working
down, to:
a select a random sample of six different numbers between 1 and 45 inclusive
b select a random sample of 5 different numbers between 100 and 499 inclusive.
2 Use your calculator to:
a select a random sample of six different numbers between 5 and 25 inclusive
b select a random sample of 10 different numbers between 1 and 25 inclusive.
3 The following calendar for 2006 shows the weeks of the year. Each of the days is
numbered.
Using a random number generator, choose a sample from the calendar of:
a five different dates
b a complete week starting with a Monday
c a month
d three different months
e three consecutive months
f a four week period starting on a Saturday
g a four week period starting on any day.
Explain your method of selection in each case.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
58
DATA ANALYSIS – CORE MATERIAL
January
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
February
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Wk 1
Wk 2
Wk 3
Wk 4
(32)
(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)
(54)
(55)
(56)
(57)
(58)
(59)
March
Wk 6
Wk 7
Wk 8
Wk 9
Wk 5
July
(182)
(183)
(184)
(185)
(186)
(187)
(188)
(189)
(190)
(191)
(192)
(193)
(194)
(195)
(196)
(197)
(198)
(199)
(200)
(201)
(202)
(203)
(204)
(205)
(206)
(207)
(208)
(209)
(210)
(211)
(212)
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
August
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Wk 27
Wk 28
Wk 29
Wk 30
Wk 31
(213)
(214)
(215)
(216)
(217)
(218)
(219)
(220)
(221)
(222)
(223)
(224)
(225)
(226)
(227)
(228)
(229)
(230)
(231)
(232)
(233)
(234)
(235)
(236)
(237)
(238)
(239)
(240)
(241)
(242)
(243)
Wk 10
Wk 11
Wk 12
Wk 13
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
September
Wk 32
Wk 33
Wk 34
Wk 35
(244)
(245)
(246)
(247)
(248)
(249)
(250)
(251)
(252)
(253)
(254)
(255)
(256)
(257)
(258)
(259)
(260)
(261)
(262)
(263)
(264)
(265)
(266)
(267)
(268)
(269)
(270)
(271)
(272)
(273)
(91)
(92)
(93)
(94)
(95)
(96)
(97)
(98)
(99)
(100)
(101)
(102)
(103)
(104)
(105)
(106)
(107)
(108)
(109)
(110)
(111)
(112)
(113)
(114)
(115)
(116)
(117)
(118)
(119)
(120)
May
Wk 14
Wk 15
Wk 16
Wk 17
Wk 18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
October
Wk 36
Wk 37
Wk 38
Wk 39
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
April
(60)
(61)
(62)
(63)
(64)
(65)
(66)
(67)
(68)
(69)
(70)
(71)
(72)
(73)
(74)
(75)
(76)
(77)
(78)
(79)
(80)
(81)
(82)
(83)
(84)
(85)
(86)
(87)
(88)
(89)
(90)
black
(274)
(275)
(276)
(277)
(278)
(279)
(280)
(281)
(282)
(283)
(284)
(285)
(286)
(287)
(288)
(289)
(290)
(291)
(292)
(293)
(294)
(295)
(296)
(297)
(298)
(299)
(300)
(301)
(302)
(303)
(304)
(121)
(122)
(123)
(124)
(125)
(126)
(127)
(128)
(129)
(130)
(131)
(132)
(133)
(134)
(135)
(136)
(137)
(138)
(139)
(140)
(141)
(142)
(143)
(144)
(145)
(146)
(147)
(148)
(149)
(150)
(151)
June
Wk 19
Wk 20
Wk 21
Wk 22
1 Th (152)
2 Fr (153)
3 Sa (154)
4 Su (155)
5 Mo (156)
6 Tu (157)
7 We (158)
8 Th (159)
9 Fr (160)
10 Sa (161)
11 Su (162)
12 Mo (163)
13 Tu (164)
14 We (165)
15 Th (166)
16 Fr (167)
17 Sa (168)
18 Su (169)
19 Mo (170)
20 Tu (171)
21 We (172)
22 Th (173)
23 Fr (174)
24 Sa (175)
25 Su (176)
26 Mo (177)
27 Tu (178)
28 We (179)
29 Th (180)
30 Fr (181)
November
Wk 40
Wk 41
Wk 42
Wk 43
Wk 44
(305)
(306)
(307)
(308)
(309)
(310)
(311)
(312)
(313)
(314)
(315)
(316)
(317)
(318)
(319)
(320)
(321)
(322)
(323)
(324)
(325)
(326)
(327)
(328)
(329)
(330)
(331)
(332)
(333)
(334)
Wk 45
Wk 46
Wk 47
Wk 48
Wk 23
Wk 24
Wk 25
Wk 26
December
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
Mo
Tu
We
Th
Fr
Sa
Su
(335)
(336)
(337)
(338)
(339)
(340)
(341)
(342)
(343)
(344)
(345)
(346)
(347)
(348)
(349)
(350)
(351)
(352)
(353)
(354)
(355)
(356)
(357)
(358)
(359)
(360)
(361)
(362)
(363)
(364)
(365)
Wk 49
Wk 50
Wk 51
Wk 52
Wk 53
VIC MCR_12
60
DATA ANALYSIS – CORE MATERIAL
BIVARIATE DATA
Many statistical investigations involve analysing the relationship between two variables. We
call the data in these investigations bivariate data. The way that bivariate data is analysed
depends on whether the data is categorical or numerical.
In this chapter we will study the display and analysis of bivariate data where:
² one variable is a categorical variable and the other is a numerical variable
² both variables are categorical
² both variables are numerical.
For any pair of variables, one of the pair is described as the dependent or response variable,
while the other is the independent or explanatory variable.
The dependent variable responds to changes in the independent variable.
The independent variable explains the changes in the dependent variable.
For example, the number of children in a family influences the type of car they have, but not
the other way around. The type of car is therefore the dependent variable and the number of
children is the independent variable.
COMPARING ONE CATEGORICAL
AND ONE NUMERICAL VARIABLE
A
BACK-TO-BACK STEMPLOTS
If the categorical variable has only two categories then a back-to-back stemplot is useful. It
is a visual display that enables easy analysis and comparison of the data.
Consider this example:
An office worker has the choice of travelling to work by tram or train. He has recorded the
travel times from recent journeys on both of these types of transport. He wishes to know
which type of transport is quicker and which is the more reliable.
Recent tram journey times (minutes):
21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24
Recent train journey times (minutes):
23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16
A back-to-back stemplot could be used to display the relationship between the categorical variable type of transport which has two categories (or levels), and the numerical variable travel time.
The type of transport is the independent variable and the travel time is the dependent variable,
because the travel time depends on the type of transport.
A back-to-back stemplot is constructed
with only one stem. The leaves are
grouped on either side of this central
stem. The ordered back-to-back stemplot for the data is shown alongside:
Train leaf
88877666
831100
0
Stem
1
2
3
4
Tram leaf
34889
1224578
03
3
The most frequently occurring travel times by train were between 10 and 20 minutes whereas
the most frequently occurring travel times by tram were between 20 and 30 minutes.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
BIVARIATE DATA
61
The median train travel time is 18 minutes and the median tram travel time is 22 minutes.
This supports the observation that train journeys are generally shorter.
The range of the train travel times is 30 ¡ 16 = 14 minutes while the range of the tram
travel times is 43 ¡ 13 = 30 minutes.
The interquartile range of travel times for the train is 21 ¡ 17 = 4 minutes, while the IQR
for tram travel times is 28 ¡ 18 = 10 minutes.
Comparison of these measures of spread indicates that the train travel times are less ‘spread
out’ than the tram travel times. The train travel times are therefore more predictable or
reliable.
In conclusion, it is generally quicker and the travel times are more reliable if the worker
travels by train to work.
EXERCISE 2A.1
1 The heights (to the nearest centimetre) of Year 10 boys and girls in a school are being
investigated. The sample data are as follows:
Boys:
164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173 187
179 181 175 174 165 167 163 160 169 167 172 174 177 188 177 185 167
160
Girls: 165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164 154
170 171 172 166 152 169 170 163 162 165 163 168 155 175 176 170 166
a What are the two variables in this investigation? Classify the variables as categorical
or numerical, dependent or independent.
b Construct a back-to-back stemplot for the data.
c Find the statistics in the five-number summaries for each of the data sets.
d Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
2 A new cancer drug is being developed and is being tested on rats. Two groups of twenty
rats with cancer were formed; one group was given the drug while the other was not.
The survival time of each rat in the experiment was recorded up to a maximum of 192
days.
Survival times of rats that were given the drug:
64
78
106
106
106
127
127 134 148 186
78
106 106
192¤ 192¤ 192¤ 192¤ 192¤ 192¤ 64
Survival times of rats that were not given the drug:
37
38
42
43
43
43
43
43
51
51
55
57
59
62
66
69
¤
48
86
49
37
denotes that the rat was still alive at the end of the experiment
a What are the variables in this investigation? Classify the variables as categorical or
numerical, dependent or independent.
b Construct a back-to-back stemplot for the data and find the statistics that make up
the 5-number summaries.
c Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
62
DATA ANALYSIS – CORE MATERIAL
3 Peter and John are competing taxi-drivers who wish to know who earns more money.
They have recorded the amount of money (in dollars) collected per hour for five hours
over five days:
Peter: 17:3 11:3 15:7 18:9 9:6 13 19:1 18:3 22:8 16:7 11:7 15:8
12:8 24 15 13 12:3 21:1 18:6 18:9 13:9 11:7 15:5 15:2 18:6
John: 23:7 10:1 8:8 13:3 12:2 11:1 12:2 13:5 12:3 14:2 18:6 18:9
15:7 13:3 20:1 14 12:7 13:8 10:1 13:5 14:6 13:3 13:4 13:6 14:2
a Construct a back-to-back stemplot for the data and find the statistics that make up
the 5-number summaries.
b Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
4 The residue that results when a cigarette is smoked collects in the filter. The residue
from twenty cigarettes from the two different brands was measured, giving the following
data, in milligrams:
Brand X: 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:69
1:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58
Brand Y: 1:61 1:62 1:69 1:62 1:60 1:59 1:66 1:55 1:61 1:62
1:64 1:61 1:58 1:57 1:57 1:57 1:58 1:60 1:63 1:59
a Copy and complete the back-to-back stemplot for this data:
Stem
150
152
154
156
158
160
162
164
166
168
Brand Y
Brand X
2
5
6
8
1
2
5
6
8
1
2
5
667
99
33
156 includes values 1:56 and 1:57
9
b Comment on and compare the shape of the distributions.
PARALLEL BOXPLOTS
Parallel boxplots are used to display and compare data where one of the variables is numerical
and the other is a categorical variable with two or more categories.
For example:
If additional car travel time data is available to the office worker in the example on page 60,
we can use parallel boxplots to compare the data. They help us decide which type of transport is the quickest to get him to work and which is the most reliable.
Car travel times (minutes): 30, 21, 19, 17, 24, 28, 23, 25, 25, 16, 18, 19, 29, 22
The categorical variable type of transport now has three categories and is the independent
variable.
Ordering the car travel times we get: 16, 17, 18, 19, 19, 21, 22, 23, 24, 25, 25, 28, 29, 30
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
The 5-number summary is:
min = 16, max = 30, median =
BIVARIATE DATA
63
22 + 23
= 22:5, lower quartile = 19, upper quartile = 25
2
The three boxplots are drawn on the one axis:
car
train
tram
10
15
20
25
30
35
40
45
categorical
variable with
three categories
travel time (minutes)
numerical variable
The car travel times have almost the same spread (range = 14 mins, IQR = 6 mins) as the
train travel times (range = 14 mins, IQR = 4 mins), suggesting that the car travel time is as
reliable as the train travel time.
However, the train travel times include two outliers which may be due to extraordinary events.
If these are ignored then the range of travel times for the train would be 7 minutes, which is
considerably less than the ranges for the car and tram.
The median car travel time is 22:5 minutes, compared to 18 minutes for the train and 22
minutes for the tram, so it is still generally quicker to travel by train.
In conclusion: From the data given, it is generally quicker and more reliable to travel by
train than it is by either tram or car.
Using the graphing calculator to graph parallel boxplots
Press y o and choose 1:Edit. Press Í.
The data for each of the transport types is entered
in separate lists.
Press y o to select STAT PLOT.
The three boxplots can be drawn on the screen at
the same time by turning each of them On.
Make sure that the ‘boxplot with outliers’ icon Õ
and the correct list is selected for each plot.
q ® will bring the graphs to the screen:
r, then the arrows, can be used to find
‘5-number summary’ values on the screen.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
64
DATA ANALYSIS – CORE MATERIAL
General rules for interpreting and comparing the distribution of bivariate data:
1 Comment on the shape of the distributions (symmetric, positively skewed, negatively
skewed, outliers).
2 Comment on and compare the centres of the data (median and mean).
3 Comment on and compare the spread of the data (range, interquartile range).
EXERCISE 2A.2
1 The percentage scores on a SAC for three classes of Further Mathematics students have
been recorded and the distribution of results for the three classes are summarised on the
graph below:
class A
class B
class C
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
score on SAC (%)
a In which class was:
i the highest mark scored
ii the lowest mark scored?
b Comment on the shape of the distribution of marks in each of the classes.
c Comment on and compare the centre of the scores for the classes.
d Comment on and compare the spread of the scores for the classes.
2 [VCAA FM 2001 Q6]
female
n¡=¡26
male
n¡=¡23
0
5
10
15
20
25
30
35
40
45
age (years)
A conservation park in Thailand is home to 49 elephants, of which 26 are females and
23 are males. The parallel boxplots above show the distribution of their ages by sex.
Based on the information contained in the parallel boxplots, which one of the following
statements is incorrect?
A The youngest elephant is male.
B There are fewer female elephants under the age of 15 years than male elephants
under the age of 15 years.
C There are no female elephants over the age of 40 years.
D The median age of the female elephants is approximately the same as the median
age of the male elephants.
E Approximately 25% of the male elephants are 30 years of age or older.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
BIVARIATE DATA
65
3 The daily maximum temperatures in Melbourne for June 21st and December 21st (the
equinoxes) are being compared. The data for the 20 years from 1981 to 2000 is given
below:
June 21st:
December 21st:
13:6,
17:4,
24:2,
21:3,
10:6,
13:5,
19:4,
23:0,
19:1,
16:7,
21:4,
28:1,
14:2,
14:0,
22:7,
20:3,
12:2,
11:1,
21:4,
17:2,
11:9,
17:0,
20:0,
35:0,
18:3,
15:4,
22:3,
33:7,
14:9,
16:3,
21:1,
21:9,
14:6,
15:6,
18:9,
21:4,
15:1,
16:3
23:5,
38:6
a What are the variables in this investigation? Classify the variables as categorical or
numerical, dependent or independent.
b Find the statistics that make up the 5-number summaries and construct parallel
boxplots for the data.
c Compare and comment on the distributions of the data, mentioning the shape, centre
and spread and quoting statistics to support your statements.
4 Using the data from question 4, Exercise 2A.1, find five-number summaries and construct
parallel boxplots to summarise the distributions of residue for the two types of cigarettes.
What conclusions can be made from comparing the boxplots? Support your statements
with statistics.
5 Plant fertilisers come in many different brands, but there are essentially two types:
organic and inorganic. A student was interested to discover whether radish plants responded better to organic or inorganic fertiliser. He prepared three identical plots of
ground, named plots A, B and C, in his mother’s garden, and planted 40 radish seeds
in each plot. After planting, each plot was treated in an identical manner, except for the
way they were fertilised. Cost prevented him using a variety of fertilisers, so he chose
one organic and one inorganic fertiliser. Plot A received no fertiliser, plot B received the
organic fertiliser as prescribed on the packet, and plot C received the inorganic fertiliser
as prescribed on the packet. The student was interested in the weight of the root that
forms under the ground.
The data supplied below is the weight of the root (measured to the nearest gram) of the
individual plants:
Data from plot A:
27
39
32
29
38
30
9
50
34
10
34
22
8
41
36
39
36
40
42
12
32
14
32
35
32
35
30
42
38
25
Data from plot B:
51
47
45
54
58
58
56
56
34
41
63
50
66
47
54
47
48
46
48
48
53
52
47
34
29
20
46
28
33
Data from plot C:
55
69
68
76
70
68
65
76
63
61
43
54
67
70
61
69
62
72
68
60
58
64
58
77
76
79
66
59
65
65
56
75
47
79
60
50
70
39
a Produce parallel boxplots for the data.
b Compare and comment on the distributions of the weights of the root for each
plot, mentioning the shape, centre and spread and quoting statistics to support your
statements.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
66
DATA ANALYSIS – CORE MATERIAL
TWO CATEGORICAL VARIABLES
B
Two-way frequency tables are used to demonstrate the relationship between two categorical
variables. Percentaged segmented barcharts give a visual display of the data.
TWO-WAY FREQUENCY TABLES
In two-way frequency tables, the independent variable fills the columns.
Example 1
A town council is considering bringing in a rule banning the drinking of alcohol in
public places. A random survey of 60 residents gave the following results:
Of the 35 women surveyed, 20 were in favour of the rule. However only 11 of the
men were in favour of it.
a Construct a two-way frequency table to summarise these findings.
b Construct a two-way percentaged frequency table and answer the following:
i What percentage of those surveyed were female?
ii What percentage of those surveyed were in favour of the proposal?
iii What percentage of the females surveyed were in favour of the proposal?
c Do the results of the survey support the theory that females would be more in
favour of this rule than males?
The two categorical variables involved in this question are:
Gender:
Male or Female
Opinion about rule: In favour or Against
Opinion about rule depends on gender so the variable gender is the independent
variable.
a
Gender
Male Female Total
In favour
11
20
31
Opinion
Against
14
15
29
Total
25
35
60
b
The two-way percentaged frequency table is:
Male
In favour
Opinion
11
25
14
25
Against
Total
i
35
60
31
60
ii
iii
Gender
£ 100 = 44%
£ 100 = 56%
100%
20
35
15
35
Female
£ 100 = 57%
£ 100 = 43%
100%
£ 100 = 58:33% of those surveyed are female.
£ 100 = 51:67% of those surveyed were in favour of the rule.
57% of the females were in favour of the rule.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
c
BIVARIATE DATA
67
57% of the females surveyed were in favour of the proposed rule compared with
44% of the males. This shows a difference of 13%. The results support the theory.
PERCENTAGED SEGMENTED BARCHARTS
percentage
The percentaged frequency table in Example 1 can be graphed using a percentaged segmented barchart:
100
80
in favour
60
against
40
20
male
female
gender
EXERCISE 2B
1 A survey of Victorians was
recently conducted to ascertain their interest in AFL
football.
The data was presented in
the following two-way percentaged frequency table:
Level of
interest
Very interested
Somewhat
Not very
Not at all
Total
Gender
Male Female
28
18
25
19
19
20
28
43
100
100
Total
22
21
20
37
100
a Use the table to find:
i the percentage of those surveyed who are very interested in football
ii the percentage of women who are either very or somewhat interested in football.
b Construct a percentaged segmented barchart that compares the interest in Australian
Rules for men and women.
c Does the data support the theory that gender influences the level of interest in AFL
football? Quote percentages to support your statement.
2 A survey of sixteen-year-old students revealed that 32 of the 48 boys and 23 of the 37
girls played a team sport outside school.
Gender
a Copy and complete the twoBoys Girls Total
way frequency table shown:
Yes
Play team sport
b Find the percentage of all
No
outside school?
the students who play a
Total
team sport outside school.
c Find the percentage of girls who play a team sport outside school.
d Construct a two-way percentaged frequency table.
e Do the figures support the theory that more boys than girls play a team sport outside
school? Quote some percentages to support your statement.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
68
DATA ANALYSIS – CORE MATERIAL
3 A market research company is contracted to investigate the age of people
who listen to the three radio stations,
A, B or C, in a city. The results of their
survey are given in the table alongside:
Station
A
B
C
Total
< 30
35
40
175
Age group
30 - 60 > 60
30
200
83
68
37
132
Total
a Complete the Totals row and column in the table alongside.
b Why do we need a two-way percentaged frequency table to help analyse the data?
c Construct the two-way percentaged frequency table.
d Compare and comment on which age groups listen to which radio station.
4 The two-way percentaged frequency
Father Mother
table alongside was produced to show
Employed
the labour force status of parents from
48:6
16:8
full-time
one-parent families.
Labour
Employed
a What are the variables in this sur13:3
27:2
force
part-time
vey? Classify them as categorical
status
Unemployed
8:3
8:9
or numerical, independent or dependent.
Not in the
29:8
47:0
labour force
b Construct a percentaged segmented barchart to illustrate the
Total
100:0
100:0
data.
(Source: ABS June 2002 Labour Force Survey)
c What conclusions can be made
from this table and graph?
Support your statements with percentages from the table.
5 A polling agency wants to test the theory that in a particular municipality, “more of the
female residents vote for female candidates”. A random sample of eighty residents in the
municipality were asked their voting preference, either Smith the female candidate, or
Jones the male candidate. Of the 35 female residents in the sample, 20 said they would
vote for Smith, whereas 25 of the male residents said they would vote for Jones.
a Fill in the missing values on
Gender
the two-way frequency table
Male Female
alongside.
Smith
20
Voting
b Construct a two-way percent25
intention Jones
aged frequency table for the
Total
35
data.
c Use the figures in the table to comment on the validity of the theory.
Total
80
TWO NUMERICAL VARIABLES
C
Scatterplots are used to demonstrate and visualise the relationship between two numerical
variables.
The data is plotted as points on a graph where the independent variable is the horizontal axis
and the dependent variable is the vertical axis.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
BIVARIATE DATA
69
CONSTRUCTING A SCATTERPLOT
The pattern formed by the points on a scatterplot indicates the strength of the relationship
between the two variables.
For example:
The relationship between weight and height of members
of an AFL football team is being investigated.
We expect there to be a fairly strong association between
these variables as it is generally perceived that the taller
a person is, the more they will weigh.
The height and weight of each of the players in the team
is recorded and these values form a coordinate pair for
each of the players:
Player
1
2
3
4
5
6
Height
203
189
193
187
186
197
Weight
106
93
95
86
85
92
Player
7
8
9
10
11
12
Height
180
186
188
181
179
191
Weight
78
84
93
84
86
92
Player
13
14
15
16
17
18
Height
178
178
186
190
189
193
Weight
80
77
90
86
95
89
Before a scatterplot is constructed you need to establish which of the variables is the independent variable and which is the dependent variable.
In this case we assume that weight depends on height and so weight is the dependent variable
and height is the independent variable.
Weight versus Height
weight (kg)
The points are therefore plotted as
coordinate pairs (height, weight) for
the individuals in the investigation.
105
100
95
90
85
80
75
175 180 185 190 195 200 205
height (cm)
Using the calculator to construct a scatterplot
Press … and choose 1:Edit. Press Í.
Enter the data into lists. The independent variable
should be L and the dependent variable should be L‚.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
70
DATA ANALYSIS – CORE MATERIAL
Press y o to select STAT PLOT.
Press Í.
Turn the plot On and select the scatterplot icon ".
The XList is for the independent variable L and
the YList is for the dependent variable L‚.
Press q ® to view the scatterplot.
You can press r and use the arrow keys to
identify the points.
INTERPRETATION OF A SCATTERPLOT
There are four aspects that we need to consider:
1 Direction
y
Positive association
The points generally go up as x increases,
similar to a straight line with positive gradient.
“As the independent variable (x) increases,
the dependent variable (y) also increases.”
x
y
Negative association
The points generally go down as ‘x’ increases,
similar to a straight line with negative gradient.
“As the independent variable (x) increases,
the dependent variable (y) decreases.”
x
2 Form
In the scatterplots above, the points are generally in a straight line. The relationship
between the variables is said to be linear.
These scatterplots show relationships which are not linear.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
BIVARIATE DATA
71
3 Strength
If the points form a well-ordered pattern then the strength of the association is said to
be strong.
For example:
Strong positive
Strong negative
Strong non-linear
If the points form a pattern which is less well defined, then the strength is said to be
moderate.
For example:
Moderate positive
Moderate negative
If the points are scattered but a general pattern is still discernable then the association is
said to be weak.
For example:
Weak positive
Weak negative
If the points appear to be randomly scattered then
there is no association between the variables.
An example of this is shown opposite.
4 Outliers
Outliers stand out from the general body of data.
The example opposite shows a “moderate positive
association with one outlier”.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
outlier
VIC MCR_12
72
DATA ANALYSIS – CORE MATERIAL
We can interpret the Weight versus Height scatterplot
from earlier as follows:
“There is a moderate positive association between the
variables height and weight. This means that as height
increases, weight increases. The relationship appears
linear and there are no obvious outliers.”
weight (kg)
Outliers should be checked to ensure they are genuine outstanding data and not errors
in the data or errors in plotting. A decision can be made to ignore them as they will
influence correlation measures and models fitted to the data, but this should only be done
after careful consideration.
Weight versus Height
100
90
80
height (cm)
180
190
200
EXERCISE 2C
1 For each of the following, state whether you would expect to find positive, negative,
or no association between the following variables. Indicate the strength (none, weak,
moderate or strong) of the association.
a Shoe size and height.
b Speed and time taken for a journey.
c The number of occupants in a household and the water consumption of the household.
d Maximum daily temperature and the number of newspapers sold.
e Age and hearing ability.
2 Copy and complete the following:
a If the variables x and y are positively associated then as x increases, y ..........
b If there is negative association between the variables m and n then as m increases,
n ..........
c If there is no association between two variables then the points on the scatterplot
appear to be .......... ..........
3 For each of the scatterplots below, state:
i whether there is positive, negative or no association between the variables
ii the strength of the association between the variables (zero, weak, moderate or
strong)
iii whether the relationship between the variables appears to be linear or not
iv the presence of outliers.
a
b
y
30
25
20
15
10
5
40
30
20
10
x
10
30
y
20.4
y
20.2
20
19.8
19.6
x
x
5
40
10 15 20 25
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
20
c
black
10
20
30
40
VIC MCR_12
Chapter 2
d
e
y
120
100
80
60
40
20
f
y
50
50
30
30
20
20
10
2
x
y
1
2
2
1
10
x
1 2 3 4 5 6 7
4 Consider the data:
y
40
40
x
73
BIVARIATE DATA
3
4
4
4
3
5
5
6
8
6
6
7
5
x
10 20 30 40 50
8
5
9
7
10
8
a Construct a scatterplot for the data.
b State whether the association between the variables is:
i positive, negative or no association
ii weak, moderate or strong
iii linear or not.
5 The following data was collected by a milkbar owner over fifteen consecutive days:
Max. daily
temp. (o C)
29
40
35
30
34
34
27
27 19 37
22 19 25
36
23
No. of ice119 164 131 152 206 169 122 143 63 208 155 96 125 248 139
creams sold
a Which of the two variables is the independent variable?
b Construct a scatterplot of the data.
c Interpret the scatterplot in terms of the variables, mentioning direction, strength,
linearity and outliers.
6 A class of 25 students was asked to record their times (in minutes) spent preparing for
a test. The table below gives the score that they achieved on the test and the recorded
preparation time.
Score
Minutes
spent
preparing
25
31
30
38
55
20
39
47
35
45
32
33
34
75
30
35
65
110
60
40
80
56
70
50
110
18
Score
Minutes
spent
preparing
38
17
38
17
17
26
41
50
30
45
36
23
80
22
30
15
10
85
100
60
55
80
50
75
a Which of the two variables is the independent variable?
b Construct a scatterplot of the data.
c Interpret the scatterplot in terms of the variables, mentioning direction, strength,
linearity and outliers.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
74
DATA ANALYSIS – CORE MATERIAL
CORRELATION
D
Correlation is a statistical word that means relationship or association. We can talk about
the correlation/relationship/association between two variables and mean the same thing.
PEARSON’S CORRELATION COEFFICIENT r
The correlation between two numerical variables can be measured by a correlation coefficient.
There are several correlation coefficients that can be used, but the most widely used coefficient
is Pearson’s correlation coefficient, named after the statistician Carl Pearson who developed
it. Its full name is Pearson’s product-moment correlation coefficient, and it is denoted r.
For a set of n bivariate numerical data with variables x and y, Pearson’s correlation
µ
¶µ
¶
coefficient is:
1 P x¡x
y¡y
r=
n¡1
sx
sy
where x and y are the means of the x and y data respectively and sx and sy are
their standard deviations.
This formula is tedious to use, so in all situations you will be using your calculator to find r.
INTERPRETATION OF PEARSON’S CORRELATION COEFFICIENT
Pearson’s correlation coefficient gives a measure of the relationship between two variables on
a scale from ¡1 to 1. Word descriptors based on r-values seem doubtful at the best of times
and the majority of texts on this subject do not include them. Many texts and Internet sites
vary on the advice they give. Here is one possible interpretation.
Description
r
r
Description
perfect
positive
correlation
¡1
perfect
negative
correlation
0:75 to 1
strong
positive
correlation
¡1 to ¡0:75
strong
negative
correlation
0:50 to 0:75
moderate
positive
correlation
¡0:75 to ¡0:50
moderate
negative
correlation
0:25 to 0:50
weak
positive
correlation
¡0:50 to ¡0:25
weak
negative
correlation
1
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
Description
r
0 to 0:25
75
Description
r
almost no
correlation
BIVARIATE DATA
¡0:25 to 0
almost no
correlation
Notes about Pearson’s correlation coefficient:
² It is designed for linear data only.
² It should be used with caution if there are outliers.
For example, the data in the two scatterplots below both have a correlation coefficient
of r = 0:8. The presence of the outlier in the second graph has greatly reduced the
r value, however, without this point, r would equal 1.
y
y
15
outlier
15
10
10
5
5
x
x
2 4 6 8 10 12 14
2
6
4
8 10 12
Using the calculator to find Pearson’s correlation coefficient
The first step is to activate the diagnostic tools on the calculator. Once turned on these will
remain on, but if the memory is cleared or battery changed then the calculator will revert
back to the default functions that do not include r.
To activate the diagnostic tools:
Locate the menu CATALOG using y Ê.
Use the arrow keys to scroll down to
DiagnosticOn and press Í.
DiagnosticOn will appear on the screen.
Press Í and you will have turned the diagnostic tools on.
We consider finding Pearson’s correlation
coefficient for the data opposite:
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
x
y
1
2
2
1
3
4
4
3
5
5
6
6
7
5
8
5
9
7
10
8
VIC MCR_12
76
DATA ANALYSIS – CORE MATERIAL
Enter the data into lists, the x-data into L
and the y-data into L‚.
We check the scatterplot at this stage as it will
reveal any errors made in entering the data, and
any outliers. It will also indicate whether the
data is linear.
Press … ~ to select CALC and choose
4:LinReg(ax+b).
(This means we are fitting a linear model or
linear regression of the form y = ax + b to
the data.) Regression will be discussed in greater
detail in Chapter 3.
LinReg(ax+b) appears on the screen. You
need to tell the calculator where your data is:
Enter L, L‚ by pressing y À ¢ y Á
Í.
The linear regression screen appears and the last
figure r = :9130 :::: is Pearson’s correlation
coefficient for this data set.
The r value indicates a strong positive correlation, which agrees with the scatterplot.
CAUSATION
When analysing data, we must be aware of causation. A high degree of correlation between two
variables does not necessarily imply that a change in one variable causes the other to change.
For example:
1 The heights and reading speeds of children were measured and a strong positive correlation was found. Does this mean that increasing height makes you read faster or that
increasing your reading speed will cause you to grow? These suggestions are obviously
not sensible. The strong correlation results because both variables are closely associated
with age. As age increases, both the variables height and reading speed increase. It is
age which causes height and reading speed to increase.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
BIVARIATE DATA
77
2 The number of television sets sold in Ballarat and
the number of stray dogs collected in Bendigo were
recorded over several years and a strong positive
association was found between the variables.
Obviously the number of television sets sold in
Ballarat was not influencing the number of stray
dogs collected in Bendigo. Both variables have
simply been increasing over the period of time that
their numbers were recorded.
If a change in one variable causes a change in the other variable then we say that a causal
relationship exists between them.
For example:
The age and height of a group of children is measured and there is a strong positive correlation
between these variables. This will be a causal relationship because an increase in age will
cause an increase in height.
EXERCISE 2D
1
a Use your calculator to find Pearson’s correlation coefficient for the data given in
question 5, Exercise 2C.
Max. daily
temp. (o C) 29
40
35
30
34
34
27
27 19 37
22 19 25
36
23
No. of ice119 164 131 152 206 169 122 143 63 208 155 96 125 248 139
creams sold
b Interpret the value of r in terms of strength and direction.
c Does the value of the correlation coefficient confirm your observations from the
scatterplot? Was it appropriate to find r for this data? Explain.
2
a Use your calculator to find Pearson’s correlation coefficient for the data given in
question 6, Exercise 2C:
Minutes
spent
preparing
75
30
35
65
110
60
40
80
56
70
50
110
18
Score
25
31
30
38
55
20
39
47
35
45
32
33
34
Minutes
spent
preparing
80
22
30
15
10
85
100
60
55
80
50
75
Score
38
17
38
17
17
26
41
50
30
45
36
23
b Interpret the value of r in terms of strength and direction.
c Does the value of the correlation coefficient confirm your observations from the
scatterplot? Was it appropriate to find r for this data? Explain.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
78
3 [VCAA FM 2000 Q5]
The scatterplot alongside shows the birth
rate and the average food intake for 14
different countries.
The value of the product moment correlation coefficient, r, for this data is closest
to:
A ¡0:6
B ¡0:2
C 0:2
D 0:6
E 0:9
birth rate
(per 100¡000)
DATA ANALYSIS – CORE MATERIAL
50
40
30
20
1.7 1.9 2.1 2.3 2.5 2.7
average food intake
(1000 calories per person)
4 Which one of the following is true for Pearson’s correlation coefficient r?
A The addition of an outlier to a set of data would always result in a lesser value
of r.
B An r value of 1 represents a stronger relationship between the variables than an
r value of ¡1.
C A high value of r means that one variable is causing the other variable to change.
D An r value of ¡0:8 means that as the independent variable increases, the dependent variable will tend to decrease.
E It can take values between 0 and 1 inclusive.
5 The following pairs of variables were measured and a strong positive correlation between
them was found. Discuss whether a causal relationship exists between the variables. If
not, suggest a third variable to which they may both be related.
a The lengths of one’s left and right feet.
b The damage caused by a fire and the number of firemen who attend it.
c Company expenditure on advertising, and sales.
d The height of parents and the height of their adult children.
e The number of hotels and the number of churches in rural towns.
THE COEFFICIENT OF DETERMINATION
E
In a bivariate set of numerical data, the coefficient of determination gives us a means of
measuring the influence that one variable has over the other variable.
Coefficient of determination = r 2 = (Pearson’s correlation coefficient)2
CALCULATION OF THE COEFFICIENT OF DETERMINATION
r2 is found on the linear regression screen of
your calculator as shown opposite.
Alternatively, if the value of r is known, then
this can simply be squared.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
Chapter 2
BIVARIATE DATA
79
INTERPRETATION OF THE COEFFICIENT OF DETERMINATION
r2 indicates the strength of association between the dependent or response
variable and the independent or explanatory variable.
If there is a causal relationship then r2 indicates the degree to which change in the explanatory
variable explains change in the response variable.
For example: An investigation into many different brands of muesli found that there is strong
positive correlation between the variables fat content and kilojoule content.
Pearson’s correlation coefficient, r, was found to be 0:8625.
The coefficient of determination for this study is (0:8625)2 = 0:7439.
An interpretation of this r2 value is “the proportion of variation in kilojoule content that can
be explained by the variation in fat content of muesli is 0:7439.”
It is usual to quote the coefficient of variation as a percentage. A proportion of 0:7439 is
equivalent to 0:7439 £ 100 = 74:39%.
The interpretation becomes:
dependent variable
74:39% of the variation in kilojoule content of muesli can be explained by the variation in
fat content of muesli.
independent variable
If 74:4% of the variation in kilojoule content of muesli can be explained by the fat content of
muesli then we can assume that the other 25:6% (100% ¡ 74:4%) of the variation in kilojoule
content of muesli can be explained by other factors (which may or may not be known).
Note:
² Since ¡1 6 r 6 1, 0 6 r2 6 1.
² If r = ¡0:625 then r2 = (¡0:625)2 = 0:3906, a positive value.
² It is only appropriate to use r2 values, like r values, in situations where there is a
linear relationship between the two variables.
² r2 values of 10% or more are worth mentioning.
² If you are finding an r value from an r2 value then you must consider that the r
p
value can be positive or negative. The solutions to r2 = a are r = a and
p
r = ¡ a. Your calculator will only give you a positive value.
Example 2
A study has found that 45% of the variation in selling price can be explained by the
variation in age of a used car.
If this statement was based on the coefficient of variation then what would be the
value of Pearson’s correlation coefficient for this study?
p
We are told that r2 = 0:45 so r is the square root of 0:45. ( 0:45 w 0:6708)
At this point we need to consider the variables involved: selling price and age of a
car. We would assume that as the age of a car increases then the selling price of a car
would decrease, i.e., there is negative correlation between the variables.
Hence we can conclude for this study that Pearson’s correlation coefficient, r,
will be ¡0:6708.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12
80
DATA ANALYSIS – CORE MATERIAL
EXERCISE 2E
1 In an investigation the coefficient of determination for the variables preparation time
and exam score is found to be 0:5624. Complete the following interpretation of the
coefficient of determination:
...... % of the variation in .......... can be explained by the .......... in preparation time.
2 For each of the following find the value of the coefficient of determination correct to
four decimal places, and interpret it in terms of the variables.
a An investigation has found the association between the variables time spent gambling
and money lost has an r value of 0:4732.
b For a group of children a product-moment correlation coefficient of ¡0:365 is found
between the variables heart rate and age.
c In a study of a sample of countries, Pearson’s correlation coefficient for the variables
female literacy and gross domestic product is found to be 0:7723.
3 A study of the relationship between stress levels and productivity has produced a productmoment correlation coefficient of 0:5629. Which one of the following would be an
interpretation that could be made from this study?
A 56:3% of the variation in productivity can be explained by the variation in stress
levels.
B 75% of the variation in productivity can be explained by the variation in stress
levels.
C 31:7% of the variation in productivity is caused by the variation in stress levels.
D 56:3% of the variation in productivity is caused by the variation in stress levels.
E 31:7% of the variation in productivity can be explained by the variation in stress
levels.
4 A rural school has investigated the relationship between the time spent travelling to
school (minutes) and a student’s year ten average (%) for a sample of students.
The results are given in the table below:
Travel time
10 33 18 43 34 30 24 47 44 41 17 45 39 31 23 11 14 25 16 17
(mins)
Year 10
51 78 97 56 90 70 64 67 37 46 95 67 31 57 43 99 98 82 40 67
average (%)
a Construct a scatterplot of the data and interpret the scatterplot.
b Find Pearson’s correlation coefficient for the data and interpret.
c Calculate the coefficient of determination and interpret this in terms of the variables.
100
95
75
50
25
5
0
100
95
75
50
25
5
0
cyan
black
VIC MCR_12