Ultimate GCSE Statistics Revision Guide

Transcription

Ultimate GCSE Statistics Revision Guide
Updated for the 2010 exam Ultimate
GCSE Statistics
Revision Guide
2
Contents Page
Page
Exam questions by topic 2004 ‐ 2009
Types of data
Sampling
Scatter graphs
Averages & standard deviation from a table
Interquartile range & Outliers & Box plots
Cumulative frequency curves
Histograms
Time series
Index numbers
Spearman’s Rank correlation coefficient
Misleading graphs
Reading and interpreting from a table of Statistics Questionnaires
Odds
Venn diagrams
Simulation
Experimental probability
Probability tree diagrams
Conditional probability
Binomial distribution
Standardised scores
Normal distribution Choropleth graphs
Comparative Pie Charts/Diagrams
3
4
10
11
15
19
24
26
28
32
35
41
45
47
49
53
54
56
58
59
63
65
75
78
83
86
Examination Questions
2004
Section A
Q
Section B
Marks
1(a)
1(b)
2(a)
2(b)
3
4(a)
4(b)
5(a)
5(b)
6
7
8
Pie charts
Misleading graphs (pie charts)
Advantages & Disadvantages of a random sample
Description of systematic sample
Reading / interpreting data
Rounding errors
Composite bar charts
Spearman’s rank
Interpretation of rank
Index numbers
Standardised scores
Choropleth graphs
TOTAL
1
1
2
2
4
1
3
3
2
4
6
6
35
Scatter graphs
9
10 Stem and leaf diagrams & outliers
11 Venn diagrams & conditional probability
12(a) Census
12(b) Sample over census
12( c) Selecting a sample (stratified)
12(d) Pilot survey
12(e) Biased questions
12(f) Advantages & Disadvantages of interviewing
13 Normal distribution & s.d. limits
14(ai) Mean from a grouped frequency table
14(aii) Standard deviation from a grouped frequency table
14(b) Histograms
14(c/d) Suitable distribution
15 Binomial distribution
TOTAL
11
9
6
1
2
2
2
1
2
8
2
3
3
6
7
65
TOTAL FOR PAPER
4
100
2005
Section A
Q
1(a)
1(b)
2(a)
2(b)
3(ai)
3(aii)
3(b)
4(a)
4(b)
5
6
7
8(a)
8(b)
8(c )
8(d)
Section B
1
2
3
4(a)
4(b)
4(c )
4(d)
4(e)
5(a)
5(b)
5(c-f )
6
7
8
9(a)
9(b)
9(c )
Types of data (Quantative, continuous..)
Taking a stratified sample
Sampling frames
Suitable sampling method, convenience, cluster,
systematic
Two way table probability
Two way table conditional probability
Interpreting probabilities
Composite bar charts
Interpreting a composite bar chart
Misleading graphs
Weighted index numbers
Simulation & random numbers
Census
Determining what the population is
Advantage of closed questions
Questionnaires
TOTAL
1
1
2
3
2
3
7
4
1
1
1
3
35
Spearman's rank
Interpreting graphs
Time series - moving averages
IQR
Outliers
Box plots
Suitable distribution
Interpreting data
Normal distribution & s.d. limits
Normal distribution & s.d. limits
Quality assurance charts
Drawing pie charts
Probability tree diagrams & conditional probability
Binomial distribution
Histogram
Mean from a grouped frequency table
Standard deviation from a grouped frequency table
TOTAL
6
5
8
2
2
2
2
1
1
1
4
6
8
8
3
3
3
65
TOTAL FOR PAPER
5
2
2
1
1
100
2006
Section A
Q
1(a-b)
1(c )
2
3(a-b)
3(c )
4(a)
4(b-c)
5
6
7
8(a)
8(b)
Grouped frequency tables & finding mean
Suitable average
Two way table probability
Odds - probability
Estimation - probability
Random sample description
Random numbers & simulation
Interpreting data
Comparative pie charts
Index numbers
Normal distribution & s.d. limits
Quality assurance
TOTAL
Section B
1
2(a)
2(b)
2(c )
2(d)
2(e)
3(a)
3(b)
4
Spearman's rank
Sample over census
Suitable sampling method
Closed questions
Pilot survey
Questionnaires
Mean & S.D. from a table
Interpreting data
Probability tree diagrams
Conditional probability
5
Scatter diagrams
6(a-b) Stem and leaf diagrams
6(c ) Outliers
6(d-e) Box plots
6(f) Improving a study - accuracy
7
Binomial distribution
8
Time series & moving averages
6
2
1
2
2
2
3
2
8
TOTAL
TOTAL FOR PAPER
6
3
1
6
2
1
1
4
4
3
7
1
2
35
11
3
3
5
1
7
7
65
100
2007
Section A
Q
Section B
1
2(a)
2(b)
2(c )
3
4(a)
4(b)
4( c )
4(d)
5
6
7
8(a)
8(b)
8(c )
1
2
3
4(a-c)
4(d-f)
5(a)
5(b)
5(c )
5(d)
6
Interpreting data
Comparative pie charts
Stratified sample - reasons
Stratified sample
Probability estimation
Advantage of census
Advantage of closed questionnaires
Pilot surveys
Questionnaires
Spearman’s rank
Types of data (Quantative, continuous..)
Standardised scores
Suitable sampling method
Normal distribution & s.d. limits
Normal distribution & s.d. limits
TOTAL
4
2
1
1
3
1
1
1
2
5
4
5
1
1
3
35
TOTAL
6
13
14
8
5
3
3
2
2
9
65
Time series
Box plots, skewness, outliers
Scatter diagrams
Probability tree diagrams
Binomial distribution
Mean from a frequency table
Histogram
Median from a histogram (interpolation)
Suitable distribution
Index numbers & geometric mean
TOTAL FOR PAPER
7
100
2008
Section A
Q
Section B
1
2
3
4
5(a)
5(b)
5(c)
5(d)
6
7
8
1
2
3
4(a)
4(b)
4(c)
5
6
7
8(a-b)
8(c)
8(d)
9
Misleading graphs - pie charts
Interpreting data
Population pyramids
Averages from a table
Advantages of a pilot study
Questionnaires
Random sample - description
Types of data (Quantative, continuous..)
Index numbers
Mean & S.D. from a table
Normal distribution & s.d. limits
TOTAL
2
5
3
6
2
1
2
2
3
5
4
35
TOTAL
9
7
7
2
2
2
7
7
6
4
3
1
8
65
Box plots & skewness
Venn diagrams & conditional probability
Scatter diagrams
Disadvantages of a census
Suitable sample
Closed questionnaires
Binomial distribution
Standardised scores
Spearman’s rank
Histograms
Median from histogram (interpolation)
Suitable distribution
Time series
TOTAL FOR PAPER
8
100
2009
Section A
Q
Section B
1
2
3
4
5
6
7
8
Misleading graphs - pie charts & calculating an angle
Interpreting data
Random numbers & simulation
Interpreting data
Time series
Odds - probability & estimation
Standardised scores
Mean & S.D. from a table
TOTAL
1(a) Advantage of sample over census
1(b) Suitable sample
1(c ) Questionnaires
1(d) Advantages of a pilot study
2
Scatter diagrams
3
Spearman’s rank
4(a)(b) Composite bar charts
4(c )-(e) Chain Index numbers & geometric mean
5
Stem and Leaf & Box plots & skewness
Outliers
6
Normal distribution & s.d. limits
7
Tree diagrams & Binomial distribution
2
1
3
2
9
7
5
7
12
TOTAL
TOTAL FOR PAPER
2010 likely topics










Venn diagrams
Spearman’s rank
Normal distribution
Time series (inc average seasonal variation)
Choropleth graphs
Standardised scores
Binomial distribution
Outliers
Scatter diagrams
Weighted index numbers (see q6 2005 A)
9
3
5
4
5
5
3
5
5
35
9
8
65
100
Types of data
Quantitative – numerical data such as time, age, height
Qualitative – non – numerical such as opinions, favourite subjects, gender
Numerical data can either be discrete or continuous.
Discrete data jumps from one measurement to the next. The measurements in
between have no meaning, such as shoe size, number of goals scored at a football
match.
Continuous data does not jump from one measurement to the next, but passes
smoothly through all the measurements in between such as, time, height.
Data that is collected by or for the person who is going to use it is called primary
data.
Data that is not collected by or for the person who is going to use it is called
secondary data.
10
Sampling
When organisations require data they either use data collected by somebody else
(secondary data), or collect it themselves (primary data). This is usually done by
SAMPLING that is collecting data from a representative SAMPLE of the population they
are interested in.
A POPULATION need not be human. In statistics we define a population as the
collection of ALL the items about which we want to know some characteristics.
Examples of populations are hospital patients, road accidents, pet owners, unoccupied
property or bridges. It is usually far too expensive and too time consuming to
collect information from every member of the population (known as taking a census),
exceptions being the General Election and The Census, so instead we collect it from a
sample.
If it is to be of any use the sample must represent the whole of the population we are
interested in, and not be biased in any way. This is where the skill in sampling lies: in
choosing a sample that will be as representative as possible.
The basis for selecting any sample is the list of all the subjects from which the sample
is to be chosen - this is the SAMPLING FRAME. Examples are the Postcode Address
File, the Electoral register, telephone directories, membership lists, lists created by
credit rating agencies and others, and maps. A problem, of course, is that the list may
not be up to date. In some cases a list may not even exist.
Simple random
sampling
Systematic
sampling
Cluster
sampling
Quota sampling
A simple random sample gives each member of the population an
equal chance of being chosen. This can be achieved using random
number tables.
This is random sampling with a system! From the sampling frame,
a starting point is chosen at random, and thereafter at regular
intervals. For example, suppose you want to sample 8 houses from
a street of 120 houses. 120/8=15, so every 15th house is chosen
after a random starting point between 1 and 15. If the random
starting point is 11, then the houses selected are 11, 26, 41, 56, 71,
86, 101, and 116.
In cluster sampling the units sampled are chosen in clusters, close
to each other. Examples are households in the same street, or
successive items off a production line. The population is divided into
clusters, and some of these are then chosen at random. Within
each cluster units are then chosen by simple random sampling or
some other method. Ideally the clusters chosen should be dissimilar
so that the sample is as representative of the population as possible
In quota sampling the selection of the sample is made by the
interviewer, who has been given quotas to fill from specified subgroups of the population. For example, an interviewer may be told
to sample 50 females between the age of 45 and 60.
11
12
13
Stratified
Sampling
A Stratified Sample will give a sample proportional to the size of the
"no. in stratum"
 sample size .
strata. We use the formula,
"total no. in population"
14
Scatter graphs
A typical GCSE Statistics question on scatter graphs will have the following structure;
 Plot some missing points on a scatter graph
 Describe relationship between variables
 Draw a line of best fit (through  x , y  )



Use the line of best fit to estimate one variable if given the other. If inside the
data range this is known as interpolation and if outside the data range, this is
known as extrapolation and may not be suitable (as trends may not continue)
Find the equation of the line of best fit in the form y = ax + b
State what a and b represent in context of the question
Finding  x , y  a and b
All  x , y  is the average x value and the average y value, so add all the x values
together and divide by how many you have and do the same for y.
To find a, the gradient of the line, pick two points that lie on your LOBF call them
 x1 , y1  and  x2 , y2  then find the difference between the y’s over the difference
between the x’s i.e.
y2  y1
x2  x1
To find b, look at the y value where your LOBF crosses the y axis
15
16
17
When a linear model (straight line of best fit) is not appropriate, another model may
be suitable.
Suitable models could be;
18
Averages & standard deviation from a table
You must be able to find the mean, median, modal class interval, range and standard
deviation from a frequency distribution & a grouped frequency distribution.
n 1
The median occurs at the 
 position for a set of n numbers.
 2 
The modal class interval is the interval with the largest frequency.
The range is the largest value in the distribution minus the smallest.
How to find the mean and standard deviation using the calculator for a
list of numbers.
Press mode
Select option 2 : STAT
Select option 1 : 1-VAR
Enter your data in the X column
Press SHIFT then 1
and then 5 : VAR
Option 2 will give you the mean x
and option 3 will give you the
standard deviation x  n
You can verify this method for finding the mean and standard deviation using the
exam on the following page.
19
20
How to find the mean and standard deviation using the calculator for a
frequency distribution
You must first switch the FREQ mode on in your calculator.
To do this, go to
Press SHIFT, then Setup
Press the down arrow
Select option 3 : STAT
Select option 1 : ON
You can now follow the same steps as you did for finding the mean and standard
deviation of a list of numbers, the only difference being, you can now also enter the
frequency.
21
22
How to find the mean and standard deviation using the calculator for a
grouped frequency distribution
Check – Make sure the FREQ is switched on in your calculator (See page 14).
When you are faced with this screen, you
must enter the midpoints of the intervals
in place of X, you must also write these
down on the exam paper to gain full
marks.
23
Interquartile range & Outliers & Boxplots
The IQR is calculated as follows : IQR = UQ – LQ.
The UQ is found ¾ of the way through the data i.e. at position
3
 n  1 .
4
The LQ is found ¼ of the way through the data i.e. at position
1
 n  1 .
4
To find an outlier we work out 1.5 times the IQR and subtract/add to the LQ/UQ
respectively. If an item is outside this range, it is considered an outlier.
This data can also be shown on a box plot.
24
25
Cumulative frequency curves
Cumulative frequency is a running total. It is calculated by adding up the frequencies
up to that point. Note that the first point that is plotted is the lower boundary of the
first class interval which has a cumulative frequency of 0. Notice also the
characteristic S-shape of the cumulative frequency curve. Draw lines up to the c.f
curve where necessary.
26
27
Histograms
With a histogram, it is the area of the bar that represents the frequency.
Along the y axis, frequency density is plotted. The formula can be found in the box
below.
Frequency Density=
Frequency
Class Width
You may need to rearrange this formula to get Frequency as the subject.
Frequency = Frequency Density  Class Width
Usually an examination question will have part of the table filled in and part of the
histogram drawn. If you look at the information for a bar that is shown on the
histogram and where the frequency is given in the table, you can work out the
frequency density and hence the scale on the y axis.
In the question shown on the following page, the interval 10  h  15 had the
frequency given in the table as well as the bar drawn so the frequency density was
worked out. The scale was then easy to figure out and the rest straight forward to
complete.
28
29
Finding the median from a histogram (interpolation)
Consider this example.
Previously (in Unit 1) you were asked to find the class interval the median lies in.
n 1
The total frequency in this case is 200. Using 
 , we find the median occurs at
 2 
 200  1 
position 
  100.5 . If we work out the cumulative frequencies, we find that
 2 
by the end of the interval 5  t  6 , our running total is 94 so 100.5 must be in the
next interval which is 6  t  8 .
If we were drawing a cumulative frequency curve, the points we would plot for these
two intervals would be  6, 94  and  8,154  . We want to find the time (t) when the
cumulative value is 100.5.
30
Consider this,
Time (t)
Cumulative frequency
6
94
95
1
60
2
60
3
60
4
60
5
60
6
60
6.5
60
96
97
98
2
60
How far through the
interval…
99
100
100.5
.
.
.
154
8
6.5
of the way through the interval.
60
6.5
of the way through the time interval,
We need to find what value of t is
60
6.5
i.e.
of the way between 6 and 8.
60
6.5
 2  0.216... , add this to 6 to find our value of t,
60
So to review, 100.5 occurs
6  0.216...  6.216... i.e. about 6.2
Verify the median area for
this question is 64.1 (3s.f)
31
Time series
You will be required in a GCSE Statistics exam to;
 Calculate an n-point moving average
 Plot the moving averages on a time series graph
 Draw a trend line (possibly find equation of it)
 Describe the trend
 Calculate the mean seasonal variation for a particular quarter
 Use the mean season variation and your trend line to calculate an estimate for
that quarter in the following year
A trend line should go through as many of the moving averages as possible and only
go within the data range (You may have to extend it in a later part of the question).
Trend should be described as; increasing, decreasing, fluctuating or no real trend.
Once you have calculated the mean seasonal variation for a given quarter, you can
use it to predict the sales for that quarter in the next year. Your trend line will give an
estimate of what the sales should be and then you just add the mean seasonal
variation and you have your answer.
32
33
34
Index numbers
What are Index Numbers ? An index number is a statistical measure designed to
follow or track changes over a period of time in the price, quantity or value of an item
or group of items.
Types of Index Numbers
1. PRICE RELATIVE
The Price Relative is the ratio of the price of a commodity at a given time to its price
at a different time - either before or after the given time.
E.g. In January 1980 the price of a bar of soap was 40p., whilst in January 1985 its
price was 60p. If we take January 1980 as the base year, the index for January 1985
is calculated as follows;
Index number =
quantity
 100
quantity in base year
Index number =
60
100 = 150
40
The percentage sign is usually omitted, and we say that the index is 150 based on
January 1980 which is 100. This indicates that the price of the soap has increased by
50% over the five year period.
If January 1985 was taken as the base period, then the price relative index is now:Index number =

40
 100 = 66.6
60
i.e. the index is 66.6…% based on January 1985 (which is 100). This indicates that
the price of soap in January 1980 was 66.6… % of the price in January 1985, or
alternatively, the price of soap was 33.3… % less in January 1980.
If information for a series of years is given, then any year can be used as the base
period but it is usually specified in the examination paper.
35
36
37
2. CHAIN BASE INDEX
This where index numbers are calculated by using the preceding year's index as the
base for calculating the present year's index.
e.g. The prices of a commodity in the years 1994 - 1999 are given below:Year
1994
1995
1996
1997
1998
1999
Price (pence
per kg)
150
170
160
180
225
260
For each year the index number can be calculated - using 1994 = 100
Year
1994
1995
1996
1997
1998
1999
Calculation
170
150
160
170
180
160
225
180
260
225
x 100
x 100
x 100
x 100
x 100
38
Index Number
100
113.3
94.1
112.5
125
115.6
3. WEIGHTED INDEX NUMBER
If you have a product which is made of different materials, each differing in
proportion, then we can calculate an accurate index number for the product based on
the weightings of the materials.
We can use the formula,
Weighted Index Number =
  weighting  index 
.
 weighting
E.G.
Product A
Product B
Index in 2008
100
100
71%
29%
Find the weighted index number in 2009.
39
Index in 2009
102
109
40
Spearman’s Rank correlation coefficient
This is used when a comparison needs to be made between two sets of data to see if
there is any connection or relationship between the data.
e.g. You may wish to see if two different groups of people - boys/girls, Year 7/Year
11, children/adults - have the same “preferences” or “likes/dislikes”, or whether they
are completely different, or even if there is no connection between the two groups or
not.
e.g. You may wish to see if two people judging at an event award marks consistently
or not, or whether people mark work consistently or not.
Each set of data is ranked in order, giving the largest value rank 1 then the next
largest rank 2 and so on.
e.g. Two competitors rank the eight photographs in a competition as follows:Photograph
Rank
(Judge A)
Rank
(Judge B)
Difference
d (A - B)
A
B
C
D
E
F
G
H
2
5
3
6
1
4
7
8
4
3
2
6
1
8
5
7
-2
2
1
0
0
-4
2
1
Difference2
d2
(-2)2
(2)2
(1)2
(0)2
(0)2
(-4)2
(2)2
(1)2
d2
=
=
=
=
=
=
=
=
=
4
4
1
0
0
16
4
1
30
To work out the correlation between the two judges, the following formula is used:
SRCC = 1 
6  d2
.
n  n2 -1
In this example, “n” is the number of photographs, which equals 8
6×30
SRCC = 1 
= 0.64 ( 2 d.p.)
8  82 -1
Interpretation
The coefficient will lie between ± 1. The closer the value is to 1, then the stronger the
positive correlation. The closer the value is to -1 then the stronger the negative
correlation. If the value is around 0, then there is no correlation. A rough guide can
be found on the following page.
41
42
43
44
Misleading graphs
Always check the scales to see if the graph is misleading. For pie charts, a 3D effect
distorts segment size as well as shading.
45
46
Reading and interpreting from a table of Statistics
Be careful when reading from a table, make sure you look at the right column and use
a ruler to make sure you are reading the right line.
If the total percentages don’t add up to 100%, this is due to rounding errors.
47
48
Questionnaires
Designing a questionnaire – The question you ask must have a timeframe
specified, for instance, How many hours of T.V. do you watch per week?
The response boxes you provide for the question must cater for every single person.
Some of these boxes may be appropriate.
0
More than …
Other
49
Don’t Know
Open questions – Have no suggested answers and gives people chance to reply as
they wish
Advantage – Allows for a range of answers
Disadvantage – Range of response too broad- hard to analyse
Closed questions – Gives a set of answers for the person to choose from
Advantage – Restricts response making it easy to analyse responses
Disadvantage – Will not necessarily cover all responses
50
Pilot survey (pre-test) – A preliminary test to see if there is a line of enquiry to
investigate further. It is a small scale replica of the survey / study. It can identify any
problems with the wording of questions, likely responses etc.
Reasons to do a pilot study
 Show if questions are understandable / clear
 Indicates likely answers
 Gives an indication of how long it takes to complete
 Find errors
 Give feedback so alterations can perhaps be made
51
Leading questions – Avoid questions that infer an opinion such as “Smoking is bad
for you. Do you agree?”
Interviews – One on one conversation which allows any ambiguities the interviewee
may have with the questions to be rectified.
Advantage – All questions are answered
Disadvantage – Time consuming and expensive
52
Odds
Odds are another way of expressing probability. Odds are given as a ratio between
the estimated number of failures and the estimated number of successful outcomes.
The ratio, failures : successes is the odds against an event happening.
The ratio, successes : failures is the odds for/on an event happening.
Odds may be changed into probabilities.
There are 2 chances of
failure to every 1 of
success, hence for every
(2+1) = 3 attempts there
will be 1 success
53
Venn diagrams
A Venn diagram may be used to calculate probabilities. Each region of a Venn diagram
represents a different set of data.
Goes outside the set for Radio
and outside the set for
Television
54
55
Simulation
It may not be possible to carry out an experiment in order to estimate the probability
of an event happening. This may be because it is too complex or just undesirable.
In such cases you can imitate or simulate the problem.
Simulation is quick and cheap, easily altered and repeatable. There are several ways
of introducing randomness to a simulation. You could use; coins, dice or random
numbers (from your calculator or published tables).
Usually you have 100 numbers available to you; 00 – 99 inclusive. If the probability of
an event happening is ½, then you can use half of the numbers to simulate it,
i.e. 00- 49.
If the probability of an event happening was
1
then you could use a tenth of the
10
100 hundred numbers, i.e. 00-09.
#
To improve the results of a simulation, just do more simulations.
56
Random number tables can also be used to aid simulations.
Ignore all
numbers
>79
Random numbers can also be generated on your calculator or you could put numbers
00 – 99 in a bag and select.
Type 100 then SHIFT RAN# (above the decimal point).
57
Experimental probability
When estimating the number of times an event might occur, multiply the number of
trials by the probability of it occurring.
Estimate for no. of times an event may occur = no. of trials  probability of it occuring
58
Probability tree diagrams
Recall for probability tree diagrams; a branch MUST add to 1, when moving along
branches you multiply but add when selecting the outcomes in the final column that
relevant to a given event.
A tip for the exam, do NOT simplify your fractions. The calculator may do this for you
so work out the fractions manually; there is then less chance of you making a
mistake.
The probability of choosing a red card is 0.4 I pick two cards.
Fill in the probability tree diagram below.
The various outcomes are listed on the right hand side. I can see the probability of
getting a red and another red is 0.16.
I can work out the probability of getting different colour socks can happen in two
ways, black and red or red and black, so the probability of getting different socks is
0.24 + 0.24 = 0.48.
Always the case in a GCSE Statistics exam, you the second event will depend on the
first.
59
60
61
62
Conditional probability
Conditional probability is the probability of some event A, given the occurrence of
some other event B.
The formula to be used is, P  A|B  =
P A B
and is said, the probability of A given B
P B 
is equal to the probability of A AND B over the probability of B.
You should always put the “given that…” probability in the denominator.
Some questions from exam papers are shown below to show how to answer a
question on conditional probability.
In this example, the previous part of the question asked, what is the probability Joan
was late for work. The answer was 0.24.
In the previous part of this question from a tree diagram you could read off the
probability of a person having tooth decay was (0.02 + 0.09 =) 0.11.
63
In the previous part of this question, it was worked out the probability of going to
131
France was
.
200
64
Binomial distribution
Consider these examples.
1. In a series of 5 Test Matches the England cricket captain only won the toss once.
2. I bought 8 pens from a shop and 1 of them did not work.
3. In a random sample of 100 people, 8 said they would vote for the Green party.
4. Over several years a woman gives birth to 6 children, of which 5 were girls.
Each situation involves an unpredictable event with two possible outcomes. It is
traditional to label one outcome as "success" and the other as "failure". The captain
may guess correctly (success) or incorrectly; the pen may work (success) or may not
work; the person may vote Green (success) or for some other party; the child may be
a girl (success) or a boy.
In each situation there are a given number of trials of this event. We call this number
n. Thus there were n  5 matches, n  8 pens, n  100 voters and n  6 children.
Each trial of the event is independent (the outcome of one doesn’t affect the
outcome of the other) of the others. The fact that the captain guessed wrongly in the
first three matches does not make it more (or less) likely that he will guess correctly
in the fourth match. Provided the pens were all the same brand, why should the fact
that one works have any effect on another? The voters were selected at random and
could not therefore influence each other. As several English kings discovered, having
several princesses does not make it more likely that the next child will be a prince!
Since the trials are independent the probability of each outcome remains constant. We
call the probability of a success p and the probability of failure q.
There is a 50% chance that the captain wins any toss so p  0.5 and q  0.5 .
There is perhaps a 1% chance that any pen does not work so p  0.99 and q  0.01 .
Maybe 5% of people vote Green so p  0.05 and q  0.95 . Approximately 50% of
children born are girls so p  0.5 and q  0.5 . Notice that q  1  p .
It is important to realise that the number of successes we get in our n trials depends
on chance. The fact that 50% of children born are girls does not mean that in every 6
children 3 will be girls. It is possible to get no girls, or all girls, or indeed any number
in between. Common-sense tells us that 3 girls are more likely than 5 girls, but the
question is how can we calculate the probabilities of getting 0, 1, 2 … 6 girls in 6
births. This is where the Binomial Distribution comes in.
Suppose that we have a situation where:
there are n repetitions or "trials" of a random event
each trial has two possible outcomes, "success" or "failure" (p and q)
trials are independent (one doesn’t affect the next)
the probability of a "success", p, remains constant from trial to trial
65
What is the probability of obtaining r successes in n trials?
The case n  1
When n  1 there is only one trial. There is a probability p that the trial results in
success (S) and probability q  1  p that the trial results in failure (F). We can show
this in two ways – by a tree diagram and a table.
Probability
p
S
F
Successes
0
1
Total
Probability
q
p
p q 1
q
The case n=2
There are four possible results from two trials: SS, SF, FS and FF. Because the trials
are independent the probabilities obey the multiplication rule:
P(S first and S second) = Pr(S first) x Pr(S second) = P(S) × P(S) = p×p = p²
Probability
S
S
F
S
F
F
p2
Successes
0
1
2
Total
Probability
q²
2pq
p²
p²+2pq+q²
pq
pq
q2
In fact for n binomial trials, the probability for each event will be terms of the
expansion  p  q  . At GCSE Statistics level this is all you are required to know,
n
however beyond this a new formula will be introduced using combinations.
66
In a GCSE Statistics exam, the expansion of  p  q  for the relevant value of n in the
n
question will be given to you.
Whenever you start a binomial distribution question always write down what p and q
are equal to. Remember that p is the probability of success and q is 1  p .
Lets look at an example. Each term in the expansion is explained. Notice the power of
p and the explanation.
p  q 
Exactly 4
successes
4
Exactly 3
successes
 p 4  4 p 3q  6 p 2q 2  4 pq 3  q 4
Exactly 2
successes
Exactly 1
success
No (0)
successes
So if you were asked in a question to work out the probability of exactly 3 successes,
you would use the term 4 p 3q (p would be given in the question and q  1  p ).
If you were asked to find the probability of less than 2 successes, you would interpret
this as 1 success or 0 successes, so you would work out 4 pq 3 and q 4 and add the
results together.
67
68
69
70
71
72
73
74
Standardised scores
To compare values from different data sets you usually need to set up standardised
scores. For this you will need to know the mean and standard deviation.
The formula to work out these scores is given as follows,
Standardised score =
score  mean
standard deviation
The mean and standard deviation will be given in the examination question.
The standardised score indicates how many standard deviations a score is above or
below the mean. This is very useful when comparing two sets of data.
75
76
77
Normal distribution
The normal distribution is used with data where the mean = median = mode. The
normal distribution is known as a continuous probability distribution. It takes the
shape of a bell and symmetrical about the mean.
The width of the curve shows how spread out the data is.
In the picture above, the two distributions have the same mean but the blue curve is
less spread out.
Properties you need to learn for the exam.
Mean + 1 s.d
Mean
Mean - 1 s.d
Mean ± 1 standard deviation will contain 68% of the data.
78
Mean + 2 s.d
Mean
Mean - 2 s.d
Mean ± 2 standard deviations will contain 95% of the data.
Mean ± 3 standard deviations will contain 99.8% of the data.
Mean + 3 s.d
Mean
Mean - 3 s.d
99.8%
79
80
81
82
Choropleth graphs
A choropleth map shows information as a series of graduated shadings - this can
either be shades of grey or colour.
They are designed to show statistical data in a series of "multi-coloured" values moving from smallest to largest - with the lightest colours for the smaller values and,
as the values get larger, the colours become darker.
These maps are useful for showing the following types of information: Average rainfall in inches over a county in a state.
Average numbers of cattle per farm across a county.
Population density across a constituency.
There are advantages and disadvantages in these maps.
Advantages
They take statistical information and change it into averages that can be understood
graphically.
In this example, the shading changes progressively from 0% black to 100% black to
reflect the amount of rainfall across a country.
0%
20%
40%
60%
80%
100%
Up to
Up to
Up to
Up to
Up to
Over
5 cm rainfall
10 cm rainfall
15 cm rainfall
20 cm rainfall
25 cm rainfall
25 cm rainfall
Disadvantages
A person's attention can be focussed on the size of the area, rather than the data.
Large areas tend to dominate a map, but they usually have the least densely
populated areas.
The second disadvantage lies with us - we have difficulty in distinguishing between
shades of grey or colour.
83
84
85
Comparative Pie Charts/Diagrams
These are used when two sets of data with differing totals are to be compared.
Examples could include comparing the costs from one year with the preceding year,
or sales in consecutive years etc. As in a normal Pie Chart, the angle of each sector is
determined by the fraction of the total - where the total is represented by 3600. These
different totals are represented by differently sized circles. The ratio of the radii is in
proportion to the ratio of the amounts (this is equivalent to the area factor).
e.g. The following agricultural statistics refer to land use, in hectares, of three
parishes. Draw three pie diagrams to compare this data.
Parish
Barley
Wheat
Woodland
Appleford
Burnford
Carnford
1830
645
320
1640
435
160
550
120
150
Total Land
(hectares)
4020
1200
630
Let Carnford be represented by a circle of radius 4 cm.
To Calculate The Radius of the Circle for Burnford
The area of the circle for Burnford has to be enlarged by an area factor which is found
as follows:Area factor = Area of Burnford = 1200 = 1.9047....
Area of Carnford
630
 Scale Factor =  Area Factor = 1.38
 Radius of New Circle = 4 x 1.38 = 5.76
= 5.8 cm
To Calculate The Radius of the Circle for Appleford
By a similar method, the area factor is first found, and then the scale factor.
Area factor = Area of Appleford = 4020 = 6.3809
Area of Carnford
630
 Scale Factor =  Area Factor = 2.526...
 Radius of New Circle = 4 × 2.526 = 10.104...
= 10.1 cm
In this question any parish can be used as a starting point, and the resulting radii will
be based on this parish.
86
196400
then to find the scale factor in which to
107000
multiply the radius (3cm) by, find the square root of the ratio of the areas and then
multiply it by the radius (3cm).
Firstly find the ratio of the areas 
196400
 3  4.06..cm
107000
87
A pie chart shows proportions but not frequencies.
Comparative pie charts can be used to compare two sets of data of different sizes.
The areas of the two circles should be in the same ratio as the two frequencies.
88
89