Level 2 Computing — Project 2: Linear Regression Analysis

Transcription

Level 2 Computing — Project 2: Linear Regression Analysis
Level 2 Computing — Project 2:
Linear Regression Analysis
page 1
Introduction
This exercise will be quite different from the first. We are going to explore regression
analysis which is commonly used in geology to quantify associations between variables.
For years scientific hand calculators and later spreadsheet programs like MS Excel and
many graphing programs have had the capability of performing regression analyses of
data sets. We (collective) tend to take it on faith that these calculators/programs are
handling the data we input in an appropriate way.
In project 2, we will cast a critical eye on exactly how MS Excel and the graphing
program Deltagraph perform linear regressions of geochemical data.
Why would one want to quantify associations among variables? Because for valid
correlations, knowledge of one variable can be used to predict the amount of another.
Example: Assume you are doing geochemical prospecting for element X and you knew
from earlier detailed work that Xʼs concentration correlates positively with element Y.
Further assume that measuring element X directly is costly and time-consuming, but
measuring element Y is fast and cheap. The correlation would give you a cost-effective
method to identify concentrations of X by using measurements of Y.
Linear Regressions
Ordinary least squares: Y on X
y
A Linear Regression is simply a line fitted to a
set of data points. In geochemistry these data are
generally X-Y plots of amounts of chemical
elements obtained by some analytical technique.
There several methods to fit lines to data points.
One method that is commonly used in
geochemistry is the least squares method, and
there are two variants of this method. The Y on
X variation fits a line for which the sum of the
squares of the Y deviations are minimized (Fig.
1).
x
The other variation on the ordinary least squares
method is X on Y. The line fitted by this method
minimizes the the sum of the squares of the X values between the data points and the line
(Fig. 2).
Figure 1.
page 2
Ordinary least squares:
X on Y
Ordinary least squares
y
X on Y
y
Y on X
(x, y)
Figure 2.
x
Figure 3.
x
As a result for the ordinary least squares method two regression lines can be obtained from a
single data set. The intersection point gives the mean x and y values (x and y) of the entire
data set (Fig. 3). The angle between the two regressed lines is larger for poorly correlated
data, and, as the correlation coefficient (usually indicated with r) approaches 1 or -1, the angle
becomes smaller. The two lines coincide when r = 1 or -1.
The reduced major axis (RMA) method minimizes the areas of triangles between the points
and the best fit lines (Fig. 4). This method is less widely applied than the least squares
method, but is more appropriate for geochemical
data
because the fitted line is independent of the
Reduced major axis
correlation coefficient. The regression line for
the RMA method will lie between the lines for
the
least squares Y on X and X on Y methods. As
y
mentioned above, as the correlation coefficient
approaches 1 or -1 the lines produced by all
three of these regression methods will converge.
Figure 4.
x
page 3
Basic Excel Formulae
Calculating the Regression Lines
In order to calculate the regressions lines, one needs to be familiar with the equation for straight
lines. The basic equation is given below.
equation for a straight line: Y = m · X + by
Y = vertical axis, X = horizontal axis, by = Y intercept (value of Y where X = 0)
Y
and m = slope =
X
Y
Y
X
origin
by
X
The main objective of project 2 is to calculate the
regression lines for three methods of linear reggression
mentioned at the beginning of this document. Since the
regression lines are derived from from the same data set
of X and Y values. The only differences among the three
methods will be the way in which the slope (m) and the
y-intercept (by) are calculated.
Quantity
Name
math
expression/value
number of samples
n
31
X data set
Fe2O3
Y data set
CaO
each X value squared
Xsq
X2
each X value squared
Ysq
Y2
mean of X values
mX
ΣX/n
mean of Y values
mY
ΣY/n
sum of squares about X
SSx
Σ(X2) - [(ΣX)2]/n
sum of squares about Y
SSy
Σ(Y2) - [(ΣY)2]/n
each cross product
for X and Y
cPxy
(X-mX)·(Y-mY)
sum of cross products
for X and Y
SPxy
Σ[(X-mX)·(Y-mY)]
Excel formula for
math expression
Assume that we have used the Name... item in the Excel Insert menu to
name cells and ranges of values of cells in the XL worksheet. Fe2O3,
CaO, Xsq,Ysq and cPxy are ranges of cells, and the remaining Names
are values in individual cells. The cross product for x and y must be
calculted for each value; assume x is in B4 and y is in C4 below.
= Sum(Fe2O3)/n
= Sum(Xsq)-Sum(Fe2O3)^2/n
page 4
Fill in the blanks
Notation & Calculation – Example
Notation & Calculation – Example
symbol
math
expression
Fill in the blanks
Excel formula
least squares (Y on X)
y
x
intercept
b(0)
mY - b(1)·mX
slope
b(1)
SPxy/SSx
intercept
b(0)
mX - b(1)·mY
slope
b(1)
SPxy/SSy
intercept
b(0)
mY - b(1)·mX
slope
b(1)
SQRT(SSy/SSx)
least squares (X on Y)
x
y
reduced major axis
y
x
page 5
page 6
Calculation - Exercise
Fill in the blanks
Using the data in Excel Sheet 1(web link) and the formulae you have just written out, calculate the
values indicated below. Note that Sheet 1 is locked. You will need to copy these values into a new
worksheet.
mX
mY
SSx
SSy
SPxy
for Y on X:
b(o)
b(1)
for X on Y:
b(o)
b(1)
b(o)
for RMA:
b(1)
The values below are the correct ones (in no particular order). Use them to check your calculations:
0.5988
57.779
0.7280
3.610
1.1298
0.7636
3.909
96.492
1.2688
-0.1689
51.141
Calculation – Exercise
Using the data in Excel Sheet 2 (web link) repeat the calculations you just did. Sheet 2 is also
locked. You will need to copy these values into a new worksheet.
mX
mY
SSx
SSy
SPxy
for Y on X:
b(0)
b(1)
for X on Y:
b(0)
b(1)
for RMA:
b(0)
b(1)
Plotting the data
Use the intercepts [b(0)] and and slopes [b(1)] to calculate the three regression lines. These can be
obtained from the equation for a line, here y = b(1) · x + b(0), and by choosing some value of x. For this
example use X = 0 and 6 and calculate the y-values. Make one chart. Plot the data points in one chart
and the three regression lines in another.
Important – The slope and intercept values that were determined above for the (Y on X) and (X on Y)
least squares regression lines do not use the same X and Y axes. These axes are transposed. In order to
plot all the curves on the same diagram with the same X and Yorientation, the (X on Y) linear equation
needs to be rearranged. See the explanation below.
page 7
page 8
To plot Y on X and RMA regression lines, use the slopey [= b(1)] and by
[= b(0)] in equation 1 to obtain values of Y where X = 0 and X = 6.
Y = slopey · X + by
(1)
The slopex [= b(1)] and bx [= b(0)] for the X on Y regression line cannot be
used in equation 1; equation 2 shows the form of the X on Y regression line
for the values of slopex and bx that you have determined.
X = slopex · Y + bx
(2)
Equation 2 must be rearranged in order to plot the X on Y regression line on
the same x-y plot as the RMA and Y on X regression lines.
X = slopex · Y + bx
-slopex · Y = bx - X
bx
1
Y=
·Xslopex
slopex
(2)
rearranging
(3)
Use the slopex [= b(1)] and bx [= b(0)] in equation 3 to obtain values of Y
where X = 0 and X = 6. These values can be plotted on the same x-y plot as
the RMA and Y on X regression lines (CaO is the ordinate; Al2O3 is the
abscissa).
Plotting the data with MS Excel
Make two plots. One plot should contain the CaO and Al2O3 data, and the other plot should contain
data points and (Y on X), (X on Y) and RMA regression lines. Size both plots such that they can be
printed on a single A4 page.
To start plotting, use this XL tool
page 9
On your plots set the lower and upper values of the X axis to 0 and 6, and set the lower and upper values of
Y to 12 and 18. Your plots should look roughly like this:
18
18
12
12
0
6
0
If you have succeeded you will have noticed that plotting in Excel is neither an intuitive nor a fun
experience.
6
Things to think about
Do the two least squares regressions nearly coincide or are they far apart? What does this tell us?
In the introduction it was stated that as the correlation coefficient (r) approaches 1 or -1 the two least
squares lines will approach coincidence. Do you think the value of r is closer to 1 or closer to 0?
Determining r in Excel
Select an empty cell and in the Insert menu chose Function... The function PEARSON. array1 and
array2 are the ranges of the cell containing the X and Y data.
The result is the
correlation coefficient
(r). So what is it,
closer to 1 or closer
to 0?
page 10
DeltaGraph
In the final part of this exercise you will plot the same data in DeltaGraph. This should be a fairly
quick exercise.
Open DeltaGraph (Launcher). Copy the data out of Sheet 2 and past it into the DeltaGraph worksheet.
Take the same data (Sheet 2) and paste it into DeltaGraph. Select the data and plot it (see below).
Shortcut:
Clicking here selects
all the data in the
worksheet.
Opens the chart type and plot windows.
page 11
DeltaGraph Continued...
This is the simplest.
I suggest using it.
Make sure you have selected
a paired scatter diagram.
Select Scatters
to reduce the choices
page 12
DeltaGraph Continued...
You can set the lower and upper bounds
and the axis length with this menu. Use the
same bounds as in the Excel plot, but here
set the length of each axis to 10 cm.
Are these similar values to what
you have previously calculated in excel?
Toggle these buttons and
examine this formula.
page 13
DeltaGraph Continued...
DeltaGraph plots can be edited in Adobe Illustrator. In the File menu there is and item
called Export... If you export as EPSF (encapsulated postscript file) the file can be
read into and edited in Illustrator via the Open command in the File menu.
Make and export the following three plots.
18
18
X on Y
Y on X
10 cm
12
0
6
10 cm
10 cm
12
0
6
10 cm
18
RMA
You will need to plot the RMA
regression line separately
(as before 2 points are sufficient).
10 cm
12
0
6
10 cm
Use Adobe Illustrator, edit these
three plots into a single diagram
containing the X on Y, Y on X and
RMA regression curves (all labeled).
Turn in only one DeltaGraph plot on
an A4 page.
18
Y on X
X on Y
12
10 cm
RMA
0
6
10 cm