How to use this manual

Transcription

How to use this manual
1
How to use this manual
This manual provides a set of data analysis laboratories and exercises
which will familiarize you with a number of multivariate (and other)
procedures as they are implemented in SYSTAT 8. You should work
through these exercises at your own pace, which initially may be rather
slow but will accelerate as you become more familiar with SYSTAT.
Some points to bear in mind:
1.
It is assumed that you have some familiarity with SYSTAT 8. As
such, the exercises only sketch the sequence of menu commands
required to do certain basic things: they do not necessarily provide
step-by-step instructions for each and every analysis. If you are
completely unfamiliar with SYSTAT, you can consult the
Laboratory exercises for the course BIO 4118, available in .pdf
format on the BIO 4118 Course Web: while these labs are written for
SYSTAT 7, they provide step-by-step instructions about how to do
certain things, and in the vast majority of cases, these instructions
apply equally well to SYSTAT 8.
2.
Each lab has one or more associated command files (.syc). These list
the set of commands which I used to do each exercise. If you get
stuck on an exercise, you can refer to the appropriate command file
to see the sequence of commands that you should use. In fact, you
can simply File/Open/Command the appropriate command file in
the COMMAND pane (Untitled), and run the file as a command
file (using File/Submit from current line to end). In fact, for
each laboratory, you could simply submit the appropriate command
file(s) and generate exactly the same output as I have done. This
would take at most a couple of minutes, but you will learn virtually
nothing about multivariate analysis in general and SYTAT 8’s
implementation of various procedures in particular. So I strongly
recommend that you refer to my command files only when you are
stuck or you want to check your output. Also, be aware that because
these commands were run on my own machines, you will have to
change the paths for both input and output files.
3.
The exception to the above occurs when the output requires
considerable programming either in SYSTAT or in BASIC. In these
instances, it would be very difficult for you to do the exercises by
yourself unless you have this expertise already. So these exercises
consist primarily of interpreting the output and figuring out how the
command files work.
4.
Most exercises include one or more questions. Consider these
questions carefully! They are designed to (hopefully) illuminate
important aspects, limitations, constraints and issues in multivariate
analysis.
5.
As you work through this manual, please bring any errors or
infelicities to my attention, and I will correct them in future versions.
2
HOW TO USE THIS MANUAL
I would also encourage you to bring to my attention any suggestions
about how the manual might be improved.
3
Displaying multivariate data
Description
This exercise is designed to familiarize you with the different options for
displaying multivariate data in SYSTAT 8.
The data
The file STURGDAT.SYD contains data on individual sturgeon captured in
the commercial harvest at two locations (The Pas and Cumberland
House) over the past several decades. The data include variables relating
to size (e.g. FKLNGTH, RDWGHT), age (AGE), sex (SEX$), location of capture
(LOCATION$) and year of capture (YEAR$). The file ALSC8.SYD contains
data on a variety of characteristics of Adirondack lakes censused in
1984-1986, including dissolved summer oxygen (DO), pH (PH), acid
neutralizing capacity (ANC), surface area (SA), mean depth (MEANDEPTH).
Finally, the file DOG.SYD contains 5 different teeth measurements of
several different kinds (GROUP$) of fossil dog.
Exercises
1.
Using STURGDAT.SYS, make a SPLOM (Graphics/SPLOM) using
FKLNGTH, TOTLNGTH, RDWGHT and AGE. What inferences would you
draw from these plots, and how would they affect subsequent
analysis? Repeat this procedure for each sex.
2.
Using the same data, do some 3-D plots using the Scatterplot
command. While your at it, play around with various options and
look at the results.
3.
Using ALSC8, make a SPLOM using DO, PH, ACIDNC and surface
conductivity (SCONDUCT). What inferences do you draw from these
plots, and how would they affect subsequent analysis? Redo the plot
using SA, MEANDEPTH and VOLUME. What inferences do you draw?
4.
Using ALSC8, do some 3-D plots with various options. Experiment!
5.
Using the file DOG.SYD, do Icon and Profile plots, using GROUP$ as
the Grouping variable and the other five variables as the Feature
variables. Repeat this procedure but before doing so, use
Data/Standardize to convert values to standardized values. How
does this change the output, and why? Repeat using Andrew’s
Fourier Plot, enabling the Overlay graphs into single frame
option.
Output and Command files
The output and command files for this laboratory are LAB2.SYO and
LAB2.SYC respectively.
4
DISPLAYING MULTIVARIATE DATA
5
The bootstrap and jackknife
Description
This exercise is designed to familiarize you with bootstrapping and
jackknifing SYSTAT 8. Here we consider these procedures in the context
of two simple univariate procedures: the t-test and simple linear
regression. You should, however, be aware that SYSTAT 8 has
bootstrapping/jackknifing routines implemented for most of its statistical
procedures, including most of the multivariate procedures discussed in
class. In the exercises that follow, I will not explicitly deal with how these
resampling methods can be implemented in the context of a specific
procedure, but the following exercises provide models that you can in
principle adapt to other more complex analyses, should you feel so
inclined.
The data
The data file GJACKAL.SYD contains length of skull measurements on
male and female golden jackals. The file islands.syd was kindly provided
by Attila Kamar, and consists of data on the number of non-marine bird
species (BIRD) present on 346 oceanic islands. Also included are various
island attributes, including size (AREA), precipitation (PRECEPT), mean
annual temperature (TEMPERATURE), etc.
Exercises
1.
Open the Systat command file BOOTT-TEST.SYC. This file contains a
set of commands for producing the 95% upper and lower confidence
intervals of the bootstrapped average difference between male and
female golden jackal skull sizes, as well as a histogram and
summary statistics. To begin with, open the file in the Untitled
(Command) pane using File/Open/Command. Then use
File/Submit from current line to end. What is the output from
this file?
2.
Run a standard t-test (see lab 3) comparing the mean skull sizes of
males and females, and compare this result with the bootstrapped
result. What do you conclude? Note that while the means are quite
similar, the 95% confidence intervals are quite different. Why?
3.
Examine the BOOTT-TEST command file. This set of commands does
as follows: (1) opens the input data file (GJACKAL.SYD); (2) sets the
random seed (RSEED); (3) runs the bootstrapped t-test procedure
1000 times using a sample size of 20, and writes the output to a file
called TEMP; (4) the program then uses a bunch of BASIC commands
to (i) read in the TEMP file; (ii) check through the temp file to find
lines whose first two fields (A$,B$) are “Pooled” and “Variance”; (iii)
skip from these lines to the next line down using the LAG function,
and assign the value in the fifth field (E$) of this line to the variable
R1 (Note that from the t-test output, this field of this line contains
the value of the average difference between males and females using
a pooled variance t-test.); (iv) uses the VAL function to convert this
6
THE BOOTSTRAP AND JACKKNIFE
character variable to a numeric value; (v) delete any lines that do not
have bona fide values for R1; (vi) writes the set of R1 values to an
output file (CC). This output file is then sorted in ascending order,
with the 25th and 975th values representing the lower and upper
95% confidence intervals. This same file is then used to produce a
histogram of the 1000 values as well as some summary statistics
(Note that the latter is done using SYSTAT, not BASIC, commands.
Note also that unless you specify a path name for the output files,
they will be saved in the default SYSTAT directory.)
4.
Using the command file JACKREGRESS, run jackknifed multiple
regression fitting a model where number of bird species is modelled
as a linear combination of island area, precipitation, temperature and
elevation. The set of commands in this file: (1) opens the islands data
file and using GLM, runs the linear regression, saving the jackknifed
regression coefficients to the file BOOTREG; (2) computes the
“standard” multiple regression coefficients (i.e. using the normal
approximation); (3) opens the file of jackknifed coefficients and
computes descriptive statistics, as well as producing histograms of
the coefficients for each variable. How do the standard coefficients
compare to the mean bootstrapped coefficients? Compare the
standard errors of the coefficients generated using the normal
approximation with the standard deviation of the bootstrapped
coefficients. What does this tell you? (Note: this is a common
occurrence: resampled measures of precision in regression are often
larger than those obtained under the normal approximation!)
Output and Command files
The command files for these exercises are BOOTT-TEST.CMD and
JACKREGRESS.CMD. There are no output files: to get the output, run the
command files with the appropriate modifications (i.e. paths for
input/output files).
7
Two-sample comparisons
Description
This exercise is designed to familiarize you with two-sample
comparisons in SYSTAT 8. Specifically, this exercise considers the case
where there are two samples which we want to compare with respect to
their multivariate means or variances.
The data
The file BUMPUS.SYD contains morphological measurements of 49 female
sparrows collected by Herman Bumpus in February, 1898 in Rhode
Island, including total length, alar extent, head length, length of humerus,
length of keel and sternum, all in mm. A brief description of
STURDAT.SYS is given in Laboratory 2. The file ALSC8 gives the
characteristics of a large sample of lakes in the Adirondacks, including
dissolved oxygen (DO), pH (PH), acid neutralizing capacity (ACIDNC), area
(AREA), elevation (ELEVATION), etc. and the presence/absence of a whole
bunch of fish species. In the first set of exercises, we compare the
characteristics of sparrows which did and didn’t survive a tropical storm.
In a second set of exercises, we compare the physiochemical properties
of lakes which have and do not have brook trout present (with the overall
objective of trying to predict the influence of physiochemical properties
on the likelihood of success of brook trout stocking programs).
Exercises
1.
Using BUMPUS.SYS, run Analysis of variance
(ANOVA)/Estimate using all five variables as the dependent
variables and SURVIVAL$ as the factor variable, with Save
File/Residuals/data enabled. (The default file name here is ANOVA
- keep it as it is.) Then run ANOVA/Hypothesis test entering
SURVIVAL$ as the effect. What do you conclude?
2.
Using the residual file ANOVA.SYD created in the step above, select
Graph/Probability, Graph/Scatterplot and Statistics/Time
series/ACF plot to create normal probability plots, plots of
residuals versus estimates, and autocorrelation plots respectively for
the residuals of all variables for each group separately (by using
SELECT statements). (Note: in the residual file, residual(1)
corresponds to the first listed variable in the ANOVA, residual(2) the
next listed variable, etc. (To make your life easier, you may want to
rename these variables in the file.). Recall that the ACF plots test for
serial autocorrelation of residuals, in this case within each group.
Do you think the assumptions of multivariate normality,
homogeneity of variances and independence of residuals are met?
3.
Using ANOVA.SYD, run Levene’s test for each set of residuals. To do
so, first use Data/Transform/Let, and create a new variable
ABSRES = ABS(RESIDUAL). This is the absolute value of the
difference between an observed value and the mean for the group
(Recall that the sample means are the least-squares estimates of the
8
TWO-SAMPLE COMPARISONS
population means!). Then compare the mean ABSRES for each
variable between survivors and non-survivors. What do you
conclude?
4.
Using BUMPUS.SYS, generate the sample covariance matrices for
both survivors and non-survivors by first enabling Data/By
groups, and then running Statistics/Correlation/Simple,
entering all 5 variables in the Variables dialog box and making sure
the Type Continuous/Covariance option is enabled. Based on
your inspection of the covariance matrices for each group, do you
think the assumption of compound symmetry holds? (Note: we shall
learn how to test the assumption of equality of covariance matrices
in a later laboratory exercise.)
5.
Test the one-tailed null hypothesis that survivors are less variable in
body size than non-survivors using Levene’s test. First, standardize
each variable in each group (survivors and non-survivors) using the
Data/Transform menu with the By groups (SURVIVAL$) enabled.
The use Statistics/Descriptive statistics/Basic statistics to
calculate the medians for each standardized variable for each group.
Then use a series of Data/Transform/If then let... statements
(disabling BY groups before doing so!) to calculate the absolute
value of the difference between the standardized values for each
observation and the sample medians. Finally, compare the vector of
mean absolute deviations of the two groups (survivors and
non-survivors) using Statistics/Analysis of variance as
described above. What do you conclude? (Note: if you are having
trouble figuring out how to do this, consult the SYSTAT command
file LAB4.SYC.)
6.
Repeat the exercise above, but this time instead of calculating the
absolute deviation of standardized values from the median, calculate
Van Valen’s measure (See Lecture notes, lecture 5). Now what do
you conclude? How do you explain the difference between this
result and the previous one? (Note: again, if you are having trouble
figuring out how to do this, consult the SYSTAT command file
LAB4.SYC.)
7.
Using File/Open/Command, import the command file
MVNORMALTEST into the command pane (Untitled). Then run this file
using File/Submit from current line to end. This file produces
a plot of the Mahalanobis distances of each lake from the mean of
the sample from which it is drawn (either the lakes with or without
brook trout) versus its chi-square approximation. Recall that if the
data are multivariate normal, this relationship should be a straight
line. The commands in this file (i) run a discriminant function
analysis using five variables (DO, pH, ACIDNC, elevation and log
surface area) and produce an output file ( DISCRIM.SYD) of the
Mahalanobis distances for each observation (Don’t worry about
what the DISCRIM procedure is doing here - we use it simply to get
the distances. This will be the subject of a later laboratory); (2)
creates a subset of the data file consisting of lakes where brook trout
were present (a total of N = 250 lakes); (3) uses this file to rank the
Mahalanobis distances; (4) computes the chi-square percentiles
9
using the ranked values and the inverse chi-square function (XIF)
which returns a critical value of the chi-square distribution given a
certain percentile and a degrees of freedom (which in this case, is p,
the number of variables); (5) plots the distances against the
chi-square values. This plot gives us an indication of whether within
the sample of lakes with brook trout, the five variables follow a
multivariate normal distribution. What do you conclude? Can you
find a transformation that improves the situation?
8.
the command file BOX2 is designed to test the equality of two
covariance matrices using the Box test (see Lecture Notes). This is
quite a long program, which prints various and sundry things
throughout. Most of the calculations involve matrices, so they are
done within SYSTAT’s MATRIX program. The commands: (i)
produce a set of covariance matrices (COV1, COV2) for the two
samples of lakes (with and without brook trout), as well as the
logarithm of the determinants of each matrix (LNS1, LNS2); (ii)
calculates a pooled covariance matrix (POOLC) and the logarithm of
its determinant (LNPOOLC) using the degrees of freedom for both
COV1 and COV2 (DF1, DF2), as well as the total degrees of freedom
(DFN); (iii) calculates the Box M and C values (BOXM, BOXC); (iv) uses
these values to compute the chi-squared approximation
(XAPPROX)and degrees of freedom (DF). On the basis of these
calculations, are the covariance matrices equal?
Output and Command files
The command files used in this lab up until the exercise testing
multivariate normality is LAB4.SYC; the output is LAB4.SYO. For the other
exercises, the command files are as indicated in the exercises themselves.
For these exercises, there are no output files: to generate output, run the
command files.
10
TWO-SAMPLE COMPARISONS
11
Correlation, covariance and distance
matrices
Description
This exercise is designed to familiarize you the calculation of correlation,
covariance, SSCP and distance matrices in SYSTAT.
The data
The data file DOG.SYD contains average measurements for six different
skull characters for 7 species of dog, both past and present. Manly
described this data set in chapter 1 of his book. The file SKULL.SYD
contains four measurements of human skulls dating from 5 periods in the
past. See Manly for more details.
Exercises
1.
Open DOG.SYD. Then use Data/Transpose to create a new data file
which transposes rows and columns, so that in this new file, the
columns now represent dog species. You can now change the names
of these column variables (which be default are called COL(1),
COL(2) etc.) to the names of the species by double-clicking on the
variable name in the data sheet and renaming it. Then choose
Statistics/Correlation/Simple to generate SSCP, Covariance
and Correlation matrices for the seven species. How is the
covariance matrix generated from the SSCP matrix? What do you
conclude from these matrices? Why is the procedure NOT a good
one to use to generate these matrices? (Note: if you run the DISTANCE
command file, it will bomb because the TRANSDOG file created by
standardizing and transposing doesn’t have the new variable names
(pre-dog, etc.). You have to do this manually in the data sheet.)
2.
Repeat the above procedure except before transposing, use
Data/Transform/Standardize/SD to standardize all variables to
zero mean and unit standard deviation. How does the covariance
matrix generated here compare to the correlation matrix generated
using the raw data?
3.
Repeat one or the other of the above exercises, but now write the
correlation or covariance or distance matrix to an output file using
Save file. Then open this file and see what it looks like.
4.
SYSTAT does not have a “canned” algorithm for calculating
Penrose and Mahalanobis distances between multivariate groups.
However, I have written a short command file (MAHALANOBIS.SYC)
which (in this particular case), calculates the Mahalanobis distance
between the mean vectors for the sample of skulls contained in the
data file SKULL.SYD (this is the same data given in Manly, Table 1.2).
The procedures are mostly executing in SYSTAT’s MATRIX
command. This set of commands (1) inputs the skull data file; (2)
selects for a certain period (e.g. early predynastic) and computes the
12
CORRELATION, COVARIANCE AND DISTANCE MATRICES
(unstandardized) covariance matrix for each period (C1, C2, etc.) as
well as the degrees of freedom on which these matrices are based
(df1, df2, etc.); (3) calculates the pooled covariance matrix; (4)
calculates the Mahalanobis distance between the mean vectors for
early and late pre-dynastic skulls by multiplying various matrices
together. Run this command file, then look at the output that’s
produced and try and figure out what’s going on. (Note that the mean
vectors and covariance matrices for some periods are slightly
different from what Manly reports on pp. 65 - 67 - either his mistake
or mine, I’m not sure which!). How would you modify this
command file to generate the distance between Ptolemaic and
Roman skulls? (This is easy!) Note that this file only produces one
distance at a time. How would you modify it to produce the entire
distance matrix? (This is a substantially harder!)
Output and Command files
The command files for these exercises are DISTANCE and
MAHALANOBIS.CMD. There are no output files: to generate them, run the
command files with the appropriate modifications
13
Single classification MANOVA
(k-groups MANOVA)
Description
This exercise is designed to familarize you running single classification
MANOVA in SYSTAT, including unplanned comparisons and checking
assumptions. The question of interest is: for Adirondack lakes which do
not have brook trout, are there differences in the physiochemical
properties of lakes with different numbers of top piscivore species (e.g.
northern pike, largemouth bass, smallmouth bass, etc.)
The data
The data file is ALSC8.syd, described briefly in previous laboratories
Exercises
1.
Open ALSC.SYD. Then use Data/Transform to create a new
variable (PSR$) which gives the number of top piscivore species
present in the lake. This variable can have three categories: “none”,
“one” or “two or more”.Then choose ANOVA/Estimate model
and run a single classification ANOVA with PSR$ as the category
and PH, DO, ELEVATION and surface area (SA) as the Dependent
variables. make sure you save the residuals and data together using
the Save File/Residuals/Data option. The output file (if you have
Long output enabled) includes: (1) the estimate of effects for each
category and each variable (note that because effects sum to zero
across all categories, only two groups are included here); (2) the
total SSCP matrix; (3) the residual SSCP matrix (what we called the
W matrix, i.e.the within-group SSCP matrix); (4) the residual
covariance matrix (i.e. the covariance matrix of residual values); (5)
the least-squares means for each group for each variable (which, as
you recall from the lecture on least-squares estimation, are just the
sample means); (6) a test of the hypothesis that the constant in the
fitted ANOVA model is zero; (7) a test of the null hypothesis that the
group effects in the model are simultaneously zero. For (6) and (7),
the univariate F-tests for each variable are printed, as well as the
multivariate test statistics (Wilk’s lambda, etc.) Don’t worry about
what the canonical correlations etc. mean - we will address this in a
later exercise. Based on these results, what would you conclude?
2.
Open the file of residuals you saved from the previous analysis. By
using SELECT statements, run ACF (Time Series/ACF) plots on
all 4 residuals for each group. (Note: in this file, residual(1)
corresponds to the first listed variable in the ANOVA, i.e., pH;
residual(2) the next listed variable, etc. (To make your life easier,
you may want to rename these variables in the file.). Recall that the
ACF plots test for serial autocorrelation of residuals, in this case
within each group. What do you conclude? Using the same file, test
for univariate normality of each variable separately by doing a
14
SINGLE CLASSIFICATION MANOVA (K-GROUPS MANOVA)
normal probability plot (Graph/Probability/Normal) for each
variable in each group. What do conclude?
3.
Using the univariate F tables from the first analysis, calculate the
intraclass correlation (see Lecture 5.17) for each of the four
variables. What do you conclude? On the basis of this analysis and
the preceding one, do you think that the assumption of independence
is valid? What do you think would/should be your next step?
4.
Open the command file MVNORMALTEST and try adapting it to check
multivariate normality within each PSR$ category. What do you
conclude? (If you get stuck, the command file MVNORM2 will do it for
you. The output is given in MVNORM2.SYO)
5.
Adapt the command file BOXTEST2 to test for equality of the three
covariance matrices (one for each PSR$ category). Here you can use
the two of the three files extracted above, namely PSR_0 and PSR_1
as inputs, but you can’t use PSR_2+ because MATRIX doesn’t like
“+” in matrix names. Do you will have to name this file something
else.)What do you conclude? (Note: if you get stuck here, the
command file BOXTEST3 will do this for you.)
6.
At this stage, it should be apparent that (1) at least for some
variables, the assumption of independence within groups is not
valid; (2) there are some indications that the data are not multivariate
normal; and (3) the assumption of equality of covariance matrices is
also unlikely to be true. So, what do you do? How might you try and
resolve the first problem (hint: see Lecture 5.19)? How about
problems (2) and (3)? Which problems (if any) do you consider
particularly worrisome in this specific example?
7.
For this exercise, let’s assume that in fact the assumptions did hold,
and we wish to find out which pairs of groups differ from one
another. In MANOVA, SYSTAT 8 does not allow automatic
post-hoc pairwise contrasts (note that when you ran the original
ANOVA above, the Post-hoc Tests dialog box was blanked out).
Instead you must specify particular hypotheses to test following the
original model estimation, by choosing ANOVA/Hypothesis
test, and filling in the resulting dialog box. Effect means the
grouping variable(s), which in this case is PSR$. You then must
specify a Contrast matrix (C matrix). To compare the first
(PSR$=’none’) and second (PSR$=’one’) groups, you must type in ‘1
-1 0’; to compare the first and third groups, enter ‘1 0 -1’ (without
the single quotes). There are in total three such pairwise contrasts: 1
-1 0, 1 0 -1, 0 -1 1. Run these contrasts. Which groups differ from
which others, and with respect to which variables? (Note that
regardless of the contrast, the univariate F and multivariate tests
always use the same error term (either the pooled (over all three
groups) error MS for univariate tests, or the determinant of the total
(pooled) SSCP matrix for the multivariate case.) This is because
these contrasts assume that the within-group variances/covariances
are in fact simply random samples of the “population”
variances/covariances. That is, the contrasts assume that
within-group variances/covariances are equal, so that the best
estimate of the “true” population variance/covariance is simply the
15
pooled variances and covariances. Thus, if you were to simply run a
two-group comparison (see Lab 4 above) of, say, PSR$=’none’ and
PSR$=’one’, the error MS appearing in the univariate ANOVA tables
would not be the same as appear here in the -1 1 0 contrast. If you
don’t believe me, try it!)
Output and Command files
The command files for these exercises are SCLASSMANOVA, MVNORM2 and
BOXTEST3. The output file is LAB6.SYO. For the output to MVNORM2 and
BOXTEST3, run the command files with the appropriate modifications.
16
SINGLE CLASSIFICATION MANOVA (K-GROUPS MANOVA)
17
Principal component and factor
analysis
Description
These laboratory exercises are designed to familarize you running
principal component analysis and factor analysis in SYSTAT. While
mathematically rather similar, these two procedures have quite different
theoretical underpinnings, and whether one or the other is more
appropriate depends not so much on the nature of the data, but rather on
the nature of the question. Remember, principal components are
weighted linear combinations of observed variables, whereas common
factors are unobserved (indeed, unobservable) variables that are
hypothesized to account for correlations among observed variables. (For
more details on this, see Lecture notes chapter 7). In SYSTAT, factor
analysis and principal component analysis are included in the same
statistical module (Statistics/Data Reduction/Factor).
The data
is a classic data file used to test various computer algorithms for
principal component and factor analysis, and consists of morphological
measurements of 305 girls, including height, arm span, forearm length,
lower leg length, weight, upper thigh diameter (BITRO), chest girth and
chest width. Since all of these measurements are an index of size, we
expect them to be highly correlated. The data file does not list the raw
data but rather the correlation matrix.
GIRLS
Exercises
1.
Open GIRLS.SYD. Then choose Data/Data Reduction/Factor,
enter all eight variables as the Model Variables, and make sure the
PCA button is highlighted. Then click on OK. From the resulting
output, answer the following questions: (1) what do the eigenvalues
(latent roots) represent? (2) what do the eigenvectors 1 and 2
represent, and why are there only 2 of them? (3) what do the
(unrotated) component loadings represent? (4) where do the values
for “Variance explained by component” come from?; (5) where does
the “Percent of Total Variance explained by component” come
from? (6) How would you interpret these results. In particular, which
components and loadings are “significant”?
2.
Repeat the above procedure, but this time make sure that Sort
Loadings is enabled, and click on Rotation... and choose
Varimax. What does the “Rotated Loading Matrix” represent, and
how do you interpret it? Does this rotation make for easier
interpretation?
3.
Repeat the above procedure, trying a few different rotations. Which,
in your view, is the best, and why? (Note: when you do Oblimin
rotations, the output also includes a panel about “Direct and Indirect
contributions of factors to Variance”. These are useful for
18
PRINCIPAL COMPONENT AND FACTOR ANALYSIS
determining whether part of a factor’s contribution to the explained
variance is due to its correlation with other factors. The reason this
occurs here and not in the other rotations is that in oblique rotations
(like Oblimin), factors need not be uncorrelated, whereas in other
rotations, the assumption is that they are orthogonal (uncorrelated).
4.
For Orthomax and Oblimin rotations, you can set a parameter
called Gamma. In Orthomax rotations, Gamma (ranging from
zero to one) determines the “family” of rotations used. Recall that
Orthomax rotations are a sort of hybrid between Varimax and
Quartimax rotations, and Gamma can be considered as determining
the degree of hybridization: Gamma = 0 is equivalent to a Varimax
rotation, while Gamma = 1 is equivalent to a Quartimax rotation.
Thus, varying Gamma from zero to one causes a change in the
family of rotations from those which minimize the number of
variables that have high loadings on each factor (Varimax) to those
that minimize the number of factors that load highly on each
variable (Orthomax).
5.
For Oblimin rotations, Gamma again determines the family of
rotations used. In general, the “best” oblimin rotation depends on the
nature of the correlation matrix. if correlations are generally
moderate, Gamma near zero will usually give the best results. If the
correlation includes many high correlations, the positive values for
Gamma will often give the best results. In this case, many of the
correlations are quite high, but a number are quite low, so a small
positive Gamma (e.g. 0.1) gives good results.
6.
Repeat the above procedures, except this time use Statistics/Data
reduction/Factor and select Maximum Likelihood (ML).
Choosing either IPA or ML causes the module to invoke factor
analysis rather than principal components, and IPA or ML
determines the analytical method by which factors (which
remember, are unobservable) are extracted. In ML estimation, for
every estimate of the communalities for each factor, the negative
log-likelihood (see Lecture Notes, L1.10 - L1.17) is computed, and
the process continues over a series of iterations until the
convergence criterion is satisfied (note that you can set both the
maximum number of iterations and the convergence criterion in
the Factor Analysis dialog box.) From the output, (1) what do the
“initial communality estimates” and “final communality estimates”
represent, and why are there eight of them? (2) what does the “Factor
Pattern” matrix represent?; (3) what do the canonical correlations
represent? (To answer this question, check out Lecture Notes Ch.
11). The “Common Variance” is the sum of the communalities: if A
is the maximum likelihood factor pattern matrix, then the common
variance Vc = tr(ATA).
7.
Note that in the above example, there are eight variables but the
factor pattern matrix lists only four factors, even though we did not
specify the number of factors in the Factor Analysis dialog box.
This is due to a well-known (in some quarters at least) theorem that
in factor analysis, the maximum number of factors for which
19
reasonably accurate estimates of loadings can be obtained equals
half the number of variables.
8.
Repeat the above procedure using IPA. In IPA (and ML), the initial
communality estimates for each variable is simply the multiple
squared correlation of that variable with all other variables. At each
iteration, communalities are estimated from the factor pattern matrix
A. Iterations continue until the largest change in an communality is
less than that specified in Convergence. Replacing the diagonal
elements of the original correlation (or covariance) matrix with these
final communality estimates and computing the eigenvalues gives
the eigenvalues reported in the next panel of the output.
9.
PCA on the Girls data set results is quite effective, insofar as (1)
most of the variation in the data set is explained by only a couple of
components/factors; (2) the resulting factors/components are easily
interpreted. But this is not always the case. For a counter example,
run PCA (or factor analysis) on the skulls data set. How do you
interpret these results? Save the scores and data (by enabling Save
/Factor scores). Then using Data/Merge, merge the period$ variable
in the original skull file with the factor scores, and use the SPLOM
procedure to produce a plot of factors(1) - (3) against one another,
with period$ as the grouping variable. You may also want to define
a confidence interval for each group using confidence ellipses (say,
80%) (Options/Scatterplot matrix options/Confidence
ellipse). What do these plots tell you?
10. PCA and factor analysis assume that the variables used in extracting
components or factors follow a multivariate normal distribution.
Using the procedures outlined in Laboratories 4 and 6, do you think
that this assumption is justified for the skulls data set? (Note:
unfortunately, SYSTAT has no general “stand alone” procedure for
calculating Mahalanobis distances, but it will generate such
distances for individual observations in the context of the DISCRIM
procedure (see Lab 4). In this exercise, we are not concerned with
variation in skull dimensions among periods, but rather with doing
PCA/Factor analysis for the sample as a whole. So really we have
only one group, and if we try and run DISCRIM, SYSTAT will
complain bitterly because it cannot run discriminant analysis when
there is only group. However, we can take the skull data file, and
create a dummy second group by simply copying the 150 “real”
cases to create a file of 300 cases, and creating another character
variable group$ which has two value: “real” for the first 150 cases,
and “dummy” for the last 150 cases. But DISCRIM won’t work if
the two groups are identical, so for all cases in the dummy category,
we need to change the values of the various skull variables, e.g. by
adding 5.0 to every value. Once we have done this, we can now run
DISCRIM on the set of 300 cases, and use only the calculated
distances for the “real” cases. Once we have the distances, we can
check for multivariate normality as we did in Laboratories 4 and 6.
(If you get stuck on this, I have already created the dummy file
(DUMMYSKULL.SYD), and the modified command file for testing
multivariate normality is MVNORM3.)
20
PRINCIPAL COMPONENT AND FACTOR ANALYSIS
Output and Command files
The command file for this lab is LAB7.The output file is LAB7.SYO. The
command file to check multivariate normality in the skulls data set is
MVNORM3.SYO
21
Discriminant function analysis
Description
These laboratory exercises are designed to familarize you with running
discriminant function analysis in SYSTAT, using both linear and
quadratic analysis functions. The variables in these functions can be
selected in a forward or backwards stepwise fashion, either interactively
by the user or automatically by SYSTAT. Remember that discriminant
analysis is rather like a hybrid between single classification MANOVA
and multiple regression: cases are grouped into “treatments” (groups) as
in MANOVA, and the predictor variables are used to generate functions
(equations) which predict group membership, as in multiple regression.
The data
IRIS.SYD is a
another classic data set from Sir Ronald Fisher, one of the
fathers of modern biostatistical analysis, and dates back to 1936. The file
consists of measurements (sepal length, sepal width, petal length and
petal width, in cm) made on 150 iris flowers from three species.
Exercises
1.
Open IRIS.SYD. Then choose
Statistics/Classification/Discriminant function and enter all
four flower variables as the Variable(s), and species as the
Grouping Variable. Click on Options... and make sure the
Estimation/Complete is enabled. Then click on Continue and
then OK.
2.
The Between-groups F-matrix tests for equality of group means for
a specified pair of groups. These values are the result of pairwise
“univariate” F-tests comparing the average within-group
Mahalanobis distances to the average between-group distances.
Because the degrees of freedom are the same for each pairwise
comparison, these F values are essentially measures of distances
between group centroids. Which pair of groups are farthest apart?
3.
The F-to-remove statistics are a measure of how important a
particular variable is to the discriminant model. These values are
equivalent to univariate ANOVAs for each variable among groups.
For all variables, the denominator df is the same: total sample size number of groups - number of variables in the model + 1 (i.e., in this
case, 144). On the basis of this result, which variable is the least
useful in distinguishing among species? Which is the most useful?
4.
What are the canonical discriminant functions, standardized
canonical discriminant functions, canonical group means, the raw
classification matrix and the jackknifed classification matrix. On the
basis of the canonical function plot, how important is the second
canonical function in discriminating among species?
22
DISCRIMINANT FUNCTION ANALYSIS
5.
Repeat the above, but this time under Statistics..., make sure the
Long/Mahal is enabled. As we have seen before, this calculates the
Mahalanobis distance of each observation from the multivariate
mean of each group. Based on these distances, it calculates an a
posteriori probability of group membership for each observation in
each group (so that the row sum for a given observation is one),
based on the Mahalanobis distances: the closer an observation is to
a group mean, the greater the posterior probability. An arrow (-->)
marks misclassified observations.
6.
On the basis of the above analysis, we might conclude that PETALLEN
is not very important in discriminating among species. Rerun the
above analysis, but this time drop PETALLEN from the variable list.
Compare the classification matrices obtained here with those
obtained using all four variables. Do you think that including
PETALLEN results in a significantly better classification?
7.
Consider now a more complex situation. Our goal is to build a model
which predicts whether we will find brook trout in Adirondack
lakes. Thus, we can ask: what is the set of functions that best
discriminates between lakes that do and do not have brook trout?
Using the file ALSC8B.SYD, run forward and backwards automatic
stepwise discriminant function analysis, beginning with the
predictors DO PH ACIDNC PHOSPHOROUS SCONDUCT ELEVATION LOGSA
LOGLITT MAXDEPTH MEANDEPTH SHORE FLUSRATE INLET OUTLET
BTSTOCK. Can you figure out how these stepwise processes are
working (see Lecture Notes, ch. 9). What do you conclude?
8.
From the stepwise procedures above it should be clear that some of
the initial variables are not very useful in discriminating among
groups. But based on the F-to-enter/remove statistics for each
variable, which ones do you think are the most important? Try
running a discriminant function analysis using only these variables
(Hint: there are 5 of them - which are they?). Compare the
jackknifed classification matrices obtained from this analysis with
those obtained using forward and backwards stepping above. What
do you conclude?
9.
In the analyses above, we assumed that the prior likelihood of a lake
having brook trout or not was the same. In reality, this is not the case.
In fact, we know that only about 30% of the lakes in the Adirondacks
have brook trout. Thus, it is appropriate to modify the priors by
including in the Model command an option / Priors = 0.70, 0.30.
(Note: there is no menu option to do this; it must be specified in the
command file.) Rerun the forward stepping analysis with these
priors, and compare the results with those obtained using equal
priors. Why the difference? In particular, why does the
misclassification rate for “absent” decrease substantially while the
misclassification rate for “present” increases dramatically?
10. Like most multivariate procedures, discriminant function procedure
analysis relies on multivariate normality within groups and equality
of covariance matrices among groups. In previous laboratories (4, 6,
and 7), we have seen how these assumptions can be checked.
Suppose we consider only the variables PH SCONDUCT ELEVATION
23
FLUSRATE BTSTOCK1. As a “quick and dirty” check of the equality
of covariance assumption, we can do a SPLOM of these 5 variables
for each group of lakes. If the sample size is about the same for each
group (or, as in this case, large enough that differences in sample size
are largely irrelevant), if the equality of covariance assumption
holds, then the ellipses for each pair of variables should have
approximately the same tilt and shape across groups. Do a SLPOM
to check whether this assumption is likely to be true. What do you
conclude?
11. On the basis of the above, you should conclude that the equality of
covariance assumption probably doesn’t hold (Big surprise!). In
such cases, quadratic discriminant function analysis may be of value
since it does not assume equality of covariances (Recall, however,
that quadratic discriminant functions generally have lower
reliability!). Rerun the above analysis using the five variables PH
SCONDUCT ELEVATION FLUSRATE BTSTOCK, Quadratic and
Complete estimation. How do the results compare to a simple
linear discrimination using the same variables?
12. Recall that jackknifed cross-validation is not a particularly good
way to assess model reliability. A better way is to use some of the
data as a training set, the rest as a test set. In SYSTAT, the way to do
this is to assign a random number (from zero to 1) to each case in the
data file, and when this number is less than, say, 0.66, the value 1.0
is stored in a new variable CASE_USE. On the other hand, if the
random number is equal to or greater than 0.66, the CASE_USE
variable is set to zero. CASE_USE then determines the value of another
variable WEIGHT (To see precisely how this is done using the
SYSTAT URN function, consult the command file for this
laboratory). In running a procedure, SYSTAT will then use only
those cases which have a WEIGHT equal to 1.0. In this way, about
66% of the cases in the file are used to (in this example) generate the
discriminant model. (i.e., these are the “training” data). Use this data
splitting procedure to evaluate the reliability of the linear and
quadratic discriminant models derived above.
13. Using the knowledge gained from Laboratory 4 and the present one,
can you produce a command file to do bootstrapped data-splitting
for discriminant function analysis (This takes some thinking about,
but is not particular difficult.)
Output and Command files
The command and output files for this lab are LAB8.SYC and
LAB8.SYO.
24
DISCRIMINANT FUNCTION ANALYSIS
25
Cluster analysis
Description
These laboratory exercises are designed to familarize you with running
cluster analysis in SYSTAT. Within SYSTAT, one can cluster cases
(usually the objects), variables or both, depending on the ultimate goal.
SYSTAT has procedures for producing hierarchical clusters as well as
partitioned clusters (additive trees and k-means clusters).
The data
shows the abundances of the 25 most abundant plant
species on 17 plots from a gazed meadow in Steneryd Nature Reserve in
Sweden. For each plot, the value for a particular species is the sum of
cover values (range 0-5) over 9 subplots, so that a score of 45 indicates
complete coverage of the entire plot.
PLANTS2.SYD
Exercises
1.
Open PLANT2.SYD. As indicated in the Lecture Notes (Ch. 10), when
clustering objects it is important to make sure that the variables (in
this case, species counts) are all on the same scales. So before
beginning, make sure you standardize the data. Then choose
Statistics/Classification/Hierarchical clustering and enter
all 25 plant species (columns in the data file) as the Variable(s), and
enable Join/Rows. This will insure that plots are clustered on the
basis of their dissimilarities in the abundances of the 25 plant
species. Make sure that Linkage is set to single and Distance to
Euclidean. Then click on OK. What conclusions do you draw from
the resulting tree (Notice which plots are grouped together). What
does this suggest?
2.
Still using row joining, try building trees based on a number of
different joining methods (e.g. single, complete, average, etc.).
What, if anything, differs among the trees built using these different
methods? Is there one method which you feel gives the “best”
results?
3.
Rerun the above analyses, but instead of using Euclidean distances,
use distances based on the correlation matrix. How do these results
compare with those obtained using Euclidean distances? Overall,
how many “major” clusters of plots do there appear to be?
4.
Suppose that instead of being measures of coverage, the data in the
matrix represented counts of individuals per plot, do you think that
constructing trees based on quantitative distance metrics (Euclidean,
correlation, etc.) would be appropriate? Why or why not? If not,
what do you think would be an appropriate distance metric to use?
Try it. Do the results change if you use this distance metric? (Note:
single linkage will not work for distance metrics for count data, e.g.
chi-square. Why not?)
26
CLUSTER ANALYSIS
5.
In the above exercise, you tried various joining methods assuming
that the data were in fact frequency data (which they are not). But,
inspection of the data matrix should immediately tell you that in this
case, clustering using frequency metrics is unlikely to be very
reliable. Why not?
6.
In many practical applications, measured attributes of objects take
only one of several values: in plant ecology, for example, a species
may simply be scored as present (1) or absent (0). PLANT3.SYD is the
same file as PLANT2.SYD, except that each species has been scored as
either present or absent. Given the form of these data, what is an
appropriate distance metric to use? Use these metrics to construct a
tree for the 17 plots. How do the trees obtained compare with those
obtained using quantitative metrics. Are the “major” clusters the
same? What does this tell you?
7.
From the above analyses it appears that there is one cluster (of plots
1 - 5) which consistently seems to form a cluster regardless of the
joining method used and regardless of the distance metric. Even if
we coarse-grain the data and use presence/absence, these plots still
form a tight cluster. This suggests that there are at least two major
groups of plots, which in turn suggests that a reasonable starting
place for k-means clustering is with 2 groups. Rerun the above
analyses using Statistics/Classification/k Means Clustering,
varying the number of groups and the distance metric. Based on
these analyses, how many “major” groups do you think you have?
Which species do you think are the most important for
distinguishing among these major clusters? Suppose you were to
consider only the data given in the original data matrix: on the basis
of this information alone, which species do you think would be more
likely to be useful in distinguishing among plots, and why?
8.
In all of the above exercises, our objective has been to cluster plots.
But we can also cluster variables (i.e. species) by using
Join/Column. Repeat the above exercises and study the
relationships among the species. Which species show the closest
relationships?
Output and Command files
The command and output files for this lab are LAB9.SYC and
LAB9.SYO.
27
Canonical correlation analysis