What is Multivariate Analysis

Transcription

What is Multivariate Analysis
What is Multivariate Analysis
•
Multivariate analysis is the best way to summarize a data tables with many
variables by creating a few new variables containing most of the information.
These new variables are then used for problem solving and display, i.e.,
classification, relationships, control charts, and more.
•
The new variables, the scores, denoted by t, are created as weighted linear
combinations of the original variables. Each observations has t-values.
•
PCA, the basic MV method, summarizes one data table.
•
Plotting the scores (t’s) gives an overview of the observations (objects)
•
PLS summarizes simultaneously 2 data tables (X the predictor variables) and
(Y the response variables) in order to develop a relationship between them
•
PCA and PLS are called Projection methods
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
1 (29)
What is a Projection?
Reduction of dimensionality, model in latent variables
•
Algebraically
– Summarizes the information in
the observations as a few new
(latent) variables
•
Geometrically
– The swarm of points in a K
dimensional space
(K = number of variables) is
approximated by a
(hyper)plane and the points
are projected on that plane.
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
2 (29)
Notation
Each obs has values of t (and u) – Each variable has values of p (and w and c)
•
t: the X scores; the new summarizing variables (coordinates in the hyper
plane of X-space)
•
u: the Y scores in PLS; the new summarizing variables (coordinates in the
hyper plane of Y-space, when Y is multidimensional)
•
p: the PC loadings. These are the weights that in PCA combine the original
variables in X to form the new variables, scores t.
•
w*: the PLS weights. These are the weights that in PLS combine the
original variables in X to form the new variables, scores t.
•
c: the weights used to combine the Y's to form the scores u.
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
3 (29)
Notation
Each obs has values of t (and u) – Each variable has values of p (and w and c)
•
One Component consists of one t and one p (PCA) or t, p, w, u, c (PLS).
The total number of components is A.
•
Model: The data are approximated by a plane or hyper plane, (the model)
with as many dimensions as components extracted.
•
DModX: also called Distance to the model, is the distance of a given
observation to the model plane.
•
T2: Hotelling’s T2, is a combination of all the scores (t) of all A components.
T2 measures how far away an observation is from the center of a PC or PLS
model.
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
4 (29)
Notation
•
R2X: The fraction of the variation of the X variables explained by the model.
•
R2Y: The fraction of the variation of the Y variables explained by the model.
•
Q2X: The fraction of the variation of the X variables predicted by the model.
•
Q2Y: The fraction of the variation of the Y variables predicted by the model.
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
5 (29)
MVA – SIMCA Road Map
Methods available
•
Preprocessing; trimming and Winsorizing (take away extremes)
•
Principal Components Analysis (PCA; overview of data)
•
Projection to Latent Structures (PLS; relationships X↔Y)
•
Simca classification
•
PLS-discriminant analysis (classification)
•
Hierarchical PCA and PLS
•
Predictions and classification of new data using any model
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
6 (29)
MVA – SIMCA Road Map
Data set = all data; Work set = working copy of data
1. Start a project
File New
Read Data File
Specify Label Cols & Rows
2. Look at the data
Data set
Quick Info; Variables or Obs.
Preprocessing, Trim, etc.
3. Prepare a work copy
Workset
variables, observations
Preprocessing, Class spec.
4. Fit the model
Analysis
Autofit
or fast button
Work main menus from left
to right
5. Plot results
Analysis
Scores, Loadings
Distance to Model
and pop-up menus from up
to down
6. Outliers in scores
Polish data
Prepare new workset
Graphically or via Workset
Plot / List allows you to plot or
list anything non-standard, not
found under Analysis
www.umetrics.com
6. No outliers in scores
Continue
Interpret model (plots)
Relate to Objective
7. New data
Predictions
Select Pred.set (observations)
T_pred, Y_pred, DModX, etc.
05-08-17
SIMCA-P Getting started.ppt
7 (29)
Steps in using SIMCA-P using the wizard
•
Start a new project and import the data set
•
Use the workset wizard to guide through building the workset and fitting the
model
•
Generate the report writer to walk through the model results and
interpretation
•
When displaying Simca-P plots always use the Analysis adviser to guide
you.
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
8 (29)
Workset wizard on
ON
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
9 (29)
Workset wizard
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
10 (29)
Autotransform variables
To transform all variables if any needed, mark the check box
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
11 (29)
Automatic creation of classes for classification or
discrimination
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
12 (29)
Selection and Fit of model
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
13 (29)
Report writer
Walks you through the model results with interpretation : File | Generate Report
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
14 (29)
Steps in Using SIMCA-P, Advanced Mode
•
Start a new project and import the data set
•
Explore and preprocess the data
•
Make working copy of selected data (workset) for model building
•
Specify model type and fit it to the workset
•
Review fit (plots, diagnostics, coefficients, etc.)
•
Predictions
•
Generate Report
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
15 (29)
1a. File New
Starting a new project
•
Select the data file containing the raw data of the project
– directory, file type (XLS, DIF, TXT, …..), file name
•
A Wizard opens (see next page) allowing you to specify (optionally) the
row containing the Variable names, and (optionally) the columns with
the Obs. Numbers and Names
•
Here (Commands) you can also do additional things such as
– transposing the input data matrix
•
Use simple mode with workset wizard
•
At the last Wizard page, you can (optionally) specify another name and
directory for the project.
•
A map of the missing data is shown
•
The Wizard finishes and puts you in the Simca-window
•
A starting work set (M1, all data, all X-s, UV -scaled) is ready
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
16 (29)
1b. The second screen of the Wizard
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
17 (29)
2. Looking at the data
•
With the data set table open (Data set edit):
•
Quick Info (both var and obs windows can be open)
– variables
– observations
•
Moving the cursor in the data set table up and down, or sidewise, changes
the displayed variable and observation
•
In the quick info options you can specify what you want to look at
(histograms, auto-correlations, …), as well as which items should be the
basis for the plots
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
18 (29)
View variables or Observations, Trim, etc.
Quick Info
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
19 (29)
3. Prepare a work copy: The Workset
Simple Mode with guidance, or Advanced Mode
•
In Workset, you prepare a working copy of the part of the data you will
analyze, i.e., use as the basis of your model.
•
Here you specify transformation, scaling, and roles of variables (X or Y or
excluded).
•
Also, you select the observations (your “training set”).
•
You can start with the previous workset (Workset / New as model xx) and
then modify it, e.g., excluding observations.
•
Whatever you do in Workset does NOT touch the raw data
•
Note that outliers are just specified as “not included” in the next workset (the
“polished” data). Outliers are NEVER removed from the raw data set.
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
20 (29)
Workset: two Modes, Simple and Advanced
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
21 (29)
4. Analysis
Fit the Model to the Workset Data
•
Either menu “Analysis / Autofit” or Fast Button
•
A model with appropriate number of components is found
– If nothing happens, get the two first components
(also menu or fast button)
•
A table appears showing the model, component by component.
•
More components can be added (menu or fast button)
•
Double click on a model to specify a title
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
22 (29)
5. Plot results
Analysis / menu (or fast buttons)
•
Summary / X/Y-Overview shows R2 and Q2 for all var.s
•
Scores – scatter plot, t1-t2 and t1-u1 & t2-u2 (PLS)
•
Loadings – scatter plot (p1-p2 fro PCA, wc1-wc2 for PLS)
•
Distance to Model – line plot
•
Contribution plots to interpret interesting observations, e.g. outliers, jumps,
…
•
For all plots, the right mouse button, properties allows choice of plot
markers, and more
•
The graphical tool box allows further modifications
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
23 (29)
6a. Outliers were seen in the score plot
(well outside the Hotelling ellipse)
•
Start another workset
(either from Workset / New as model xx, or using the graphical tool-box to
remove outliers from the score plot)
•
Note that outliers should NOT be deleted from the data by Edit/Data set
•
When the new workset is all-right, return to “4. Analysis” to fit a new model
to the new work set
(fast button or Analysis/Autofit)
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
24 (29)
6b. No outliers were seen in the score plots
(or they have been excluded, and the score plots now look all-right)
•
Now, interpret the model
•
Look at “patterns”, trends, etc., in the score plots
•
Inspect the loading plots to interpret the above patterns
•
Look at DModX
•
What do these patterns say about the objective of the investigation?
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
25 (29)
Analysis Advisor to understand and interpret model results
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
26 (29)
7. Predictions
New Data, Prediction Set
•
Under Predictions, specify the set of observations for which predictions will
be made, the prediction set
•
New data can be read in as a secondary data set
(File / Import) and predictions can be made for these
•
Prediction set / Complement WS, gives a prediction set with those
observations that were not in the training set
•
Predictions / Y-predicted, T-predicted, etc., calculates and displays the
predicted values accordingly
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
27 (29)
8. Generate the report, with customizable templates
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
28 (29)
Use of these slides
•
You may use any or all of these slides in your own presentations, provided
that you keep (and do not modify) the Umetrics logo and web reference
•
If you have any problems with the software, or with understanding of the
material, please e-mail us at
[email protected]
www.umetrics.com
05-08-17
SIMCA-P Getting started.ppt
29 (29)