What is Multivariate Analysis
Transcription
What is Multivariate Analysis
What is Multivariate Analysis • Multivariate analysis is the best way to summarize a data tables with many variables by creating a few new variables containing most of the information. These new variables are then used for problem solving and display, i.e., classification, relationships, control charts, and more. • The new variables, the scores, denoted by t, are created as weighted linear combinations of the original variables. Each observations has t-values. • PCA, the basic MV method, summarizes one data table. • Plotting the scores (t’s) gives an overview of the observations (objects) • PLS summarizes simultaneously 2 data tables (X the predictor variables) and (Y the response variables) in order to develop a relationship between them • PCA and PLS are called Projection methods www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 1 (29) What is a Projection? Reduction of dimensionality, model in latent variables • Algebraically – Summarizes the information in the observations as a few new (latent) variables • Geometrically – The swarm of points in a K dimensional space (K = number of variables) is approximated by a (hyper)plane and the points are projected on that plane. www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 2 (29) Notation Each obs has values of t (and u) – Each variable has values of p (and w and c) • t: the X scores; the new summarizing variables (coordinates in the hyper plane of X-space) • u: the Y scores in PLS; the new summarizing variables (coordinates in the hyper plane of Y-space, when Y is multidimensional) • p: the PC loadings. These are the weights that in PCA combine the original variables in X to form the new variables, scores t. • w*: the PLS weights. These are the weights that in PLS combine the original variables in X to form the new variables, scores t. • c: the weights used to combine the Y's to form the scores u. www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 3 (29) Notation Each obs has values of t (and u) – Each variable has values of p (and w and c) • One Component consists of one t and one p (PCA) or t, p, w, u, c (PLS). The total number of components is A. • Model: The data are approximated by a plane or hyper plane, (the model) with as many dimensions as components extracted. • DModX: also called Distance to the model, is the distance of a given observation to the model plane. • T2: Hotelling’s T2, is a combination of all the scores (t) of all A components. T2 measures how far away an observation is from the center of a PC or PLS model. www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 4 (29) Notation • R2X: The fraction of the variation of the X variables explained by the model. • R2Y: The fraction of the variation of the Y variables explained by the model. • Q2X: The fraction of the variation of the X variables predicted by the model. • Q2Y: The fraction of the variation of the Y variables predicted by the model. www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 5 (29) MVA – SIMCA Road Map Methods available • Preprocessing; trimming and Winsorizing (take away extremes) • Principal Components Analysis (PCA; overview of data) • Projection to Latent Structures (PLS; relationships X↔Y) • Simca classification • PLS-discriminant analysis (classification) • Hierarchical PCA and PLS • Predictions and classification of new data using any model www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 6 (29) MVA – SIMCA Road Map Data set = all data; Work set = working copy of data 1. Start a project File New Read Data File Specify Label Cols & Rows 2. Look at the data Data set Quick Info; Variables or Obs. Preprocessing, Trim, etc. 3. Prepare a work copy Workset variables, observations Preprocessing, Class spec. 4. Fit the model Analysis Autofit or fast button Work main menus from left to right 5. Plot results Analysis Scores, Loadings Distance to Model and pop-up menus from up to down 6. Outliers in scores Polish data Prepare new workset Graphically or via Workset Plot / List allows you to plot or list anything non-standard, not found under Analysis www.umetrics.com 6. No outliers in scores Continue Interpret model (plots) Relate to Objective 7. New data Predictions Select Pred.set (observations) T_pred, Y_pred, DModX, etc. 05-08-17 SIMCA-P Getting started.ppt 7 (29) Steps in using SIMCA-P using the wizard • Start a new project and import the data set • Use the workset wizard to guide through building the workset and fitting the model • Generate the report writer to walk through the model results and interpretation • When displaying Simca-P plots always use the Analysis adviser to guide you. www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 8 (29) Workset wizard on ON www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 9 (29) Workset wizard www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 10 (29) Autotransform variables To transform all variables if any needed, mark the check box www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 11 (29) Automatic creation of classes for classification or discrimination www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 12 (29) Selection and Fit of model www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 13 (29) Report writer Walks you through the model results with interpretation : File | Generate Report www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 14 (29) Steps in Using SIMCA-P, Advanced Mode • Start a new project and import the data set • Explore and preprocess the data • Make working copy of selected data (workset) for model building • Specify model type and fit it to the workset • Review fit (plots, diagnostics, coefficients, etc.) • Predictions • Generate Report www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 15 (29) 1a. File New Starting a new project • Select the data file containing the raw data of the project – directory, file type (XLS, DIF, TXT, …..), file name • A Wizard opens (see next page) allowing you to specify (optionally) the row containing the Variable names, and (optionally) the columns with the Obs. Numbers and Names • Here (Commands) you can also do additional things such as – transposing the input data matrix • Use simple mode with workset wizard • At the last Wizard page, you can (optionally) specify another name and directory for the project. • A map of the missing data is shown • The Wizard finishes and puts you in the Simca-window • A starting work set (M1, all data, all X-s, UV -scaled) is ready www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 16 (29) 1b. The second screen of the Wizard www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 17 (29) 2. Looking at the data • With the data set table open (Data set edit): • Quick Info (both var and obs windows can be open) – variables – observations • Moving the cursor in the data set table up and down, or sidewise, changes the displayed variable and observation • In the quick info options you can specify what you want to look at (histograms, auto-correlations, …), as well as which items should be the basis for the plots www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 18 (29) View variables or Observations, Trim, etc. Quick Info www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 19 (29) 3. Prepare a work copy: The Workset Simple Mode with guidance, or Advanced Mode • In Workset, you prepare a working copy of the part of the data you will analyze, i.e., use as the basis of your model. • Here you specify transformation, scaling, and roles of variables (X or Y or excluded). • Also, you select the observations (your “training set”). • You can start with the previous workset (Workset / New as model xx) and then modify it, e.g., excluding observations. • Whatever you do in Workset does NOT touch the raw data • Note that outliers are just specified as “not included” in the next workset (the “polished” data). Outliers are NEVER removed from the raw data set. www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 20 (29) Workset: two Modes, Simple and Advanced www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 21 (29) 4. Analysis Fit the Model to the Workset Data • Either menu “Analysis / Autofit” or Fast Button • A model with appropriate number of components is found – If nothing happens, get the two first components (also menu or fast button) • A table appears showing the model, component by component. • More components can be added (menu or fast button) • Double click on a model to specify a title www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 22 (29) 5. Plot results Analysis / menu (or fast buttons) • Summary / X/Y-Overview shows R2 and Q2 for all var.s • Scores – scatter plot, t1-t2 and t1-u1 & t2-u2 (PLS) • Loadings – scatter plot (p1-p2 fro PCA, wc1-wc2 for PLS) • Distance to Model – line plot • Contribution plots to interpret interesting observations, e.g. outliers, jumps, … • For all plots, the right mouse button, properties allows choice of plot markers, and more • The graphical tool box allows further modifications www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 23 (29) 6a. Outliers were seen in the score plot (well outside the Hotelling ellipse) • Start another workset (either from Workset / New as model xx, or using the graphical tool-box to remove outliers from the score plot) • Note that outliers should NOT be deleted from the data by Edit/Data set • When the new workset is all-right, return to “4. Analysis” to fit a new model to the new work set (fast button or Analysis/Autofit) www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 24 (29) 6b. No outliers were seen in the score plots (or they have been excluded, and the score plots now look all-right) • Now, interpret the model • Look at “patterns”, trends, etc., in the score plots • Inspect the loading plots to interpret the above patterns • Look at DModX • What do these patterns say about the objective of the investigation? www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 25 (29) Analysis Advisor to understand and interpret model results www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 26 (29) 7. Predictions New Data, Prediction Set • Under Predictions, specify the set of observations for which predictions will be made, the prediction set • New data can be read in as a secondary data set (File / Import) and predictions can be made for these • Prediction set / Complement WS, gives a prediction set with those observations that were not in the training set • Predictions / Y-predicted, T-predicted, etc., calculates and displays the predicted values accordingly www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 27 (29) 8. Generate the report, with customizable templates www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 28 (29) Use of these slides • You may use any or all of these slides in your own presentations, provided that you keep (and do not modify) the Umetrics logo and web reference • If you have any problems with the software, or with understanding of the material, please e-mail us at [email protected] www.umetrics.com 05-08-17 SIMCA-P Getting started.ppt 29 (29)