Multiple Linear Regression Objective: 1/9

Transcription

Multiple Linear Regression Objective: 1/9
1/9
Multiple Linear Regression
Objective:
This example shall show how to perform a multiple li- Main points:
near regression analysis on the basis of existing
Performing a
regression
process data (field data). After completing the examanalysis
ple, you will be able to do those regression analysis
studies on your own and you will be able to evaluate
the results based on different statistical values and Pre-Requisites:
Basics of regraphical charts.
gression
Important:
Please switch to the qs-STAT Analysis of Regression/Variance program module in order to be
able to access the function as described below (menu
item Module|qs-STAT Analysis of Regression/Variance).
MODULE|
ANALYSIS
REGRESSION-/
VARIANCE
OF
Initial situation:
Ammonium sulphate is filled into sacks. Very often
during this process, agglutinations can occur that block
the filling system. The observation (measurement) of
possible causes shall show hints about on which influence factors the response flow rate of the filling system depends on the most. The following potential influence factors were analyzed:
x1 = Humidity of the ammonium sulphate (in 0.01%),
x2 = ration length/width of the crystals and
x3 = contaminations in the ammonium sulphate (in
0.01%)
48 data sets were recorded.
Task:
The recorded data shall be analyzed using a multiple
linear regression analysis in order to determine the
significance of the individual influence factors.
Remark:
You can find the data in the
FLOWRATE_REGRESSI
ON.DFQ file.
Alternatively, you
can also create a
Procedure:
new file using the
1. Select the File|Open menu function and open data in the table below.
the FLOWREATE_REGRESSION.DFQ file.
REGRESSION ANALYSIS
2. Select the REGRESSION ANALYSIS function from
the ANALYSIS / PROCEDURE menu item and then
chose the MULTIPLE REGRESSION / LINEAR
REGRESSION.
Version: 1
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Doc:
S-FB 154 E
2/9
Multiple Linear Regression
Icon Next
3. The characteristic selection dialog can be
opened up with a mouse-click on “Linear regression” or on the Next button. The response and
the influence factors can be determined by setting “crosses” per mouse-click.
The interpretation of the results will be discussed in
the following. This example will discuss the most important circumstances.
Parameter estimation
The first results are the estimations for the regression
coefficients.
The coefficient of determination is not very high with
R=57.493%. It indicates how well the variation of the
response can be explained by the influence factors.
Behind the descriptions of the response and influence
magnitudes, the regression coefficients and their conDoc-No.:
S-FB 154 E
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Version:
1
3/9
Multiple Linear Regression
fidence areas (bi) are listed. For the evaluation of the
coefficients, their standard deviation (sci) and the tstatistic are output as well. The bar chart shows - analog to the t-values - whether the effect of the influence
magnitude is significant. In addition to that, the VIF
values shown are a measure for the dependency of
the individual influence factors to each other.
Interpretation:
The influence factors can only explain the variation of
the flow rate by 57.5 %. This means that important effects have not been considered.
The contamination is the most important influence factor, followed by the ration of length to width.
Model evaluation
The next results serve for the evaluation of the overall
model approach. It is tested whether the basic requirements for the model are met.
The first test is used to double-check the (quasi-) linear relation of the selected approach.
The second test checks whether all influence factors
together have a significant effect on the response.
Version: 1
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Doc:
S-FB 154 E
4/9
Multiple Linear Regression
Interpretation:
The selected linear relation can be used for this problem. The hypothesis for the linear relation is not rejected (green color).
The overall influence of all influence factors on the response is significant (red color). This means the selected regression approach is usable.
Response Surface Plots Overview
A meaningful graphical interpretation of the relations is
possible with this graphical chart. The graphical display shows the influence of two influence factors onto
the response under certain conditions of the other influences.
Interpretation:
The chart shows the effect of the humidity [%] and the
length-width ratio under a given contamination of ca. 2
%. The surface changes depending on the setting for
the contamination [%]. The highest flow rate for the
selected setting can be detected in the top left corner
(red coloring).
Doc-No.:
S-FB 154 E
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Version:
1
5/9
Multiple Linear Regression
Single-factor plots overview
The Single-factor plots overview shows which influence factors have which effect, and what prognosis
value can be expected for a user-defined setting of the
influence factors.
Interpretation:
For the selected setting for the influence factors (red
lines), we can expect a flow rate of ca. 4, with a prediction interval of ± 1.7 which indicates the accuracy of
the prognosis.
Analysis of the residuals
We can determine graphically whether the assumptions of the regression approach are full-filled using the
residuals (deviation of the calculated value and the
measured value for the response).
Version: 1
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Doc:
S-FB 154 E
6/9
Multiple Linear Regression
1. The value chart of the residuals (chart up top) displays the behavior of the residuals per value number. Ideally, the behavior of residuals is random.
2. The probability plot (chart in the lower left corner)
serves for the assessment of the assumption that
the residuals are normally distributed.
3. The chart in the lower right corner compares the
residuals with the estimated values (fitted values) in
a scatter plot. Ideally, the data points are distributed
randomly in the coordinate-system.
Interpretation:
The residual do not seem to be random, at least not
after the 35th value. It should be double-checked
whether something specials has happened during the
data recording process.
The assumption of a normal distribution of the residuals cannot be rejected based on the probability plot.
The scatter chart of the fitted values and the residuals
does not show anything special either.
Doc-No.:
S-FB 154 E
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Version:
1
7/9
Multiple Linear Regression
Other graphical evaluations of the model
Leverage
The leverage determines how far the individual values
of the influence factors deviate from their average (average vector). These values serve for double-checking
whether the individual value sets of the influence factors can be interpreted as outliers. When working with
small sample sizes, a few extreme values can strongly
influence the evaluation of the regression coefficients.
Interpretation:
Value number 22, 35 and 42 can be regarded as outliers that could influence the evaluations. For testing
purposes the regression could be re-evaluated without
these values and be compared to the existing approach. In this example, this would lead to a slightly
smaller coefficient of determination but the meaning of
the influence factors would not change.
Cook’s distance
The Cook’s Distances evaluate the significance of individual data sets for the estimation of the model parameters. It is tested how much the estimation of the
response changes if a data set is removed from the
sample.
Comparing the Cook’s distances to the leveragevalues in a scatter plot can indicate whether certain
Version: 1
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Doc:
S-FB 154 E
8/9
Multiple Linear Regression
outliers of the influence factors are responsible for the
estimation of the parameters.
Interpretation:
Nothing out of the ordinary can be spotted in the chart
up top. The individual Cook’s distances are small.
The extreme outliers (high leverage-value: data sets
no. 22, 35 and 42) show a relatively large Cook’s distance. Therefore, they have an important influence on
the estimation of the model parameters.
Data sets:
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Doc-No.:
S-FB 154 E
Flow rate
5,00
4,81
4,46
4,81
4,46
3,85
3,21
3,25
4,55
4,85
4,00
3,62
5,15
3,76
4,90
Humidity
21
20
16
18
16
18
12
12
13
13
17
24
11
10
17
Length to width
2,40
2,40
2,40
2,50
3,20
3,10
3,20
2,70
2,70
2,70
2,70
2,80
2,50
2,60
2,00
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Contamination
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
Version:
1
9/9
Multiple Linear Regression
No.
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Flow rate
4,13
5,10
5,05
4,27
4,90
4,55
5,32
4,39
4,85
4,59
5,00
3,82
3,68
5,15
2,94
3,18
2,28
5,00
2,43
0,00
4,10
3,70
3,36
3,79
3,40
1,51
0,00
1,72
2,33
2,38
3,68
4,20
5,00
Version: 1
Humidity
14
14
14
20
12
11
10
10
16
17
17
17
15
17
21
23
22
21
24
37
21
28
29
23
32
26
28
21
22
34
29
17
11
Length to width
2,00
2,00
1,90
2,10
1,90
2,00
2,00
2,00
2,00
2,20
2,40
2,40
2,40
2,20
2,20
2,20
2,00
1,90
2,10
2,30
2,40
2,40
2,40
3,60
3,30
3,50
3,50
3,00
3,00
3,00
3,50
3,50
3,20
Contamination
0
1
0
2
1
2
7
2
2
3
4
0
2
3
4
10
7
4
8
14
2
5
7
7
8
4
12
3
6
8
5
3
2
© 2008 Q-DAS GmbH & Co. KG, 69469 Weinheim
Doc:
S-FB 154 E