How to avoid over-fitting in multivariate calibration—The

Transcription

How to avoid over-fitting in multivariate calibration—The
Analytica Chimica Acta 595 (2007) 98–106
How to avoid over-fitting in multivariate calibration—The
conventional validation approach and an alternative
N.M. Faber a,∗ , R. Rajk´o b
a
b
Chemometry Consultancy, Rubensstraat 7, 6717 VD Ede, The Netherlands
Department of Unit Operations and Food Engineering, Szeged College of Food Engineering, University of Szeged, H-6701 Szeged, POB 433, Hungary
Received 27 September 2006; received in revised form 17 May 2007; accepted 21 May 2007
Available online 25 May 2007
Abstract
This paper critically reviews the problem of over-fitting in multivariate calibration and the conventional validation-based approach to avoid it.
It proposes a randomization test that enables one to assess the statistical significance of each component that enters the model. This alternative is
compared with cross-validation and independent test set validation for the calibration of a near-infrared spectral data set using partial least squares
(PLS) regression. The results indicate that the alternative approach is more objective, since, unlike the validation-based approach, it does not require
the use of ‘soft’ decision rules. The alternative approach therefore appears to be a useful addition to the chemometrician’s toolbox.
© 2007 Elsevier B.V. All rights reserved.
Keywords: Multivariate calibration; PLS; Component selection; Cross-validation; Test set validation; Randomization test; Near-infrared spectroscopy
“. . .I personally have not been aware of clear unambiguous
automated warnings starting to appear when data was being
over-fitted . . .”
A.N. Davies, Spectroscopy Europe (2004).
1. Introduction
Multivariate calibration models play an important role in
various technical fields. These models are not only applied in
particular in the chemical, petrochemical, pharmaceutical, cosmetic, coloring, plastics, paper, rubber and foodstuffs industries,
but also in forensic, environmental, medical, sensory and marketing research. As an illustration, consider near-infrared (NIR)
spectroscopy, which is increasingly used for the characterization
of solid, semi-solid, fluid and vapor samples [1]. Frequently, the
objective with this characterization is to determine the value
of one or several concentrations in future unknown samples.
Multivariate calibration is then used to develop a quantitative
relation, i.e., a model, between the digitized spectra, stored in a
data matrix X, and the concentrations, stored in a data matrix Y,
as reviewed by Martens and Næs [2]. NIR spectroscopy is also
∗
Corresponding author. Tel.: +31 318 641985; fax: +31 318 642150.
E-mail address: [email protected] (N.M. Faber).
0003-2670/$ – see front matter © 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.aca.2007.05.030
increasingly used to infer other properties than concentrations,
e.g., the strength and viscosity of polymers, the thickness of a
tablet coating, and the octane rating of gasoline. It is important
to note that precise and accurate quantification on the basis of
highly non-selective NIR spectra is one of the major success
stories of chemometrics.
Various methods have been developed for building a multivariate calibration model. The three most common ones are
multiple linear regression (MLR), which is also known as ordinary least squares (OLS), principal component regression (PCR)
and partial least squares (PLS) regression. While MLR requires
more samples, denoted by N, than spectral channels, denoted
by K, PCR and PLS can handle the opposite case as well,
i.e., K > N. For that reason, they are often referred to as fullspectrum methods. PCR and PLS are able to cope with an
arbitrarily large number of spectral channels by compressing
the X-data into a relatively small number, denoted by A, of socalled t-scores—usually less than ten. The score matrix T of
size N × A then replaces the original X-matrix of size N × K
in the subsequent regression step, i.e., Y is regressed onto T
instead of X. The regression step amounts to solving a system of
equations where each sample represents an equation and each
t-score can be regarded as an unknown. Consequently, the strict
mathematical requirement follows that the number of samples
must exceed the number of t-scores, i.e. N > A. This require-
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
ment is easily fulfilled in practice. PCR constructs t-scores that
successively describe the maximum amount of variation in X
while being orthogonal to each other. PLS can be seen as a
further development of PCR because the Y-data contribute to
the construction of the t-scores [3]. The full-spectrum methods
are generally preferred since the additional wavelength selection step required for application of MLR, to ensure that N > K,
is problematic in itself. Moreover, the compression to a small
number of t-scores acts as an effective noise filter. PLS is currently the de-facto standard in chemometrics because it has often
been reported to exhibit a slight edge over PCR in applied work.
For this reason, we will in the remainder restrict ourselves to
PLS, although the proposed methodology should be equally
suited for use with other score-based multivariate calibration
methods.
2. Background
2.1. The problem of over-fitting
Usually, the first step towards constructing a PLS model is to
remove undesirable features from the X-data by pre-treatment
techniques such as filtering [1] or differentiation [4]. When the
data have been made appropriate for the actual modeling process, the next critical step serves to select the optimum model
dimensionality (also known as model rank), which is the number
of PLS components (also known as factors or latent variables)
that constitute the multivariate model. This step is equivalent
to determining the optimum degree of a polynomial for fitting
univariate (x,y)-data pairs. However, it is a much harder problem
to solve for multivariate calibration, owing to the larger amount
of input data at hand, with possibly intricate signal and noise
characteristics, and the consequently increased complexity of
the calibration method deployed. The state of the art concerning
commercially available software has been recently criticized by
Davies [5]: “Back in 1998 more advanced chemometric tools
were being made available as standard in spectrometer control
packages. This had, however, raised fears that the inherent dangers of over-fitting data were not being sufficiently addressed
99
in order to help inexperienced spectroscopists handle the additional computing power that was becoming available. I must
admit that the work of my co-column Editor in pushing for
“Good Chemometrics Practice” has hopefully raised awareness in the community of the potential pitfalls in using these
packages without due consideration, but I personally have not
been aware of clear unambiguous automated warnings starting to appear when data was being over-fitted.” (Our italics.)
Over-fitting causes harm because one not only incorporates predictive features of the data in the model, but also noise. The
implication is degraded model performance in the prediction
stage.
An example of over-fitting that is conveniently visualized
occurs when a (two-dimensional) plane is fitted using two scores,
while a (one-dimensional) line, using a single score, would be
appropriate (Fig. 1). It is readily observed that prediction is
still reliable in a restricted region, namely sufficiently close
to the line. The concept of a ‘correct’ predictive region while
effectively over-fitting is further illustrated for a univariate polynomial fit in Fig. 2. For interpolating points one distinguishes
a small but statistically significant increase of prediction uncertainty. By contrast, a large increase of prediction uncertainty is
clearly observed for extrapolating points. This is all the more
surprising because the fitted relationship is almost the same for
this particular example.
It is generally good advice to avoid extrapolation when
deploying an empirical, entirely data-driven, ‘soft’ calibration
model, since in a strict sense the estimated relationship is only
supported in a region close to the calibration points. However,
extrapolation is often implied (to some degree) by the goal of the
application. Apart from genuine prediction in time or forecasting
as in Fig. 2, important examples of unavoidable extrapolation
are:
• the detection of lower analyte concentrations in trace analysis;
• the determination of analyte content using the method of
standard additions;
• the development of a product with higher consumer appreciation in sensory and marketing research; and
Fig. 1. Two collinear X-variables onto which the Y-data ( ) are regressed. Note that an extremely high collinearity is the rule for adjacent channels in molecular
spectroscopy, e.g., NIR. The X-variables allow for the construction of only one stable component using the first score, t1 . By contrast, the plane spanned by the first
two scores, t1 and t2 , is unstable. The spread of the fitted Y-data points () around the first axis is caused by noise and should therefore be ignored. The model based
on the first score is an effective noise filter, whereas the plane is over-fitting the data.
100
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
(independent) validation samples and averaging the squared prediction errors, i.e., the differences between model prediction and
the associated ‘known’ reference value.1 The square root of this
average squared error is known as the root mean squared error of
prediction (RMSEP). In equation form, for increasing number
of components (A),
Nval
1 2
RMSEP(A) = (Yˆ A,n − Yref,n )
(1)
Nval n
Fig. 2. The population of the USA in the period 1900–1990 (䊉). This data
set is available through the built-in function census of Matlab (The Mathworks,
Natick, MA, USA). The data are fitted using a polynomial of degree 2 (
)
and 3 (
), respectively. The associated uncertainty bands are given by the
same line type.
• the search for a molecule with higher biological activity using
a quantitative structure activity relationship (QSAR) model.
One should further bear in mind that many calibration
samples are required to cover a high-dimensional space. Assuming that, for example, five (non-replicated) calibration points
are sufficient to fit a straight line; then surely many more
points are required to obtain the same coverage for a plane,
a three-dimensional space, etc. It is easily verified that a highdimensional space can be virtually empty (S. de Jong, personal
communication). It follows that with increasing model dimensionality (A) it becomes increasingly difficult to achieve the same
degree of interpolation for a multivariate model as for its familiar
univariate counterparts. In summary, the consequences of overfitting are likely to be much more disturbing for multivariate
calibration models and this is the very reason why component
selection is to be regarded as a critical step in multivariate predictive modeling.
2.2. The conventional validation approach to avoid
over-fitting
Many methods have been developed to tackle the problem
of over-fitting, of which model validation is the most frequently
applied one in practice. In the context of multivariate calibration,
validation amounts to assessing the ability of a model to predict
the property of interest for future samples, from the same type.
This assessment can be performed in two essentially different
modes, namely externally and internally. The adjective ‘external’ refers to the requirement that the validation samples (also
known as test samples) be independent of the samples used for
constructing the model, i.e., the calibration set; otherwise one
does not properly assess the ability to predict for truly unknown
future samples. For example, simple replicates are not allowed.
The predictive ability is estimated by applying the model to these
where Nval is the number of validation samples and Yˆ A,n and
Yref,n denote the model prediction with A components and
‘known’ reference value for sample n (n = 1, . . ., Nval ), respectively. Ideally, the results of this calculation lead to a clear (i.e.,
not too broad and shallow) minimum RMSEP for the optimum
model dimensionality. This is achieved in case of a favorable
bias-variance trade-off, see Fig. 3.
Internal validation differs from external validation in the
sense that the validation samples are taken from the calibration
set itself, i.e., the validation samples are not truly independent.
To execute an internal validation, one has the choice between (1)
cross-validation, (2) bootstrapping, (3) leverage correction and
(4) criteria originally intended in particular for variable selection
in connection with MLR (e.g., Mallow’s Cp ). In cross-validation,
one constructs models after judiciously leaving out segments of
(calibration) samples. Then an estimate of RMSEP follows by
averaging squared prediction errors for the left-out samples, as in
external validation. To emphasize that this estimate is not based
on truly independent validation samples, it will be denoted as
root mean squared error of cross-validation (RMSECV) in the
remainder of this paper. Cross-validation can be quite computerintensive, depending on the size of the calibration set and the
number of segments. Bootstrapping performs similarly to crossvalidation [7,8]. Leverage correction is only a ‘quick and dirty’
alternative when applied to PCR and PLS [9]. Finally, the criteria like Mallow’s Cp are seldom used. In the remainder we will
therefore focus on internal validation using cross-validation.
2.3. Problems with the conventional validation approach
Validation-based component selection is problematic for various general and specific reasons. Three general reasons are:
• Eq. (1) will only provide a correct estimate of RMSEP if the
reference values are ‘known’ with sufficient precision. This
condition is, however, often not fulfilled in practice. DiFoggio
[10] has coined the term apparent RMSEP to emphasize that
the result of Eq. (1) is a pessimistic estimate (i.e., biased high)
of the actual RMSEP because it contains a spurious error com1
To estimate the predictive ability of a multivariate calibration model one
does not want to rely on theoretical formulas such as the ones underlying the
uncertainty bands displayed in Fig. 2, although it is important to note that significant advances have been made in terms of characterizing the uncertainty in
multivariate model results, see Part III of “Guidelines for calibration in analytical
chemistry” of the International Union of Pure and Applied Chemistry (IUPAC)
[6].
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
101
Fig. 3. Schematic presentations of bias and variance contribution to RMSEP as a function of model dimensionality, e.g., the number of PLS components: (left panel)
) increases rapidly and bias (
) gives a substantial contribution to RMSEP (—) for the optimum model, and
standard presentation where variance (
(right panel) alternative presentation where variance increases slowly (when interpolating) and bias is relatively small for the optimum model. The latter asymmetric
presentation is usually more realistic in practice and illustrates why under-fitting is seldom a concern.
ponent, namely the reference error. It is clear that this spurious
contribution to RMSEP does not depend on model dimensionality (A). Consequently, it will tend to obscure the sought-for
global minimum by making it broader and shallower.
• Martens and Dardenne [11] have shown that many validation
samples are required to obtain a sufficiently precise estimate
of RMSEP. Simulations [12] inspired by this study have led to
the suggestion that a rule of thumb holding for a plain standard
deviation also works for estimates of RMSEP, i.e.,
σ(RMSEP)
1
=√
RMSEP
2Nval
(2)
where σ(·) denotes the standard error of the associated quantity. This expression is an example of the law of diminishing
returns. For example, to have a relative uncertainty of less
than 20% requires about 13 validation samples (that spread
out reasonably well in calibration space). To further reduce
this uncertainty to less than 10% one has to quadruple the
number of validation samples. Eq. (2) intends to enable the
analyst to calculate the number of validation samples (s)he
needs to report an RMSEP estimate in sufficient (significant)
decimal digits—usually two.
• Often, the RMSEP estimates do not exhibit a clear global
minimum, as in Fig. 3. This is a direct consequence of the
previous issues. As a result, one often has to resort to ‘soft’
decision rules like ‘the first local minimum’ or ‘the start of a
plateau’, which is highly unsatisfactory both from a practical
as well as a scientific point of view. It is important to note that
the previous issues have led researchers to develop error indicator functions that do not require possibly noisy reference
values [13,14].
Specific problems with the conventional approach are:
• External validation, i.e., test set validation, is best in the sense
that a closer assessment of RMSEP is possible (‘test is best’).
However, it is wasteful because the validation samples are not
available for the construction of the model.
• Cross-validation, on the other hand, ensures a more economic
use of the available data, but it has two major drawbacks. First,
it cannot be used if the data are designed. This can be under-
stood as follows. Design points are special in the sense that
they should have a large impact on the model. Consequently,
the actual prediction uncertainty should be correspondingly
small for these points. However, when leaving out these
points, the model constructed for the remaining points may
be very different from the ‘full’ model hence it may generate
an unduly large prediction residual for the left-out samples.
Depending on the type of design, this drawback can be ignored
if the calibration set is large enough to have some redundancy,
but it certainly precludes the use of cross-validation in many
sensory and QSAR applications, where the calibration set can
be as small as ten samples (and, in principle, redundancy is
avoided because of the high cost of sampling). Similar reasoning holds when the calibration model must be updated for
new sources of variation, with few samples (X. Capron, personal communication). Obviously, in such cases one will not
have recourse to a sufficiently large independent validation
set either. Second, many variants of cross-validation have a
tendency to select too many components, because they do
not compensate for the fact that the same samples are used
for both calibration and validation. In other words, with crossvalidation one is vulnerable to over-fitting the calibration data
(‘false positive’ components). So-called Monte Carlo crossvalidation has recently been introduced in chemometrics to
reduce the risk of over-fitting [15]. For simulated data, the
risk was reduced from 25% to about 14%. However, the latter
risk is still fairly large, and what is even more disturbing: the
procedure does not provide any hint about this risk. Finally,
the simple fact that a different implementation can easily lead
to a different advice constitutes an ambiguity that is confusing
to the analyst.
2.4. The proposed alternative
The proposed alternative assesses the statistical significance
of each individual component that enters the model. Theoretical approaches to achieve this goal (using a t- or F-test) have
been put forth but they are all based on unrealistic assumptions
about the data, e.g., the absence of spectral noise, see [16] for
examples. A pragmatic data-driven approach is therefore called
for. A so-called randomization test is a data-driven approach and
102
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
Fig. 4. Generating the distribution under the null-hypothesis (Ho ) by building a series of PLS models after pairing up the observations for predictor (X) and response
(Y) variables at random. Any result obtained by PLS modeling after randomization must be due to chance. Consequently, the statistical significance of the value
obtained for the original data follows from a comparison with the corresponding randomization results.
therefore ideally suited for avoiding unrealistic assumptions. For
an excellent description of this methodology, see van der Voet
[17]. The rationale behind the randomization test in the context
of regression modeling is illustrated in Fig. 4. Randomization
amounts to permuting indices. For that reason, the randomization test is often referred to as a permutation test. In QSAR
applications it is known as ‘Y-scrambling’. Clearly, ‘scrambling’
the elements of Y, while keeping the corresponding numbers in
X fixed, destroys any relationship that might exist between the
X- and Y-variables. Randomization therefore yields PLS regression models that should reflect the absence of a real association
between the X- and Y-variables – in other words: purely random
models. For each of these random models, a test statistic is calculated. We have opted for the covariance between the t-score
and the Y-values because it is a natural measure of association,
see [16] for more details. Geometrically, it is the inner product
of the t-score vector and the Y-vector in Fig. 1. Clearly, the value
for a test statistic obtained after randomization should be indistinguishable from a chance fluctuation. For this reason, it will be
referred to as a ‘noise value’. Repeating this calculation a number of times generates a histogram for the null-distribution, i.e.,
the distribution that holds when the component under scrutiny is
due to chance—the null-hypothesis (Ho ). Next, a critical value
is derived from the null-distribution as the value exceeded by a
certain percentage of ‘noise values’ (say 5% or 10%). Finally,
the statistic obtained for the original data – the value under test
– is compared with the critical value. The (only) difference with
a conventional statistical test is that the critical value follows as
a percentage point of a data-driven histogram of ‘noise values’
instead of a theoretical distribution that is tabulated, e.g., t or F.
It is important to note that ‘Y-scrambling’ has become a standard for assessing the significance of a (complete) QSAR model
[18]. One may therefore feel tempted to apply this test to counter
over-fitting but this will not work as intended because a significant model can either over- or under-fit. It follows that the
resulting significance levels will be misleading at best. For example, the grand mean in analysis of variance invariably comes out
as significant in a test, but it (usually) under-fits. It appears that
the testing of complete models, instead of individual components, must lead to trouble.
3. Experimental
3.1. The example data set
A NIR spectral data set will serve to illustrate the problems
with the conventional validation approach to avoid over-fitting.
This type of spectral data provides critical test cases for PLS
component selection procedures because tiny substructures may
have predictive value. The example data set (F. Wahl, Institut Franc¸ais du P´etrole, personal communication) contains
NIR spectra (X) for 239 gas oil samples measured between
4900–9000 cm−1 (Fig. 5). The property of interest (Y) is the
hydrogen content. The reference values were determined by
nuclear magnetic resonance, which has an estimated measure-
Fig. 5. NIR spectra of the example data set.
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
Fig. 6. Validation results for the example data set: (top panels) internal RMSECV ( ) for the 84 calibration samples and (bottom panels) external RMSEP (
the 155 independent validation samples. To better exploit the vertical scale, the first point is omitted in panels (b) and (d).
ment error standard deviation σ ref = 0.025 g (100 g)−1 . The 239
samples were split into a calibration and validation set by using
the duplex algorithm. This method starts by selecting the two
points furthest from each other and puts them both in a first
set (calibration). Then the next two points furthest from each
other are put in a second set (validation), and the procedure is
continued by alternately placing pairs of points in the first or
second set. As a result, 84 samples were used for calibration
and 155 samples for validation. It is noted that the majority
of the available samples was selected for (external) validation,
which is unusual in practice. However, Fern´andez Pierna et al.
had chosen this particular data split to test expressions for multivariate sample-specific prediction uncertainty [19]. In other
words: focus was more on assessing the predictive ability of a
model than on obtaining the best model. Also for the current
study it should be useful to have a relatively large validation
set because external validation is generally considered to be the
‘golden standard’.
3.2. Calculations
The proposed randomization test has been implemented in
Matlab 7.0 (The Mathworks, Natick, MA, USA) and the program
is available from the first author. Histograms of ‘noise values’
were generated using 1000 permutations. Although as few as
100 permutations can be used [17], this relatively large number ensures that the resulting histograms are fairly smooth. For
the current example data set (84 samples × 2128 wavelengths),
the computations were completed within seven CPU seconds
on a 3.4 GHz personal computer. To calculate the risk of overfitting when, in fact, none of the ‘noise values’ exceeds the value
under test, the so-called inverse Gaussian function is fit to the
‘noise values’. This function is often suited for modeling positive
and/or positively skewed data [20].
103
) for
4. Results and discussion
4.1. The conventional validation approach
Both internal and external validation – the ‘golden standard’
– lead to a rather subjective decision process, see Fig. 6. The
five-dimensional model achieves the first local minimum in
RMSECV (see top panels). By contrast, the external RMSEP
estimates continue to decrease until eight components have
been fitted (see bottom panels). The analyst faces major difficulties to objectively decide whether the further decrease
of RMSEP is worthwhile or merely results from ‘statistical
fluctuations’.2 We suspect that to obtain a clear minimum as
in the schematic presentations of Fig. 3, many more samples are
required since the law of diminishing returns is in force—Eq.
(2). However, the currently available total number of samples
(239) is already quite favorable. It is noted that the unscrambler (CAMO, Trondheim, Norway) and SIMCA (Umetrics,
Ume˚a, Sweden) packages have a decision rule implemented to
assess whether a decrease of RMSEP or RMSECV resulting
from the fit of an additional component is worthwhile or not.
The underlying idea, which makes good sense, is that small
‘improvements’ of RMSEP or RMSECV should be regarded
with caution, because larger models are inherently less robust.
Interestingly, both packages suggest on the basis of (internal)
RMSECV that as few as two components are sufficient. This
would, however, lead to a serious under-fitting of the data since
the (external) RMSEP further decreases from 0.01 g (100 g)−1 to
0.045 g (100 g)−1 —the latter value being quite close to the refer2 The decision process is subjective in the sense that different analysts may
easily arrive at different conclusions about the optimal number of components.
For the current example data set, the conclusion may even depend on the use of
scale for the ordinate, cf. Fig. 6c and d.
104
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
ence error (σ ref = 0.025 g (100 g)−1 ). The fundamental problem
with these decision rules is that the actual significance of a further ‘improvement’ of RMSEP estimates depends on the size
and quality of the data set at hand, as well as the validation procedure applied. Consequently, general purpose rules may easily
fail in specific cases.
It has been attempted to rationalize the validation-based
selection of model dimensionality by comparing competing
models in a pair-wise fashion [17,21]. However, the initial choice
of competing model dimensionalities to be further scrutinized
is left to the practitioner. For example, for the current example data set, likely initial choices are five, in combination with
cross-validation, or eight, when test set validation is deployed.
As a result, a major source of subjectivity is not eliminated.
For example, the finally selected model could still be either
under- or over-fitting the data. After all, each model entering
this stage could either under- or over-fit. It stands to reason that
the chain cannot be stronger than its weakest link. Finally, it
is noted that the method developed by van der Voet [17] has
been implemented in SAS® (SAS® Institute, Cary, NC, USA),
as thoroughly reviewed by Reeves and Delwiche [22].
4.2. The proposed alternative
Histograms of ‘noise values’ generated for components 1–8
are presented in Fig. 7. It is observed that the probability that the
Fig. 7. Randomization results for the example data set: histogram of 1000 ‘noise values’, fit using the inverse Gaussian function (
). The symbol α stands for the significance level.
(
) and value under test
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
105
Fig. 8. Comparison of current practice of multivariate predictive modeling (left) and the one enabled by the proposed alternative (right).
value under test is due to chance (α) is extremely small for components 1 (0.0009%), 2 (0.02%), 4 (0.0006%) and 5 (0.002%).
Interestingly, the significance of component 3 is only 3.3%. We
speculate this to be due to component 3 taking care, with some
difficulty, of subtle non-linearities in the spectra, after which
the remaining linear contributions are conveniently handled by
components 4 and 5. The high α-values for components 6–8
constitute a clear unambiguous warning that over-fitting starts
after the fifth component.
5. Recommendations
The proposed randomization test enables a different scheme
for calibration modeling (Fig. 8). The essential difference is that
the two critical steps preceding the actual modeling process are
disentangled. The best data pre-treatment depends highly on the
type and quality of the input data. Expert knowledge is valuable
in this stage and allowing for subjectivity here is to be understood in a favorable sense: it may help reducing the inherent
black-box character of ‘soft’ modeling procedures. By contrast,
the selection of optimum model dimensionality should be kept
fully objective since a human expert cannot judge the observed
modeling power of an additional component to be genuine, i.e.,
not due to chance. An adequate validation step – of course one
must validate! – then constitutes the justification of the overall
trial-and-error procedure. It is stressed that not completely relying on validation for component selection has an added bonus
in the sense that the RMSEP estimate can be reported with
more confidence, simply because it has not guided component
selection.
Until now the discussion focused completely on quantitative
aspects of multivariate calibration modeling. However, in application areas such as sensory research, qualitative aspects such
as the interpretation of the individual components can be even
more important. Standard practice is to visualize the lowestnumbered components in score and loading plots (components
1 versus 2, components 1 versus 3, etc.) to discover patterns and
trends. These observations then may lead to the development
of a better product. A tacit assumption is that the components
included in the model are ordered according to their importance
for describing the Y-variable – the property of interest. It has
been observed, however, that non-significant components can
be preceded and followed by (highly) significant ones [16,23].
This phenomenon has been termed ‘sandwiching’ and can often
be rationalized (see [24–27] for in-depth discussions of this
aspect). An early component can, for example, take care of a
background in the X-data and it consequently bears no relationship with the Y-variable. (Recall that PLS component 3 of the
current example data set is close to being non-significant, while
the preceding and following ones are highly significant.) It is
clear that one should be cautious when attempting to interpret
these ‘sandwiched’ components. We therefore recommend displaying the statistical significance of a component in score and
loading plots to avoid the interpretation of patterns and trends
that have no significant relationship with the Y-variable—the
property of interest.
6. Concluding remarks
The conventional validation approach to component selection
is problematic in practice because often the RMSEP estimates do
not yield a clear global minimum. In such a case, the analyst has
to resort to ‘visual inspection’ and its associated ‘soft’ decision
rules. This all leads to a rather subjective decision process, which
makes the proposed statistical alternative rather attractive. The
following concluding remarks seem to be in order:
106
N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106
• The alternative enables one to scrutinize individual components without making strong assumptions about the data.
• It is user-friendly because it only requires (1) the number
of permutations and (2) the critical significance level to be
selected. The first requirement constitutes the only practical
difference between a randomization test and a conventional
statistical test.
• The result is often consistent with the one obtained using
validation (e.g., unscrambler or SIMCA advice), but now it is
fully objective – ‘visual inspection’ does not play a role.
• It can replace validation for component selection, but it can
also supplement the common plot (RMSEP estimates vs. components) with an advice.
• The applicability of the randomization test is not restricted
to PLS regression. Moreover, it is easily verified that it also
applies to multiway calibration. One only needs to replace the
(1-way) rows of the X-matrix in Fig. 4 by the appropriate data
object (2-way matrices or in general N-way arrays).
• Compression may be necessary to handle large data
sets. However, once the compression is done, other
computer-intensive methods such as bootstrap, jack-knife and
cross-validation can be entertained almost for free as well. The
only requirement is that the compression should not introduce
dependencies among the samples.
• The currently described randomization test operates on the
calibration set. A purpose could be to add objectivity to crossvalidation, cf. Fig. 6a and b. There is no reason, however, why
it cannot be adapted to add objectivity to test set validation,
cf. Fig. 6c and d.
In summary, the proposed randomization test appears to be a
useful addition to the chemometrician’s toolbox.
Acknowledgements
The thoughtful comments by Waltraud Kessler (Reutlingen University), Randy Pell (The Dow Chemical Company),
Michael Sj¨ostr¨om (Ume˚a University) and Svante Wold (Ume˚a
University) are appreciated by the authors. We further thank
Chris Brown (InLight Solutions) for supplying the function for
the inverse Gaussian fit and Alejandro Olivieri (Universidad
Nacional de Rosario) for pointing out a numerical problem in
the calculations. The relationship of the proposed alternative to
the following patent is acknowledged: N.M. Faber, “Method and
system for selection of calibration model dimensionality, and use
of such a calibration model”, PCT/NL2005/000124. Part of this
work was supported by the Hungarian Scientific Research Fund
(OTKA T-046484) and was completed when one of the authors
(RR) spent a sabbatical year from the University of Szeged. The
critical comments by the reviewer are appreciated by the authors.
References
¨
[1] S. Wold, H. Antti, F. Lindgren, J. Ohman,
Chemom. Intell. Lab. Syst. 44
(1998) 175.
[2] H. Martens, T. Næs, Multivariate Calibration, Wiley, New York, 1989.
[3] S. Wold, A. Ruhe, H. Wold, W.J. Dunn III, SIAM J. Sci. Statist. Comput.
5 (1984) 735.
[4] A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627.
[5] A.N. Davies, Spectrosc. Eur. 16 (3) (2004) 26.
[6] A.C. Olivieri, N.M. Faber, J. Ferr´e, R. Boqu´e, J.H. Kalivas, H. Mark, Pure
Appl. Chem. 78 (2006) 633.
[7] M.C. Denham, J. Chemom. 14 (2000) 351.
[8] R. Wehrens, H. Putter, L.M.C. Buydens, Chemom. Intell. Lab. Syst. 54
(2000) 35.
[9] A. Lorber, B.R. Kowalski, Appl. Spectrosc. 44 (1990) 1464.
[10] R. DiFoggio, Appl. Spectroscp. 49 (1995) 67.
[11] H.A. Martens, P. Dardenne, Chemom. Intell. Lab. Syst. 44 (1998) 99.
[12] N.M. Faber, Chemom. Intell. Lab. Syst. 49 (1999) 79.
[13] L. Xu, I. Schechter, Anal. Chem. 68 (1996) 2392.
[14] E.T.S. Skibsted, H.F.M. Boelens, J.A. Westerhuis, D.T. Witte, A.K. Smilde,
Anal. Chem. 58 (2004) 264.
[15] Q.-S. Xu, Y.-Z. Liang, Chemom. Intell. Lab. Syst. 56 (2001) 1.
[16] S. Wiklund, D. Nilsson, L. Eriksson, M. Sj¨ostr¨om, S. Wold, K. Faber, J.
Chemom., submitted.
[17] H. van der Voet, Chemom. Intell. Lab. Syst. 25 (1994) 313.
[18] S.-S. So, M. Karplus, J. Med. Chem. 40 (1997) 4347.
[19] J.A. Fern´andez Pierna, L. Jin, F. Wahl, N.M. Faber, D.L. Massart, Chemom.
Intell. Lab. Syst. 65 (2003) 281.
[20] R.S. Chhikara, J.L. Folks, The Inverse Gaussian Distribution: Theory,
Methodology, and Applications, Marcel Dekker, New York, 1989.
[21] E.V. Thomas, J. Chemom. 17 (2003) 653.
[22] J.B. Reeves III, S.R. Delwiche, J. Near Infrared Spectrosc. 11 (2003) 415.
[23] M.P. G´omez-Carracedo, J.M. Andrade, D.N. Rutledge, N.M. Faber, Anal.
Chim. Acta 585 (2007) 253.
[24] H.R. Keller, D.L. Massart, Y.-Z. Liang, O.M. Kvalheim, Anal. Chim. Acta
263 (1992) 29.
[25] Y.-L. Xie, J.H. Kalivas, Anal. Chim. Acta 348 (1997) 19.
[26] Y.-L. Xie, J.H. Kalivas, Anal. Chim. Acta 348 (1997) 29.
[27] U. Depczynski, V.J. Frost, K. Molt, Anal. Chim. Acta 420 (2000) 217.