How to avoid over-fitting in multivariate calibration—The
Transcription
How to avoid over-fitting in multivariate calibration—The
Analytica Chimica Acta 595 (2007) 98–106 How to avoid over-fitting in multivariate calibration—The conventional validation approach and an alternative N.M. Faber a,∗ , R. Rajk´o b a b Chemometry Consultancy, Rubensstraat 7, 6717 VD Ede, The Netherlands Department of Unit Operations and Food Engineering, Szeged College of Food Engineering, University of Szeged, H-6701 Szeged, POB 433, Hungary Received 27 September 2006; received in revised form 17 May 2007; accepted 21 May 2007 Available online 25 May 2007 Abstract This paper critically reviews the problem of over-fitting in multivariate calibration and the conventional validation-based approach to avoid it. It proposes a randomization test that enables one to assess the statistical significance of each component that enters the model. This alternative is compared with cross-validation and independent test set validation for the calibration of a near-infrared spectral data set using partial least squares (PLS) regression. The results indicate that the alternative approach is more objective, since, unlike the validation-based approach, it does not require the use of ‘soft’ decision rules. The alternative approach therefore appears to be a useful addition to the chemometrician’s toolbox. © 2007 Elsevier B.V. All rights reserved. Keywords: Multivariate calibration; PLS; Component selection; Cross-validation; Test set validation; Randomization test; Near-infrared spectroscopy “. . .I personally have not been aware of clear unambiguous automated warnings starting to appear when data was being over-fitted . . .” A.N. Davies, Spectroscopy Europe (2004). 1. Introduction Multivariate calibration models play an important role in various technical fields. These models are not only applied in particular in the chemical, petrochemical, pharmaceutical, cosmetic, coloring, plastics, paper, rubber and foodstuffs industries, but also in forensic, environmental, medical, sensory and marketing research. As an illustration, consider near-infrared (NIR) spectroscopy, which is increasingly used for the characterization of solid, semi-solid, fluid and vapor samples [1]. Frequently, the objective with this characterization is to determine the value of one or several concentrations in future unknown samples. Multivariate calibration is then used to develop a quantitative relation, i.e., a model, between the digitized spectra, stored in a data matrix X, and the concentrations, stored in a data matrix Y, as reviewed by Martens and Næs [2]. NIR spectroscopy is also ∗ Corresponding author. Tel.: +31 318 641985; fax: +31 318 642150. E-mail address: [email protected] (N.M. Faber). 0003-2670/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2007.05.030 increasingly used to infer other properties than concentrations, e.g., the strength and viscosity of polymers, the thickness of a tablet coating, and the octane rating of gasoline. It is important to note that precise and accurate quantification on the basis of highly non-selective NIR spectra is one of the major success stories of chemometrics. Various methods have been developed for building a multivariate calibration model. The three most common ones are multiple linear regression (MLR), which is also known as ordinary least squares (OLS), principal component regression (PCR) and partial least squares (PLS) regression. While MLR requires more samples, denoted by N, than spectral channels, denoted by K, PCR and PLS can handle the opposite case as well, i.e., K > N. For that reason, they are often referred to as fullspectrum methods. PCR and PLS are able to cope with an arbitrarily large number of spectral channels by compressing the X-data into a relatively small number, denoted by A, of socalled t-scores—usually less than ten. The score matrix T of size N × A then replaces the original X-matrix of size N × K in the subsequent regression step, i.e., Y is regressed onto T instead of X. The regression step amounts to solving a system of equations where each sample represents an equation and each t-score can be regarded as an unknown. Consequently, the strict mathematical requirement follows that the number of samples must exceed the number of t-scores, i.e. N > A. This require- N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 ment is easily fulfilled in practice. PCR constructs t-scores that successively describe the maximum amount of variation in X while being orthogonal to each other. PLS can be seen as a further development of PCR because the Y-data contribute to the construction of the t-scores [3]. The full-spectrum methods are generally preferred since the additional wavelength selection step required for application of MLR, to ensure that N > K, is problematic in itself. Moreover, the compression to a small number of t-scores acts as an effective noise filter. PLS is currently the de-facto standard in chemometrics because it has often been reported to exhibit a slight edge over PCR in applied work. For this reason, we will in the remainder restrict ourselves to PLS, although the proposed methodology should be equally suited for use with other score-based multivariate calibration methods. 2. Background 2.1. The problem of over-fitting Usually, the first step towards constructing a PLS model is to remove undesirable features from the X-data by pre-treatment techniques such as filtering [1] or differentiation [4]. When the data have been made appropriate for the actual modeling process, the next critical step serves to select the optimum model dimensionality (also known as model rank), which is the number of PLS components (also known as factors or latent variables) that constitute the multivariate model. This step is equivalent to determining the optimum degree of a polynomial for fitting univariate (x,y)-data pairs. However, it is a much harder problem to solve for multivariate calibration, owing to the larger amount of input data at hand, with possibly intricate signal and noise characteristics, and the consequently increased complexity of the calibration method deployed. The state of the art concerning commercially available software has been recently criticized by Davies [5]: “Back in 1998 more advanced chemometric tools were being made available as standard in spectrometer control packages. This had, however, raised fears that the inherent dangers of over-fitting data were not being sufficiently addressed 99 in order to help inexperienced spectroscopists handle the additional computing power that was becoming available. I must admit that the work of my co-column Editor in pushing for “Good Chemometrics Practice” has hopefully raised awareness in the community of the potential pitfalls in using these packages without due consideration, but I personally have not been aware of clear unambiguous automated warnings starting to appear when data was being over-fitted.” (Our italics.) Over-fitting causes harm because one not only incorporates predictive features of the data in the model, but also noise. The implication is degraded model performance in the prediction stage. An example of over-fitting that is conveniently visualized occurs when a (two-dimensional) plane is fitted using two scores, while a (one-dimensional) line, using a single score, would be appropriate (Fig. 1). It is readily observed that prediction is still reliable in a restricted region, namely sufficiently close to the line. The concept of a ‘correct’ predictive region while effectively over-fitting is further illustrated for a univariate polynomial fit in Fig. 2. For interpolating points one distinguishes a small but statistically significant increase of prediction uncertainty. By contrast, a large increase of prediction uncertainty is clearly observed for extrapolating points. This is all the more surprising because the fitted relationship is almost the same for this particular example. It is generally good advice to avoid extrapolation when deploying an empirical, entirely data-driven, ‘soft’ calibration model, since in a strict sense the estimated relationship is only supported in a region close to the calibration points. However, extrapolation is often implied (to some degree) by the goal of the application. Apart from genuine prediction in time or forecasting as in Fig. 2, important examples of unavoidable extrapolation are: • the detection of lower analyte concentrations in trace analysis; • the determination of analyte content using the method of standard additions; • the development of a product with higher consumer appreciation in sensory and marketing research; and Fig. 1. Two collinear X-variables onto which the Y-data ( ) are regressed. Note that an extremely high collinearity is the rule for adjacent channels in molecular spectroscopy, e.g., NIR. The X-variables allow for the construction of only one stable component using the first score, t1 . By contrast, the plane spanned by the first two scores, t1 and t2 , is unstable. The spread of the fitted Y-data points () around the first axis is caused by noise and should therefore be ignored. The model based on the first score is an effective noise filter, whereas the plane is over-fitting the data. 100 N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 (independent) validation samples and averaging the squared prediction errors, i.e., the differences between model prediction and the associated ‘known’ reference value.1 The square root of this average squared error is known as the root mean squared error of prediction (RMSEP). In equation form, for increasing number of components (A), Nval 1 2 RMSEP(A) = (Yˆ A,n − Yref,n ) (1) Nval n Fig. 2. The population of the USA in the period 1900–1990 (䊉). This data set is available through the built-in function census of Matlab (The Mathworks, Natick, MA, USA). The data are fitted using a polynomial of degree 2 ( ) and 3 ( ), respectively. The associated uncertainty bands are given by the same line type. • the search for a molecule with higher biological activity using a quantitative structure activity relationship (QSAR) model. One should further bear in mind that many calibration samples are required to cover a high-dimensional space. Assuming that, for example, five (non-replicated) calibration points are sufficient to fit a straight line; then surely many more points are required to obtain the same coverage for a plane, a three-dimensional space, etc. It is easily verified that a highdimensional space can be virtually empty (S. de Jong, personal communication). It follows that with increasing model dimensionality (A) it becomes increasingly difficult to achieve the same degree of interpolation for a multivariate model as for its familiar univariate counterparts. In summary, the consequences of overfitting are likely to be much more disturbing for multivariate calibration models and this is the very reason why component selection is to be regarded as a critical step in multivariate predictive modeling. 2.2. The conventional validation approach to avoid over-fitting Many methods have been developed to tackle the problem of over-fitting, of which model validation is the most frequently applied one in practice. In the context of multivariate calibration, validation amounts to assessing the ability of a model to predict the property of interest for future samples, from the same type. This assessment can be performed in two essentially different modes, namely externally and internally. The adjective ‘external’ refers to the requirement that the validation samples (also known as test samples) be independent of the samples used for constructing the model, i.e., the calibration set; otherwise one does not properly assess the ability to predict for truly unknown future samples. For example, simple replicates are not allowed. The predictive ability is estimated by applying the model to these where Nval is the number of validation samples and Yˆ A,n and Yref,n denote the model prediction with A components and ‘known’ reference value for sample n (n = 1, . . ., Nval ), respectively. Ideally, the results of this calculation lead to a clear (i.e., not too broad and shallow) minimum RMSEP for the optimum model dimensionality. This is achieved in case of a favorable bias-variance trade-off, see Fig. 3. Internal validation differs from external validation in the sense that the validation samples are taken from the calibration set itself, i.e., the validation samples are not truly independent. To execute an internal validation, one has the choice between (1) cross-validation, (2) bootstrapping, (3) leverage correction and (4) criteria originally intended in particular for variable selection in connection with MLR (e.g., Mallow’s Cp ). In cross-validation, one constructs models after judiciously leaving out segments of (calibration) samples. Then an estimate of RMSEP follows by averaging squared prediction errors for the left-out samples, as in external validation. To emphasize that this estimate is not based on truly independent validation samples, it will be denoted as root mean squared error of cross-validation (RMSECV) in the remainder of this paper. Cross-validation can be quite computerintensive, depending on the size of the calibration set and the number of segments. Bootstrapping performs similarly to crossvalidation [7,8]. Leverage correction is only a ‘quick and dirty’ alternative when applied to PCR and PLS [9]. Finally, the criteria like Mallow’s Cp are seldom used. In the remainder we will therefore focus on internal validation using cross-validation. 2.3. Problems with the conventional validation approach Validation-based component selection is problematic for various general and specific reasons. Three general reasons are: • Eq. (1) will only provide a correct estimate of RMSEP if the reference values are ‘known’ with sufficient precision. This condition is, however, often not fulfilled in practice. DiFoggio [10] has coined the term apparent RMSEP to emphasize that the result of Eq. (1) is a pessimistic estimate (i.e., biased high) of the actual RMSEP because it contains a spurious error com1 To estimate the predictive ability of a multivariate calibration model one does not want to rely on theoretical formulas such as the ones underlying the uncertainty bands displayed in Fig. 2, although it is important to note that significant advances have been made in terms of characterizing the uncertainty in multivariate model results, see Part III of “Guidelines for calibration in analytical chemistry” of the International Union of Pure and Applied Chemistry (IUPAC) [6]. N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 101 Fig. 3. Schematic presentations of bias and variance contribution to RMSEP as a function of model dimensionality, e.g., the number of PLS components: (left panel) ) increases rapidly and bias ( ) gives a substantial contribution to RMSEP (—) for the optimum model, and standard presentation where variance ( (right panel) alternative presentation where variance increases slowly (when interpolating) and bias is relatively small for the optimum model. The latter asymmetric presentation is usually more realistic in practice and illustrates why under-fitting is seldom a concern. ponent, namely the reference error. It is clear that this spurious contribution to RMSEP does not depend on model dimensionality (A). Consequently, it will tend to obscure the sought-for global minimum by making it broader and shallower. • Martens and Dardenne [11] have shown that many validation samples are required to obtain a sufficiently precise estimate of RMSEP. Simulations [12] inspired by this study have led to the suggestion that a rule of thumb holding for a plain standard deviation also works for estimates of RMSEP, i.e., σ(RMSEP) 1 =√ RMSEP 2Nval (2) where σ(·) denotes the standard error of the associated quantity. This expression is an example of the law of diminishing returns. For example, to have a relative uncertainty of less than 20% requires about 13 validation samples (that spread out reasonably well in calibration space). To further reduce this uncertainty to less than 10% one has to quadruple the number of validation samples. Eq. (2) intends to enable the analyst to calculate the number of validation samples (s)he needs to report an RMSEP estimate in sufficient (significant) decimal digits—usually two. • Often, the RMSEP estimates do not exhibit a clear global minimum, as in Fig. 3. This is a direct consequence of the previous issues. As a result, one often has to resort to ‘soft’ decision rules like ‘the first local minimum’ or ‘the start of a plateau’, which is highly unsatisfactory both from a practical as well as a scientific point of view. It is important to note that the previous issues have led researchers to develop error indicator functions that do not require possibly noisy reference values [13,14]. Specific problems with the conventional approach are: • External validation, i.e., test set validation, is best in the sense that a closer assessment of RMSEP is possible (‘test is best’). However, it is wasteful because the validation samples are not available for the construction of the model. • Cross-validation, on the other hand, ensures a more economic use of the available data, but it has two major drawbacks. First, it cannot be used if the data are designed. This can be under- stood as follows. Design points are special in the sense that they should have a large impact on the model. Consequently, the actual prediction uncertainty should be correspondingly small for these points. However, when leaving out these points, the model constructed for the remaining points may be very different from the ‘full’ model hence it may generate an unduly large prediction residual for the left-out samples. Depending on the type of design, this drawback can be ignored if the calibration set is large enough to have some redundancy, but it certainly precludes the use of cross-validation in many sensory and QSAR applications, where the calibration set can be as small as ten samples (and, in principle, redundancy is avoided because of the high cost of sampling). Similar reasoning holds when the calibration model must be updated for new sources of variation, with few samples (X. Capron, personal communication). Obviously, in such cases one will not have recourse to a sufficiently large independent validation set either. Second, many variants of cross-validation have a tendency to select too many components, because they do not compensate for the fact that the same samples are used for both calibration and validation. In other words, with crossvalidation one is vulnerable to over-fitting the calibration data (‘false positive’ components). So-called Monte Carlo crossvalidation has recently been introduced in chemometrics to reduce the risk of over-fitting [15]. For simulated data, the risk was reduced from 25% to about 14%. However, the latter risk is still fairly large, and what is even more disturbing: the procedure does not provide any hint about this risk. Finally, the simple fact that a different implementation can easily lead to a different advice constitutes an ambiguity that is confusing to the analyst. 2.4. The proposed alternative The proposed alternative assesses the statistical significance of each individual component that enters the model. Theoretical approaches to achieve this goal (using a t- or F-test) have been put forth but they are all based on unrealistic assumptions about the data, e.g., the absence of spectral noise, see [16] for examples. A pragmatic data-driven approach is therefore called for. A so-called randomization test is a data-driven approach and 102 N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 Fig. 4. Generating the distribution under the null-hypothesis (Ho ) by building a series of PLS models after pairing up the observations for predictor (X) and response (Y) variables at random. Any result obtained by PLS modeling after randomization must be due to chance. Consequently, the statistical significance of the value obtained for the original data follows from a comparison with the corresponding randomization results. therefore ideally suited for avoiding unrealistic assumptions. For an excellent description of this methodology, see van der Voet [17]. The rationale behind the randomization test in the context of regression modeling is illustrated in Fig. 4. Randomization amounts to permuting indices. For that reason, the randomization test is often referred to as a permutation test. In QSAR applications it is known as ‘Y-scrambling’. Clearly, ‘scrambling’ the elements of Y, while keeping the corresponding numbers in X fixed, destroys any relationship that might exist between the X- and Y-variables. Randomization therefore yields PLS regression models that should reflect the absence of a real association between the X- and Y-variables – in other words: purely random models. For each of these random models, a test statistic is calculated. We have opted for the covariance between the t-score and the Y-values because it is a natural measure of association, see [16] for more details. Geometrically, it is the inner product of the t-score vector and the Y-vector in Fig. 1. Clearly, the value for a test statistic obtained after randomization should be indistinguishable from a chance fluctuation. For this reason, it will be referred to as a ‘noise value’. Repeating this calculation a number of times generates a histogram for the null-distribution, i.e., the distribution that holds when the component under scrutiny is due to chance—the null-hypothesis (Ho ). Next, a critical value is derived from the null-distribution as the value exceeded by a certain percentage of ‘noise values’ (say 5% or 10%). Finally, the statistic obtained for the original data – the value under test – is compared with the critical value. The (only) difference with a conventional statistical test is that the critical value follows as a percentage point of a data-driven histogram of ‘noise values’ instead of a theoretical distribution that is tabulated, e.g., t or F. It is important to note that ‘Y-scrambling’ has become a standard for assessing the significance of a (complete) QSAR model [18]. One may therefore feel tempted to apply this test to counter over-fitting but this will not work as intended because a significant model can either over- or under-fit. It follows that the resulting significance levels will be misleading at best. For example, the grand mean in analysis of variance invariably comes out as significant in a test, but it (usually) under-fits. It appears that the testing of complete models, instead of individual components, must lead to trouble. 3. Experimental 3.1. The example data set A NIR spectral data set will serve to illustrate the problems with the conventional validation approach to avoid over-fitting. This type of spectral data provides critical test cases for PLS component selection procedures because tiny substructures may have predictive value. The example data set (F. Wahl, Institut Franc¸ais du P´etrole, personal communication) contains NIR spectra (X) for 239 gas oil samples measured between 4900–9000 cm−1 (Fig. 5). The property of interest (Y) is the hydrogen content. The reference values were determined by nuclear magnetic resonance, which has an estimated measure- Fig. 5. NIR spectra of the example data set. N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 Fig. 6. Validation results for the example data set: (top panels) internal RMSECV ( ) for the 84 calibration samples and (bottom panels) external RMSEP ( the 155 independent validation samples. To better exploit the vertical scale, the first point is omitted in panels (b) and (d). ment error standard deviation σ ref = 0.025 g (100 g)−1 . The 239 samples were split into a calibration and validation set by using the duplex algorithm. This method starts by selecting the two points furthest from each other and puts them both in a first set (calibration). Then the next two points furthest from each other are put in a second set (validation), and the procedure is continued by alternately placing pairs of points in the first or second set. As a result, 84 samples were used for calibration and 155 samples for validation. It is noted that the majority of the available samples was selected for (external) validation, which is unusual in practice. However, Fern´andez Pierna et al. had chosen this particular data split to test expressions for multivariate sample-specific prediction uncertainty [19]. In other words: focus was more on assessing the predictive ability of a model than on obtaining the best model. Also for the current study it should be useful to have a relatively large validation set because external validation is generally considered to be the ‘golden standard’. 3.2. Calculations The proposed randomization test has been implemented in Matlab 7.0 (The Mathworks, Natick, MA, USA) and the program is available from the first author. Histograms of ‘noise values’ were generated using 1000 permutations. Although as few as 100 permutations can be used [17], this relatively large number ensures that the resulting histograms are fairly smooth. For the current example data set (84 samples × 2128 wavelengths), the computations were completed within seven CPU seconds on a 3.4 GHz personal computer. To calculate the risk of overfitting when, in fact, none of the ‘noise values’ exceeds the value under test, the so-called inverse Gaussian function is fit to the ‘noise values’. This function is often suited for modeling positive and/or positively skewed data [20]. 103 ) for 4. Results and discussion 4.1. The conventional validation approach Both internal and external validation – the ‘golden standard’ – lead to a rather subjective decision process, see Fig. 6. The five-dimensional model achieves the first local minimum in RMSECV (see top panels). By contrast, the external RMSEP estimates continue to decrease until eight components have been fitted (see bottom panels). The analyst faces major difficulties to objectively decide whether the further decrease of RMSEP is worthwhile or merely results from ‘statistical fluctuations’.2 We suspect that to obtain a clear minimum as in the schematic presentations of Fig. 3, many more samples are required since the law of diminishing returns is in force—Eq. (2). However, the currently available total number of samples (239) is already quite favorable. It is noted that the unscrambler (CAMO, Trondheim, Norway) and SIMCA (Umetrics, Ume˚a, Sweden) packages have a decision rule implemented to assess whether a decrease of RMSEP or RMSECV resulting from the fit of an additional component is worthwhile or not. The underlying idea, which makes good sense, is that small ‘improvements’ of RMSEP or RMSECV should be regarded with caution, because larger models are inherently less robust. Interestingly, both packages suggest on the basis of (internal) RMSECV that as few as two components are sufficient. This would, however, lead to a serious under-fitting of the data since the (external) RMSEP further decreases from 0.01 g (100 g)−1 to 0.045 g (100 g)−1 —the latter value being quite close to the refer2 The decision process is subjective in the sense that different analysts may easily arrive at different conclusions about the optimal number of components. For the current example data set, the conclusion may even depend on the use of scale for the ordinate, cf. Fig. 6c and d. 104 N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 ence error (σ ref = 0.025 g (100 g)−1 ). The fundamental problem with these decision rules is that the actual significance of a further ‘improvement’ of RMSEP estimates depends on the size and quality of the data set at hand, as well as the validation procedure applied. Consequently, general purpose rules may easily fail in specific cases. It has been attempted to rationalize the validation-based selection of model dimensionality by comparing competing models in a pair-wise fashion [17,21]. However, the initial choice of competing model dimensionalities to be further scrutinized is left to the practitioner. For example, for the current example data set, likely initial choices are five, in combination with cross-validation, or eight, when test set validation is deployed. As a result, a major source of subjectivity is not eliminated. For example, the finally selected model could still be either under- or over-fitting the data. After all, each model entering this stage could either under- or over-fit. It stands to reason that the chain cannot be stronger than its weakest link. Finally, it is noted that the method developed by van der Voet [17] has been implemented in SAS® (SAS® Institute, Cary, NC, USA), as thoroughly reviewed by Reeves and Delwiche [22]. 4.2. The proposed alternative Histograms of ‘noise values’ generated for components 1–8 are presented in Fig. 7. It is observed that the probability that the Fig. 7. Randomization results for the example data set: histogram of 1000 ‘noise values’, fit using the inverse Gaussian function ( ). The symbol α stands for the significance level. ( ) and value under test N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 105 Fig. 8. Comparison of current practice of multivariate predictive modeling (left) and the one enabled by the proposed alternative (right). value under test is due to chance (α) is extremely small for components 1 (0.0009%), 2 (0.02%), 4 (0.0006%) and 5 (0.002%). Interestingly, the significance of component 3 is only 3.3%. We speculate this to be due to component 3 taking care, with some difficulty, of subtle non-linearities in the spectra, after which the remaining linear contributions are conveniently handled by components 4 and 5. The high α-values for components 6–8 constitute a clear unambiguous warning that over-fitting starts after the fifth component. 5. Recommendations The proposed randomization test enables a different scheme for calibration modeling (Fig. 8). The essential difference is that the two critical steps preceding the actual modeling process are disentangled. The best data pre-treatment depends highly on the type and quality of the input data. Expert knowledge is valuable in this stage and allowing for subjectivity here is to be understood in a favorable sense: it may help reducing the inherent black-box character of ‘soft’ modeling procedures. By contrast, the selection of optimum model dimensionality should be kept fully objective since a human expert cannot judge the observed modeling power of an additional component to be genuine, i.e., not due to chance. An adequate validation step – of course one must validate! – then constitutes the justification of the overall trial-and-error procedure. It is stressed that not completely relying on validation for component selection has an added bonus in the sense that the RMSEP estimate can be reported with more confidence, simply because it has not guided component selection. Until now the discussion focused completely on quantitative aspects of multivariate calibration modeling. However, in application areas such as sensory research, qualitative aspects such as the interpretation of the individual components can be even more important. Standard practice is to visualize the lowestnumbered components in score and loading plots (components 1 versus 2, components 1 versus 3, etc.) to discover patterns and trends. These observations then may lead to the development of a better product. A tacit assumption is that the components included in the model are ordered according to their importance for describing the Y-variable – the property of interest. It has been observed, however, that non-significant components can be preceded and followed by (highly) significant ones [16,23]. This phenomenon has been termed ‘sandwiching’ and can often be rationalized (see [24–27] for in-depth discussions of this aspect). An early component can, for example, take care of a background in the X-data and it consequently bears no relationship with the Y-variable. (Recall that PLS component 3 of the current example data set is close to being non-significant, while the preceding and following ones are highly significant.) It is clear that one should be cautious when attempting to interpret these ‘sandwiched’ components. We therefore recommend displaying the statistical significance of a component in score and loading plots to avoid the interpretation of patterns and trends that have no significant relationship with the Y-variable—the property of interest. 6. Concluding remarks The conventional validation approach to component selection is problematic in practice because often the RMSEP estimates do not yield a clear global minimum. In such a case, the analyst has to resort to ‘visual inspection’ and its associated ‘soft’ decision rules. This all leads to a rather subjective decision process, which makes the proposed statistical alternative rather attractive. The following concluding remarks seem to be in order: 106 N.M. Faber, R. Rajk´o / Analytica Chimica Acta 595 (2007) 98–106 • The alternative enables one to scrutinize individual components without making strong assumptions about the data. • It is user-friendly because it only requires (1) the number of permutations and (2) the critical significance level to be selected. The first requirement constitutes the only practical difference between a randomization test and a conventional statistical test. • The result is often consistent with the one obtained using validation (e.g., unscrambler or SIMCA advice), but now it is fully objective – ‘visual inspection’ does not play a role. • It can replace validation for component selection, but it can also supplement the common plot (RMSEP estimates vs. components) with an advice. • The applicability of the randomization test is not restricted to PLS regression. Moreover, it is easily verified that it also applies to multiway calibration. One only needs to replace the (1-way) rows of the X-matrix in Fig. 4 by the appropriate data object (2-way matrices or in general N-way arrays). • Compression may be necessary to handle large data sets. However, once the compression is done, other computer-intensive methods such as bootstrap, jack-knife and cross-validation can be entertained almost for free as well. The only requirement is that the compression should not introduce dependencies among the samples. • The currently described randomization test operates on the calibration set. A purpose could be to add objectivity to crossvalidation, cf. Fig. 6a and b. There is no reason, however, why it cannot be adapted to add objectivity to test set validation, cf. Fig. 6c and d. In summary, the proposed randomization test appears to be a useful addition to the chemometrician’s toolbox. Acknowledgements The thoughtful comments by Waltraud Kessler (Reutlingen University), Randy Pell (The Dow Chemical Company), Michael Sj¨ostr¨om (Ume˚a University) and Svante Wold (Ume˚a University) are appreciated by the authors. We further thank Chris Brown (InLight Solutions) for supplying the function for the inverse Gaussian fit and Alejandro Olivieri (Universidad Nacional de Rosario) for pointing out a numerical problem in the calculations. The relationship of the proposed alternative to the following patent is acknowledged: N.M. Faber, “Method and system for selection of calibration model dimensionality, and use of such a calibration model”, PCT/NL2005/000124. Part of this work was supported by the Hungarian Scientific Research Fund (OTKA T-046484) and was completed when one of the authors (RR) spent a sabbatical year from the University of Szeged. The critical comments by the reviewer are appreciated by the authors. References ¨ [1] S. Wold, H. Antti, F. Lindgren, J. Ohman, Chemom. Intell. Lab. Syst. 44 (1998) 175. [2] H. Martens, T. Næs, Multivariate Calibration, Wiley, New York, 1989. [3] S. Wold, A. Ruhe, H. Wold, W.J. Dunn III, SIAM J. Sci. Statist. Comput. 5 (1984) 735. [4] A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627. [5] A.N. Davies, Spectrosc. Eur. 16 (3) (2004) 26. [6] A.C. Olivieri, N.M. Faber, J. Ferr´e, R. Boqu´e, J.H. Kalivas, H. Mark, Pure Appl. Chem. 78 (2006) 633. [7] M.C. Denham, J. Chemom. 14 (2000) 351. [8] R. Wehrens, H. Putter, L.M.C. Buydens, Chemom. Intell. Lab. Syst. 54 (2000) 35. [9] A. Lorber, B.R. Kowalski, Appl. Spectrosc. 44 (1990) 1464. [10] R. DiFoggio, Appl. Spectroscp. 49 (1995) 67. [11] H.A. Martens, P. Dardenne, Chemom. Intell. Lab. Syst. 44 (1998) 99. [12] N.M. Faber, Chemom. Intell. Lab. Syst. 49 (1999) 79. [13] L. Xu, I. Schechter, Anal. Chem. 68 (1996) 2392. [14] E.T.S. Skibsted, H.F.M. Boelens, J.A. Westerhuis, D.T. Witte, A.K. Smilde, Anal. Chem. 58 (2004) 264. [15] Q.-S. Xu, Y.-Z. Liang, Chemom. Intell. Lab. Syst. 56 (2001) 1. [16] S. Wiklund, D. Nilsson, L. Eriksson, M. Sj¨ostr¨om, S. Wold, K. Faber, J. Chemom., submitted. [17] H. van der Voet, Chemom. Intell. Lab. Syst. 25 (1994) 313. [18] S.-S. So, M. Karplus, J. Med. Chem. 40 (1997) 4347. [19] J.A. Fern´andez Pierna, L. Jin, F. Wahl, N.M. Faber, D.L. Massart, Chemom. Intell. Lab. Syst. 65 (2003) 281. [20] R.S. Chhikara, J.L. Folks, The Inverse Gaussian Distribution: Theory, Methodology, and Applications, Marcel Dekker, New York, 1989. [21] E.V. Thomas, J. Chemom. 17 (2003) 653. [22] J.B. Reeves III, S.R. Delwiche, J. Near Infrared Spectrosc. 11 (2003) 415. [23] M.P. G´omez-Carracedo, J.M. Andrade, D.N. Rutledge, N.M. Faber, Anal. Chim. Acta 585 (2007) 253. [24] H.R. Keller, D.L. Massart, Y.-Z. Liang, O.M. Kvalheim, Anal. Chim. Acta 263 (1992) 29. [25] Y.-L. Xie, J.H. Kalivas, Anal. Chim. Acta 348 (1997) 19. [26] Y.-L. Xie, J.H. Kalivas, Anal. Chim. Acta 348 (1997) 29. [27] U. Depczynski, V.J. Frost, K. Molt, Anal. Chim. Acta 420 (2000) 217.