Conditional Fisher`s exact test as a selection criterion for pair
Transcription
Conditional Fisher`s exact test as a selection criterion for pair
Chemometrics and Intelligent Laboratory Systems 57 Ž2001. 1–14 www.elsevier.comrlocaterchemometrics Conditional Fisher’s exact test as a selection criterion for pair-correlation method. Type I and Type II errors b,) Robert Rajko´ a,) , Karoly Heberger ´ ´ ´ a Department of Unit Operations and EnÕironmental Engineering, Institute of Food Industry College, UniÕersity of Szeged, P.O. Box 433, H-6701 Szeged, Hungary b Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences, P.O. Box. 17, H-1525 Budapest, Hungary Received 1 February 2000; accepted 20 December 2000 Abstract The pair-correlation method ŽPCM. has been developed recently for discrimination between two variables. PCM can be used to identify the decisive Žfundamental, basic. factor from among correlated variables even in cases when all other statistical criteria fail to indicate significant difference. These decisions are needed frequently in QSAR studies andror chemical model building. The conditional Fisher’s exact test, based on testing significance in the 2 = 2 contingency tables is a suitable selection criterion for PCM. The test statistic provides a probabilistic aid for accepting the hypothesis of significant differences between two factors, which are almost equally correlated with the response Ždependent variable.. Differentiating between factors can lead to alternative models at any arbitrary significance level. The power function of the test statistic has also been deduced theoretically. A similar derivation was undertaken for the description of the influence of Type I Žfalsepositive conclusion, error of the first kind. and Type II Žfalse-negative conclusion, error of the second kind. errors. The appropriate decision is indicated from the low probability levels of both false conclusions. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Variable Žor feature. selection; Pair-correlation method ŽPCM. 1. Introduction Variable selection Žsubset selection, feature selection. is one of the key issues in chemometrics. The selection process is more or less solved for linear relationships. Unfortunately, the algorithm for variable selection and the selection criteria are often indistinguishable. The same algorithm can lead to the selec) Corresponding authors. E-mail addresses: [email protected] ŽR. Rajko´ ., .. [email protected] ŽK. Heberger ´ tion of different variables using different criteria and vice versa. Construction of data sets for which even the accepted criteria Žforward selection, backward elimination, stepwise. can lead to different conclusions is relatively easy w1a,1bx. Other algorithms based on principal component analysis ŽPCA., partial least squares ŽPLS., genetic algorithm and artificial neural networks ŽANN. increase the uncertainty concerning the selection of the best subset. The increasing usage of ANN technique forced the chemometricians to develop non-linear variable selection methods. The approaches are often heuristic and lack 0169-7439r01r$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 7 4 3 9 Ž 0 1 . 0 0 1 0 1 - 0 2 R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ any firm theoretical basis. Centner et al. w2x emphasised that A . . . weakness of all these methods is the estimation of a suitable number of variables Žcut-off level.. No explicit rule exists up to now. As a result all approaches work with a user-defined number of variables or with a user-defined critical value for the considered selection criterionB. All the above mentioned methods build models for prediction. However, the prediction is not necessarily a goal to be achieved for the model building process. Models having theoretical basis, physical relevance, are superior to empirical ones. However, there are no algorithmic ways to select important or basic factors. The connection to the physical significance has to be examined individually again and again. The present paper introduces a new technique, which uses other portion of information present in the data than usually. The technique is able to select AsuperiorB factors if the superiority exists. Consider the following example based on the correlation coefficient. Two independent variables Ž X 1 and X 2 . can be discriminated according to the magnitude of the correlation coefficients r Y vs. X 1 and r Y vs. X 2 . The discrimination can be formulated as an F-test to identify significant differences at a given probability level. The classical Pearson product moment correlation coefficient is not the only measure for correlation. There are also non-parametric measures for correlation, e.g., Spearman’s rho and Kendall’s tau w3x. They are, however, not yet used for variable selection. The pair-correlation method ŽPCM. provides an alternative possibility to characterise different correlations without using the correlation coefficient. PCM w4–6x has been developed recently for the discrimination of variables as a non-parametric method, contrast with methods that require the assumption of normality. PCM can be used to choose the decisive Žfundamental, basic. factor from among correlated Žcollinear. variables, even if all classical statistical criteria cannot indicate any significant difference. PCM, however, needs a test statistic as a selection criterion, i.e., a probabilistic aid for accepting the hypothesis that a significant difference exists between the two factors at any arbitrary significance level. There are two hypotheses that must be specified in any statistical testing procedure w7–9x: the null hy- pothesis, denoted H 0 , and the alternative hypothesis, denoted H A . Acceptation or rejection of the null hypothesis is the task to be solved. However, statistical hypothesis testing is based on sample information. Nobody can be sure that the decision is correct. When H 0 is true but, by chance, the sample data infer incorrectly to that it is false, this is referred to as a Type I error or the error of the first kind Žthe probability of this event is ´ .. When H 0 is false but, by bad luck, the sample data lead mistakenly to that it could be true, this is called Type II error or the error of the second kind Žthe probability of this event is b .. The power of a test Žequals 1 y b . is a measure of how good the test is at rejecting a false null hypothesis. PCM is used to choose between two factors X 1 and X 2 , which are approximately equally correlated with the dependent variable Y. Hence, determination of b is of crucial importance ŽPCM can only discriminate between X 1 and X 2 if the null hypothesis can be rejected.. Low levels of both ´ and b indicate that the correct decision has been made. Our aim in this paper was to develop a selection criterion for PCM. The theoretical deduction of Type I and II errors will justify the usage of the method. Moreover, we would like to communicate an improvement of the algorithm for PCM. The improvement is summarised in the Appendix A. Finally, we present some examples to validate the method and to better understand how it works. 2. Theoretical principles of PCM PCM is based on non-parametric, i.e., distribution-free Žcombinatorial. analysis. The formulation of the initial task is given below. Let us define three vectors as dependent Ž Y . and independent variables Ž X 1 and X 2 .. The task is to choose the superior one from the coequal X 1 and X 2 . Both of the independent variables correlate positively with the dependent variables. The case when one of them or both does not correlate with Y does not cause serious limitation. This will be discussed in the validation part of the paper. Likewise, a negative correlation does not limit the usage of the method. Consider all the possible element pairs of the Y vector that can occur when the differences D X 1 for R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ Table 1 Distribution of events A, B, C, and D; frequencies obtained using PCM D X2 )0 D X2 -0 D X1) 0 D X1- 0 A: kA B: k B C: k C D: k D Y vs. X 1 , and D X 2 for Y vs. X 2 are determined. Only the signs of the differences are important: D X 1 s Ž X 1i y X 1 j . sgn Ž Yi y Yj . D X 2 s Ž X 2 i y X 2 j . sgn Ž Yi y Yj . 5 1 F i F j F m, Ž 1. °0, if Yi s Yj , ~ <Y yY < sgn Ž Yi y Yj . s i j i j ¢ Y yY , otherwise, where m is the number of measurements. There will m be s w mŽ m y 1.xr2 s n point pairs and differ2 ences D X 1 and D X 2 . There are only four possible signs of differences in D X 1 and D X 2 . They are termed A, B, C and D. Table 1 summarises the four possibilities Ževents.. The frequencies of the events A, B, C and D Ž k A , k B , k C , and k D , respectively. are counted and ordered Žsee Table 1.. Fig. 1 represents the fundamental nature of the four events as the basis ž / 3 of PCM. The cases are ignored if the Yi s Yj . However, this cannot cause any limitation. These cases do not hold any information on the differences in the independent variables. Because of the initial assumption, both positive Žor negative. correlations for Y vs. X 1 and Y vs. X 2 , the frequency of event A should be the largest. That is both X 1 and X 2 must change in the same direction as Y. Event D shows how the correlation tends to be reduced by chance. Its frequency is expected to be the lowest then. If the frequency of event A is not the highest, then either one of or both X variables correlate with Y negatively. The rearrangement of boxes is equivalent to the multiplication of X 1 or X 2 , or both by minus one to obtain positive correlation between Y and X 1 , as well as Y and X 2 . This can be seen from the formulas in brackets in Appendix A, where the rearrangement procedure is given in details. Events A and D have no direct information for choosing between X 1 and X 2 . If the frequency value k B belonging to event B is larger than k C for event C, then X 1 overrides X 2 and vice versa. Further details of the properties of PCM are given in w6x. The word ‘larger’ can be interpreted statistically. Thus, a test statistic is required to determine whether the frequency value associated with event B is significantly larger than that for event C Žor vice versa.. The paper describes a test statistic based on testing the significance of a 2 = 2 contingency table. The power function of this test statistic and the influence Fig. 1. Graphical representation of four possible events as the basis of PCM. 4 R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ of Type I and Type II errors are also investigated and described. 3. Conditional Fisher’s exact test ing realised with factual values k A , k B , k C , k D w3,9,13x: n n ž P k B N Ž k B q kC . , n n , s 2 2 / 3.1. Type I error Consider Table 1 as an example of 2 = 2 contingency tables. Similar contingency tables are frequently used, e.g., in medical sciences, so several tests have been developed and investigated. The most important one is Fisher’s exact test w3,9–14x. The contingency table shown in Table 2 can be created by applying PCM to the data Žsee Table 2.. If k B is significantly larger than k C , then variable X 1 is more strongly correlated with Y than variable X 2 , and vice versa. The null hypothesis assumes that X 1 and X 2 are equally correlated with Y: H 0 : k B s kC . t FŽ t,K . s ÝP ks0 t s Consider the following alternative hypothesis If H 0 is rejected, then X 1 Žor X 2 . is more strongly correlated with Y than the other X variable. If H 0 is not rejected, then it can be supposed that the probability of events B and C are equal, i.e., they have, in addition to the same predictive property, the same correlation. It can be further supposed that the event B appears only in half of n, and the event C appears only in the other half of n. The test statistics is based on the probability of the 2 = 2 contingency table be- ž kNK , n Ý ks0 Ž 3. 2 kB ž 2 kC n k B q kC . / n n , 2 2 n / 0 0 2 k 2 Kyk n K , Ž 5. ž / where K s k B q k C . The following equation must hold: n n n n 0 0 0 0 2 2 2 K Kyk k Kyk q Ý Ý Y n n ks0 ksk ´ K K ´ ´ s q s´ , Ž 6. 2 2 where k´X and k´Y are chosen according to ´ , which is X k´ 2 k ž / ž / Table 2 The 2 = 2 contingency table to help to test the discrimination between variables X 1 and X 2 based on calculations by PCM X 1 may have stronger correlation X 2 may have stronger correlation Marginal sum Ž 4. Alternative developments of this hypergeometric formula can be found in Ref. w15x. Use of this formula for determining the optimal sample size in forensic casework has recently appeared in w16x. Thus, the cumulative distribution function of the test statistic will be hypergeometric: Ž 2. HA : k B / kC . 0 0 Frequencies with information to discriminate between variables Frequencies without information to discriminate between variables Marginal sum kB kC k B q kC Ž nr2. y k B Ž nr2. y k C kA q k D Ž nr2. Ž nr2. n R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ the probability of Type I error, so that k´Y s K y k´X . Because of the symmetry, Eq. Ž6. can be reduced to 2 F Ž k´X , K . s ´ . 5 First, an approximation is investigated. The approximation can be considered that the alternative hypothesis H A : k B s k 3 is true. Let Ž 7. K ´ s k B q k C / Kb s k B q k 3 The above part of the paper describes Fisher’s test. As stated by Massart et al. w14x this is the best choice for testing hypotheses on 2 = 2 contingency tables. Now all the statistical tools are at the disposal to make a decision for discrimination between the two variables X 1 and X 2 . If k B - k´X or k B ) k´Y , then the null hypothesis H 0 should be rejected at a confidence level Ž1 y ´ ., and again, if k B is larger than k C , then X 1 correlates with Y stronger than X 2 and vice versa. The procedure is visualised in Fig. 2. To apply Eq. Ž6 ., the binomial coefficients n s n!rŽ k!Ž n y k .!. must be calculated via factok rials n!, k! and Ž n y k .!. It can preferably be done using an approximation called Stirling-formula, see Appendix B. ž/ Ž kC / k3 . , Ž 8. and the probability of H A becomes nX X ž P kB N Ž kB q k3 . , X n n , s 2 2 / nX 0 0 2 kB ž 2 k3 nX kB q k3 , Ž 9. / where nX s k A q k B q k 3 q k D . Fig. 3 shows an example of making a Type II error and its probability. The calculation of the crosshatched area in Fig. 3 can be done with help of Eq. Ž5.: b s F Ž k´Y , Kb . y F Ž k´X , Kb . 3.2. Type II error, a strictly conditional approximation The power function PowŽ.. of the previously described test can be deduced by taking into account the wrong acceptance of the null hypothesis H 0 : k B s k C . s F Ž K ´ y k´X , Kb . y F Ž k´X , Kb . , then the power function will be Pow ž Kb kB q k3 s 2 2 / s1 y b nX nX s1q 0 0 2 k X k´ Ý 2 Kb y k nX Kb ks0 ž / nX nX X K ´yk ´ Fig. 2. Hypergeometric distribution function helping the acceptance or rejection of the null hypothesis H 0 based on the test statistic described in the text Ž kA s 40, K s k B q k C s 20, k D s 2, ´ s 0.02, k´X s6, k´Y s14.. Ž 10 . y Ý ks0 0 0 2 k 2 Kb y k nX Kb ž / . Ž 11 . R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ 6 and its graph can be seen in Fig. 4. If the graph is steeper around z s 0, and if PowŽz. is close to 1, the false null hypothesis can be accepted with smaller probability. 3.3. Type II error, a complex unconditional deduction Fig. 3. Representation of the probability of Type II error Ž b . by two possible distribution functions according to H 0 and H A . The value of k´X can be calculated from n n 0 0 2 2 K´ y k 2 k X k´ Ý n K´ F´ , pB Ž 12 . ž / ks0 In Eq. Ž8. of the previous section, a specific Kb as the real value for K instead of K ´ was considered. This latter could appear only by some random effect. If the value of Kb is not known in advance, all possible values have to be regarded as prerequisite. Two nr2 sized independent random binomial samples are assumed in conformity with Table 2, i.e., the binomial events are the following: X 1 or X 2 have stronger correlation with Y. Consider the odds ratio defined as cs for given ´ . For the difference z, the power function will be 1 y pB pC , Ž 14 . 1 y pC where p B s k B rŽ nr2. and pC s k C rŽ nr2. are the probabilities of the two events according to the rows of Table 2. ž Pow z s K´ 2 y Kb 2 / s1 y b nX nX s1q 0 2 k X k´ Ý ž ks0 nX k B q kC y 2 z X y Ý ks0 0 / nX nX K ´yk ´ 2 k B q kC y 2 z y k 0 2 k ž 2 k B q kC y 2 z y k nX k B q kC y 2 z / 0 , Ž 13 . Fig. 4. Power function of the test statistic for PCM. R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ The conditional distribution of k B , at a given K s k B q k C and c , according to Ref. w17x is n n 0 0 f Ž k B N K ;c . s n ž / min K , n 2 Ý ž ismax 0, Ky n 2 n s . n Ý P Ž Ž k b ,k c . g Wcond Ž ´ , K . 4 . Ks0 0 0 / 2 i indicate the marginal distribution of K. The unconditional power can be given based on W Ž ´ . as Pow Ž p B , pC . s P Ž Ž k b ,k c . g Wuncond Ž ´ . 4 . 2 c kB KykB 2 kB 7 c 2 Kyi i n s Ý fŽ K. Ks0 Ý f Ž k b N K ;c . , WcondŽ ´ , K . Ž 15 . Ž 19 . Conditional Wcond and unconditional Wuncond critical regions are defined as the ones that appeared in Ref. w18x, but they have been modified according to the two-sided test: where k b and k c are variables and their values can be varied from 0 to nr2. The sets of Wcond Ž ´ , K . and Wuncond Ž ´ . can be constructed using tables introduced by Finney et al. w13x. For example, for n s 20 and ´ s 0.05 the rejection region is shown in Table 3. The values in this table are calculated using Eq. Ž20. if k b - k c and Eq. Ž21. if k b ) k c following the form of Eq. Ž5.. Wcond Ž ´ , K . ° n Ž min K , 2 ¶ . f Ž x N K ; c s1 . F Ý ~ s Ž k b ,k c . :k b / k c ; x s min Ž k b , k c . ¢ Ý 2 xsk b •, f Ž x N K ; c s1 . F Ž n x s max 0 , K y 2. n 2 ´ kb 2 Ý ß ž / f Ž x N Ž k b q k c . ;c s 1. , ´ max Ž k b , k c . Ý ž min k bqk c , xsmax 0, k bqk cy f Ž x N Ž k b q k c . ;c s 1. . n 2 Ž 20 . Ž 21 . / Ž 16 . n Wuncond Ž ´ . s D Wcond Ž ´ , K . . Ž 17 . Ks0 Wuncond Ž ´ . is the union of n q 1 mutually exclusive conditional critical regions, some of which might be zero. Let n ž / min K , fŽ K. s n 2 Ý ž ismax 0, Ky n 2 0 / 2 i p Bi Ž 1 y p Bn r2yi . n = 0 p BKy i Ž 1 y p Bn r2yKqi . , 2 Kyi Ž 18 . If the value is less than ´r2 s 0.025, then the Ž k b , k c . pair belongs to the conditional critical region Wcond Ž ´ , K s k b q k c .. For easier interpretation of Table 3, numbers belonging to the cumulative distribution, e.g., at K s 5 Žor else k b s 5 y k c ., are displayed with grey background. Only half of the cumulative distributions were calculated, because of the symmetry of the distribution. The two-sided approach means a summation from 0 to the lower value of k b and k c . Moreover, it means a summation from the higher value of k b and k c to nr2. The other cumulative distributions are situated perpendicularly Žand diagonally. to the designated one ŽTable 3.. According to the above example, the unconditional critical region is given using Eq. Ž17., i.e., unifying Ž k b , k c . pairs for which the value of the cumu- 8 R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ Table 3 Critical region Žbolded values. for n s 20 and ´ s 0.05 lative distribution function is not larger then ´r2 s 0.025. Thus, Wuncond will be a set of unions Žj. of the appropriate sets of Wcond Že.g., Wcond Ž ´ ,5. s Ž5,0., Ž0,5.4. according to Eq. Ž17. and the detailed enumeration is the following: Wuncond Ž ´ . s Wcond Ž ´ , 5. j Wcond Ž ´ , 6. j Wcond Ž ´ , 7. j Wcond Ž ´ , 8. j Wcond Ž ´ ,9. j Wcond Ž ´ ,10. j Wcond Ž ´ ,11. j Wcond Ž ´ ,12. j Wcond Ž ´ ,13. j Wcond Ž ´ ,14. j Wcond Ž ´ ,15.4 , because Wcond Ž ´ ,r. s B for r g 0, 1, 2, 3, 4, 16, 17, 18, 19, 204 . Thus, Wuncond Ž ´ . s Ž5,0., Ž0,5.4 j Ž6,0., Ž0,6.4 j Ž7,0., Ž0,7.4 j Ž8,0., Ž7,1., Ž1,7., Ž0,8.4 j Ž9,0., Ž8,1., Ž1,8., Ž0,9.4 j Ž10,0., Ž9,1., Ž8,2., Ž2,8., Ž1,9., Ž0,10.4 j Ž10,1., Ž9,2., Ž2,9., Ž1,10.4 j Ž10,2., Ž9,3., Ž3,9., Ž2,10.4 j Ž10,3., Ž3,10.4 j Ž10,4., Ž4,10.4 j Ž10,5., Ž5,10.4 . Several algorithms are available for fast and reliable calculation of the power function w19–21x. A new, effortlessly programmable algorithm was, however, developed and used for PCM based on the Stirling-formula described in Appendix B. 4. Discrimination between two variables by the well-known parametric way of the correlation may be the Pearson product moment correlation coefficient, r: r s r Ž j ,h . s M Ž jyM Ž j . . ŽhyM Žh . . . DŽ j . DŽh . Ž 22 . It can only be calculated for distribution functions with finite mean M wh x, M w j x and finite non-zero variance V wh x s D 2 wh x, V w j x. To know more about the correlation coefficient, see e.g., Falk and Well w22x. If the assumptions are not fulfilled exactly, it is expedient to use robust and fuzzy procedures Žsee Ref. w23x and the references therein.. In the correlation analysis, the observations are drawn from the joint distribution of X and Y, which are assumed to have a bivariate normal distribution, and inferences concerning the correlation between X and Y can be made. The sample multiple correlation coefficient of Y with X 1 and X 2 is defined as the simple correlation coefficient between Y and its predicted value Yˆ Ž Y and Yˆ have bivariate normal distribution in most of the practically relevant cases. w24x. It is denoted by r: m This section summarises the methods compared to PCM. The correlation means a stochastic relationship between two random variables h , j , i.e., P Žh - y N j s x . / P Žh - y ., where P Ž.. means the probability of the event in the brackets. The numerical measure Ý yi yˆi r s r y yˆ s ( is1 m Ý is1 yi2 , m Ý is1 yˆi2 Ž 23 . R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ where yi s Yi y Y, yˆi s Yˆi y Y, Y s Ý mis1Yi and Yˆi m s a q b 1 X 1 i q b 2 X 2 i . It is possible to test whether the multiple correlation coefficient r Y vs. X 1 , X 2 of Y on X 1 and X 2 equals the simple correlation coefficient r Y vs. X 1 of Y on only one variable, say, X 1. Testing the null hypothesis H 0 : r Y vs . X 1 , X 2 s r Y vs .X 1 Žthe Greek letter r means the expected value of r, i.e., r is the population correlation., the test statistic will be r Y2 vs . X 1 , X 2 y r Y2 vs . X 1 Ž m y 1. y Ž m y 2. Fs 1 y r Y2 vs . X 1 , X 2 , Ž 24 . my2y1 and this F value should be compared to the critical value of F w1 y ´ , 1, m y 3x from the table of the F distribution for a significant level ´ and at the given degrees of freedoms. This test is the same test for H 0 : b 2 s 0 Žthe Greek letter b means the expected value of b ., thus one can decide whether the variable X 2 is significant or not. This procedure gives a simple selection criterion for choosing between the two variables X 1 and X 2 . The selection criterion based on Eq. Ž24. can be only used if one of the two variables is non-significant comparing to the other. If X 1 and X 2 are in approximately the same correlation with Y, another test statistic is required. That is based on Fisher’s z statistic w24,25x: 1 1qr z s ln . Ž 25 . 2 1yr It has been shown w25x that z is approximately normally distributed with mean 1r2lnwŽ1 q r .rŽ1 y r .x and variance 1rŽ m y 3.. The null hypothesis is H 0 : r Y vs. X 1 s r Y vs. X 2 . The test statistic Ž z Y vs. X 1 y z Y vs. X 2 . has normal distribution with mean 0 and variance 1rŽ m1 y 3. q 1rŽ m 2 y 3., for calculating sample correlations r Y vs. X 1 and r Y vs. X 2 based on independent samples of size m1 and m 2 , respectively. One can use normal tables for testing whether ž / 1 2 ln ž Ž 1 q r Y vs . X .Ž 1 y r Y vs . X . Ž 1 y r Y vs . X .Ž 1 q r Y vs . X . equals zero. 1 2 1 2 / Ž 26 . 9 5. Discussion It is always a difficult task to validate a new method. Especially, if the new method is more exact, precise, the validation with accepted methods of less precision is impossible. However, there are some possibilities to test, justify the new methods. First, a theoretical deduction is mentioned. The derivation of Type I and Type II in the preceding chapter is a clear indication of the correctness of the method. Secondly, Monte Carlo simulations are mentioned. They were planned to be used an opposite approach, i.e., if there is no correlation between Y and X 1 as well as Y and X 2 , then the PCM should not cause any artefacts. Extended Monte Carlo simulations were made using various element numbers of vectors, different distributions, many thousands repetitions. The results are under consideration Ževaluation.. We will publish them in due course. Herewith, we mention only that Ži. the results of Monte Carlo simulations are not in contradiction with the theoretical deduction presented here; Žii. the artefact rate was 4.6% Ž N s 20, ´ s 0.05., using random numbers of a uniform distribution. The method found differences in factors X 1 and X 2 with similar probability just as the difference was generated in a random manner. Thirdly, the empirical experience should be mentioned. Many hundred usage of PCM and comparisons of the results by classical methods suggest that PCM is able to find significant difference in factors more frequently than any of the classical methods. On the other hand, whenever classical methods find a factor to be superior, PCM finds the same superiority. 5.1. Case study 1 (simulated data) First, the results derived from simulated data were investigated. Two vectors Ž X 1 and X 2 . were created by a random number generator. The dependent variable Ž Y . was defined as a sum of X 1 and X 2 . Thus, the correlations between Y and X 1 as well as Y and X 2 were ensured, whereas there were no correlation between X 1 and X 2 . The results below were chosen from hundreds of simulations to prove that a seem- R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ 10 Table 4 Frequencies obtained by applying PCM for the simulated data D X2 )0 D X2 -0 D X1) 0 D X1- 0 kA s 50 k B s6 kC s 2 kD s0 made as high as at ´ s 0.253. Now, the Type II error will be 0.46, which is much smaller then previously. Increasing the number of samples can reduce the false negative error too and increase the power of the test, as seen in Table 5. Conclusions from this case study are that Ži. The PCM does not indicate difference of factors if there were no difference generated in the data. ŽEven if seemingly large ratio exists between frequencies of k B and k C . Žii. The method is conservative enough not to signalise difference Žthe probability of Type II error.. Žiii. To be sure, to keep both types of errors low the number of vector elements Ždegrees of freedom. should be increased. ingly large ratio is not necessarily followed by a significant difference between variables. Using PCM, the following frequencies Žsummarised in Table 4. are obtained. To make a decision, the following values of the distribution function were calculated from the Eq. Ž5.: 2 F Ž k s k C s 2, K s 2 q 6 . s 0.253 ) 0.1 ) 0.05 ) 0.01, 5.2. Case study 2 (real data) Ž 27 . 2 F Ž k s 1, K s 1 q 7 . s 0.0517 - 0.1 ) 0.05 ) 0.01, To find an example for which the classical tests do not indicate significant difference between factors knowing for sure that such a difference exists, is a difficult task. As an experimental examination mentioned before w4x, the 2-cyano-2-propyl radical addition to vinyl type alkenes is considered based on preliminary results in Ref. w6x. The logarithms of the addition rate constants of some alkenes were investigated as functions of the reaction enthalpies Ž D Hr . and the electron affinities of the alkenes ŽEA.. F-test according to Eq. Ž24. could not help to choose between variables EA and D Hr Ž F s 4.054 at p s 0.0612 ) 0.05.. The absolute values of the correlation coefficients were very close to each other Ž r log k . vs. EAs 0.8427 and r log k vs. D H r s y0.8554 , and the z-statistic according to Eq. Ž26. could not differentiate between them. Ž D z s z log k vs. D H r y z log k vs. EA s y0.04582, Varw D z x s 0.125, p s 0.2758 4 0.05.. Heuristically, a possible dominance of the reaction enthalpy was predicted by pair-correlation method, but it was not proved statistically. Table 6 shows the Ž 28 . 2 F Ž k s 0, K s 0 q 8 . s 0.00448 - 0.01 - 0.05 - 0.1. Ž 29 . According to Eq. Ž27., the null hypothesis, i.e., the two variables equivalently correlated, cannot be rejected at the 1%, 5% and even 10% level. This is because the same uniform random number generator for variables X 1 and X 2 generated the data. The classical F test of correlation coefficients provides the same result. The two variables X 1 and X 2 are indistinguishable. If the null hypothesis is accepted, then the alternative hypothesis has to be rejected. It means, Type II error can occur. The question still remains what is the probability of this event? The probability of Type II error, b , will be 0.82 at a significance level of ´ s 0.05. This value of b is very high. The ´ s 0.05 is only a nominal value, however, the decision can be Table 5 Reducing the false negative error at the nominal value ´ s 0.05 2 = 2 Tables b Power 50 2 6 0 0.822 0.178 100 12 0.518 0.482 4 0 125 15 0.390 0.610 5 0 150 18 0.292 0.708 6 0 175 21 0.213 0.787 7 0 200 24 0.163 0.837 8 0 225 27 0.118 0.882 9 0 250 30 0.082 0.918 10 0 R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ Table 6 Frequencies obtained by applying PCM to discriminate between EA and D Hr DEA ) 0 DEA - 0 DD Hr ) 0 DD Hr - 0 kA s109 k B s 22 k C s6 kD s3 results of the PCM calculating with a realistic additive error level s 0.2 w6x. 2 F Ž k s k C s 6, K s 6 q 22 . s 0.00123 - 0.01 - 0.05 - 0.1, Ž 30 . 2 F Ž k s 7, K s 7 q 21 . s 0.00537 - 0.01 - 0.05 - 0.1, Ž 31 . 2 F Ž k s 8, K s 8 q 20 . s 0.0190 - 0.05 - 0.1 ) 0.001, Ž 32 . 2 F Ž k s 9, K s 9 q 19 . s 0.0560 - 0.1 ) 0.05 ) 0.001, Ž 33 . 2 F Ž k s 10, K s 10 q 18 . s 0.138 ) 0.1 ) 0.05 ) 0.01. Ž 34 . It is obvious that the reaction enthalpy D Hr has stronger correlation with log k than electron affinity EA. Thus, the dominance of the variable D Hr is proven statistically for a confidence level of 99.9% according to Eq. Ž30.. The null hypothesis was rejected at the nominal value of ´ s 0.05, meaning that one can make a false positive conclusion with probability 0.05, i.e., one fails in 5% of the occurrences. On the other hand, H 0 can be accepted with probability less then 0.00123. In that case, the probability of false decision only for 1.23 per thousand of events, but the false negative error is b s 0.49, i.e., making mistake in almost half of the occurrences. At a nominal value of ´ s 0.05 the power of the test is 0.914, which is rather reassuring, so in this example it is highly recommended to reject H 0 and to accept H A . 6. Conclusions Some improvements over the earlier version of PCM w6x were made to the algorithm in order to avoid 11 the use of correlation coefficients as parametric characteristics. The discrimination of variables is based on a 2 = 2 contingency table. The conditional Fisher’s exact test was introduced to test the hypothesis. The power function of the test statistic and description of the influence of Type I Žfalse-positive conclusion, error of the first kind. and Type II Žfalse-negative conclusion, error of the second kind. errors have been given by theoretical deductions. To our best knowledge, this is the first report on using the concept of the Type I and Type II errors to validate a variable selection method in the literature. The consequences of the Type I and II errors were detailed showing on two case studies. They can help to understand the principles of the algorithm and to avoid the pitfalls of hypothesis testing. Example 1 stressed the importance of increasing the sample size when b is very high to accept H 0 at low risk. Example 2 showed the situation when one can reject H 0 and the hazard of choosing a very low value for ´ . Results shown in this paper are not only related to PCM, but they can be generally used for every 2 = 2 tables and the Fisher’s exact test appearing in chemometric problems. So much the more, as the Type II error of Fisher’s exact test is not mentioned at all in chemometric papers and handbooks. Summarising the advantages of the pair correlation method with Fisher’s exact test as a selection criterion: The method is able to find significant differences between factors Žmodels., even if other statistical criteria cannot indicate it. PCM does not need the assumption of Gaussian or other fixed type distribution. In contrast to the classical methods, PCM can work with correlated variables. PCM can easily be generalised for variable selections for more than two variables. The comparison of factors can be made pair-wise in all possible combinations. Every comparison can mark a factor as superior, inferior or no decision can be made. Then the factors are ordered according to the number of their superiority. Moreover, PCM can be generalised for any fixed nonlinear model. In such cases, Yˆ1 and Yˆ2 should be used instead of X 1 and X 2 . The consistency is also important in developing the first non-parametric variable selection method. Hence, the rearrangement of boxes Žsummarised in Appendix A. is introduced, which has the advantage of using consistently non-parametric methods. R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ 12 As a disadvantage, we can only mention the nonsensitiveness of PCM for prediction purposes. The prediction lies outside of the scope of the method. A data set can easily be constructed for which a more correlated Žsuperior. factor has a less predictive ability as measured by classical tests, residual error. However, this happens rarely during normal usage of PCM. Generally, if a factor better correlates Žsuperior by PCM. with the dependent variable than another factor, it has the better prediction performance. For the vast majority of the cases, PCM can select the better variable for prediction as well. The novelty of the paper can be summarised as following: Ži. this is the first approach to discriminate between seemingly equivalent factors, to select variables in a non-parametric way, Žii. this is the first presentation for application of statistically correct selection criterion to a nonparametric variable selection method and Žiii. this is the first appearance of the derivation of Type I and Type II errors for Fisher’s exact test, with two sided way, in the chemometric literature. A user-friendly program w28x is available from the authors upon request. Acknowledgements The Academic Research Project ŽNo. AKP 98-51 2,4r19. and the Hungarian Science Foundation ŽNo. OTKA F-025287. supported this scientific research. The authors would like to acknowledge helpful critical comments to an anonymous referee. Appendix A There are 75 different outputs from raw use of PCM. The following summarises all possible outcomes: Ž1. Ž2. Ž3. Ž4. Ž5. Ž6. Ž7. Ž8. Ž9. Ž10. Ž11. Ž12. Ž13. Ž14. Ž15. Ž16. Ž17. Ž18. Ž19. Ž20. Ž21. Ž22. Ž23. k A s k D ) k B ) k C : Situation-1 k A s k D ) k C ) k B : Situation-1 k B s k C ) k A ) k D : Situation-1 k B s k C ) k D ) k A : Situation-1 k A s k B ) k C ) k D Situation 0 k A s k B ) k C s k D : Situation 0 k A s k B s k C ) k D : Situation 0 k A s k B s k C s k D : Situation 0 k A ) k B ) k C ) k D : Situation 0 k A ) k B ) k C s k D : Situation 0 k A ) k B s k C ) k D : Situation 0 k A ) k B s k C s k D : Situation 0 k A ) k B ) k D ) k C : Situation 0 k A ) k B s k D ) k C : Situation 0 k A s k C ) k B ) k D : Situation 0 k A s k C ) k B s k D : Situation 0 k A ) k C ) k B ) k D : Situation 0 k A ) k C ) k B s k D : Situation 0 k A ) k C ) k D ) k B : Situation 0 k A ) k C s k D ) k B : Situation 0 k A s k D ) k B s k C : Situation 0 k A ) k D ) k B ) k C : Situation 0 k A ) k D ) k B s k C : Situation 0 Ž24. Ž25. Ž26. Ž27. Ž28. Ž29. Ž30. Ž31. Ž32. Ž33. Ž34. Ž35. Ž36. Ž37. Ž38. Ž39. Ž40. Ž41. Ž42. Ž43. Ž44. Ž45. Ž46. kA ) k D ) k C ) k B : Situation 0 k A s k B ) k D ) k C : Situation 1 k A s k B s k D ) k C : Situation 1 k B ) k A ) k C ) k D : Situation 1 k B ) k A ) k C s k D : Situation 1 k B ) k A s k C ) k D : Situation 1 k B ) k A s k C s k D : Situation 1 k B ) k A ) k D ) k C : Situation 1 k B ) k A s k D ) k C : Situation 1 k B s k C ) k A s k D : Situation 1 k B ) k C ) k A ) k D : Situation 1 k B ) k C ) k A s k D : Situation 1 k B ) k C ) k D ) k A : Situation 1 k B ) k C s k D ) k A : Situation 1 k B s k D ) k A ) k C : Situation 1 k B s k D ) k A s k C : Situation 1 k B ) k D ) k A ) k C : Situation 1 k B ) k D ) k A s k C : Situation 1 k B ) k D ) k C ) k A : Situation 1 k A s k C ) k D ) k B : Situation 2 k A s k C s k D ) k B : Situation 2 k C ) kA ) k B ) k D : Situation 2 k C ) kA ) k B s k D : Situation 2 R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ Ž47. Ž48. Ž49. Ž50. Ž51. Ž52. Ž53. Ž54. Ž55. Ž56. Ž57. Ž58. Ž59. Ž60. Ž61. k C ) k A s k B ) k D : Situation 2 k C ) k A s k B s k D : Situation 2 k C ) k A ) k D ) k B : Situation 2 k C ) k A s k D ) k B : Situation 2 k C ) k B ) k A ) k D : Situation 2 k C ) k B ) k A s k D : Situation 2 k C ) k B ) k D ) k A : Situation 2 k C ) k B s k D ) k A : Situation 2 k C s k D ) k A ) k B : Situation 2 k C s k D ) k A s k B : Situation 2 k C ) k D ) k A ) k B : Situation 2 k C ) k D ) k A s k B : Situation 2 k C ) k D ) k B ) k A : Situation 2 k B s k C s k D ) k A : Situation 3 k B s k D ) k C ) k A : Situation 3 Ž62. Ž63. Ž64. Ž65. Ž66. Ž67. Ž68. Ž69. Ž70. Ž71. Ž72. Ž73. Ž74. Ž75. 13 k C s k D ) k B ) k A : Situation 3 k D ) k A ) k B ) k C : Situation 3 k D ) k A ) k B s k C : Situation 3 k D ) k A s k B ) k C : Situation 3 k D ) k A s k B s k C : Situation 3 k D ) k A ) k C ) k B : Situation 3 k D ) k A s k C ) k B : Situation 3 k D ) k B ) k A ) k C : Situation 3 k D ) k B ) k A s k C : Situation 3 k D ) k B ) k C ) k A : Situation 3 k D ) k B s k C ) k A : Situation 3 k D ) k C ) k A ) k B : Situation 3 k D ) k C ) k A s k B : Situation 3 k D ) k C ) k B ) k A : Situation 3 If k A is not the largest one, then the boxes A, B, C and D must be rearranged. The situations above are interpreted as follows: Situation-1 Situation 0 Situation 1 Situation 2 Situation 3 ambiguous situation, PCM cannot make a distinction no change needed changes needed like k A l k B and k C l k D Ž Y vs. X 1 and Y vs. yX 2 . changes needed like k A l k C and k B l k D Ž Y vs. yX 1 and Y vs. X 2 . changes needed like k A l k D and k B l k C Ž Y vs. yX 1 and Y vs. yX 2 . The ambiguous situation does not occur frequently. Even if an ambiguous situation occurs, it does not cause any problem. If the output is number 1 or 2 of the situation table, then k A s k D . It means that the correlation is enhanced and weakened equally. For outputs 3 and 4, the realignment of boxes can be done as prescribed for either Situation 1 and 2. In those cases, PCM gives that X 1 and X 2 correlated equally with Y as if k B and k C would be statistically indistinguishable. It can happen that k D will not be the smallest one after reshuffling, because the main regulation is that k A has to be the largest. The modification is based on the properties of Criterion Function CFŽ i, j . and prevents us from using correlation coefficients, a parametric characteristics that was used in the previously presented algorithm of PCM w6x. The criterion function for rearrangement is given by: CF Ž i , j . s ½ 0, Ž y2D1yD2q5. r2, if D1PD2 s 0 otherwise, Ž 35 . where D k s sgnŽ X k i y X k j .)sgnŽ Yi y Yj ., k s 1 or 2. Thus, the number of cases in box A is a CFŽ i, j . s 1; 1 F i - j F n4 , in box B is a CFŽ i, j . s 2; 1 F i - j F n4 , in box C is a CFŽ i, j . s 3; q1 F i - j F n4 , in box D is a CFŽ i, j . s 4; q1 F i - j F n4 , and there will be a CFŽ i, j . s 0; q1 F i - j F n4 ignored cases. An easy derivation of CFŽ i, j . is based on finding the simplest function depending on only D k s sgnŽ X k i y X k j .)sgnŽ Yi y Yj ., k s 1 or 2: CF Ž i , j . s aD1 q bD2 q cD1D2 q d. Ž 36 . Four equations can be created to determine 4 unknown coefficients a, b, c and d Ž D k has values of only 1 and y1.: CF Ž i , j . s 1 s a1 q b1 q c1 q d Ž 37 . CF Ž i , j . s 2 s a1 q b Ž y1 . q c Ž y1 . q d CF Ž i , j . s 3 s a Ž y1 . q b1 q c Ž y1 . q d CF Ž i , j . s 4 s a Ž y1 . q b Ž y1 . q c1 q d. The solutions of this linear equation system are a s y1, b s y1r2, c s 0 and d s 5r2. R. Rajko, Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14 ´ K. Hebergerr ´ 14 Substitution of these results into Eq. Ž36. yields Eq. Ž35.. Appendix B To calculate factorials the following Stirling approximation can be used w26,27x: n!s n n ž /'p 2 ne e h Bn), h s Ž B n), h. , B2 j 1 Ý js1 Ž 2 j y 1 . 2 j n 2 jy1 , Ž 38 . where the Bernoulli numbers Bj are defined as B0 s 1 jy1 y Bj s jq1 Bi i ž / Ý is0 jq1 j , j s 1,2,4,6,8,10, . . . ž / Ž 39 . because B3 s B5 s B7 s B9 s . . . s 0. n The binomial coefficient Ž . can be calculated using k Eq. Ž38.: 1 n s k '2p ž/ n 1 nq 1 2 k kq 2 Ž n y k . ny kq 1 e ž B) n,h ) ) Bk , h Bnyk ,h /. 2 Ž 40 . Value of h may be chosen beginning with 0 to any large integer. Precise enough results has been obtained, however, at h s 3. References w1x Ža. M.L. Thompson, Int. Stat. Rev. 46 Ž1978. 1–19; Žb. M.L. Thompson, Int. Stat. Rev. 46 Ž1978. 129–146. w2x V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna, Anal. Chem. 68 Ž1996. 3851–3858. w3x W.J. Conover, Practical Nonparametric Statistics. 2nd edn., Wiley, New York, 1980 Chap. 4. w4x K. Heberger, H. Fischer, Int. J. Chem. Kinet. 25 Ž1993. 249– ´ 263. w5x K. Heberger, H. Fischer, Int. J. Chem. Kinet. 25 Ž1993. 913– ´ 920. w6x K. Heberger, R. Rajko, ´ ´ Discrimination of statistically equivalent variables in quantitative structure-activity relationships, ŽEds.., Quantitative Structurein: F. Chen, G. Schuurmann ¨¨ Activity Relationships ŽQSAR. in Environmental Sciences— VII. SETAC Press, Pensacola, Florida, 1997, pp. 423–431, Chap. 29. w7x I. Vincze, Mathematische Statistik mit industriellen Anwendungen. Akademiai Kiado, ´ ´ Budapest, 1971 Žin German.. w8x R.L. Mason, R.F. Gunst, J.L. Hess, Statistical Design and Analysis of Experiments with Applications to Engineering and Science. Wiley, New York, 1989. w9x E.L. Lehmann, Testing Statistical Hypotheses. 2nd edn., Chapman and Hall, New York, 1993. w10x W.H. Robertson, Technometrics 2 Ž1960. 103–107. w11x G.J.G. Upton, J. R. Stat. Soc. A 145 Ž1982. 86–105. w12x F. Yates, J. R. Stat. Soc. A 145 Ž1984. 426–463. w13x D.J. Finney, R. Latsa, B.M. Bennett, P. Hsu, Tables for Testing Significance in a 2=2 Contingency Table. Cambridge Univ. Press, Cambridge, 1963. w14x D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part A and Part B. Elsevier, Amsterdam, 1998. w15x R.C. Serlin, J.R. Levin, J. Stat. Edu. 4 Ž2. Ž1996. http:rr www.stat.unipg.itrncsurinforjserv4n2rserlin.html. w16x N.M. Faber, M. Sjerps, H.A.L. Leijenhorst, S.E. Maljaars, Sci. Justice 39 Ž2. Ž1999. 113–122. w17x R.A. Fisher, J. R. Stat. Soc. A 98 Ž1935. 39–54. w18x M. Gail, J.J. Gart, Biometrics 29 Ž1973. 441–448. w19x J.T. Casagrande, M.C. Pike, P.G. Smith, Appl. Stat. 27 Ž1978. 212–219. w20x R.G. Thomas, M. Conlon, Technical Report 382, University of Florida, Gainesville, 1991. w21x M. Conlon, R.G. Thomas, Appl. Stat. 42 Ž1993. 258–260. w22x R. Falk, A.D. Well, J. Stat. Edu. 5 Ž3. Ž1997. http:rrwww. stat.unipg.itrncsurinforjserv5n3rfalk.html. w23x R. Rajko, ´ Anal. Lett. 27 Ž1994. 215–228. w24x O.J. Dunn, V.A. Clark, Applied Statistics: Analysis of Variance and Regression. 2nd edn., Wiley, New York, 1987. w25x G.S. Mudholkar, Fisher’s z-distribution. in: S. Kotz, N.L. Johnson ŽEds.., Encyclopedia of Statistical Sciences, vol. 3 Wiley, New York, 1983. w26x M. Abramowitz, C.A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1972 9th printing. w27x P. Szasz, ´ Elements of Differential and Integral Calculus. Kozoktatasugyi ¨ ´ ¨ Kiado, ´ Budapest, 1951 ŽIn Hungarian.. w28x R. Rajko, Program for Pair-Correlation Method ´ K. Heberger, ´ ŽPCM. V1.0a written in Visual Basic for Applications of MS Excel V7.0, 1998.