International Conference on Mathematical and Statistical Modeling
Transcription
International Conference on Mathematical and Statistical Modeling
International Conference on Mathematical and Statistical Modeling in Honor of Enrique Castillo. June 28-30, 2006 Distance-based Multi-sample Tests for General Multivariate Data Carles M. Cuadras∗ Department d’Estadistica Uiversitat de Barcelona Abstract Most multivariate tests are based on the hypothesis of multinormality. But often this hypothesis fails, or we have variables that are non quantitative. On the other hand we can deal with a large number of variables. Defining probabilistic models with mixed data is not easy. However, it is always possible to define a measure of distance between two observations. We prove that the use of distances can provide alternative tests for comparing several populations when the data are of general type. This approach is illustrated with three real data examples. We also define and study a measure of association between two data sets and make an extension of the so-called distance-based discriminant rule. Key Words: Statistical distances, multivariate association, discriminant analysis, MANOVA, ANOQE, permutation test, large data sets. 1 Introduction Let Ω = {ω1 , ω2 , ..., ωn } be a finite set with n individuals. Let δii = δ(ωi , ωi ) = δ(ωi , ωi ) ≥ δ(ωi , ωi ) = 0 a distance or dissimilarity function defined on Ω. We suppose that the n × n distance matrix ∆ = (δii ) is Euclidean. Then there exist a configuration x1 , . . . , xn ∈ Rp , with xi = (xi1 , . . . , xip ) , i = 1, . . . , n, such that δii2 p = (xij − xi j )2 = (xi − xi ) (xi − xi ). (1.1) j=1 These coordinates constitute an n × p matrix X =(xij ) such that the distance between two rows i and i equals δii . A way of obtaining X from ∆ is as follows. Write A = − 12 ∆(2) and G = HAH, where ∆(2) = (δii2 ) and H = In − n−1 1n 1n is the centering ∗ Correspondence to: Carles M. Cuadras. Department of Statistics. University of Barcelona. Spain 2 C. M. Cuadras matrix. Then ∆ is Euclidean with dimension p =rang(G) if and only if G ≥ 0. The spectral decomposition G = UΛ2 U gives X = UΛ. Thus G = XX and the relation between ∆(2) and G is given by ∆(2) = 1g + g1 − 2G, where the n × 1 vector g contains the diagonal entries in G. Note that if S = (sii ) is a similarity matrix and we define the squared distance δii2 = sii + si i − 2sii , then G = HSH. Matrices X and U contain the principal and standard coordinates, respectively. This method is called classic multidimensional scaling or principal coordinate analysis (Cox and Cox, 1994; Gower, 1966; Mardia et al., 1979). Classic multivariate inference is mainly based on the hypothesis of normality. But often this hypothesis fails or we have non-quantitative variables. On the other hand, we can deal with a large number of variables. The main aim of this paper is to present three distance-based methods, on the basis of general data (quantitative, qualitative, binary, nominal, mixed), for comparing several populations. This distance-based approach extends some results by Cuadras and Fortiana (2004) and is in the line of Cuadras (1989; 1992), Cuadras and Arenas (1990), Cuadras et al. (1996, 1997), Rao (1982) and Liu and Rao (1995). Firstly, let us comment some distance-based aspects on multivariate association and discrimination. 2 Multivariate association Suppose that we have two data sets D1 and D2 on the same Ω. The task of associating D1 and D2 has been well studied when two quantitative data matrices are available. Thus, if X and Y are two centered data matrices of orders n × p and n × q, Escoufier (1973) introduced the generalized correlation RV(X, Y) = tr(S12 S21 )/ tr(S211 )tr(S222 ), where S11 = X X, S22 = Y Y, S12 = X Y, S21 = Y X. This correlation is quite related to the Procrustes statistics (Cox and Cox, 1994) R2 = 1 − {tr(X YY X)1/2 }2 /{tr(X X)tr(Y Y)}. Yanai et al. (2006) employ determinants of rectangular matrices to 3 Distance-based Tests introduce a measure Re(X, Y) of association, where X X X Y 2 /[det(X X) det(Y Y)]. Re(X, Y) = det YX YY We have RV(X, Y) = 1, R2 =Re(X, Y)2 = 0 if X = TY (T orthogonal) and RV(X, Y) = 0, R2 =Re(X, Y)2 = 1 if X Y = 0. We show that we can use principal coordinates to define association with general data when a distance function is available. In this case, we may obtain X, Y by considering a distance between observations, which provides the n×n distance matrices ∆x and ∆y . By spectral decomposition of the corresponding inner product matrices Gx = UΛ2x U , Gy = VΛ2y V , we can obtain the standard coordinates U, V and the principal coordinates X = UΛx , Y = VΛy . We define the association between D1 and D2 by η 2 (D1 , D2 ) = det(U VV U). (2.1) For the sake of simplicity, we write η 2 (X, Y). This association coefficient satisfies the following properties: 1) 0 ≤ η 2 (X, Y) = η 2 (Y, X) ≤ 1. 2) η 2 (X, Y) = det(X YY X)/[det(X X) det(Y Y)]. 3) η 2 (X, Y) does not depend on the configuration matrices X, Y. 4) If y is a vector and X is a matrix, both quantitative, then R2 (y, X) = η 2 (y, X), where R is the multiple correlation coefficient. 5) If x, y are quantitative vectors, then r 2 (y, x) = η 2 (y, x), where r is the ordinary correlation coefficient. 6) If rj , j = 1, · · · , q are the canonical correlation coefficients between X and Y, then q rj2 . η 2 (X, Y) = j=1 We outline the proof. Write W = UV , where the columns of U and V are orthonormal. Then 1) follows from 0 ≤ det(WW ) = det(W W) ≤ 1. As Λx , Λy are diagonal, 2) reduces to det(U VV U). Similarly, if S and T are p × p nonsingular matrices, then η 2 (X, Y) = η 2 (XS, YT). In particular 4 C. M. Cuadras S and T can be orthogonal matrices and XS, YT define the same distance √ matrices ∆x , ∆y . To prove 4), note that Λx / n contains the standard √ −1 deviations of the columns of X = UΛx . If y = nsy v and r =Λ−1 x X ysy is the vector of simple correlations between y and X, as Rxx = I, then: R2 (y, X) = r R−1 xx r XΛ−1 Λ−1 X y y = s−2 y x x = v UU v. 2 Finally, the canonical correlations satisfy det(Rxy R−1 yy Ryx −rj Rxx ) = 0, with Rxx = Ryy = I and Rxy = Ryx = U V and 6) follows. The association measure (2.1) is similar to the measure used in Arenas and Cuadras (2004) for studying the agreement between two representations of the same data. This measure can be computed using only distances and is given by θ(X, Y) = 2[1 − tr(Gxy )/tr(Gx + Gy )], 1/2 1/2 1/2 1/2 where Gxy = Gx + Gy − (Gx Gy + Gy Gx )/2. Normalizing Gx , Gy to tr(Gx )=tr(Gy )=tr(Λ2x ) =tr(Λ2y ) = 1, this measure reduces to θ(X, Y) = tr(UΛx U VΛy V ). Example 2.1. Is there relation between trade and science? Table 1 is a matrix reporting trade and scientific relations between 10 countries. The lower triangle contains 1 if significant trade occurred between two countries, 0 otherwise. The diagonal and upper triangle contains, for every pair of countries, the number of papers (mathematics and statistics, period 19962002) published in collaboration. Thus, Spain published 8597 paper without collaboration, 692 collaborating with USA, 473 with France, etc.The upper matrix Q is standardized to S = D−1 Q, where D =diag(S). Thus S has ones in the diagonal. The lower matrix is transformed to a matrix of similarities using the Jaccard coefficent, as explained in Cox and Cox (1994, p. 73). Then we obtain the spectral decomposition G = HSH for each similarity matrix, and considering all principal dimensions (i.e., nine), we get η 2 (D1 , D2 ) = 0.0816. This coefficient reveals a weak association between trade and science. 5 Distance-based Tests Table 1: Trade (1 if significant, 0 otherwise) and scientific relation (number of papers of mathematics and statistics in collaboration during 1996-2002) among ten countries. USA Spain France U.K. Italy Germany Canada Japan China Russia 3 USA Spa Fra U.K. Ital Ger Can Jap Chi Rus 63446 692 2281 2507 1642 2812 2733 1039 1773 893 1 8597 473 347 352 278 163 69 104 177 1 1 17155 532 916 884 496 269 167 606 1 1 1 12585 490 810 480 213 339 365 0 1 1 0 13197 677 290 169 120 512 1 1 1 1 1 16588 499 350 408 984 1 0 0 1 0 0 7927 228 601 204 1 0 0 0 0 0 1 20001 371 193 1 0 0 0 0 0 1 1 39140 64 0 0 0 0 0 0 0 0 1 18213 The proximity function Let X be a random vector with pdf f (x) with respect to a suitable measure and support S. Since the results below can be generalized easily, we may suppose the Lebesgue measure. If δ is a distance or dissimilarity function between the observations of X, we define the geometric variability of X with respect to δ as 1 δ2 (x, y) f (x) f (y) dxdy. (3.1) V δ (X) = 2 S×S The proximity function of an observation x to the population Π represented by X is defined by δ2 (x, y)f (y)dy−V δ (X) . (3.2) φ2δ (x,Π) = S×S Suppose that ψ : S → L is a representation of S in a Euclidean (or separable Hilbert) space L such that δ2 (x, y) = ||ψ(x)−ψ(y)||2 . The interest of V δ (X) and φ2δ (x) comes from the following properties. 1. We can interpret V δ (X) as a generalized variance and φ2δ (x) as the 6 C. M. Cuadras squared distance from x to an ideal mean of X : V δ (X) = E||ψ(X)||2 − ||E(ψ(X)||2 , φ2δ (x,Π) = ||ψ(x)−E(ψ(X))||2 . (3.3) In fact, if δ is the ordinary Euclidean distance, then V δ (X) =tr(Σ). 2. If we transform the distance: δ2 = aδ2 + b, then V δ = aV δ + b/2 and φ2δ = aφ2δ + b/2. 3. If δ2 = δ12 + δ22 then φ2δ = φ2δ1 + φ2δ2 . 4. By suitable choices of a, b we may transform δ and generate the probability density fδ (x) = exp(−φ2δ (x,Π)). Then I(f ||fδ ) = V δ (X) − H(f ) ≥ 0, where I(f ||fδ ) is the Kullback-Leibler divergence and H(f ) is the Shannon entropy. 5. Given G populations Π1 , . . . , ΠG , where X has pdf fg (x) when x comes from Πg , we can allocate an individual ω ∈ Π1 ∪ · · · ∪ ΠG by using the rule: allocate ω to Πi if φ2δ (x,Πi ) = min {φ2δ (x,Πg )}. 1≤g≤G This is the distance-based (DB) discriminant rule (Cuadras et al, 1997). In the next section we perform an extension of this rule. 4 The distance-based Bayes rule Suppose that Π1 , . . . , ΠG have probabilities “a priori” P (Πj ) = qj , with qj = 1. The DB discriminant rule is equivalent to the Bayes rule by using the dissimilarity δ2 (x, y) = log fj (x)fj (y) + 2 log qj if x, y comes from Πj . Then φ2δ (x,Πj ) = log fj (x) + log qj and the Bayes rule allocate ω to Πi if qi fi (x) = max {qg fg (x)}, 1≤g≤G 7 Distance-based Tests is equivalent to the DB rule allocate ω to Πi if log fi (x) + log qi = min {log fg (x) + log qg }. 1≤g≤G However, the DB rule has interest when we can define a proper distance between observations without using the pdf. For example, suppose that Πj is multivariate normal Np (µj , Σ). The Mahalanobis distance M 2 (x, y) = (x − y) Σ−1 (x − y) between observations provides VM = p and φ2M = (x − µj ) Σ−1 (x − µj ). By adding an additive constant: 2 (x, y) = (x − y) Σ−1 (x − y) − 4 log qj M if x, y comes from Πj , then φ2M (x,Πj ) = (x − µj ) Σ−1 (x − µj ) − 2 log qj , and the DB rule: allocate ω to Πi if d(x, µi , qi ) = min {d(x, µg , qg )}, 1≤g≤G where d(x, µj , qj ) = (x − µj ) Σ−1 (x − µj ) − 2 log qj , is equivalent to the Bayes rule. 5 Multivariate multiple-sample tests The comparison of several populations can be approached under parametric models. A non-parametric general method, which extends that of Cuadras and Fortiana (2004) is next proposed. Suppose that D1 , . . . , DG are G ≥ 2 independent data sets coming from the populations Π1 , . . . , ΠG . These data can be general (quantitative, qualitative, nominal, mixed). We wish to test H0 : Π1 = · · · = ΠG . Under H0 all data come from the same underlying distribution. First, we assume that, by means of a distance function between observations, we can obtain the intra-distance matrices ∆11 , . . . , ∆GG , and the inter-distance matrices ∆12 , . . . , ∆G−1G . Thus we have the n × n superdistance matrix ⎤ ⎡ ∆11 · · · ∆1G ⎢ .. ⎥ , .. ∆ = ⎣ ... . . ⎦ ∆G1 · · · ∆GG 8 C. M. Cuadras where ∆ij is ni × nj . Next, we compute, via principal coordinate analysis, the matrices G and X such that G = XX . We write the full X as ⎤ ⎡ X1 ⎥ ⎢ X = ⎣ ... ⎦ . XG The Euclidean distances between the rows of Xi and Xi give ∆ii . Thus the matrices X1 , . . . , XG may represent the G quantitative data sets, which can be compared for testing H0 . 5.1 Partitioning the geometric variability The rows x1 , . . . , xn of any N × p multivariate data matrix X satisfy N N N = 2N (5.1) i=1 i =1 (xi − xi )(xi − xi ) i=1 (xi − x)(xi − x) , where x = N −1 N i=1 xi is the mean vector. Now suppose G data matrices X1 , · · · , XG , where each Xi has order ni × p. Recall the identity T = B + W, with B = W = T = G g=1 ng (xg G ng G ng g=1 g=1 − x)(xg − x) , i=1 (xgi − xg )(xgi − xg ) , i=1 (xgi − x)(xgi − x) , where n = n1 + · · · + ng , xgi is a row of Xg with mean xg and x is the overall mean. Matrices T, B, W are the total, between samples and within samples, respectively. 9 Distance-based Tests From (5.1) we obtain: T B Wg = G g,h=1 ng ,nh i,i =1 (xgi − xhi )(xgi − xhi ) ng = 2n G i=1 (xgi − x)(xgi − x) , g=1 = G g,h=1 ng nh (xg − xh )(xg − xh ) = 2n G g=1 ng (xg − x)(xg − x) , = G g,h=1 ng nh (xg − xh )(xg − xh ) ng = 2ng i=1 (xgi − xg )(xgi − xg ) , which shows the following identity concerning matrices built with differences between observations and between means: T=B+n G n−1 g Wg , (5.2) g=1 where T, B and Wi are p × p matrices. We can partition the variability in a similar way. The geometric variability of an n × n distance matrix ∆ = (δii ), with related inner product matrix G, is defined by Vδ = n n 1 2 δii = tr(G)/n, 2n2 i=1 i =1 where δii2 = (xi − xi ) (xi − xi ). Vδ is the sampling version of (3.1). By taking traces in (5.2) we obtain tr(T) = tr(B) + n G n−1 g tr(Wg ). g=1 We write this identity as Vδ (total) = Vδ (between) + n −1 G g=1 ng Vδ (within g). (5.3) 10 5.2 C. M. Cuadras Tests with principal coordinates The above identities (5.2) and (5.3) can be used for comparing populations. Given G independent data sets D1 , · · · , DG , we may obtain the super-distance matrix ∆ and the principal coordinates X1 , . . . , XG . Then we can obtain B and T and compute two statistics for testing H0 : a) γ1 = det(T − B)/ det(T), b) γ2 = Vδ (between)/Vδ (total). Both statistics lie between 0 and 1. Small values of γ1 and large values of γ2 , respectively, give evidence to the alternative hypothesis. Note that γ2 = tr(B)/tr(T) is an statistic based on quadratic entropy and it is used in ANOQE (analysis of quadratic entropy), a generalization of ANOVA (analysis of variance) proposed by C. R. Rao in several papers. Also note that the distribution of γ1 is Wilks if the populations are multivariate normal with the same covariance matrix and we choose the Euclidean distance. Except for multinormal data and other few distributions, the sample distribution of γ1 and γ2 is unknown. The asymptotic distribution involves sequences of nuisance parameters, which were found for very specific distances and distributions (see Cuadras and Fortiana, 1995; Cuadras and Lahlou, 2000; Cuadras et al., 2006). Liu and Rao (1997) derives the bootstrap distribution of Vδ (between). Indeed, the use of resampling methods, as described in Flury (1997), may overcome this difficulty. 5.3 Tests with proximity functions Another test, which avoids resampling procedures, can be derived by using proximity functions and non-parametric statistics. First note that, with quantitative data and Mahalanobis distances, the proximity functions are φ2δ (x,Πg ) = (x − µg ) Σ−1 (x − µg ), g = 1, . . . , G. These functions are equal under H0 : µ1 = · · · = µG . Suppose in general that x11 , . . . , x1n1 represent the n1 observations coming from Π1 and ω is a new individual with coordinates x. The sampling counterpart of the proximity function (3.2) is n1 1 2 δ2 (x, x1i ) − Vδ (within 1 ), φ1 (x) = n1 i=1 11 Distance-based Tests where δ(ω, ω1i ) = δ(x, x1i ). Note that we do not need to find the x’s vectors. We similarly obtain φ2 (x) , etc. However, under H0 all the population 2 proximity functions are the same, see (3.3). Thus, we may work with the full proximity function G ng 1 2 δ x, xgi − Vδ (total). φ2 (x) = n g=1 i=1 If agi = φ2 (xgi ) = ||xgi −x||2 , where xgi comes from Πg , we obtain the proximity values (computed using only distances), Π1 : a11 , . . . , a1n1 ; · · · ; ΠG : aG1 , . . . , aGnG . Under H0 the agi follows (approximately) the same distribution. Then a Kruskal-Wallis test can be performed to accept or reject H0 . This test is based on G Rg2 12 ) − 3(n + 1), H=( n(n + 1) g=1 ng where the n values agi are arranged and Rg is the sum of the ranks for the values in Πg . This statistic H is asymptotically chi-square with G − 1 d.f. Example 5.1. We consider three data sets covering the quantitative, mixed and nominal cases. The first data set is the well-known Fisher Iris data with p = 4 quantitative variables, G = 3 species and ng = 50. We use the city-block distance. The student mixed data is taken from Mardia et al. (1979, p. 294). We only consider G = 3 groups with n1 = 25, n2 = 117, n3 = 26. There is a quantitative variable and a qualitative variable and we use the distance δij = 1 − sij , where sij is Gower’s similarity coefficient for mixed variables (Gower and Legendre, 1986). The DNA data, used in Cuadras et al. (1997), consists of sequences of length 360 base pairs taken from a segment of mitochondrial DNA for a set of 120 individuals belonging to G = 4 human groups, with n1 = 25, n2 = 41, n3 = 37, n4 = 17 individuals. Since the data are large strings of ACGT, the standard methods fail miserably, whereas the DB approach provides a solution. 12 C. M. Cuadras Table 2: Features and sizes of data sets used to illustrate three distance-based multisample tests. Data Iris Students DNA Type Quantitative Mixed Nominal Groups 3 3 4 Sizes 50 + 50 + 50 = 150 25 + 117 + 26 = 168 25 + 41 + 37 + 17 = 120 Distance City-block Gower Matching Table 3: Some results of three distance-based multisample tests on real data. Data Iris Students DNA H 42.85 4.498 27.49 d.f. 2 2 3 P −value 0.000 0.105 0.000 γ1 0.0056 0.5788 0.0071 P −value 0.000 0.001 0.000 γ2 0.8051 0.0424 0.7800 P −value 0.000 0.003 0.000 We obtain the randomization distribution of γ1 and γ2 for N = 10000 partitions into subsets of sizes n1 , . . . , nG and estimate the P -values. In contrast, note that the statistic H can be obtained without resampling. Table 2 describes data and distances used. Note that the Euclidean distance for Iris data gives γ1 = 0.0234, which under normality and H0 is distributed as Wilks Λ(4, 147, 2). Table 3 reports the results obtained. There are significant differences among Iris species and among DNA groups. The statistic H suggests a non-significant difference among students groups. Finally, we test the performance of this method by comparing three artificial populations. Suppose the bivariate normal populations N2 (c1, Σ), N2 (2c1, Σ), N2 (3c1, Σ). We simulate samples of sizes n1 = n2 = n3 for c = 0 and c = 1, respectively. We then choose the Euclidean distance, compute the Wilks statistic, the exact and the empirical P -values after N = 10000 permutations. The results, summarized in Table 4, show that the conclusions (to accept H0 for c = 0, to reject H0 for c = 1) at level of significance α = 0.05 are the same. 13 Distance-based Tests Table 4: Distance-based comparison of three (simulated) bivariate normal populations using the Euclidean distance. Size 3 5 10 6 c=0 Exact Wilks P -value 0.3364 0.203 0.7609 0.538 0.8735 0.465 Empirical P -value 0.221 0.527 0.470 Size 3 5 10 c=1 Exact Wilks P -value 0.2243 0.086 0.1835 0.001 0.4608 0.000 Empirical P -value 0.078 0.001 0.000 References Cox, T.F. and Cox, M.A.A. (1994) Multidimensional Scaling. Chapman and Hall, London. Cuadras, C.M. (1989) Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (Ed.), Statistical Data Analysis and Inference, pp. 459–473. Elsevier Science Publishers B. V. (North–Holland), Amsterdam. Cuadras, C.M. (1992) Some examples of distance based discrimination. Biometrical Letters, 29, 3-20. Cuadras, C.M. and Arenas, C. (1990) A distance based regression model for prediction with mixed data. Communications in Statistics, Theory and Methods, 19, 2261-2279. Cuadras, C.M. and Fortiana, J. (1995) A continuous metric scaling solution for a random variable. Journal of Multivariate Analysis, 52, 1-14. Cuadras, C.M., Arenas, C. and Fortiana, J. (1996) Some computational aspects of a distance-based model for prediction. Communications in Statistics, Simulation and Computation, 25, 593-609. Cuadras, C.M., Atkinson, R.A. and Fortiana, J. (1997) Probability densities from distances and discriminant analysis. Statistics and Probability Letters, 33, 405-411. Cuadras, C.M. and Fortiana, J. (2004) Distance-based multivariate two sample tests. In: M. S. Nikulin, N. Balakrishnan, M. Mesbah, 14 C. M. Cuadras N. Limnios, (Eds.), Parametric and Semiparametric Models with Applications to Reliability, Survival Analysis and Quality of Life, pp. 273-290. Birkhauser, Boston. Cuadras, C.M., Fortiana, J. and Oliva, F. (1997) The proximity of an individual to a population with applications in discriminant analysis. Journal of Classification, 14, 117-136. Cuadras, C.M., Cuadras, D. and Lahlou, Y. (2006) Principal directions for the general Pareto distribution. Journal of Statistical Planning and Inference, 136, 2572-2583. Cuadras, C.M., and Lahlou, Y. (2000) Some orthogonal expansions for the logistic distribution. Communications in Statistics, Theory and Methods, 29, 2643-2663. Escoufier, Y. (1973) Le trataiment des variables vectorielles. Biometrics, 29, 751-760. Flury, B. (1997) A First Course in Multivariate Statistics. SpringerVerlag, New York. Gower, J.C. (1966) Some distance properties of latent roots and vector methods in multivariate analysis. Biometrika, 53, 315-328. Gower, J.C. and Legendre, P. (1986) Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5-48. Liu, Z.J. and Rao, C.R. (1995) Asymptotic distribution of statistics based on quadratic entropy and bootstrapping. Journal of Statistical Planning and Inference, 43, 1-18. Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979) Multivariate Analysis. Academic Press, London. Rao, C.R. (1982) Diversity: its measurement, decomposition, apportionment and analysis. Sankhya A, 44, 1-21. Yanai, H., Takane, Y. and Ishii, Y. (2006) Nonnegative determinant of a rectangular matrix: its definition and applications to multivariate analysis. Linear Algebra and Its Applications, in press.