Symbolic Data Analysis - Institute of Statistical Science, Academia

Transcription

Symbolic Data Analysis - Institute of Statistical Science, Academia
Advances and directions of
research in Symbolic Data
Analysis
E. Diday
CEREMADE. Paris–Dauphine University
June 14, 2014 SDA Workshop – Tutorial
Academica Sinica
OUTLINE
• PART 1
BUILDING SYMBOLIC DATA
• PART 2
OPEN DIRECTION OF RESEARH.
• PART 3
AN ILLUSTRATIVE EXAMPLE : TRACHOMA STUDY
PART 1
Building Symbolic data:
. Some principles
. Ten kinds of Symbolic Variables
Some principles
Symbolic Data are not given or found like standard or complex data.
 They are build from classes of individuals in case of standard data or
from classes of several kinds of individuals in case of complex data.
Symbolic data are not only distributions.
Ten examples of Symbolic variables
PART 2: OPEN DIRECTION OF RESEARH
• Building Symbolic Data.
• Extending methods to Symbolic Data
• Four theorems of convergence needed to be proved on
any extended method to Symbolic Data
• Models of models
• Law of parameters of laws and Laws of vectors of laws.
• Copulas needing.
• Optimisation in non supervised learning (hierarchical and
pyramidal clustering).
BUILDING SYMBOLIC DATA
The discretization of the initial classical variables has to be
donne in order to optimize at least three kinds of aims:
1) The quality of the obtained distribution
 It can be measured by model selection criteria BIC, MDL,
AIC, MML like or other criterion of this kind based on the
likelihood estimation.
 Flat distributions are not interesting so criterion of
“information” like (Sum of pi Log(pi)) can be used.
2) The level of discrimination between the obtained
symbolic description. It can be measured by the sum of
their dissimilarities two by two.
3) The correlation between the bins associated to the
different symbolic variables (metabins).
EXTENDING METHODS ON SYMBOLIC DATA:
MUCH REMAINS TO BE DONE
- Graphical visualisation of Symbolic Data
- Correlation, Mean, Mean Square, distribution of a symbolic variables.
- Dissimilarities between symbolic descriptions, K-nearest neighbourg
- Clustering, spatial hierarchies and pyramids of symbolic descriptions, SKohonen Mappings
- S-Decision Trees
- S-Principal Component, Discriminant Factorial Analysis
- S- Canonical Analysis, Regression
 S- Bayesian trees, Multilevel analysis, Variance Analysis, Vector Support
Machine, Mixture decomposition, Multilevel Analysis, Learnong machine by
FOUR THEOREM TO BE PROVED ON ANY EXTENDED METHOD
TO SYMBOLIC DATA
M(n, k) is supposed to be a SDA method where k is the number of classes
obtained on n initial individuals.
THEOREME 1 : If the k classes are fixed and n tends towards infinity, then
M(n, k) converges towards a stable position.
THEOREME 2 : If k increases until getting a single individual by class, then
M(n, k) converges towards a standard method.
THEOREME 3 : If k and n increase simulataneously towards infinity, then M(n,
k) converges towards a stable position.
THEOREME 4 If the k laws associated to the k classes are considered as a
sample of a law of laws, then M(n, k) applied to this sample converges to
M(n, k) applied to this law.
Exemples :
Théorème 1: il a été démontré dans Diday, Emilion (CRAS, Choquet 1998), pour les treillis de Galois: à mesure que la taille de la population augmente les
classes (décrites par des vecteurs de distributions), s’organisent dans un treillis de Galois qui converge. Emilion (CRAS, 2002) donne aussi un théorème
dans le cas de mélanges de lois de lois utilisant les martingales et un modèle de Dirichlet.
Théorème 2: Par ex, l’ACP classique MO est un cas particulier de l’ACP notée M(n, k) construite sur les vecteurs d’intervalles.
Théorème 3: c’est le cadre de données qui arrivent séquentiellement (de type « Data Stream ») et des algorithmes de type one pass (voir par ex Diday,
Murty (2005)).
Théorème 4: Dans le cas d'une classification hiérarchique ou pyramidale 2D, 3D etc. la convergence signifie que les grands paliers et leur structure se
stabilisent. Dans le cas d’une ACP la convergence signifie que les axes factoriels se stabilisent.
MODELS OF MODELS ARE NEEDED
Table 1
Individual
Table 2
X1
Team
s
Xj
ind1
A number
X’1
X’j
C1
(age of Messi)
Messi
indn
Xij
Ci
Ck
Xj is a standard random numerical variable
X’j is a random variable with histogram value
 Question: if the law of Xj is given what is
the law of X’j ? (Dirichlet models useful).
A symbolic data
(age of Messi
team)
Law of parameters of laws
Y1
Yj
C1
Ci
Ck
Parij
Estimated parameters of
the law Xij of the class Ci
Y1
Yj
Yp
Law(P
arj)
Law
(Parj)
Law
(Parp)
Find the law of the parameters for
each symbolic variable Yj and the
law of the associated vector of
parameters laws.
Example: Parij = ( ij , ϭ ij )
Example: If f is the density of the parameters of the
uniform law of intervals and g the law of intervals then:
g(y) = 6 p f(x) / j = 1,p(x j max - x j min)
(Diday à SFC 2011 Orléans).
Copulas needing in Symbolic Data
Analysis
In each ll of the symbolic data table, we supose to have a
density function f(i,j)
f(i, j, j’) is the joint probability of the variables j and j’ for the
individual i.
 In case of independency , we have
f(i, j, j’) = f(i, j’). f(i, j’),
 If there is no independancy:
f(i, j, j’) = Copula(f(i, j’). f(i, j’))
Aim of Copula model in SDA:
 find the Copula which minimise the differences with the joint.
 In order to avoid the restriction to independency hypotheses
and to reduce the cost of f(i, j, j’) computing.
 In that way we can obtain a Copular PCA, Regression,
Canonical, Analysis, ….
Bi-plot of histogram variables
• The joint probability can be inferred by a
copula model
Y1
C1
Ci
Ck
Y2
Copula
Optimisation in clustering
d is the given dissimilarity
Ultrametric
dissimilarity = U
x1
x2
x3
x5
x4
Each class is
described by
symbolic data
x1
x2
Hierarchies
Pyramides
x3
x4
x5
3D Spatial Pyramid
S1
W = |d - U |
Robinsonian
dissimilarity = R
W = |d - R |
S
2
A1
C3
C2
B1
C1
Yadidean
dissimilarity = Y
W = |d - Y |
PART 3
ILLUSTRATIVE EXAMPLE ON TRACHOMA
 Trachoma, caused by repeated ocular infections with
Chlamydia tra- chomatis whose vector is a fly, is an
important cause of blindness in the world. This
study was conducted in Mali.
 The first aim was to choose among three antibiotic
strategies those with the best cost-effectiveness
ratio.
 The second aim was to find the demographic and
environmental parameters on which we could try to
intervene.
Symbolic Table of Degradation
The classes 0x0, 0x1, 1x0 and 1x1 of degradation (0 = healthy , 1 = ill at the
(beginning x end) of the one year study. These classes are directly issued from
the given data and not from a clustering process.
INTERPRETATION: The THIRD STRATEGY is the most frequent in the worth class (0x1).
Nevertheless we cannot conclude that it is the worth strategy as the degradation can
come from the environmental of this class 0X1.
The third strategy remains the worse in three homogeneous
environmental conditions obtained by clustering
PCA OF THE SYMBOLIC DATA TABLE OF DEGRADATION
A Standard PCA is applied on the categories of the symbolic fariables (considered as numerical variables)
of the “degradation symbolic data table” on which the piecharts of the strategies are projected.
ANY PIECHART oF SYMBOLIC VARIABLE CAN BE SEEN: Borehole well
CORRELATION CIRCLE OF ALL THE CATEGORIES ( ie BINS) OF THE SYMBOLIC V
SYMBOLIC VARIABLES PROJECTION IN HYPERCUBE
QUADRANT SYMBOLIC VARIABLES PROJECTION
THE SDA STRATEGY
The classes are generally not obtained from a clustering
process. The classes 0x0, 0x1, 1x0 and 1x1 of degradation
are directly issued from the given data.
the clustering strategy in SDA is not much used to build
the classes to be studied, it is mainly used in order to
show dependencies or independencies between groups of
symbolic variables. Here the environmental conditions
CONCLUSION
• Classical, Complex and Big Data are GIVEN.
• Symbolic data are BUILD.
• Complex and Big Data data can be simplified and reduced
in Symbolic Data.
• The quality the obtained Symbolic Data can be improved
by optimization of several criteria.
• The number of papers for building Symbolic Data remains
few. Much remains to do in this direction.
• Symbolic data are not only distributions.
SYMBOLIC DATA ARE THE NUMBERS OF THE FUTURE.
Basic books and papers
• Bock H.H., Diday E. (editors and co-authors) ( 2000): Analysis of Symbolic
Data.Exploratory methods for extracting statistical information from
complex data. Springer Verlag, Heidelberg, 425 pages, ISBN 3-540-66619-2.
• L. Billard, E. Diday (2003) "From the statistics of data to the statistic of
knowledge: Symbolic Data Analysis". JASA . Journal of the American
Statistical Association. Juin, Vol. 98, N° 462.
• Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual
Statistics and Data Mining. 321 pages. Wiley series in computational
statistics. Wiley, Chichester, ISBN 0-470-09016-2.
• E. Diday, M. Noirhomme (eds and co-authors) (2008) “Symbolic Data
Analysis and the SODAS software”. 457 pages. Wiley. ISBN 978-0-47001883-5.
• Noirhomme-Fraiture, M. and Brito, P. (2012) Far beyond the classical data
models: symbolic data analysis. Statistical Analysis and Data Mining 4 (2),
157-170.
• Lazare N. (2013) "Symbolic Data Analysis". CHANCE magazine. Editor’s
Letter – Vol. 26, No. 3.
In Building Symbolic Data
• Stéphan V., Hébrail G.,Lechevallier Y. (2000) « Generation of
symbolic objects from relationnal data base ». Chapter in book : Analysis of
Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data
(eds. H.-H.Bock and E. Diday). Springer-Verlag, Berlin, 103-124.
• Chiun-How, K., Chih-Wen, O., Yin-Jing, T., Chuan-kai, Yang, Chun-houh, Chen
(2012) “A Symbolic Database for TIMSS”. Arroyo J., Maté C., Brito P. Noihomme M. eds, 3rd
Workshop in Symbolic Data Analysis. Universidad Compiutense de Madrid. http://www.sdaworkshop.org/.
• E. Diday, F. Afonso, R. Haddad (2013) : “The symbolic data analysis
paradigm, discriminate discretization and financial application”. In
Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis,
HDSDA 2013. Revue des Nouvelles Technologies de l'Information vol. RNTI-E-25, pp. 1-14
IN SYMBOLIC DATA ANALYSIS
 In Pricipal Component Analysis
Cazes P., Chouakria A., Diday E., Schektman Y. (1997). Extension de l’analyse en composantes
principales à des données de type intervalle, Rev. Statistique Appliquées, Vol. XLV Num. 3, pp. 5-24,
France. 29.
Cazes P. (2002) Analyse factorielle d’un tableau de lois de probabilité. Revue de statistique appliquée,
tome 50, n0 3.
Diday E. (2013) "Principal Component Analysis for bar charts and Metabins tables". Statistical
Analysis and Data Mining. Article first published online: 20 May 2013. DOI: 10.1002/sam.11188.
2013 Wiley. Statistical Analysis and Data Mining,6,5, 403-430.
Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical
Analysis and Data Mining, Wiley. 184-198.
Makosso-Kallyth S. and Diday E. (2012) Adaptation of interval PCA to symbolic histogram variables.
Advances in Data Analysis and Classification (ADAC). July, Volume 6, Issue 2, pp 147-159.
Rademacher, J., Billard , L., (2012) Principal component analysis for interval data. Wiley
interdisciplinary Reviews: Computational Statistics .Volume 4, Issue 6, pp. 535–540.
Shimizu N., Nakano J. (2012) Histograms Principal Component Analysis. Arroyo J., Maté C., Brito
P. Noihomme M. eds, 3rd Workshop in Symbolic Data Analysis. Universidad Compiutense de
Madrid. http://www.sda-workshop.org/
Wang H., Guan R., Wu J. (2012a). CIPCA: Complete-Information-based Principal Component
Analysis for interval-valued data, Neurocomputing, Volume 86, Pages 158-169.
In Symbolic Forecasting
Arroyo, J. and Maté, C. (2009). Forecasting histogram time series with k-nearest
neighbors' methods. International Journal of Forecasting 25, 192–207.
García-Ascanio, C.; Maté, C. (2010). Electric power demand forecasting using
interval time series: A comparison between VAR and iMLP. Energy Policy 38,
715-725
Han, A., Hong, Y., Lai, K.K., Wang, S. (2008). Interval time series analysis with an
application to the sterling-dollar exchange rate. Journal of Systems Science and
Complexity, 21 (4), 550-565.
He, L.T. and C. Hu (2009). Impacts of Interval Computing on Stock Market
Variability Forecasting. Computational Economics 33, 263-276.
In Symbolic rule extraction
Afonso, F. et Diday, E. (2005). Extension de l’algorithme Apriori et des regles
d’association aux cas des donnees symboliques diagrammes et intervalles. Revue RNTI,
Extraction et Gestion des Connaissances (EGC 2005), Vol. 1, pp 205-210, Cepadues, 2005.
In Symbolic Decision Tree
Ciampi, A., Diday, E., Lebbe, J., Perinel, E. et Vignes, R. (2000). Growing a
tree classifier with imprecise data. Pattern Recognition letters 21: 787-803.
Mballo C., Diday E. (2006) The criterion of Smirnov-Kolmogorov for
binary decision tree : application to interval valued variables.
Intelligent Data Analysis. Volume 10, Number 4 . pp 325 – 341
Winsberg S., Diday E., Limam M. (2006). A tree structured classifier
for symbolic class description. Compstat 2006. Physica-Verlag.
Bravo, M. et Garcia-Santesmases, J. (2000). Symbolic Object
Description of Strata by Segmentation Trees, Computational
Statistics, 15:13-24, Physica-Verlag.
 In Clustering
•
•
•
•
•
De Carvalho F., Souza R., Chavent M., and Lechevallier Y. (2006) Adaptive Hausdorff distances and
dynamic clustering of symbolic interval data. Pattern Recognition Letters Volume 27, Issue 3, February
2006, Pages 167-179.
De Souza R.M.C.R, De Carvalho F.A.T. (2004). Clustering of interval data based on City-Block distances.
Pattern Recognition Letters, 25, 353–365.
Diday E. (2008) Spatial classification. DAM (Discrete Applied Mathematics) Volume 156, Issue 8, Pages
1271-1294.
Diday, E., Murty, N. (2005) "Symbolic Data Clustering" in Encyclopedia of Data Warehousing and Mining
. John Wong editor . Idea Group Reference Publisher.
Irpino, A. and Verde, R. (2008): Dynamic clustering of interval data using a Wasserstein-based distance.
Pattern Recognition Letters 29, 1648-1658.
 In Multidimensional Scaling
•
Terada, Y., Yadohisa, H. (2011) Multidimensional scaling with hyperbox model for percentile dissimilarities,
In: Watada, J., Phillips-Wren, G., Jain, L. C., and Howlett, R. J. (Eds.): Intelligent Decision Technologies
Springer Verlag, 779–788
•
Groenen, P.J.F.,Winsberg, S., Rodriguez, O., Diday, E. (2006). I-Scal: Multidimensional scaling of interval
dissimilarities. Computational Statistics and Data Analysis 51, 360–378.
Some Symbolic Data Analysis
references
 In Self Organizing map
•
Hajjar C., Hamdan H. (2011). Self-organizing map based on L2 distance for interval-valued data. In SACI 2011, 6th
IEEE International Symposium on Applied Computational Intelligence and Informatics (Timisoara, Romania), pp. 317–
322.P.
In Dissimilarities between Symbolic Data
•
Kim, J. and Billard, L. (2013): Dissimilarity measures for histogram-valued observations, Communications in StatisticsTheory and Method, 42, 283-303.
• Verde, R., Irpino, A. (2010). Ordinary Least Squares for Histogram Data Based on
Wasserstein Distance, in: Proc. COMPSTAT’2010, Y. Lechevallier and G.Saporta (Eds).PP.581-589.
Physica Verlag Heidelberg.
In Regression and Canonical analysis extended
to Symbolic Data
Dias, S., Brito, P., (2011). A New Linear Regression Model for HistogramValued Variables. In Proceedings of the 58th ISI World Statistics Congress
(Dublin, Ireland).
Lauro, C., Verde, R. , Irpino, A. (2008). Generalized canonical analysis, in:
Symbolic Data Analysis and the Sodas Software, E. Diday and M.
Noirhomme. Fraiture (Eds.), 313-330, Wiley, Chichester.
Tenenhaus A., Diday E., Emilion R., Afonso F. (2013) Regularized General
Canonical Correlation Analysis Extended To Symbolic Data. ADAC
(publication on the way).
Neto, E.A, De Carvalho F.A.T. (2010). Constrained linear regression
models for symbolic interval-valued variables. Computational
Statistics and Data Analysis 54, 333-347.
Wang H., Guan R., Wu J. (2012c). Linear regression of interval-valued
data based on complete information in hypercubes, Journal of
Systems Science and Systems Engineering, Volume 21, Issue 4, Page
422-442.
In Symbolic Data Models referencies
•
•
•
•
•
•
•
•
•
•
•
P. Bertrand, F. Goupil (2000) “ Descriptive Statistics for symbolic data“ . In H.H. Bock, E.
Diday (Eds) “Analysis of Symbolic Data “. Springer-Verlag, pp. 106-124.
Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and SkewNormal distributions. Journal of Applied Statistics, 39 (1), 3-20.
E. Diday, M. Vrac (2005) "Mixture decomposition of distributions by Copulas in the
symbolic data analysis framework". Discrete Applied Mathematics (DAM). Volume 147,
Issue1, 1 April, pp. 27-41.
E. Diday (2011) Modélisation de données symboliques et application au cas des
intervalles. Journées Nationales de la Société Francophone de Classification. Orléans
E. Diday (2002) “From Schweizer to Dempster: mixture decomposition of distributions by
copulas in the symbolic data analysis framework” IPMU 2002, July, Annecy, France
Diday E., Emilion R. (1997) "Treillis de Galois Maximaux et Capacités de Choquet" . C.R.
Acad. Sc. t.325, Série 1, p 261-266. Présenté par G. Choquet en Analyse Mathématiques
Diday E., R. Emilion (2003) Maximal and stochastic Galois lattices. Discrete appliedMath.
Journal. Vol. 27 (2), pp. 271-284.
Emilion R., Classification et mélanges de processus. C.R. Acad. Sci. Paris, 335, série I,
189-193 (2002).
Emilion R., Unsupervised Classification and Analysis of objects described by
nonparametric probability distributions. Statistical Analysis and Data Mining (SAM), Vol
5, 5, 388-398 (2012).
J. Le-Rademacher, L. Billard (2011) “Likelihood functions and some maximum likelihood
estimators for symbolic data”. Journal of Statistical Planning and Inference 141 1593–
1602. Elsevier.
T. Soubdhan, R. Emilion, R. Calif (2009) “Classification of daily solar radiation
distributions”. Solar Energy 83 (2009) 1056–1063. Elsevier.
In SDA Industrial Applications
•
•
•
•
•
•
•
•
•
•
•
•
•
Afonso F., Diday E., Badez N., Genest Y. (2010) Symbolic Data Analysis of Complex Data:
Application to nuclear power plant. COMPSTAT’2010 , Paris.
Bezerra B., Carvalho F. (2011) Symbolic data analysis tools for recommendation systems.
Knowl. Inf. Syst 01/2011; 26:385-418. DOI:10.1007/s10115-009-0282-3.
Bouteiller V., Toque C., A., Cherrier J-F., Diday E., Cremona C. (2011) Non-destructive
electrochemical characterizations of reinforced concrete corrosion: basic and symbolic data
analysis. Corros Rev . Walter de Gruyter • Berlin • Boston. DOI 10.1515/corrrev-2011-002.
Courtois, A., Genest, G., Afonso, F., Diday, E., Orcesi, A., (2012) In service inspection of
reinforced concrete cooling towers – EDF’s feedback ,IALCCE 2012, Vienna, Austria
Cury, A., Crémona, C., Diday, E. (2010). Application of symbolic data analysis for structural
modification assessment. Engineering Structures Journal. Vol 32, pp 762-775.
Christelle Fablet, Edwin Diday, Stephanie Bougeard, Carole Toque, Lynne Billard (2010).
Classification of Hierarchical-Structured Data with Symbolic Analysis. Application to
Veterinary Epidemiology. COMPSTAT’2010 , Paris.
Haddad R., Afonso F., Diday E., (2011) Approche symbolique pour l'extraction de
thématiques: Application à un corpus issu d'appels téléphoniques. In actes des XVIIIèmes
Rencontres de la Sociéte francophone de Classification. Université d'Orléans
Laaksonen, S. (2008). People’s Life Values and Trust Components in Europe - Symbolic Data
Analysis for 20-22 Countries. In. Edwin Diday and Monique Noirhomme-Fraiture, “Symbolic
Data Analysis and the SODAS Software", Chapter 22, pp. 405-419. Wiley and Sons:
Chichester, UK.
Quantin C., Billard L., Touati M., Andreu N., Cottin Y., Zeller M., Afonso F., Battaglia G., Seck
D., Le Teuff G., and Diday E.. (2011) Classification and Regression Trees on Aggregate Data
Modeling: An Application in Acute Myocardial Infarction. Journal of Probability and Statistics
Volume 2011 (2011), 19 pages.
Terraza V, Toque C. (2013) Mutual Fund Rating: A Symbolic Data Approach. In "Understanding
Investment Funds Insights from Performance and Risk Analysis". Edited by Virginie Terraza and Hery
Razafitombo . Economics & Finance Collection 2013. The Palgrave Macmilan editor. UK.
He, L.T. and C. Hu (2009). Impacts of Interval Computing on Stock Market Variability Forecasting.
Computational Economics 33, 263-276.
E. Diday, F. Afonso, R. Haddad (2013) : The symbolic data analysis paradigm, discriminate
discretization and financial application, in Advances in Theory and Applications of High Dimensional
and Symbolic Data Analysis, HDSDA 2013. Revue des Nouvelles Technologies de l'Information vol.
RNTI-E-25, pp. 1-14
Han, A., Hong, Y., Lai, K.K., Wang, S. (2008). Interval time series analysis with an application
to the sterling-dollar exchange rate. Journal of Systems Science and Complexity, 21 (4), 550565.