DOCUMENT: tutorial001
Transcription
DOCUMENT: tutorial001
DOCUMENT: tutorial001 Illustration of running DIOGENE for processing a simple Factorial Correspondence Analysis Introduction The documents of the “tutorial series” as this one are only concerned with the very practical problems to which the user is faced before being able to use the results of data processing (thesis, publication, selection of the best genotypes etc.). Other documents, “notice series”, are designed to give general informations about the biometrical and genetic models and cautions mandatory to draw conclusions from the experimental results. These practical problems may be classified into three categories: (1) Data organization: How to prepare the data files to be processed, both for the measured or graded traits (observations), for the codes which describe the experimental structure which the user wants to analyze (indicators) and for the alphanumeric labelling of these two categories that we shall shortly name “trait labels”, “indicator labels” or, more compactly, “labels”; (2) Processing selection: How to select the program (or sequence of programs) which best fits the user’s aim, whatever the kind of results would be (data management, estimation of parameters, exploration of data structure etc.) (3) Importation of results: After data processing, it is necessary to obtain the final results in a form edited as well as possible and which anybody can use for integrating them in a variety of documents. For point (1), even if it is not the only way to prepare files for DIOGENE, the data in Excel format will be privileged because this spreadsheet is de facto the standard tool used by researchers. For point (3) and for the same reasons, Word will also be privileged, even if “paper” edition of the results is also possible without using it, via; for instance, an A3 network printer. Of course, Excel and Word may be replaced by programs belonging to the same categories, for instance the equivalent spreadsheets and word-processors which can be found for Linux. The examples given in this documents have been practically obtained On a Sun Microsystem Enterprise 450 server with Solaris 9 version of Unix (for running DIOGENE) and Windows-XP (professional edition), with Office-XP, for preparation of Excel data files and importation/edition of results. The standard programs connecting the PC to the server were Tera Term Pro (as alphanumeric terminal emulator) and WS-FTP95 LE, for file transfer. These programs can be freely downloaded on internet. Lastly, although DIOGENE now enables 2-D and 3-D graphics (as interfaced with Gnuplot), this aspect will not developed in the basic “tutorials”, because it requires graphic terminal emulators as X-windows, for instance. The points where the files specialized for these graphics are created by the programs will be indicated. Note that these files can also be used for realization of graphics using Excel (or an equivalent spreadsheet). The edited graphics will be alphanumeric. Specialized tutorials will be later devoted to high definition graphics. The above Excel table (lebart.xls file) displays data from the book ”Traitement des données statistiques” by Lebart, Morineau and Fénelon (1979, Dunod Editor), selected in the Correspondence Analysis section (pp. 306-326). It crosses 10 categories of profession (lines) with 8 kinds of housing. Within each cell is the count of individuals who belong to a particular combination. Its therefore a contingence table from which one can describe the association between lines (arbitrarily considered as “populations”) and columns, what we shall consider (always arbitrarily) as “traits”. DIOGENE requires numeric data represented in “binary” form, wich are here the within-cell counts and, additionally, fore this analysis, one indicator, still binary, which is the “population code”(column B). On the margins, you can find the trait’s labels (row 1), and the population labels (column A). Note that each population label is followed by a semicolon. We shall see that this is mandatory at this stage. 2 The “data organization“ step is fulfilled by the Excel table, but two points may be still solved before obtaining a file really suitable for processing by DIOGENE: - Conversion of labels into ASCII format; - Conversion of numerical codes and of counts into binary representation. For this purpose, we need first to translate the Excel file into ASCII format using “spaces” as separators, what is called a “.prn” file. This is achieved as shown by the above display by selecting this sub-option after the “write according to format…” Excel command. 3 The next step is to transfer the “lebart.prn“ file from the PC to the server (in the “tutorial” sub-directory, for this example). It is mandatory to do this transfer in ASCII mode, as shown by the above screen. If this transfer is done in binary mode, control codes at the end of each line (CR) will result into errors when the transcoding program (presented below) will run: In that case, no transcoded file would be obtained. 4 When the transfer is achieved, you can visualize and modify the lebart.prn file, using vi (as here) or another text editor available under Unix (emacs, xemacs etc.). You must verify that there is at least one “space” character between two different columns and that each “population label” is followed by a semicolon (if it is not the case, an error will occur when the transcoding program will be run). Space character is not required (but may be present) between the last character of a label and the semicolon. But spaces are not allowed within a label. Note that a perfect alignment of the items of the same column is not required: the only requirement is that there is at least one space character between two adjacent columns. Note also that the maximum length for a label (population as well as traits) is 10 characters (the semicolon being excluded). If they are longer (as the first one in the above example), they will be truncated to the ten first characters. If the listed requirements are not fulfilled, correct, going back to Excel or under Unix. 5 The next step is to run the transcoding program, ASCBIN, updated to process the two categories of labels. DIOGENE can be used according two ways: an “expert mode” for those who know the programs which they need and their functions, and a “beotian mode” (following Dr Arbez’s famous expression) for people who use it for the first time and/or are not sufficiently familiar with the software. The first way is much quicker, because it authorizes many shortcuts. The second way, what we only consider here, uses the scrolling menu manager (DIOGENE) where the user is not supposed to know the program’s names and functions. Type “diogene” -lower-case letters-[CR]. The following “first level menu” is displayed. The “p” branch that we have selected for this example means that the program(s) will be run not immediately (interactive mode) but, via a “script” (or “procedure”) which is an executable file which runs the 6 program(s) when all the parameters have been entered (detailed mechanism will be explained in the “notice” series). At this point, we answer to the question “name of the script” (not shown here). For this example, this name was “transcode”. We have reached now the “second level“ of the menu hierarchy and we select the “file management” branch. 7 This choice leads to a third level of hierarchy where we can select the kind of management operation that we need. Note that this branch is extended to software administration (language translation, compilation, link-edition etc.). At this point, we select the “transcodage, copy-again” heading, as we have to transcode an ASCII file. 8 This gives access to the fourth level of hierarchy where we select the “suited for Excel” program, ASCBIN. Note that this program will automatically detect the presence of “population” labels and will run a program specialised to manage these labels. We shall comment this deeper below. Note that ASCBIN will automatically link two other programs involved in data file conversion: - NORMEX (addition of minimum and maximum values of indicators in the parameter-file (see later its definition); - CREFAM (restructuration of the “population” label’s file in an apptopriate format). These programs don’t need additional parameters. Moreover, they can be run separately for other particular situations which will be examined later. 9 After validation of the choice of ASCBIN (Y key [CR]), the menu manager, DIOGENE, runs the OPEP2 supervisor which is deviced to generate the script and the associated file of parameters. Note that the TRIPROG option is devoted for selecting a sub-sample within a chain of programs in complex treatments and that we don’t use it here. 10 The supervisor confirms that the parameters to enter concern ASCBIN and begins to display the corresponding sequence of questions. Note that all the choices are “contextual”: it means that the supervisor performs logical tests to only ask for entry of the parameters necessary according to all the choices which were previously done. 11 On the above screen display, the supervisor asks for the “record’s format” description. The record may be considered as a row of the Excel table (considering only the binary values). DIOGENE processes these records once-at-a-time (saving memory space) and extends there definition to several “individuals” the number of which may vary from one record to another. More explanations will be given in the “notice” series. 12 On the above screen, we select the use of “column headings” option because our file effectively integrates labels for traits. The “reaffectation option” has nothing to do here. 13 The first among the last two questions concerns the possibility to redefine “null” values which mean “absence of observation for this trait”. For instance, for a given software, the conventional value may be “-1” and “-5” for another one. DIOGENE processes differently two conventional values of this kind : “-5” which means “no observation” and “-9” which means “the given individual was dead when the corresponding trait was observed”. More explanations will be given in the “notice” series. The second question is about the presence of “space“ for data delimitation. It is always the case except for old fashioned files created in the sixties or seventies. 14 At this point, all the necessary parameters were entered and the above screen offers a possibility of check and/or correction of the parameter via a system call of the supervisor to vi text editor. The “v” (validate) choice ends the script building. However, it is always possible later to correct a script by changing some of the parameters on the corresponding parameter-file. In this example, and if you use vi, type: vi transcode.don* Then select the appropriate file (“: n” to get the next parameter file associated to the script), correct the parameters and type “: w” to store the corrected file and “:q” to leave vi. 15 The above screen is displayed after selecting the “Z” key when the supervisor (OPEP2) is relayed by the menu manager (DIOGENE), or when the user decided to delay the execution of the script by selecting the “W” key. In this last eventuality, he has to type: gene which is the name of a c-shell command which is devoted to run the script after a check/creation stage for insuring that all the “service files” necessary for running the DIOGENE library are effectively present in the working directory. The selection of “Z” key results in directly running the gene command. Note that we have deliberately omitted to indicate that “carriage return” [CR] key has to be pressed after each command or parameter. 16 The above screen and the screen shown below show the final result of the file processing. The original lebart.prn file raised three associated files: - lebart (binary data including population codes and observed “traits”); lebart.p (mixture of data in numeric -integer format- and alphanumeric format) which describe the lebart file (maximum number of indicators, number of individuals/record, number of traits, names of indicators and labels of the traits); - lebart.fam (purely alphanumeric file of “population labels”). At an informatics point of view and for reasons which shall be explained in the notice series, the first and the last one are “direct access” files as the second is an “unformatted sequential” file. 17 We can note that the file of the “population labels”, lebart.fam, is formatted in such a way that the rank (row number) of a particular label corresponds to the value of the numeric population code which is in the lebart binary file. The reason is very simple: during data processing, these labels will be stored in column vectors and they will be directly addressed by the value of the numeric codes of populations. We can note that character strings of ten “*” are used to fill the place where there are no population labels. This “dummy” characters will be useful to manage the cases where there are more that one set of labels, but several sets corresponding to m factors of a more or less complex experiment. An example will be given in the next tutorial. Note that that “.fam” file is mandatory to obtain labels of the factors but not to process the data: it is an optional file. 18 Now, our aim is to visualize the data. As it is a very small file, we can choose the “screen display” option. For this, we first select the “interactive” mode in the first level branch of the DIOGENE menu manager (see above). 19 Then, as before, we select the “file management” section. 20 Then, we select the “edition“ section. 21 Then, we select the “screen edition“ of a data file (LIRE program). 22 LIRE is run directly, and, after input of file name, gives its characteristics (obtained from the lebart.p file) and asks the ranks of the first and the last records to be displayed. 23 After input of the last record’s rank, the edition is achieved and the program asks parameters for another section to be displayed or to close the edition (0). 24 Now, we wish to process our data according to the Correspondence Analysis way. We run the menu manager which has now become familiar. We select at this time the “script” option which is mandatory for at least the first program of the chain (it is always the case for biometrical processing data files following the “extended ANTAR normalization” such as the lebart file in three parts (lebart, lebart.p and lebart.fam). 25 At the second level of the menu hierarchy, we select the “biometry-quantitative genetics” branch. 26 Then, we select, for instance, the “complete” built-in chain including dendrograms underlined above (more complex chains are possible, for instance, “Discriminant Analysis of Qualitative data” which is partly built with Correspondence Analysis, or “Multiple Factorial Correspondance Analysis”). 27 As before, the OPEP2 supervisor is activated by the menu manager and we neglect the “triprog” option. 28 The supervisor displays the ordered list of the programs to be chained and begins to ask for the parameters of two “supplementary” programs, not explicitly included in the chain, but necessary to define the data file and the traits to be processed, ANTAR and DEFCAR. For the second aspect, DIOGENE supplies a syntactic analyser able to translate general formulae into a FIFO “pile” of ordered elementary operations. That means that the user can enter its trait’s definitions involving the traits present in the file. We shall illustrate this possibility in a further tutorial. Here, we have to enter the data file’s name (lebart). The lebart.p will be supposed to exist; the existence of lebart.fam file will be tested and, if it is available, the labels will be used in the appropriate parts of edition of results. 29 On the above screen, there is a choice of somewhat “exotic”options which are of no use here. A by-pass enables to shorten this choice. 30 DIOGENE uses the three modes of processing data listed above. Qualitative modes (1 and 2) have to be selected to derive from observed data elementary frequencies of derived ones, using a disjunction process via an incidence matrix. For instance, derivation of frequencies of the five categories of an insect attack scored from 0 to 5. Here, as elementary data are directly counts, we select the “quantitative” mode. 31 When one wants to select the n first traits within a file, without any transformation, DIOGENE proposes the shortcut “-n” which was used here. 32 As before, after entry of the last parameter, the supervisor affords the possibility to check and eventually correct the entered parameters. 33 The parameters specific of the first module, CHAIX, are the minimum expectation of a cell, to allow a valid Chi-square test of row x column association and the number of dimensions retained to compute the distances and the projection of the points on the two-by-two combinations of axes. 34 Lastly, as it is the rule for all the “leading” programs of the biometrical chains, there is the possibility to filter simultaneously for two indicators and two traits. The “by-pass” value “0” replaces the non-entered upper and lower-bond values by dummy values which lead to no selection. If one needs a great number of simultaneous filters, an utility is available in the DIOGENE library (CRECAR program) which allows an unlimited number of filters. Two other multi-purpose utilities are also available for this kind of use (TROUVE and COPIE). 35 The ANACOR program computes the eigenvectors and eigenvalues of the row x column associations. The only parameters concern the edition of the coordinates of the points and of there absolute and relative contributions. 36 The parameters of the DAUGEY program which computes the Chi-square distances and the two-dimensional graphics concern the “normalization” of distances (using square roots of eigenvalues) and the selection of entities to be selected (populations, traits, or both). 37 The POSKI1 program plots the histograms of the coordinates of the population and/or the trait points (no great interest here). We retain 5 classes. 38 The last program of the chain, PRADET, is common with a variety of other kinds of computations, as Discriminant and Principal Components Analysis. Its purpose is to represent the between-point relationships as dendrograms, after transformation of distances into similarities (see their definitions in further notices). The parameters concern the edition of results preceding the dendrograms, as the similarity matrix, and one among the three available algorithms to build the clusters. 39 40 41 The last entry of PRADET enables to reiterate dendrograms from the same similarity matrix with another combination of parameters. 42 When the menu manager again displays the first level of the “biometry-quantitative genetics” branch from which we selected the chain of programs, we select the “Z” key to run the script… 43 and the gene command is again activated and stats when the name of the script is entered (afc here). 44 Short comments about the afc script. The script shows the sequence of the seven programs involved in data processing of lebart file. The first line means that the commands are written in c-shell and may be executed by the corresponding translator. Lines 2 and three test the existence of two service files; erreur, for error reports and sortie, for output of results. If they exist, these files are destroyed. Line 4 does the same for the output files labelled with the script’s name (see later). The following lines alternate statement to run each program specifying the “parameter file” where it has to read its own parameters and tests to redirect the script toward the “echec” label, in case of fatal error. After this test, a line redirects the content of the sortie file toward “.out” file identified by the script’s name and by a three-digit number. In case of success; the message “end of procedure…” is displayed. 45 After the success of the script’s run, all the results may be found in five “.out” files: afc.out.001 (output of CHAIX), afc.out.002 (output of ANACOR), afc.out.003 (output of DAUGEY), afc.out.004 (output of POSKI1) and afc.out.005 (output of PRADET). The GENE utility automatically concatenates these five files into only one: afc.out. This file may be visualized using Tera Term with vi or another text editor. The above screen shows the importation of the afc.out file from the server toward the PC. Note that it may be indifferently done in binary or “ASCII” mode. NB: Discard the name of the file (afcout in place of afc.out) as the example was built before the last modification of GENE. 46 Now, we shall deal with importation of results in Word. Note that the DIOGENE’s edition in alphanumeric format don’t exceeds 155 columns width and that a simple X-Window emulator as Tera Term can easily be parametrized to accommodate this format. For importation in Word and completely visualize any page of results, it is mandatory to select the “8” size for the characters and the police “courier new” which allows a fixed symbol’s size. Don’t select a “proportional size” police as “times new roman”, because problems would occur in within-column alignments! The above screen shows the selection of character size. 47 The above display shows the classical selection to number the pages in Word that everybody knows. It is useful for an easier selection of the parts of the results which are of interest for the researcher, mainly if there are several hundreds of pages. 48 For adequate format of results, it is mandatory to use the “landscape” presentation and to select the up and down margins as narrow as possible ‘important to don’t truncate the graphics of positions of points on two axes combinations, for instance. The above selection of parameters is adequate for every outputs of DIOGENE. Here-after are listed the results edited according to the chosen parameters. Our purpose is not to comment these results where you will learn that people who earn more money usually live in the more pleasant houses. Note that the labels for traits are automatically managed all along these outputs with the appropriate format choices (right or left justification, optimal position on the graphics and so on). 49 RESULTS $*$*$*$*$* 24 heures sur 24, DIOGENE 2004 a votre service ! *$*$*$*$*$* Biometrie du fichier : lebart ---------------------------------housing/profession relationshi noms des 8 caracteres etudies : --------------------------------y y y y y y y y 1 2 3 4 5 6 7 8 = hotel = rent_house = priv_house = parents = friends = mob_house = holi_vill = miscellan .............................................................................. definition des 8 caracteres etudies : y 1 = x1 y 2 = x2 y 3 = x3 y 4 = x4 y 5 = x5 y 6 = x6 y 7 = x7 y 8 = x8 .............................................................................. CHAIX : preparation d'Analyse Factorielle des Correspondances nombre d'indicatifs/enregistrement : option 1 0 (0 = caracteres quantitatifs, 1 = caracteres qualitatifs) enregistrements numeros 1 a 10 numero du premier individu traite/enregistrement : 8 caracteres observes, 8 1 , dernier = 1 , saut = etudies contraintes lim. inf. lim. sup. 50 1 indicatif indicatif caractere caractere 1 1 1 1 -99999 -99999 -99999.000 -99999.000 99999 99999 99999.000 99999.000 Nombre de populations retenues : 10 (en imposant une esperance minimum/cellule de 0.0000E+00) Profil des populations num_code (% total/ligne) : 0 = profil moyen (prend en compte l'ensemble des variables : principales & supplementaires) population libelle 1 population agri_owner 796 libelle nombre 3 population agri_emplo 260 libelle nombre 5 population factory_ow 2978 libelle nombre 7 population big_manag libelle 4620 nombre 9 population med_manag libelle 4298 nombre 11 population trade_empl 2972 libelle nombre 14 population labours libelle 9209 nombre 15 population serv_empl libelle 583 nombre 16 population misce_act libelle 1423 nombre 17 population retired libelle 3940 nombre 0 ensemble nombre 31079 y 1 hotel 20.101 y 1 hotel 13.462 y 1 hotel 23.506 y 1 hotel 20.801 y 1 hotel 13.309 y 1 hotel 14.838 y 1 hotel 8.503 y 1 hotel 11.149 y 1 hotel 5.411 y 1 hotel 18.807 y 1 hotel y 2 rent_house 3.518 y 2 rent_house 13.077 y 2 rent_house 11.887 y 2 rent_house 10.195 y 2 rent_house 12.494 y 2 rent_house 13.594 y 2 rent_house 12.097 y 2 rent_house 7.376 y 2 rent_house 4.216 y 2 rent_house 8.426 y 2 rent_house 14.592 10.866 y 3 priv_house 0.000 y 3 priv_house 0.385 y 3 priv_house 7.690 y 3 priv_house 13.701 y 3 priv_house 6.491 y 3 priv_house 5.585 y 3 priv_house 4.202 y 3 priv_house 3.602 y 3 priv_house 13.282 y 3 priv_house 8.299 y 3 priv_house 7.182 y 4 parents 40.327 y 4 parents 68.462 y 4 parents 32.203 y 4 parents 34.199 y 4 parents 39.297 y 4 parents 36.306 y 4 parents 44.000 y 4 parents 50.429 y 4 parents 58.960 y 4 parents 45.406 y 4 parents y 5 friends 4.523 y 5 friends 3.077 y 5 friends 6.212 y 5 friends 6.602 y 5 friends 4.793 y 5 friends 5.989 y 5 friends 5.397 y 5 friends 13.551 y 5 friends 3.725 y 5 friends 7.893 y 5 friends y 6 mob_house 17.714 y 6 mob_house 0.000 y 6 mob_house 9.805 y 6 mob_house 7.792 y 6 mob_house 17.403 y 6 mob_house 14.603 y 6 mob_house 15.897 y 6 mob_house 9.777 y 6 mob_house 8.714 y 6 mob_house 5.990 y 6 mob_house y 7 holi_vill 5.653 y 7 holi_vill 1.538 y 7 holi_vill 3.996 y 7 holi_vill 3.506 y 7 holi_vill 3.606 y 7 holi_vill 5.989 y 7 holi_vill 5.701 y 7 holi_vill 3.087 y 7 holi_vill 1.968 y 7 holi_vill 2.589 y 7 holi_vill y 8 miscellan 8.166 y 8 miscellan 0.000 y 8 miscellan 4.701 y 8 miscellan 3.203 y 8 miscellan 2.606 y 8 miscellan 3.096 y 8 miscellan 4.202 y 8 miscellan 1.029 y 8 miscellan 3.725 y 8 miscellan 2.589 y 8 miscellan 41.121 5.978 12.407 4.299 3.555 p 14 labours 17.266 p 14 labours 32.988 p 14 p 15 serv_empl 1.433 p 15 serv_empl 1.273 p 15 Profil des caracteres (% du total par colonne) : 0 = profil moyen (prend en compte l'ensemble des variables : principales & supplementaires) caractere libelle nombre 1 caractere hotel libelle 4535 nombre 2 caractere rent_house 3377 libelle nombre p 1 p 3 p 5 agri_owner agri_emplo factory_ow 3.528 0.772 15.436 p 1 p 3 p 5 agri_owner agri_emplo factory_ow 0.829 1.007 10.483 p 1 p 3 p 5 p 7 big_manag 21.191 p 7 big_manag 13.947 p 7 51 p 9 p 11 med_manag trade_empl 12.613 9.724 p 9 p 11 med_manag trade_empl 15.902 11.963 p 9 p 11 p 16 misce_act 1.698 p 16 misce_act 1.777 p 16 p 17 retired 16.340 p 17 retired 9.831 p 17 3 caractere priv_house 2232 libelle nombre 4 caractere parents libelle 12780 nombre 5 caractere friends libelle 1858 nombre 6 caractere mob_house libelle 3856 nombre 7 caractere holi_vill libelle 1336 nombre 8 caractere miscellan libelle 1105 nombre 0 ensemble 31079 agri_owner 0.000 p 1 agri_owner 2.512 p 1 agri_owner 1.938 p 1 agri_owner 3.657 p 1 agri_owner 3.368 p 1 agri_owner 5.882 p 1 agri_owner 2.561 agri_emplo 0.045 p 3 agri_emplo 1.393 p 3 agri_emplo 0.431 p 3 agri_emplo 0.000 p 3 agri_emplo 0.299 p 3 agri_emplo 0.000 p 3 agri_emplo 0.837 factory_ow 10.260 p 5 factory_ow 7.504 p 5 factory_ow 9.957 p 5 factory_ow 7.573 p 5 factory_ow 8.907 p 5 factory_ow 12.670 p 5 factory_ow 9.582 big_manag 28.360 p 7 big_manag 12.363 p 7 big_manag 16.416 p 7 big_manag 9.336 p 7 big_manag 12.126 p 7 big_manag 13.394 p 7 big_manag 14.865 med_manag 12.500 p 9 med_manag 13.216 p 9 med_manag 11.087 p 9 med_manag 19.398 p 9 med_manag 11.602 p 9 med_manag 10.136 p 9 med_manag 13.829 trade_empl 7.437 p 11 trade_empl 8.443 p 11 trade_empl 9.580 p 11 trade_empl 11.255 p 11 trade_empl 13.323 p 11 trade_empl 8.326 p 11 trade_empl 9.563 labours 17.339 p 14 labours 31.706 p 14 labours 26.749 p 14 labours 37.967 p 14 labours 39.296 p 14 labours 35.023 p 14 labours 29.631 Test Chi-2 de l'association ligne*colonne ( 63 d.l.) = 2598.273 Seuil de signification de ce test (en %) = 0.000 (le test ne prend en compte que les variables principales) Matrice des associations entre caracteres (colonnes) (cette matrice ne prend en compte que les variables principales) y y y y y y y y 1: hotel 2:rent_house 3:priv_house 4: parents 5: friends 6: mob_house 7: holi_vill 8: miscellan y 1 hotel 0.168 0.125 0.111 0.235 0.097 0.124 0.075 0.072 y 2 y 3 rent_house priv_house 0.114 0.085 0.208 0.080 0.121 0.071 0.061 0.090 0.169 0.067 0.083 0.050 0.048 y 4 parents y 5 friends 0.422 0.156 0.225 0.131 0.120 0.064 0.082 0.050 0.045 y 6 mob_house 0.140 0.079 0.069 52 y 7 holi_vill 0.047 0.041 y 8 miscellan 0.039 serv_empl 0.941 p 15 serv_empl 2.300 p 15 serv_empl 4.252 p 15 serv_empl 1.478 p 15 serv_empl 1.347 p 15 serv_empl 0.543 p 15 serv_empl 1.876 misce_act 8.468 p 16 misce_act 6.565 p 16 misce_act 2.853 p 16 misce_act 3.216 p 16 misce_act 2.096 p 16 misce_act 4.796 p 16 misce_act 4.579 retired 14.651 p 17 retired 13.998 p 17 retired 16.738 p 17 retired 6.120 p 17 retired 7.635 p 17 retired 9.231 p 17 retired 12.677 ANACOR : Analyse Factorielle des Correspondances nombre de dimensions (axes a contribution relatives >.0001) = 7 valeurs propres (par ordre decroissant): lambda 1 4.4316E-02 lambda 2 2.0249E-02 lambda 3 8.8049E-03 lambda 4 5.4653E-03 lambda 5 2.3815E-03 lambda 6 2.1451E-03 lambda 7 2.4065E-04 lambda 5 2.849 lambda 6 2.566 lambda 7 0.288 lambda 5 97.146 lambda 6 99.712 lambda 7 100.000 valeurs propres en % de la variance totale: (la plus gande -egale a 1- est eliminee) lambda 1 53.008 lambda 2 24.221 lambda 3 10.532 lambda 4 6.537 valeurs propres cumulees en % de la variance totale: lambda 1 53.008 lambda 2 77.229 lambda 3 87.760 lambda 4 94.298 vecteurs propres (dans l'ordre des valeurs propres): ve 1 ve 2 ve 3 ve 4 ve 5 ve 6 ve 7 y 1 y 2 y 3 y 4 y 5 y 6 5.9769E-01 -1.2159E-01 5.1898E-01 -1.9374E-01 1.3107E-01 -5.0434E-01 y 1 y 2 y 3 y 4 y 5 y 6 -4.6878E-01 -3.2009E-01 2.4708E-01 6.4607E-01 1.4905E-02 -3.4815E-01 y 1 y 2 y 3 y 4 y 5 y 6 -3.5370E-01 2.0959E-01 7.5007E-01 -2.6290E-01 -3.0118E-01 3.2069E-01 y 1 y 2 y 3 y 4 y 5 y 6 -1.2801E-01 6.0326E-01 -8.5547E-02 -2.8660E-02 3.5209E-01 -1.6636E-01 y 1 y 2 y 3 y 4 y 5 y 6 -2.9275E-01 -2.3328E-01 1.6478E-01 -2.2969E-01 7.4047E-01 -5.0467E-02 y 1 y 2 y 3 y 4 y 5 y 6 -1.1988E-01 4.1439E-01 -9.5594E-03 5.7492E-02 -3.1282E-01 -6.0434E-01 y 1 y 2 y 3 y 4 y 5 y 6 -1.8849E-01 3.7431E-01 -2.4959E-02 -8.8577E-02 2.4898E-01 -4.5595E-02 analyse du profil des populations (lignes) coordonnees sur chacun des p p p p p p axe 1 1 -1.3896E-01 axe 1 3 -3.6085E-02 axe 1 5 2.0957E-01 axe 1 7 3.2562E-01 axe 1 9 -1.0110E-01 axe 1 11 -6.9267E-02 4 axes retenus axe 2 axe 3 axe 4 -1.7535E-01 -2.4353E-01 -3.4388E-01 axe 2 axe 3 axe 4 3.8541E-01 -3.4680E-01 2.0365E-01 axe 2 axe 3 axe 4 -1.8412E-01 -5.6425E-02 -3.3691E-02 axe 2 axe 3 axe 4 -2.1507E-02 9.9371E-02 -5.8301E-03 axe 2 axe 3 axe 4 -5.9565E-02 7.2752E-02 3.1912E-02 axe 2 axe 3 axe 4 -1.3017E-01 1.4386E-02 5.9292E-02 53 y 7 -2.0999E-01 y 7 -2.3696E-01 y 7 3.3817E-02 y 7 -4.1019E-02 y 7 4.3122E-01 y 7 4.7595E-01 y 7 -6.6443E-01 y 8 -7.3932E-02 y 8 -1.4751E-01 y 8 -6.7504E-02 y 8 -6.7692E-01 y 8 2.0778E-01 y 8 3.4771E-01 y 8 5.5710E-01 p p p p axe 1 14 -2.3605E-01 axe 1 15 -3.8013E-02 axe 1 16 8.8608E-03 axe 1 17 2.0691E-01 axe 2 axe 3 axe 4 8.3748E-03 7.9274E-03 3.0155E-04 axe 2 axe 3 axe 4 2.0113E-01 -2.3882E-01 1.6948E-01 axe 2 axe 3 axe 4 4.7365E-01 1.3006E-01 -1.3484E-01 axe 2 axe 3 axe 4 1.1714E-01 -1.2217E-01 3.1716E-02 contributions absolues des populations principales p 1 p 3 p 5 p 7 p 9 p 11 p 14 p 15 p 16 p 17 axe 1 1.1160E-02 axe 1 2.4581E-04 axe 1 9.4964E-02 axe 1 3.5566E-01 axe 1 3.1898E-02 axe 1 1.0353E-02 axe 1 3.7255E-01 axe 1 6.1167E-04 axe 1 8.1120E-05 axe 1 1.2247E-01 axe 2 3.8891E-02 axe 2 6.1370E-02 axe 2 1.6041E-01 axe 2 3.3958E-03 axe 2 2.4231E-02 axe 2 8.0016E-02 axe 2 1.0263E-03 axe 2 3.7476E-02 axe 2 5.0728E-01 axe 2 8.5903E-02 axe 3 1.7252E-01 axe 3 1.1427E-01 axe 3 3.4647E-02 axe 3 1.6671E-01 axe 3 8.3130E-02 axe 3 2.2477E-03 axe 3 2.1149E-03 axe 3 1.2151E-01 axe 3 8.7962E-02 axe 3 2.1488E-01 axe 4 5.5416E-01 axe 4 6.3482E-02 axe 4 1.9901E-02 axe 4 9.2452E-04 axe 4 2.5769E-02 axe 4 6.1513E-02 axe 4 4.9300E-06 axe 4 9.8591E-02 axe 4 1.5232E-01 axe 4 2.3333E-02 contributions relatives des populations principales p 1 p 3 p 5 p 7 p 9 p 11 p 14 p 15 p 16 p 17 axe 1 0.083 axe 1 0.003 axe 1 0.519 axe 1 0.904 axe 1 0.328 axe 1 0.176 axe 1 0.972 axe 1 0.008 axe 1 0.000 axe 1 0.590 axe 2 0.132 axe 2 0.335 axe 2 0.401 axe 2 0.004 axe 2 0.114 axe 2 0.623 axe 2 0.001 axe 2 0.223 axe 2 0.863 axe 2 0.189 axe 3 0.255 axe 3 0.272 axe 3 0.038 axe 3 0.084 axe 3 0.170 axe 3 0.008 axe 3 0.001 axe 3 0.315 axe 3 0.065 axe 3 0.206 axe 4 0.508 axe 4 0.094 axe 4 0.013 axe 4 0.000 axe 4 0.033 axe 4 0.129 axe 4 0.000 axe 4 0.159 axe 4 0.070 axe 4 0.014 analyse du profil des caracteres (colonnes) 54 coordonnees sur chacun des y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 axe 1 3.2938E-01 axe 1 -7.7651E-02 axe 1 4.0768E-01 axe 1 -6.3601E-02 axe 1 1.1285E-01 axe 1 -3.0142E-01 axe 1 -2.1321E-01 axe 1 -8.2540E-02 axe 2 -1.7463E-01 axe 2 -1.3818E-01 axe 2 1.3120E-01 axe 2 1.4337E-01 axe 2 8.6746E-03 axe 2 -1.4065E-01 axe 2 -1.6263E-01 axe 2 -1.1132E-01 4 axes retenus axe 3 -8.6884E-02 axe 3 5.9662E-02 axe 3 2.6263E-01 axe 3 -3.8470E-02 axe 3 -1.1558E-01 axe 3 8.5430E-02 axe 3 1.5305E-02 axe 3 -3.3593E-02 axe 4 -2.4774E-02 axe 4 1.3529E-01 axe 4 -2.3599E-02 axe 4 -3.3040E-03 axe 4 1.0646E-01 axe 4 -3.4915E-02 axe 4 -1.4626E-02 axe 4 -2.6540E-01 contributions absolues des caracteres principaux y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 axe 1 3.5724E-01 axe 1 1.4784E-02 axe 1 2.6934E-01 axe 1 3.7535E-02 axe 1 1.7179E-02 axe 1 2.5436E-01 axe 1 4.4096E-02 axe 1 5.4660E-03 axe 2 2.1975E-01 axe 2 1.0246E-01 axe 2 6.1049E-02 axe 2 4.1741E-01 axe 2 2.2216E-04 axe 2 1.2121E-01 axe 2 5.6149E-02 axe 2 2.1761E-02 axe 3 1.2510E-01 axe 3 4.3927E-02 axe 3 5.6260E-01 axe 3 6.9118E-02 axe 3 9.0707E-02 axe 3 1.0284E-01 axe 3 1.1436E-03 axe 3 4.5568E-03 axe 4 1.6387E-02 axe 4 3.6393E-01 axe 4 7.3182E-03 axe 4 8.2137E-04 axe 4 1.2397E-01 axe 4 2.7674E-02 axe 4 1.6826E-03 axe 4 4.5822E-01 contributions relatives des caracteres principaux y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 axe 1 0.729 axe 1 0.116 axe 1 0.655 axe 1 0.153 axe 1 0.202 axe 1 0.724 axe 1 0.471 axe 1 0.066 axe 2 0.205 axe 2 0.368 axe 2 0.068 axe 2 0.778 axe 2 0.001 axe 2 0.158 axe 2 0.274 axe 2 0.120 axe 3 0.051 axe 3 0.069 axe 3 0.272 axe 3 0.056 axe 3 0.212 axe 3 0.058 axe 3 0.002 axe 3 0.011 axe 4 0.004 axe 4 0.353 axe 4 0.002 axe 4 0.000 axe 4 0.180 axe 4 0.010 axe 4 0.002 axe 4 0.683 55 DAUGEY : graphique des resultats issus d'ANACOR avec reperage des populations ou des caracteres. Graphique par coordonnees des facteurs sur chaque couple d'axes Pour l'Analyse des Constellations, l'espace de reference est celui des 4 premiers axes Option DISTANCE 0 (0 = non normees, 1 = normees) Graphique des points-population & des points-caractere Le fichier DUMAS comprend 18 enregistrements donnant les coordonnees de chacun des points (populations et/ou caracteres) sur les 4 axes choisis. **************************************************************** Graphe des facteurs par coordonnees : 52 interlignes*100 colonnes (coordonnees positives ou nulles, car exprimees en deviation au minimum) (* = point population, + = point caractere, # = point double ou multiple) Indicatif de population = nombre de 1 a 4 chiffres Indicatif de caractere = nombre de 1 a 3 chiffres precede d'un y Si plus de 10 points sur 1 ligne, les indicatifs concernent les 10 premiers. Les autres indicatifs seront alors reportes a la page suivante 56 Axe 1 indicatifs dans l'ordre des points ^ | 7.0909E-01 1+ +priv_house y003 2| 3| 4| 5| 6| +hotel y001 7| *big_manag 7 8| 9| 10| 11| 12| 13| 14| 15|*factory_ow *retired 5 17 16| 17| 18| 19| 20| 21| 22| +friends y005 23| 24| 25| 26| 27| 28| 29m 30| misce_act* 16 31| 32| 33| serv_empl* agri_emplo* 15 3 34| 35| *trade_empl +parents 11 y004 36| rent_+ +miscellan y002 y008 37| 38| *med_manag 9 39| 40| 41| *agri_owner 1 42| 43| 44| 45| 46| +holi_vill y007 47| 48| *labours 14 49| 50| 51| 52+ +mob_house y006 | 0.0000E+00 |+----------------------------m---------------------------------------------------------------------+->Axe 2 0.0000E+00 6.5777E-01 57 Axe 1 indicatifs dans l'ordre des points ^ | 7.0909E-01 1+ priv_house+y003 2| 3| 4| 5| 6| +hotel y001 7| big_manag* 7 8| 9| 10| 11| 12| 13| 14| 15| retired* *factory_ow 17 5 16| 17| 18| 19| 20| 21| 22| +friends y005 23| 24| 25| 26| 27| 28| 29m 30| misce_act* 16 31| 32| 33|*agri_emplo *serv_empl 3 15 34| 35| parents+ *trade_empl y004 11 36| miscellan+ +rent_house y008 y002 37| 38| med_manag* 9 39| 40| 41| *agri_owner 1 42| 43| 44| 45| 46| holi_vill+ y007 47| 48| labours* 14 49| 50| 51| 52+ mob_house+ y006 | 0.0000E+00 |+---------------------------------------------------m----------------------------------------------+->Axe 3 0.0000E+00 6.0944E-01 58 Axe 2 indicatifs dans l'ordre des points ^ | 6.5777E-01 1+ misce_act* 16 2| 3| 4| 5| 6| 7|*agri_emplo 3 8| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| *serv_empl 15 23| 24| 25| 26| 27| parents+ y004 28| priv_house+y003 29| *retired 17 30| 31| 32| 33| 34| 35| 36| 37m friends+ *labours y005 14 38| 39| 40| big_manag* 7 41| 42| 43| med_manag* 9 44| 45| 46| 47| miscellan+ y008 48| trade_empl* 11 49| rent_house+ +mob_house y002 y006 50| 51| holi_vill+ y007 52+ *agri_owner hotel+ *factory_ow 1 y001 5 | 0.0000E+00 |+---------------------------------------------------m----------------------------------------------+->Axe 3 0.0000E+00 6.0944E-01 59 NORMEX : conversion d'un fichier de norme ANTAR a la norme ANTAR etendue Le fichier dumas a ete converti a la norme ANTAR etendue -----------------------------------------------------------------------NORMEX : conversion d'un fichier de norme ANTAR a la norme ANTAR etendue Le fichier dista a ete converti a la norme ANTAR etendue -----------------------------------------------------------------------Fichier DUMAS - graphiques par coordonnees - maintenu et recopie sous le nom de daugey1002.gnu Fichier DISTA - distances - maintenu et recopie sous le nom de daugey2002.gnu 60 POSKI1 Histogrammes de Composantes Principales (issues de SARTO ou de facteurs d'Analyse des Correspondances (issus de DAUGEY) valeurs-limites des variables synthetiques : variable variable variable variable synthetique synthetique synthetique synthetique vs vs vs vs 1 2 3 4 minimum maximum -0.3014 -0.1841 -0.3468 -0.3439 0.7091 0.6578 0.6094 0.5475 nombre de populations ou facteurs retenus : 18 61 histogramme de la variable synthetique vs 5 classes d'intervalle 1 0.202 depuis -0.301 jusqu'a 0.709 limite inf incluse limite sup exclue (sauf pour derniere classe) moyenne : 0.305 , ecart-type : 0.222 coefficient de symetrie : 0.274 , coefficient d'aplatissement : 1.761 table des frequences classe 1 2 3 4 5 lim inf lim sup -0.301 -0.099 0.103 0.305 0.507 -0.099 0.103 0.305 0.507 0.709 nbre 0 3 8 2 5 pourcent 0.000 16.667 44.444 11.111 27.778 cumul 0.000 16.667 61.111 72.222 100.000 graphe de l'histogramme (pourcent):$ = 0.601 % arrondi au plus pres surface distribution normale en surimpression = surface histogramme numerotation des classes a la base des rectangles (voir page suivante) 62 1 2 3 4 5 * $$$$$$$$$$$$$$$$$$$$$$$*$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$ * $$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$$$$$$$$$$$$$ 63 histogramme de la variable synthetique vs 5 classes d'intervalle 2 0.168 depuis -0.184 jusqu'a 0.658 limite inf incluse limite sup exclue (sauf pour derniere classe) moyenne : 0.209 , ecart-type : 0.176 coefficient de symetrie : 0.978 , coefficient d'aplatissement : 2.741 table des frequences classe 1 2 3 4 5 lim inf lim sup -0.184 -0.016 0.153 0.321 0.489 -0.016 0.153 0.321 0.489 0.658 nbre 0 9 5 2 2 pourcent 0.000 50.000 27.778 11.111 11.111 cumul 0.000 50.000 77.778 88.889 100.000 graphe de l'histogramme (pourcent):$ = 0.676 % arrondi au plus pres surface distribution normale en surimpression = surface histogramme numerotation des classes a la base des rectangles (voir page suivante) 64 1 2 3 4 5 * $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ * $$$$$$$$$$$$$$$$ * $$$$$$*$$$$$$$$$ 65 histogramme de la variable synthetique vs 5 classes d'intervalle 3 0.191 depuis -0.347 jusqu'a 0.609 limite inf incluse limite sup exclue (sauf pour derniere classe) moyenne : 0.323 , ecart-type : coefficient de symetrie : 0.161 -0.612 , coefficient d'aplatissement : 3.000 table des frequences classe 1 2 3 4 5 lim inf lim sup -0.347 -0.156 0.036 0.227 0.418 -0.156 0.036 0.227 0.418 0.609 nbre 0 1 3 9 5 pourcent 0.000 5.556 16.667 50.000 27.778 cumul 0.000 5.556 22.222 72.222 100.000 graphe de l'histogramme (pourcent):$ = 0.676 % arrondi au plus pres surface distribution normale en surimpression = surface histogramme numerotation des classes a la base des rectangles (voir page suivante) 66 1 * 2 3 4 5 $$$$*$$$ $$$$$$$$$$$$$$$$$$$$$$$$$ * $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$ 67 histogramme de la variable synthetique vs 5 classes d'intervalle 4 0.178 depuis -0.344 jusqu'a 0.548 limite inf incluse limite sup exclue (sauf pour derniere classe) moyenne : 0.320 , ecart-type : coefficient de symetrie : 0.144 -1.100 , coefficient d'aplatissement : 4.258 table des frequences classe lim inf lim sup nbre pourcent cumul 1 2 3 4 -0.344 -0.166 0.013 0.191 -0.166 0.013 0.191 0.369 0 1 1 9 0.000 5.556 5.556 50.000 0.000 5.556 11.111 61.111 5 0.369 0.548 7 38.889 100.000 graphe de l'histogramme (pourcent):$ = 0.676 % arrondi au plus pres surface distribution normale en surimpression = surface histogramme numerotation des classes a la base des rectangles (voir page suivante) 68 1 2 3 4 5 * $*$$$$$$ $$$$$$$$ * $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$* $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$ Fichier poski1002.gnu cree en vue des graphiques ( 5 enregistrements) 69 PRADET : analyse des constellations sur matrice de similarite issue de DAG, MAHAL, DAUGEY ou EUCLID Matrices issues de DAUGEY Similarites des facteurs ligne et colonne (populations + caracteres) Option MATSUP = 0 , GRSUP = 0 , ZOOM = 3 Algorithme d'agglomeration = 0 (0 = lien simple , 1 = lien moyen , 2 = lien complet) Nombre total de groupes formes : 27 Dendrogramme (similarite au .01 plus proche si <0.995 et a 0.99 sinon) ---------------------------------------------------------------------1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 +---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+ p p y p y y y p y p y p y p p y p p 9 : med_manag*====>+-* | 11 :trade_empl*====>+-*--* | 2 :rent_house*====>+----* | 14 : labours*====>+----** | 6 : mob_house*====>+-* * | | 7 : holi_vill*====>+-*---** | 4 : parents*====>+------*----* | 17 : retired*====>+------* * | | 5 : friends*====>+------*----*------* | 5 :factory_ow*====>+* * | | 1 : hotel*====>+*-----------------*-* | 7 : big_manag*====>+--------------------* | 3 :priv_house*====>+--------------------** | 3 :agri_emplo*====>+--------------* * | | 15 : serv_empl*====>+--------------*------*----------------------------* | 8 : miscellan*====>+----------------------------* * | | 1 :agri_owner*====>+----------------------------*---------------------* | 16 : misce_act*====>+--------------------------------------------------*------- echelle : +---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+ 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 70 Miscellaneous In the given example, we supposed that “all worked perfectly” and that there was no corrections to be done. As (we hope that) the reader is now familiar with the use of the DIOGENE menu manager, he should be able to find by himself all the additional programs that he needs for corrections. For instance, he can visualize and correct the “.fam” file using the EDIFAM and CORFAM programs…that he can run directly, by simply typing their names (don’t forget the [CR] key!). If you use your own data, you can also correct the data section of your file (lebart in our example), by using the CORREC program, or its parameter file (lebart.p in our example) using the CREPAR program. Moreover, there is possible to create the DIOGENE data files without using Excel and in a somewhat faster way (STOCK program). Note that if you directly create DIOGENE data files, it is easy to make the reverse conversion that ASCBIN does (from DIOGENE format toward Excel format); in that case, use the TOTEM and TOTEMG programs that you can both run directly or via the menu manager. If you read the above outputs, you will see that, at several points, output files are created and are suffixed by “.gnuxxx - the suffix integrates a threedigit number -”. These files are devoted to GUPLOT graphics using a particular interface we shall describe later. But, you can use them to perform Excel graphics after running the TOTEM and TOTEMG format translators. The last two points that I shall mention are the following ones: (a) You can destroy al the “service” file and the “.gnu” files from the working directory by typing the “reset” command, which can also been accessed by from the menu-manager, as you can see by inspection of the above screens. This automatic cleaning is useful to avoid the accumulation of files that it is not necessary to archive in the working directory…and to preserve good relations with your system manager. (b) The alphanumeric labelling of “populations” (or, more generally, “modalities of factors”) allows a more general format that what is described in this example. The number of corresponding labels is generally less fewer than the number of data lines. Moreover, this number may vary from one column to another. We shall see in the next tutorial how the labelling system that we used here may be easily extended to every case, by properly using the “;” and “*” symbols. This system, which results into the creation of a normalised “.fam” file and the addition of indicator’s min-max in the “.p” file, extends the ANTAR standard, defined in the last eighties by Thierry Labbé whose nickname is now “labbé@Pierr”, due to his present internet address ([email protected]). We name now this system “EAN” (Extended ANTAR normalization) or “NAE” (Norme ANTAR Etendue). This first tutorial is dedicated to him. My own internet address is private as I am a poor retired guy, simple - and quite old - disciple of Thierry Labbé: [email protected]. Montpellier March 16/2004 Ph. Baradat 71