DOCUMENT: tutorial001

Transcription

DOCUMENT: tutorial001
Illustration of running DIOGENE for processing a simple Factorial Correspondence Analysis
Introduction
The documents of the “tutorial series” as this one are only concerned with the very practical problems to which the user is faced before being able to
use the results of data processing (thesis, publication, selection of the best genotypes etc.). Other documents, “notice series”, are designed to give
general informations about the biometrical and genetic models and cautions mandatory to draw conclusions from the experimental results.
These practical problems may be classified into three categories:
(1) Data organization: How to prepare the data files to be processed, both for the measured or graded traits (observations), for the codes which
describe the experimental structure which the user wants to analyze (indicators) and for the alphanumeric labelling of these two categories
that we shall shortly name “trait labels”, “indicator labels” or, more compactly, “labels”;
(2) Processing selection: How to select the program (or sequence of programs) which best fits the user’s aim, whatever the kind of results
would be (data management, estimation of parameters, exploration of data structure etc.)
(3) Importation of results: After data processing, it is necessary to obtain the final results in a form edited as well as possible and which
anybody can use for integrating them in a variety of documents.
For point (1), even if it is not the only way to prepare files for DIOGENE, the data in Excel format will be privileged because this spreadsheet is de
facto the standard tool used by researchers. For point (3) and for the same reasons, Word will also be privileged, even if “paper” edition of the results is
also possible without using it, via; for instance, an A3 network printer. Of course, Excel and Word may be replaced by programs belonging to the same
categories, for instance the equivalent spreadsheets and word-processors which can be found for Linux. The examples given in this documents have
been practically obtained On a Sun Microsystem Enterprise 450 server with Solaris 9 version of Unix (for running DIOGENE) and Windows-XP
(professional edition), with Office-XP, for preparation of Excel data files and importation/edition of results. The standard programs connecting the PC
to the server were Tera Term Pro (as alphanumeric terminal emulator) and WS-FTP95 LE, for file transfer. These programs can be freely downloaded
on internet. Lastly, although DIOGENE now enables 2-D and 3-D graphics (as interfaced with Gnuplot), this aspect will not developed in the basic
“tutorials”, because it requires graphic terminal emulators as X-windows, for instance. The points where the files specialized for these graphics are
created by the programs will be indicated. Note that these files can also be used for realization of graphics using Excel (or an equivalent spreadsheet).
The edited graphics will be alphanumeric. Specialized tutorials will be later devoted to high definition graphics.
The above Excel table (lebart.xls file) displays data from the book ”Traitement des données statistiques” by Lebart, Morineau and Fénelon (1979,
Dunod Editor), selected in the Correspondence Analysis section (pp. 306-326). It crosses 10 categories of profession (lines) with 8 kinds of housing.
Within each cell is the count of individuals who belong to a particular combination. Its therefore a contingence table from which one can describe the
association between lines (arbitrarily considered as “populations”) and columns, what we shall consider (always arbitrarily) as “traits”. DIOGENE
requires numeric data represented in “binary” form, wich are here the within-cell counts and, additionally, fore this analysis, one indicator, still binary,
which is the “population code”(column B). On the margins, you can find the trait’s labels (row 1), and the population labels (column A). Note that each
population label is followed by a semicolon. We shall see that this is mandatory at this stage.
2
The “data organization“ step is fulfilled by the Excel table, but two points may be still solved before obtaining a file really suitable for processing by
DIOGENE:
- Conversion of labels into ASCII format;
- Conversion of numerical codes and of counts into binary representation.
For this purpose, we need first to translate the Excel file into ASCII format using “spaces” as separators, what is called a “.prn” file. This is achieved as
shown by the above display by selecting this sub-option after the “write according to format…” Excel command.
3
The next step is to transfer the “lebart.prn“ file from the PC to the server (in the “tutorial” sub-directory, for this example). It is mandatory to do this
transfer in ASCII mode, as shown by the above screen. If this transfer is done in binary mode, control codes at the end of each line (CR) will result into
errors when the transcoding program (presented below) will run: In that case, no transcoded file would be obtained.
4
When the transfer is achieved, you can visualize and modify the lebart.prn file, using vi (as here) or another text editor available under Unix (emacs,
xemacs etc.). You must verify that there is at least one “space” character between two different columns and that each “population label” is followed
by a semicolon (if it is not the case, an error will occur when the transcoding program will be run). Space character is not required (but may be present)
between the last character of a label and the semicolon. But spaces are not allowed within a label. Note that a perfect alignment of the items of the
same column is not required: the only requirement is that there is at least one space character between two adjacent columns. Note also that the
maximum length for a label (population as well as traits) is 10 characters (the semicolon being excluded). If they are longer (as the first one in the
above example), they will be truncated to the ten first characters. If the listed requirements are not fulfilled, correct, going back to Excel or under Unix.
5
The next step is to run the transcoding program, ASCBIN, updated to process the two categories of labels. DIOGENE can be used according two
ways: an “expert mode” for those who know the programs which they need and their functions, and a “beotian mode” (following Dr Arbez’s famous
expression) for people who use it for the first time and/or are not sufficiently familiar with the software. The first way is much quicker, because it
authorizes many shortcuts. The second way, what we only consider here, uses the scrolling menu manager (DIOGENE) where the user is not supposed
to know the program’s names and functions.
Type “diogene” -lower-case letters-[CR]. The following “first level menu” is displayed. The “p” branch that we have selected for this example means
that the program(s) will be run not immediately (interactive mode) but, via a “script” (or “procedure”) which is an executable file which runs the
6
program(s) when all the parameters have been entered (detailed mechanism will be explained in the “notice” series). At this point, we answer to the
question “name of the script” (not shown here). For this example, this name was “transcode”.
We have reached now the “second level“ of the menu hierarchy and we select the “file management” branch.
7
This choice leads to a third level of hierarchy where we can select the kind of management operation that we need. Note that this branch is extended to
software administration (language translation, compilation, link-edition etc.). At this point, we select the “transcodage, copy-again” heading, as we
have to transcode an ASCII file.
8
This gives access to the fourth level of hierarchy where we select the “suited for Excel” program, ASCBIN. Note that this program will automatically
detect the presence of “population” labels and will run a program specialised to manage these labels. We shall comment this deeper below. Note that
ASCBIN will automatically link two other programs involved in data file conversion:
- NORMEX (addition of minimum and maximum values of indicators in the parameter-file (see later its definition);
- CREFAM (restructuration of the “population” label’s file in an apptopriate format).
These programs don’t need additional parameters. Moreover, they can be run separately for other particular situations which will be examined later.
9
After validation of the choice of ASCBIN (Y key [CR]), the menu manager, DIOGENE, runs the OPEP2 supervisor which is deviced to generate the
script and the associated file of parameters. Note that the TRIPROG option is devoted for selecting a sub-sample within a chain of programs in
complex treatments and that we don’t use it here.
10
The supervisor confirms that the parameters to enter concern ASCBIN and begins to display the corresponding sequence of questions. Note that all the
choices are “contextual”: it means that the supervisor performs logical tests to only ask for entry of the parameters necessary according to all the
choices which were previously done.
11
On the above screen display, the supervisor asks for the “record’s format” description. The record may be considered as a row of the Excel table
(considering only the binary values). DIOGENE processes these records once-at-a-time (saving memory space) and extends there definition to several
“individuals” the number of which may vary from one record to another. More explanations will be given in the “notice” series.
12
On the above screen, we select the use of “column headings” option because our file effectively integrates labels for traits. The “reaffectation option”
has nothing to do here.
13
The first among the last two questions concerns the possibility to redefine “null” values which mean “absence of observation for this trait”. For
instance, for a given software, the conventional value may be “-1” and “-5” for another one. DIOGENE processes differently two conventional values
of this kind : “-5” which means “no observation” and “-9” which means “the given individual was dead when the corresponding trait was observed”.
More explanations will be given in the “notice” series. The second question is about the presence of “space“ for data delimitation. It is always the case
except for old fashioned files created in the sixties or seventies.
14
At this point, all the necessary parameters were entered and the above screen offers a possibility of check and/or correction of the parameter via a
system call of the supervisor to vi text editor. The “v” (validate) choice ends the script building. However, it is always possible later to correct a script
by changing some of the parameters on the corresponding parameter-file. In this example, and if you use vi, type:
vi transcode.don*
Then select the appropriate file (“: n” to get the next parameter file associated to the script), correct the parameters and type “: w” to store the corrected
file and “:q” to leave vi.
15
The above screen is displayed after selecting the “Z” key when the supervisor (OPEP2) is relayed by the menu manager (DIOGENE), or when the user
decided to delay the execution of the script by selecting the “W” key. In this last eventuality, he has to type:
gene
which is the name of a c-shell command which is devoted to run the script after a check/creation stage for insuring that all the “service files” necessary
for running the DIOGENE library are effectively present in the working directory. The selection of “Z” key results in directly running the gene
command. Note that we have deliberately omitted to indicate that “carriage return” [CR] key has to be pressed after each command or parameter.
16
The above screen and the screen shown below show the final result of the file processing. The original lebart.prn file raised three associated files:
-
lebart (binary data including population codes and observed “traits”);
lebart.p (mixture of data in numeric -integer format- and alphanumeric format) which describe the lebart file (maximum
number of indicators, number of individuals/record, number of traits, names of indicators and labels of the traits);
- lebart.fam (purely alphanumeric file of “population labels”).
At an informatics point of view and for reasons which shall be explained in the notice series, the first and the last one are “direct access” files as the
second is an “unformatted sequential” file.
17
We can note that the file of the “population labels”, lebart.fam, is formatted in such a way that the rank (row number) of a particular label corresponds
to the value of the numeric population code which is in the lebart binary file. The reason is very simple: during data processing, these labels will be
stored in column vectors and they will be directly addressed by the value of the numeric codes of populations. We can note that character strings of ten
“*” are used to fill the place where there are no population labels. This “dummy” characters will be useful to manage the cases where there are more
that one set of labels, but several sets corresponding to m factors of a more or less complex experiment. An example will be given in the next tutorial.
Note that that “.fam” file is mandatory to obtain labels of the factors but not to process the data: it is an optional file.
18
Now, our aim is to visualize the data. As it is a very small file, we can choose the “screen display” option. For this, we first select the “interactive”
mode in the first level branch of the DIOGENE menu manager (see above).
19
Then, as before, we select the “file management” section.
20
Then, we select the “edition“ section.
21
Then, we select the “screen edition“ of a data file (LIRE program).
22
LIRE is run directly, and, after input of file name, gives its characteristics (obtained from the lebart.p file) and asks the ranks of the first and the last
records to be displayed.
23
After input of the last record’s rank, the edition is achieved and the program asks parameters for another section to be displayed or to close the edition
(0).
24
Now, we wish to process our data according to the Correspondence Analysis way. We run the menu manager which has now become familiar. We
select at this time the “script” option which is mandatory for at least the first program of the chain (it is always the case for biometrical processing data
files following the “extended ANTAR normalization” such as the lebart file in three parts (lebart, lebart.p and lebart.fam).
25
At the second level of the menu hierarchy, we select the “biometry-quantitative genetics” branch.
26
Then, we select, for instance, the “complete” built-in chain including dendrograms underlined above (more complex chains are possible, for instance,
“Discriminant Analysis of Qualitative data” which is partly built with Correspondence Analysis, or “Multiple Factorial Correspondance Analysis”).
27
As before, the OPEP2 supervisor is activated by the menu manager and we neglect the “triprog” option.
28
The supervisor displays the ordered list of the programs to be chained and begins to ask for the parameters of two “supplementary” programs, not
explicitly included in the chain, but necessary to define the data file and the traits to be processed, ANTAR and DEFCAR. For the second aspect,
DIOGENE supplies a syntactic analyser able to translate general formulae into a FIFO “pile” of ordered elementary operations. That means that the
user can enter its trait’s definitions involving the traits present in the file. We shall illustrate this possibility in a further tutorial. Here, we have to enter
the data file’s name (lebart). The lebart.p will be supposed to exist; the existence of lebart.fam file will be tested and, if it is available, the labels will
be used in the appropriate parts of edition of results.
29
On the above screen, there is a choice of somewhat “exotic”options which are of no use here. A by-pass enables to shorten this choice.
30
DIOGENE uses the three modes of processing data listed above. Qualitative modes (1 and 2) have to be selected to derive from observed data
elementary frequencies of derived ones, using a disjunction process via an incidence matrix. For instance, derivation of frequencies of the five
categories of an insect attack scored from 0 to 5. Here, as elementary data are directly counts, we select the “quantitative” mode.
31
When one wants to select the n first traits within a file, without any transformation, DIOGENE proposes the shortcut “-n” which was used here.
32
As before, after entry of the last parameter, the supervisor affords the possibility to check and eventually correct the entered parameters.
33
The parameters specific of the first module, CHAIX, are the minimum expectation of a cell, to allow a valid Chi-square test of row x column
association and the number of dimensions retained to compute the distances and the projection of the points on the two-by-two combinations of axes.
34
Lastly, as it is the rule for all the “leading” programs of the biometrical chains, there is the possibility to filter simultaneously for two indicators and
two traits. The “by-pass” value “0” replaces the non-entered upper and lower-bond values by dummy values which lead to no selection. If one needs a
great number of simultaneous filters, an utility is available in the DIOGENE library (CRECAR program) which allows an unlimited number of filters.
Two other multi-purpose utilities are also available for this kind of use (TROUVE and COPIE).
35
The ANACOR program computes the eigenvectors and eigenvalues of the row x column associations. The only parameters concern the edition of the
coordinates of the points and of there absolute and relative contributions.
36
The parameters of the DAUGEY program which computes the Chi-square distances and the two-dimensional graphics concern the “normalization” of
distances (using square roots of eigenvalues) and the selection of entities to be selected (populations, traits, or both).
37
The POSKI1 program plots the histograms of the coordinates of the population and/or the trait points (no great interest here). We retain 5 classes.
38
The last program of the chain, PRADET, is common with a variety of other kinds of computations, as Discriminant and Principal Components
Analysis. Its purpose is to represent the between-point relationships as dendrograms, after transformation of distances into similarities (see their
definitions in further notices). The parameters concern the edition of results preceding the dendrograms, as the similarity matrix, and one among the
three available algorithms to build the clusters.
39
40
41
The last entry of PRADET enables to reiterate dendrograms from the same similarity matrix with another combination of parameters.
42
When the menu manager again displays the first level of the “biometry-quantitative genetics” branch from which we selected the chain of programs,
we select the “Z” key to run the script…
43
and the gene command is again activated and stats when the name of the script is entered (afc here).
44
Short comments about the afc script.
The script shows the sequence of the seven programs involved in data processing of lebart file. The first line means that the commands are written in
c-shell and may be executed by the corresponding translator. Lines 2 and three test the existence of two service files; erreur, for error reports and
sortie, for output of results. If they exist, these files are destroyed. Line 4 does the same for the output files labelled with the script’s name (see later).
The following lines alternate statement to run each program specifying the “parameter file” where it has to read its own parameters and tests to redirect
the script toward the “echec” label, in case of fatal error. After this test, a line redirects the content of the sortie file toward “.out” file identified by the
script’s name and by a three-digit number. In case of success; the message “end of procedure…” is displayed.
45
After the success of the script’s run, all the results may be found in five “.out” files:
afc.out.001 (output of CHAIX), afc.out.002 (output of ANACOR), afc.out.003 (output of DAUGEY),
afc.out.004 (output of POSKI1) and afc.out.005 (output of PRADET). The GENE utility automatically concatenates these five files into only one:
afc.out. This file may be visualized using Tera Term with vi or another text editor.
The above screen shows the importation of the afc.out file from the server toward the PC. Note that it may be indifferently done in binary or “ASCII”
mode. NB: Discard the name of the file (afcout in place of afc.out) as the example was built before the last modification of GENE.
46
Now, we shall deal with importation of results in Word. Note that the DIOGENE’s edition in alphanumeric format don’t exceeds 155 columns width
and that a simple X-Window emulator as Tera Term can easily be parametrized to accommodate this format. For importation in Word and completely
visualize any page of results, it is mandatory to select the “8” size for the characters and the police “courier new” which allows a fixed symbol’s size.
Don’t select a “proportional size” police as “times new roman”, because problems would occur in within-column alignments! The above screen shows
the selection of character size.
47
The above display shows the classical selection to number the pages in Word that everybody knows. It is useful for an easier selection of the parts of
the results which are of interest for the researcher, mainly if there are several hundreds of pages.
48
For adequate format of results, it is mandatory to use the “landscape” presentation and to select the up and down margins as narrow as possible
‘important to don’t truncate the graphics of positions of points on two axes combinations, for instance. The above selection of parameters is adequate
for every outputs of DIOGENE. Here-after are listed the results edited according to the chosen parameters. Our purpose is not to comment these
results where you will learn that people who earn more money usually live in the more pleasant houses. Note that the labels for traits are automatically
managed all along these outputs with the appropriate format choices (right or left justification, optimal position on the graphics and so on).
49
RESULTS
$*$*$*$*$* 24 heures sur 24, DIOGENE 2004 a votre service ! *$*$*$*$*$*
Biometrie du fichier : lebart
---------------------------------housing/profession relationshi
noms des
8 caracteres etudies :
--------------------------------y
y
y
y
y
y
y
y
1
2
3
4
5
6
7
8
=
hotel
= rent_house
= priv_house
=
parents
=
friends
= mob_house
= holi_vill
= miscellan
..............................................................................
definition des
8 caracteres etudies :
y 1 = x1
y 2 = x2
y 3 = x3
y 4 = x4
y 5 = x5
y 6 = x6
y 7 = x7
y 8 = x8
..............................................................................
CHAIX : preparation d'Analyse Factorielle des Correspondances
nombre d'indicatifs/enregistrement :
option
1
0 (0 = caracteres quantitatifs, 1 = caracteres qualitatifs)
enregistrements numeros
1
a
10
numero du premier individu traite/enregistrement :
8 caracteres observes,
8
1 , dernier =
1 , saut =
etudies
contraintes
lim. inf.
lim. sup.
50
1
indicatif
indicatif
caractere
caractere
1
1
1
1
-99999
-99999
-99999.000
-99999.000
99999
99999
99999.000
99999.000
Nombre de populations retenues :
10
(en imposant une esperance minimum/cellule de
0.0000E+00)
Profil des populations num_code (% total/ligne) : 0 = profil moyen
(prend en compte l'ensemble des variables : principales & supplementaires)
population
libelle
1
population
agri_owner
796
libelle
nombre
3
population
agri_emplo
260
libelle
nombre
5
population
factory_ow
2978
libelle
nombre
7
population
big_manag
libelle
4620
nombre
9
population
med_manag
libelle
4298
nombre
11
population
trade_empl
2972
libelle
nombre
14
population
labours
libelle
9209
nombre
15
population
serv_empl
libelle
583
nombre
16
population
misce_act
libelle
1423
nombre
17
population
retired
libelle
3940
nombre
0
ensemble
nombre
31079
y
1
hotel
20.101
y
1
hotel
13.462
y
1
hotel
23.506
y
1
hotel
20.801
y
1
hotel
13.309
y
1
hotel
14.838
y
1
hotel
8.503
y
1
hotel
11.149
y
1
hotel
5.411
y
1
hotel
18.807
y
1
hotel
y
2
rent_house
3.518
y
2
rent_house
13.077
y
2
rent_house
11.887
y
2
rent_house
10.195
y
2
rent_house
12.494
y
2
rent_house
13.594
y
2
rent_house
12.097
y
2
rent_house
7.376
y
2
rent_house
4.216
y
2
rent_house
8.426
y
2
rent_house
14.592
10.866
y
3
priv_house
0.000
y
3
priv_house
0.385
y
3
priv_house
7.690
y
3
priv_house
13.701
y
3
priv_house
6.491
y
3
priv_house
5.585
y
3
priv_house
4.202
y
3
priv_house
3.602
y
3
priv_house
13.282
y
3
priv_house
8.299
y
3
priv_house
7.182
y
4
parents
40.327
y
4
parents
68.462
y
4
parents
32.203
y
4
parents
34.199
y
4
parents
39.297
y
4
parents
36.306
y
4
parents
44.000
y
4
parents
50.429
y
4
parents
58.960
y
4
parents
45.406
y
4
parents
y
5
friends
4.523
y
5
friends
3.077
y
5
friends
6.212
y
5
friends
6.602
y
5
friends
4.793
y
5
friends
5.989
y
5
friends
5.397
y
5
friends
13.551
y
5
friends
3.725
y
5
friends
7.893
y
5
friends
y
6
mob_house
17.714
y
6
mob_house
0.000
y
6
mob_house
9.805
y
6
mob_house
7.792
y
6
mob_house
17.403
y
6
mob_house
14.603
y
6
mob_house
15.897
y
6
mob_house
9.777
y
6
mob_house
8.714
y
6
mob_house
5.990
y
6
mob_house
y
7
holi_vill
5.653
y
7
holi_vill
1.538
y
7
holi_vill
3.996
y
7
holi_vill
3.506
y
7
holi_vill
3.606
y
7
holi_vill
5.989
y
7
holi_vill
5.701
y
7
holi_vill
3.087
y
7
holi_vill
1.968
y
7
holi_vill
2.589
y
7
holi_vill
y
8
miscellan
8.166
y
8
miscellan
0.000
y
8
miscellan
4.701
y
8
miscellan
3.203
y
8
miscellan
2.606
y
8
miscellan
3.096
y
8
miscellan
4.202
y
8
miscellan
1.029
y
8
miscellan
3.725
y
8
miscellan
2.589
y
8
miscellan
41.121
5.978
12.407
4.299
3.555
p
14
labours
17.266
p
14
labours
32.988
p
14
p
15
serv_empl
1.433
p
15
serv_empl
1.273
p
15
Profil des caracteres (% du total par colonne) : 0 = profil moyen
(prend en compte l'ensemble des variables : principales & supplementaires)
caractere
libelle
nombre
1
caractere
hotel
libelle
4535
nombre
2
caractere
rent_house
3377
libelle
nombre
p
1
p
3
p
5
agri_owner agri_emplo factory_ow
3.528
0.772
15.436
p
1
p
3
p
5
agri_owner agri_emplo factory_ow
0.829
1.007
10.483
p
1
p
3
p
5
p
7
big_manag
21.191
p
7
big_manag
13.947
p
7
51
p
9
p
11
med_manag trade_empl
12.613
9.724
p
9
p
11
med_manag trade_empl
15.902
11.963
p
9
p
11
p
16
misce_act
1.698
p
16
misce_act
1.777
p
16
p
17
retired
16.340
p
17
retired
9.831
p
17
3
caractere
priv_house
2232
libelle
nombre
4
caractere
parents
libelle
12780
nombre
5
caractere
friends
libelle
1858
nombre
6
caractere
mob_house
libelle
3856
nombre
7
caractere
holi_vill
libelle
1336
nombre
8
caractere
miscellan
libelle
1105
nombre
0
ensemble
31079
agri_owner
0.000
p
1
agri_owner
2.512
p
1
agri_owner
1.938
p
1
agri_owner
3.657
p
1
agri_owner
3.368
p
1
agri_owner
5.882
p
1
agri_owner
2.561
agri_emplo
0.045
p
3
agri_emplo
1.393
p
3
agri_emplo
0.431
p
3
agri_emplo
0.000
p
3
agri_emplo
0.299
p
3
agri_emplo
0.000
p
3
agri_emplo
0.837
factory_ow
10.260
p
5
factory_ow
7.504
p
5
factory_ow
9.957
p
5
factory_ow
7.573
p
5
factory_ow
8.907
p
5
factory_ow
12.670
p
5
factory_ow
9.582
big_manag
28.360
p
7
big_manag
12.363
p
7
big_manag
16.416
p
7
big_manag
9.336
p
7
big_manag
12.126
p
7
big_manag
13.394
p
7
big_manag
14.865
med_manag
12.500
p
9
med_manag
13.216
p
9
med_manag
11.087
p
9
med_manag
19.398
p
9
med_manag
11.602
p
9
med_manag
10.136
p
9
med_manag
13.829
trade_empl
7.437
p
11
trade_empl
8.443
p
11
trade_empl
9.580
p
11
trade_empl
11.255
p
11
trade_empl
13.323
p
11
trade_empl
8.326
p
11
trade_empl
9.563
labours
17.339
p
14
labours
31.706
p
14
labours
26.749
p
14
labours
37.967
p
14
labours
39.296
p
14
labours
35.023
p
14
labours
29.631
Test Chi-2 de l'association ligne*colonne (
63 d.l.) = 2598.273
Seuil de signification de ce test (en %) = 0.000
(le test ne prend en compte que les variables principales)
Matrice des associations entre caracteres (colonnes)
(cette matrice ne prend en compte que les variables principales)
y
y
y
y
y
y
y
y
1:
hotel
2:rent_house
3:priv_house
4: parents
5: friends
6: mob_house
7: holi_vill
8: miscellan
y 1
hotel
0.168
0.125
0.111
0.235
0.097
0.124
0.075
0.072
y 2
y 3
rent_house priv_house
0.114
0.085
0.208
0.080
0.121
0.071
0.061
0.090
0.169
0.067
0.083
0.050
0.048
y 4
parents
y 5
friends
0.422
0.156
0.225
0.131
0.120
0.064
0.082
0.050
0.045
y 6
mob_house
0.140
0.079
0.069
52
y 7
holi_vill
0.047
0.041
y 8
miscellan
0.039
serv_empl
0.941
p
15
serv_empl
2.300
p
15
serv_empl
4.252
p
15
serv_empl
1.478
p
15
serv_empl
1.347
p
15
serv_empl
0.543
p
15
serv_empl
1.876
misce_act
8.468
p
16
misce_act
6.565
p
16
misce_act
2.853
p
16
misce_act
3.216
p
16
misce_act
2.096
p
16
misce_act
4.796
p
16
misce_act
4.579
retired
14.651
p
17
retired
13.998
p
17
retired
16.738
p
17
retired
6.120
p
17
retired
7.635
p
17
retired
9.231
p
17
retired
12.677
ANACOR : Analyse Factorielle des Correspondances
nombre de dimensions (axes a contribution relatives >.0001) =
7
valeurs propres (par ordre decroissant):
lambda 1
4.4316E-02
lambda 2
2.0249E-02
lambda 3
8.8049E-03
lambda 4
5.4653E-03
lambda 5
2.3815E-03
lambda 6
2.1451E-03
lambda 7
2.4065E-04
lambda 5
2.849
lambda 6
2.566
lambda 7
0.288
lambda 5
97.146
lambda 6
99.712
lambda 7
100.000
valeurs propres en % de la variance totale:
(la plus gande -egale a 1- est eliminee)
lambda 1
53.008
lambda 2
24.221
lambda 3
10.532
lambda 4
6.537
valeurs propres cumulees en % de la variance totale:
lambda 1
53.008
lambda 2
77.229
lambda 3
87.760
lambda 4
94.298
vecteurs propres (dans l'ordre des valeurs propres):
ve
1
ve
2
ve
3
ve
4
ve
5
ve
6
ve
7
y 1
y 2
y 3
y 4
y 5
y 6
5.9769E-01 -1.2159E-01 5.1898E-01 -1.9374E-01 1.3107E-01 -5.0434E-01
y 1
y 2
y 3
y 4
y 5
y 6
-4.6878E-01 -3.2009E-01 2.4708E-01 6.4607E-01 1.4905E-02 -3.4815E-01
y 1
y 2
y 3
y 4
y 5
y 6
-3.5370E-01 2.0959E-01 7.5007E-01 -2.6290E-01 -3.0118E-01 3.2069E-01
y 1
y 2
y 3
y 4
y 5
y 6
-1.2801E-01 6.0326E-01 -8.5547E-02 -2.8660E-02 3.5209E-01 -1.6636E-01
y 1
y 2
y 3
y 4
y 5
y 6
-2.9275E-01 -2.3328E-01 1.6478E-01 -2.2969E-01 7.4047E-01 -5.0467E-02
y 1
y 2
y 3
y 4
y 5
y 6
-1.1988E-01 4.1439E-01 -9.5594E-03 5.7492E-02 -3.1282E-01 -6.0434E-01
y 1
y 2
y 3
y 4
y 5
y 6
-1.8849E-01 3.7431E-01 -2.4959E-02 -8.8577E-02 2.4898E-01 -4.5595E-02
analyse du profil des populations (lignes)
coordonnees sur chacun des
p
p
p
p
p
p
axe 1
1 -1.3896E-01
axe 1
3 -3.6085E-02
axe 1
5 2.0957E-01
axe 1
7 3.2562E-01
axe 1
9 -1.0110E-01
axe 1
11 -6.9267E-02
4 axes retenus
axe 2
axe 3
axe 4
-1.7535E-01 -2.4353E-01 -3.4388E-01
axe 2
axe 3
axe 4
3.8541E-01 -3.4680E-01 2.0365E-01
axe 2
axe 3
axe 4
-1.8412E-01 -5.6425E-02 -3.3691E-02
axe 2
axe 3
axe 4
-2.1507E-02 9.9371E-02 -5.8301E-03
axe 2
axe 3
axe 4
-5.9565E-02 7.2752E-02 3.1912E-02
axe 2
axe 3
axe 4
-1.3017E-01 1.4386E-02 5.9292E-02
53
y 7
-2.0999E-01
y 7
-2.3696E-01
y 7
3.3817E-02
y 7
-4.1019E-02
y 7
4.3122E-01
y 7
4.7595E-01
y 7
-6.6443E-01
y 8
-7.3932E-02
y 8
-1.4751E-01
y 8
-6.7504E-02
y 8
-6.7692E-01
y 8
2.0778E-01
y 8
3.4771E-01
y 8
5.5710E-01
p
p
p
p
axe 1
14 -2.3605E-01
axe 1
15 -3.8013E-02
axe 1
16 8.8608E-03
axe 1
17 2.0691E-01
axe 2
axe 3
axe 4
8.3748E-03 7.9274E-03 3.0155E-04
axe 2
axe 3
axe 4
2.0113E-01 -2.3882E-01 1.6948E-01
axe 2
axe 3
axe 4
4.7365E-01 1.3006E-01 -1.3484E-01
axe 2
axe 3
axe 4
1.1714E-01 -1.2217E-01 3.1716E-02
contributions absolues des populations principales
p
1
p
3
p
5
p
7
p
9
p
11
p
14
p
15
p
16
p
17
axe 1
1.1160E-02
axe 1
2.4581E-04
axe 1
9.4964E-02
axe 1
3.5566E-01
axe 1
3.1898E-02
axe 1
1.0353E-02
axe 1
3.7255E-01
axe 1
6.1167E-04
axe 1
8.1120E-05
axe 1
1.2247E-01
axe 2
3.8891E-02
axe 2
6.1370E-02
axe 2
1.6041E-01
axe 2
3.3958E-03
axe 2
2.4231E-02
axe 2
8.0016E-02
axe 2
1.0263E-03
axe 2
3.7476E-02
axe 2
5.0728E-01
axe 2
8.5903E-02
axe 3
1.7252E-01
axe 3
1.1427E-01
axe 3
3.4647E-02
axe 3
1.6671E-01
axe 3
8.3130E-02
axe 3
2.2477E-03
axe 3
2.1149E-03
axe 3
1.2151E-01
axe 3
8.7962E-02
axe 3
2.1488E-01
axe 4
5.5416E-01
axe 4
6.3482E-02
axe 4
1.9901E-02
axe 4
9.2452E-04
axe 4
2.5769E-02
axe 4
6.1513E-02
axe 4
4.9300E-06
axe 4
9.8591E-02
axe 4
1.5232E-01
axe 4
2.3333E-02
contributions relatives des populations principales
p
1
p
3
p
5
p
7
p
9
p
11
p
14
p
15
p
16
p
17
axe 1
0.083
axe 1
0.003
axe 1
0.519
axe 1
0.904
axe 1
0.328
axe 1
0.176
axe 1
0.972
axe 1
0.008
axe 1
0.000
axe 1
0.590
axe 2
0.132
axe 2
0.335
axe 2
0.401
axe 2
0.004
axe 2
0.114
axe 2
0.623
axe 2
0.001
axe 2
0.223
axe 2
0.863
axe 2
0.189
axe 3
0.255
axe 3
0.272
axe 3
0.038
axe 3
0.084
axe 3
0.170
axe 3
0.008
axe 3
0.001
axe 3
0.315
axe 3
0.065
axe 3
0.206
axe 4
0.508
axe 4
0.094
axe 4
0.013
axe 4
0.000
axe 4
0.033
axe 4
0.129
axe 4
0.000
axe 4
0.159
axe 4
0.070
axe 4
0.014
analyse du profil des caracteres (colonnes)
54
coordonnees sur chacun des
y
1
y
2
y
3
y
4
y
5
y
6
y
7
y
8
axe 1
3.2938E-01
axe 1
-7.7651E-02
axe 1
4.0768E-01
axe 1
-6.3601E-02
axe 1
1.1285E-01
axe 1
-3.0142E-01
axe 1
-2.1321E-01
axe 1
-8.2540E-02
axe 2
-1.7463E-01
axe 2
-1.3818E-01
axe 2
1.3120E-01
axe 2
1.4337E-01
axe 2
8.6746E-03
axe 2
-1.4065E-01
axe 2
-1.6263E-01
axe 2
-1.1132E-01
4 axes retenus
axe 3
-8.6884E-02
axe 3
5.9662E-02
axe 3
2.6263E-01
axe 3
-3.8470E-02
axe 3
-1.1558E-01
axe 3
8.5430E-02
axe 3
1.5305E-02
axe 3
-3.3593E-02
axe 4
-2.4774E-02
axe 4
1.3529E-01
axe 4
-2.3599E-02
axe 4
-3.3040E-03
axe 4
1.0646E-01
axe 4
-3.4915E-02
axe 4
-1.4626E-02
axe 4
-2.6540E-01
contributions absolues des caracteres principaux
y
1
y
2
y
3
y
4
y
5
y
6
y
7
y
8
axe 1
3.5724E-01
axe 1
1.4784E-02
axe 1
2.6934E-01
axe 1
3.7535E-02
axe 1
1.7179E-02
axe 1
2.5436E-01
axe 1
4.4096E-02
axe 1
5.4660E-03
axe 2
2.1975E-01
axe 2
1.0246E-01
axe 2
6.1049E-02
axe 2
4.1741E-01
axe 2
2.2216E-04
axe 2
1.2121E-01
axe 2
5.6149E-02
axe 2
2.1761E-02
axe 3
1.2510E-01
axe 3
4.3927E-02
axe 3
5.6260E-01
axe 3
6.9118E-02
axe 3
9.0707E-02
axe 3
1.0284E-01
axe 3
1.1436E-03
axe 3
4.5568E-03
axe 4
1.6387E-02
axe 4
3.6393E-01
axe 4
7.3182E-03
axe 4
8.2137E-04
axe 4
1.2397E-01
axe 4
2.7674E-02
axe 4
1.6826E-03
axe 4
4.5822E-01
contributions relatives des caracteres principaux
y
1
y
2
y
3
y
4
y
5
y
6
y
7
y
8
axe 1
0.729
axe 1
0.116
axe 1
0.655
axe 1
0.153
axe 1
0.202
axe 1
0.724
axe 1
0.471
axe 1
0.066
axe 2
0.205
axe 2
0.368
axe 2
0.068
axe 2
0.778
axe 2
0.001
axe 2
0.158
axe 2
0.274
axe 2
0.120
axe 3
0.051
axe 3
0.069
axe 3
0.272
axe 3
0.056
axe 3
0.212
axe 3
0.058
axe 3
0.002
axe 3
0.011
axe 4
0.004
axe 4
0.353
axe 4
0.002
axe 4
0.000
axe 4
0.180
axe 4
0.010
axe 4
0.002
axe 4
0.683
55
DAUGEY : graphique des resultats issus d'ANACOR avec reperage
des populations ou des caracteres.
Graphique par coordonnees des facteurs sur chaque couple d'axes
Pour l'Analyse des Constellations, l'espace de
reference est celui des
4 premiers axes
Option DISTANCE
0
(0 = non normees, 1 = normees)
Graphique des points-population & des points-caractere
Le fichier DUMAS comprend
18 enregistrements
donnant les coordonnees de chacun des points (populations
et/ou caracteres) sur les
4 axes choisis.
****************************************************************
Graphe des facteurs par coordonnees : 52 interlignes*100 colonnes
(coordonnees positives ou nulles, car exprimees en deviation au minimum)
(* = point population, + = point caractere, # = point double ou multiple)
Indicatif de population = nombre de 1 a 4 chiffres
Indicatif de caractere = nombre de 1 a 3 chiffres precede d'un y
Si plus de 10 points sur 1 ligne, les indicatifs concernent les 10 premiers.
Les autres indicatifs seront alors reportes a la page suivante
56
Axe 1
indicatifs dans l'ordre des points
^
| 7.0909E-01
1+
+priv_house
y003
2|
3|
4|
5|
6| +hotel
y001
7|
*big_manag
7
8|
9|
10|
11|
12|
13|
14|
15|*factory_ow
*retired
5
17
16|
17|
18|
19|
20|
21|
22|
+friends
y005
23|
24|
25|
26|
27|
28|
29m
30|
misce_act* 16
31|
32|
33|
serv_empl*
agri_emplo*
15
3
34|
35|
*trade_empl
+parents
11 y004
36| rent_+
+miscellan
y002 y008
37|
38|
*med_manag
9
39|
40|
41| *agri_owner
1
42|
43|
44|
45|
46|
+holi_vill
y007
47|
48|
*labours
14
49|
50|
51|
52+
+mob_house
y006
| 0.0000E+00
|+----------------------------m---------------------------------------------------------------------+->Axe 2
0.0000E+00
6.5777E-01
57
Axe 1
^
| 7.0909E-01
1+
priv_house+y003
2|
3|
4|
5|
6|
+hotel
y001
7|
big_manag*
7
8|
9|
10|
11|
12|
13|
14|
15|
retired*
*factory_ow
17
5
16|
17|
18|
19|
20|
21|
22|
+friends
y005
23|
24|
25|
26|
27|
28|
29m
30|
misce_act*
16
31|
32|
33|*agri_emplo
*serv_empl
3
15
34|
35|
parents+
*trade_empl
y004
11
36|
miscellan+
+rent_house
y008 y002
37|
38|
med_manag*
9
39|
40|
41|
*agri_owner
1
42|
43|
44|
45|
46|
holi_vill+
y007
47|
48|
labours*
14
49|
50|
51|
52+
mob_house+
y006
| 0.0000E+00
|+---------------------------------------------------m----------------------------------------------+->Axe 3
0.0000E+00
6.0944E-01
58
Axe 2
^
| 6.5777E-01
1+
misce_act*
16
2|
3|
4|
5|
6|
7|*agri_emplo
3
8|
9|
10|
11|
12|
13|
14|
15|
16|
17|
18|
19|
20|
21|
22|
*serv_empl
15
23|
24|
25|
26|
27|
parents+
y004
28|
priv_house+y003
29|
*retired
17
30|
31|
32|
33|
34|
35|
36|
37m
friends+
*labours
y005
14
38|
39|
40|
big_manag*
7
41|
42|
43|
med_manag*
9
44|
45|
46|
47|
miscellan+
y008
48|
trade_empl*
11
49|
rent_house+
+mob_house
y002 y006
50|
51|
holi_vill+
y007
52+
*agri_owner
hotel+
*factory_ow
1 y001
5
| 0.0000E+00
|+---------------------------------------------------m----------------------------------------------+->Axe 3
0.0000E+00
6.0944E-01
59
NORMEX : conversion d'un fichier de norme ANTAR a la norme ANTAR etendue
Le fichier dumas a ete converti a la norme ANTAR etendue
-----------------------------------------------------------------------NORMEX : conversion d'un fichier de norme ANTAR a la norme ANTAR etendue
Le fichier dista a ete converti a la norme ANTAR etendue
-----------------------------------------------------------------------Fichier DUMAS - graphiques par coordonnees - maintenu et recopie sous le nom de daugey1002.gnu
Fichier DISTA - distances - maintenu et recopie sous le nom de daugey2002.gnu
60
POSKI1
Histogrammes de Composantes Principales (issues de SARTO
ou de facteurs d'Analyse des Correspondances (issus de DAUGEY)
valeurs-limites des variables synthetiques :
variable
variable
variable
variable
synthetique
synthetique
synthetique
synthetique
vs
vs
vs
vs
1
2
3
4
minimum
maximum
-0.3014
-0.1841
-0.3468
-0.3439
0.7091
0.6578
0.6094
0.5475
nombre de populations ou facteurs retenus :
18
61
histogramme de la variable synthetique vs
5 classes d'intervalle
1
0.202 depuis
-0.301 jusqu'a
0.709
limite inf incluse limite sup exclue (sauf pour derniere classe)
moyenne :
0.305 , ecart-type :
0.222
coefficient de symetrie :
0.274 , coefficient d'aplatissement :
1.761
table des frequences
classe
1
2
3
4
5
lim inf
lim sup
-0.301
-0.099
0.103
0.305
0.507
-0.099
0.103
0.305
0.507
0.709
nbre
0
3
8
2
5
pourcent
0.000
16.667
44.444
11.111
27.778
cumul
0.000
16.667
61.111
72.222
100.000
graphe de l'histogramme (pourcent):$ =
0.601 % arrondi au plus pres
surface distribution normale en surimpression = surface histogramme
numerotation des classes a la base des rectangles
(voir page suivante)
62
1
2
3
4
5
*
$$$$$$$$$$$$$$$$$$$$$$$*$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$
*
$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$$$$$$$$$$$$$
63
2
0.168 depuis
-0.184 jusqu'a
0.658
moyenne :
0.176
0.978 , coefficient d'aplatissement :
2.741
classe
1
2
3
4
5
lim inf
lim sup
-0.184
-0.016
0.153
0.321
0.489
-0.016
0.153
0.321
0.489
0.658
nbre
0
9
5
2
2
pourcent
0.000
50.000
27.778
11.111
11.111
cumul
0.000
50.000
77.778
88.889
100.000
64
1
2
3
4
5
*
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
*
$$$$$$$$$$$$$$$$
*
$$$$$$*$$$$$$$$$
65
3
0.191 depuis
-0.347 jusqu'a
0.609
moyenne :
0.161
-0.612 , coefficient d'aplatissement :
3.000
classe
1
2
3
4
5
lim inf
lim sup
-0.347
-0.156
0.036
0.227
0.418
-0.156
0.036
0.227
0.418
0.609
nbre
0
1
3
9
5
pourcent
0.000
5.556
16.667
50.000
27.778
cumul
0.000
5.556
22.222
72.222
100.000
66
1
*
2
3
4
5
$$$$*$$$
$$$$$$$$$$$$$$$$$$$$$$$$$
*
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$
67
4
0.178 depuis
-0.344 jusqu'a
0.548
moyenne :
0.144
-1.100 , coefficient d'aplatissement :
4.258
classe
lim inf
lim sup
nbre
pourcent
cumul
1
2
3
4
-0.344
-0.166
0.013
0.191
-0.166
0.013
0.191
0.369
0
1
1
9
0.000
5.556
5.556
50.000
0.000
5.556
11.111
61.111
5
0.369
0.548
7
38.889
100.000
68
1
2
3
4
5
*
$*$$$$$$
$$$$$$$$
*
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$*$$$$$$$$$$
Fichier poski1002.gnu cree en vue des graphiques (
5 enregistrements)
69
PRADET : analyse des constellations
sur matrice de similarite issue de DAG, MAHAL, DAUGEY ou EUCLID
Matrices issues de DAUGEY
Similarites des facteurs ligne et colonne (populations + caracteres)
Option MATSUP =
0 , GRSUP =
0 , ZOOM =
3
Algorithme d'agglomeration =
0
(0 = lien simple , 1 = lien moyen , 2 = lien complet)
Nombre total de groupes formes :
27
Dendrogramme (similarite au .01 plus proche si <0.995 et a 0.99 sinon)
---------------------------------------------------------------------1.000
0.900
0.800
0.700
0.600
0.500
0.400
0.300
0.200
0.100
0.000
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
p
p
y
p
y
y
y
p
y
p
y
p
y
p
p
y
p
p
9 : med_manag*====>+-*
|
11 :trade_empl*====>+-*--*
|
2 :rent_house*====>+----*
|
14 :
labours*====>+----**
|
6 : mob_house*====>+-*
*
|
|
7 : holi_vill*====>+-*---**
|
4 :
parents*====>+------*----*
|
17 :
retired*====>+------*
*
|
|
5 :
friends*====>+------*----*------*
|
5 :factory_ow*====>+*
*
|
|
1 :
hotel*====>+*-----------------*-*
|
7 : big_manag*====>+--------------------*
|
3 :priv_house*====>+--------------------**
|
3 :agri_emplo*====>+--------------*
*
|
|
15 : serv_empl*====>+--------------*------*----------------------------*
|
8 : miscellan*====>+----------------------------*
*
|
|
1 :agri_owner*====>+----------------------------*---------------------*
|
16 : misce_act*====>+--------------------------------------------------*-------
echelle :
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
1.000
0.900
0.800
0.700
0.600
0.500
0.400
0.300
0.200
0.100
0.000
70
Miscellaneous
In the given example, we supposed that “all worked perfectly” and that there was no corrections to be done. As (we hope that) the reader is now
familiar with the use of the DIOGENE menu manager, he should be able to find by himself all the additional programs that he needs for corrections.
For instance, he can visualize and correct the “.fam” file using the EDIFAM and CORFAM programs…that he can run directly, by simply typing their
names (don’t forget the [CR] key!).
If you use your own data, you can also correct the data section of your file (lebart in our example), by using the CORREC program, or its parameter
file (lebart.p in our example) using the CREPAR program.
Moreover, there is possible to create the DIOGENE data files without using Excel and in a somewhat faster way (STOCK program). Note that if you
directly create DIOGENE data files, it is easy to make the reverse conversion that ASCBIN does (from DIOGENE format toward Excel format); in
that case, use the TOTEM and TOTEMG programs that you can both run directly or via the menu manager.
If you read the above outputs, you will see that, at several points, output files are created and are suffixed by “.gnuxxx - the suffix integrates a threedigit number -”. These files are devoted to GUPLOT graphics using a particular interface we shall describe later. But, you can use them to perform
Excel graphics after running the TOTEM and TOTEMG format translators.
The last two points that I shall mention are the following ones:
(a) You can destroy al the “service” file and the “.gnu” files from the working directory by typing the “reset” command, which can also been
accessed by from the menu-manager, as you can see by inspection of the above screens. This automatic cleaning is useful to avoid the
accumulation of files that it is not necessary to archive in the working directory…and to preserve good relations with your system manager.
(b) The alphanumeric labelling of “populations” (or, more generally, “modalities of factors”) allows a more general format that what is described in
this example. The number of corresponding labels is generally less fewer than the number of data lines. Moreover, this number may vary from
one column to another. We shall see in the next tutorial how the labelling system that we used here may be easily extended to every case, by
properly using the “;” and “*” symbols. This system, which results into the creation of a normalised “.fam” file and the addition of indicator’s
min-max in the “.p” file, extends the ANTAR standard, defined in the last eighties by Thierry Labbé whose nickname is now “labbé@Pierr”,
due to his present internet address ([email protected]). We name now this system “EAN” (Extended ANTAR normalization) or “NAE”
(Norme ANTAR Etendue). This first tutorial is dedicated to him. My own internet address is private as I am a poor retired guy, simple - and
quite old - disciple of Thierry Labbé: [email protected].
Montpellier March 16/2004
Ph. Baradat
71

DOCUMENT: tutorial001

Transcription

Similar documents

AXE Media Plan By Kevin Koerner

Fiskars new Axe range

chopping

Leif Erikson

FMA Informative Issue No # 43

ANSI 207 PublIc SAfety VeSt wIth ZIPPer cloSure

IB apr 2014 - Barony of Iron Mountain

The 27 Most Hilarious Album Covers Of All Time (compiled by Top

30 St Mary Axe 2013 CTBUH Awards Book section