as a PDF

Transcription

as a PDF

Stemming and lemmatisation: improving
knowledge management through language
processing techniques
Joan-Josep Vallbé ∗
Aleks Jakulin
§
M. Antònia Mart´ı †
Dunja Mladeniˇc ¶
Blaˇz Fortuna
‡
Pompeu Casanovas
k
Abstract
In this article we show how some aspects of knowledge management can be improved through the distinction between stemming and
lemmatisation techniques. To do that we apply both techniques to the
same corpus—in the framework of the building of a legal ontology—
which is analyzed by Alceste and TextGarden. In the final part of
the article we discuss whether the results of the experiment are significant or not regarding the analysis done by these software programs.
Our discussion focuses around the results being useful for the ontology
building process in the frame of European project named SEKT.
1
Introduction
To perform the quantitative analysis of text we must choose a way of representing the words as mere numbers, stripped of meaning. Even if this
operation may appear brusque, the patterns of words co-appearing with
one another reveal much structure in a corpus of text. Furthermore, some
∗
[email protected], IDT - Institut de Dret i Tecnologia [Institute of Law and Technology], Universitat Aut`
onoma de Barcelona (UAB).
†
[email protected], CLIC - Centre de Llenguatge i Computació [Center of Language and
Computation], Universitat de Barcelona, (UB).
‡
J.Stefan Institute, Ljubljana, Slovenia.
§
¶
k
[email protected], IDT - Institut de Dret i Tecnologia [Institute of Law and
Technology], Universitat Aut`
onoma de Barcelona (UAB).
1
words (judge, attorney, court) tend to appear in the same document or paragraph, so associations can be established directly from the patterns of coappearance.
As a part of the SEKT (Semantically Enabled Knowledge Technologies,
EU-IST Project IST-2003-506826, http://sekt-project.org/), we are developing tools that attempt to move beyond these simple patterns of coappearance, and extend the basic numerical representations with additional
information. Such information can be understood as ‘semantic’, as it is a
formalization of certain aspects of meaning. The final goal of this research is
to improve the quality of access to judicial information as a part of one case
study developed in the mentioned European project SEKT.
Stemming and lemmatisation both represent related words as a single
word. For example, the default quantification is to treat the words ‘am’,
’are’ and ‘is’ as separate. On the other hand, all these words have the same
meaning, which we denote as ‘be’. This way, we simplify the representation, and capture patterns of identity. Stemming and lemmatisation are two
approaches to this operation.
The rest of this paper is structured as follows. Section 2 described the two
used software programs. Section 3 describes the used collection of questions
from legal domain provided in Spanish language and the results of experiments. Some potential problems caused by using stemming instead of a more
sophisticated lemmatization are pointed out in Section 4. Section 5 briefly
describes the used lemmatization software and provide results of experiments
on the lemmatized corpus. Section 6 concludes the paper by providing discussion on the conducted experiments and findings.
2
2.1
The software
Alceste
´
Alceste stands for Analyse des Lexèmes Co-occurents dans les Ennonc´
es
Simples d’un TExt [Analysis of the co-ocurrent lexemes within the simple
statements of a text] and its algorithm was created by Max Reinert at the
CNRS—partly based on Bénzecri’s contributions on textual statistics [2].
The aim of Alceste is to quantify the text in order to extract its most
significant structures. Its creator Max Reinert conceives the discourse not
depending on its representation but according to the activity that takes place
in it [14]. The program performs a particular analysis of the “topography of
discourse” by creating, confronting and representing different “lexical worlds”
from it. Thus we may say that we can model the trace of the meaning of a
2
text as the trace of a discourse activity (production and repetition of signs)
[1].
Alceste’s method is known as Hierarchical Decreasing Classification
(HDC). The corpus to be analysed is successively split in chunks (see Figure 1); then one observes the distribution of most significant words within
every segment; it extracts the most representative words from the text.
Figure 1: Schema of the way Alceste works [9].
In fact, it has to be said that in reality what the program extracts are not
really words but word reductions. Thus, a “most significant word” is not a
lexeme and a segment is not necessarilly a sentence. Lexical identification is
made by a dictionary. Each word, then, is reduced to a root which is shared
with other words; for instance, the different forms of a verb should be reduced
to a single “lemma”. The creator of the program calls this reduction process
lemmatisation, though all along this article we will show that it better fits
to what is ususally refered to as stemming.
The program, finally, classifies the chunks of the corpus (called elemental
context units [ECU]). This chunking is done according to the punctuation of
the text—if any—and to the number of words. Then, the program classifies
3
the ECU according to the distribution of the vocabulary that appears within
these context units. Alceste finds the vocabulary in the different ECU and
relates them or, in other words, connects the ECU that have common vocabulary, identifies the strongest vocabulary oppositions and extracts some
categories of representative segments.1
2.2
Text Garden Document Atlas
Text Garden is a library of software components enabling complex operations on text data that are usually needed in text mining (http://www.
textmining.net/). Document Atlas is one of the components that provides
different kinds of document corpus visualization based on Latent Semantic
Indexing (LSI) [8] and Multidimensional scaling [4].
LSI is a technique for extracting background knowledge from text documents. It uses a technique from linear algebra called Singular Value Decomposition (SVD) and bag-of-words [16] representation of text documents for
extracting words with similar meanings. This can be viewed as extraction
of hidden semantic concepts from the documents corpora. Multidimensional
scaling is applied to documents, represented using concepts from LSI, to position them on a 2D plane. The intuition behind first applying LSI is to
describe the documents using only the main concepts, that appear in them,
before doing the radical reduction of the space to a 2D plane. After the
documents are mapped onto the plane, some other techniques can be used
to make the structure of documents more explicit for the user:
• Landscape generation: a landscape can be generated by using the
density of points. This is done by estimating the distribution of documents on the plane using mixture of Gaussians. The landscape is then
drawn as a background for the document map.
• Keywords: each point on the plane can be assigned a set of keywords
by averaging the TFIDF vectors of documents, which appear within a
circle with the centre in this point and some radius R. These keywords
are used to fill the map with words describing different areas of the
map. This offers the user a quick flavour of the topics covered in the
documents. For more detailed list of keywords the user can move mouse
to the desired area. These features can be used to make visualizations
more descriptive and informative.
See Figures 3 and 5 for examples of visualization using Text Garden
Document Atlas.
1
For more information about Alceste applications, see, among others, [15] and [12].
4
3
3.1
Original corpus
Type of corpus
In the framework of the ethnographic fieldwork campaign held by UAB
researchers all along Spain [5], a particular part of the interviews to judges
implied that the researcher asked the judge to try to formulate concrete questions about his main problems of his daily work, concerning civil, criminal
jurisdiction and his on-duty period. Some of the judges really formulated
short and concrete questions; others preferred to explain their doubts or
problems in a more detailed way. Nevertheless, UAB researchers extracted
all the questions the judges formulated and that constitutes the corpus to be
analysed (756 questions posed by the judges).
3.2
Interpretation of the Alceste results
With respect to this corpus, Alceste classifies 431 ECU—out of 500 that
were created, which represents 86.20%—and creates 4 stable classes, that are
represented in Figure 2.
Figure 2: Graphical representation of the factorial analysis on the original
corpus.
Following some conclusions regarding other related corpora [5], the X axis
of the class projection may be interpreted as the representation of the path
from the private world (family law, left side of the graphic) to the public world
(legal system, right side of the graphic). The Y axis somehow represents the
5
boundary between both fields. In particular, word distribution along the Y
axis refers (a) to the interlocutory decisions the judge has to make during
the “on-duty” period, (b) to the proceedings (inferior part of the graphic),
and (c) to the legal qualification, trial and judgment (superior part of the
graphic). Table 1 shows the most representative words of each class.
Class 1: Proceedings and trial
juicio+(37), verbal+(16), cuantia+(9), instruccion(13), admit+(7), circunstancia+(8), fase+(8), inform+(9), ordinario+(5), pericial+(6), procedimiento+(18),
vista+(13), acto+(7), celebr+(5), dia+(11), escrit+(6), interrogatorio(3), oral+(5),
parte+(25), procurador+(8), prueba+(11), recurso+(5), resolver(6), suspenderse(4),
decision+(3)
Class 2: Enforcement (judgments)
ejecucion+(15), ejecut+(15), embarg+(10), finca+(9), depositario(5), notificacion+(7), pago+(5), demand+(14), interes+(7), quiebra+(6), sentencia(10), accion+(4), acordarse(4), archiv+(6), bienes(4), cabo(12), coche(3), condenad+(3), deposit+(6), edicto+(2), entrega+(4), fallo(3), gasto+(5)
Class 3: Family law (gender violence, divorce, separation,. . . )
alejamiento(21), malo+(22), medida+(14), mujer+(15), orden+(24), proteccion(17),
senor+(14), tratos(22), victima+(10), domestic+(9), padre+(7), violencia(9), denunci+(13), madre(7), marido(6), pension(5), agresor(4), alimentos(4), hijo+(4), lesion+(7), maltratada+(4), nino+(5)
Class 4: Actions during “on-duty” period
juzgado+(64), juez(48), cadaver+(19), detenido+(18), funcionario+(20), guardia+(30), internamiento+(18), judicial+(38), llam+(26), medico+(23), policia(29),
actuacion+(10), acud+(20), asunto+(21), causa+(8), comision(8), competente+(10),
deten+(11), disposicion(9), funcion+(10), hacer(91), hospital(8), informacion+(8),
levantamiento+(11)
Table 1: Specific vocabulary of each class of the original corpus.
Class 1 represents 18.10% of the ECU and it has been entitled as “Proceeding and trial”, because it mostly refers to the final part of the legal
proceeding in the context of the legal system (not in the private field), as
may be observed above.
Class 2 represents 14.39% of the ECU and occupies more or less the same
place of class 1in the word projection graphic. However, class 2 expresses a
lexical world by itself, as it refers to a specific final part of the legal proceeding, which provokes doubts and problems to judges with less experience: the
enforcement of judgments. Its vocabulary is specially illustrative. Class 3
represents 13.46% of the ECU and it rather clearly refers to the intermediate
steps of the proceeding in family law cases. In general, it somehow contains
some external elements of this type of proceedings.
And finally we have Class 4, that represents 54.06% of the ECU and
6
expresses specific actions the judge has to perform during his “on-duty” period. This class is strongly opposed to classes 1 and 2, as it refers to the very
initial steps of any incoming case during the “on-duty” period, and it is also
opposed (although in a weaker way) to class 3 because we go from the private
field (family law) to the public field (legal system). The lexical forms that
represent this class include the typical “on-duty” activities, its main actors,
and the physical places where its actions take place.
3.3
Interpretation of the Text Garden Document Atlas
results
In order to proces the test using Text Garden Document Atlas we have
first extracted the stems from the Alceste reports in the form of, say, ‘funcionario+’. We have then used these patterns to perform a kind of stemming
on the original corpus. Furthermore, a few stems and stop words that were
thought not to be representative of the topic were removed: es, un, una,
lo, los, ni, ido, ha, las, debe, viene, hace, puedo, dos, son, y, el, se, deber,
hacer, etc. Namely, it is quite forseeable that we would obtain a partitioning of questions into the group that starts with ‘Why?’, another group with
‘What?’ and yet another group with ‘Who?’. While such a taxonomy would
be interesting for a linguist, it is not interesting for the present context of
the legal case.
Each question is represented as a ‘document’ in the TextGarden nomenclature, and the corpus was then analysed and visualized by the system (see
Figure 3). We have used the first 125 most important LSI components. Each
component is described by two sets of keywords representing two extremes
of the dimension spanned by this component. For instance the first two
components are the following:
• levantamiento, cadaver, proced, llev cabo, forense, / orden, proteccion,
alejamiento, tratos, malo, senor, denunci
• juicio rapido, fiscal, parte, policia, ministerio fiscal, / levantamiento
cadaver, orden proteccion, llev cabo, alejamiento
The first diemnsion captures the most of the documents from the corpora,
and it can be interpreted as the content/semantics of the most documents
is captured by representing the documents on the dimension where the one
end can be described by words such as, “levantamiento, cadaver, proced, llev
cabo, forense” and the opposite end by words such as, “orden, proteccion,
alejamiento, tratos, malo, senor, denunci”.
7
Figure 3: Visualization of stemmed corpus. Each yellow cross is a particular
question, and the representative words are dispersed among the questions.
The lighter color indicates a higher density of documents in that area.
4
Problems with stemming
If we pay attention to the forms that conform each class (i.e. the “lemmas”) instead to the different classes and their relations—in order to create
a “lexical world”—we will observe that these word roots are not precisely
lemmas. Table 2 shows some examples of it.
4.1
Stemming vs. lemmatisation
The way by which stemming process works is to apply some rules to
transform word into its stemm usually by cutting-off the word suffix. For
instance, a stemmer would put all the variants of love (verb and noun),
loving (verb), lover (noun), lovely (adjective), etc., behind the reduced form
“lov+”. A lemmatiser, on the other hand, would assing the above words to
three different lemmas: love, lover and lovely also based on their gramatical
8
Stems
acumulacion
amumularse
acumul+
admision
admit+
celebracion
celebr+
misma+
mismo+
suspenderse
suspend+
Lemmas
acumulación
acumular
—
admisión
admitir
celebración
celebrar
mismo
—
suspender
—
Table 2: Difference between stems and lemmas.
form.
If a stemming process is applied to languages such as Spanish, Catalan,
Slovenia with rich inflection (which can have 60 forms for verbs, not counting
composed forms) a lot of information keeps hidden and the reduction process
based on stemming often produces results that are not refined enough. In
poorly inflectioned languages, such as English, most verbal forms are reduced
to only four forms for each verb: love, loves, loving, loved. Future tense is
formed with “will” and the conditional with “would”, plus infinitive form.
Thus variation is minimal. Furthermore, a lot of nominal, verbal and/or
adjectival forms are identical, for instance hammer (can be verb and noun
and can form adjectival forms through the adjuction of some other particles,
as in hammer-shaped ).
Moreover, stemming may put many different forms behind the same stem.
For instance, in Spanish it may put estafa [swindle] and estafeta [post office] behind the root “estaf+”, while these two forms are obviously not even
related. So this information keeps hidden. Lemmatisation would give two
different forms: estafa and estafeta. Moreover, in some cases stemming gives
different stems when there should be the same stem. As an example, in
verbal forms like resolver [to resolve] and resuelto [resolved] may give two
stems: “resol+” and “resuel+”, and it would be wrong, because they are two
different forms of the same verb (resolver ), and the only valid lemma would
be resolver. Some real examples from our experience can be seen in table 2.
9
5
Lemmatisation
Following the intitial experiments using the stemmed corpus, we have
identified the mentioned problems and repeated the experimenta using the
lemmatized corpus as described in this Section.
5.1
MACO — Morphological Analyser
In order to better handle natural language properties, we have decided
to pre-proces the data using a morphological analyser and avoid using more
sophisticated tools developed for corpus analysis in Spanish [7]. Namely, we
do not need neither morphosyntactic tagging nor disambiguation (in that
case, RELAX [11] would be a good solution, since it has an accuracy of
the output that varies between 94-96%), nor syntactic chunking (as TACAT
does). All these tools are organised in a pipeline-like process.
MACO [3] is a Morphological Analyser for Spanish (and Catalan) which
provides both lemma(s) and POS-tag(s) for each word and whose output
has the following form:
word tag1 − lemma1 ... tagn − lemman
Tags codify 13 part-of-speech categories (noun, verb, adjective, adverb,
pronoun, determiner, preposition, conjunction, interjection, dates, punctuation, numbers and abbreviations) as well as subcategories and morphological
features [6], as it is proposed by Eagles [10]. The total amount of tags is
2853.2
5.2
Description of the lemmatized corpus
Original
corpus
(stemming)
N.o different forms
3074
o
N. occurrences
19861
Max. freq. of a form
1230
Hapax
1666
Lemmatised
corpus
2064
19946
2208
934
Table 3: Comparative table of results.
While in the original corpus the program identified 3074 different forms
(and the number of occurrences was 19861), regarding the lemmatised corpus
2
To have a description of the total amount of tags, see [7] and [6].
10
we have 2064 different forms and 19946 occurrences. The difference in the
number of different forms can be understood because the stemming system
for Spanish that Alceste uses turns to count some forms as different when
it should reduce them to a single one. Thus, while the number of forms
is reduced, the number of occurrences increases. This fact, although is not
important (it could be neglected), can be explained because, on one hand,
Alceste recognizes some types of locution—i.e. a-pesar-de—and, on the other
hand, it unifies some words as if they were locutions while they are not (i.e.
para-que, lo-que). Indeed, it seems that this process is quite arbritary.
Moreover, the average frequency per form is higher in the lemmatised
corpus (10) than in the stemmed or original corpus (6), because the number
of forms is higher, as the word reduction has been performed in a more
refined way. And, according to this, the maximum frequency of a form (2208)
is also higher in the second corpus than in the first one (1230). Finally,
a relevant question is the difference between both results regarding to the
forms that only appear once (hapax). While the first analysis identifies 1666
hapax, the refined analysis through lemmatisation reduces this number to
934, which may be significant in the sense that the dictionary of Alceste may
not recognize some forms that could be reduced to a lemma and it counts
them as isolated forms that, by chance, only occur once.
5.3
Alceste results
Once the corpus has been lemmatised, it is applied again to Alceste. The
main differences that we can observe in these new results affect to the first
steps Alceste makes, that is to say, to the calculus of its dictionary (step A2
in the program). Regarding the creation of ECU, in this second analysis the
program classifies 378 ECU out of 502 in classes, which represents 75.30%, a
quite inferior percentage with respect to the first analysis. But, nevertheless,
the second analysis creates 6 stable classes, while the first one created just 4.
The first aspect we have to consider—as will be seen—is that the 6 classes
of the second analysis represent ECU percentages which cannot be neglected
(even the class 5).
Regarding the content of the classes, the results are not obviously different
from the former analysis, because in fact we are treating with the same
corpus. Thus here we will not repeat the terms of general interpretation,
as it can work for both corpora. However, we can affirm that through the
lemmatised corpus we have helped the program to refine its analysis, so
the interpretation turns to be easier and better. As shown in table 4, a
part of representative vocabulary of Class 1 pertained to the former Class
4 (Actions during the “on-duty” period). However, this Class 1, due to
11
Figure 4: Graphical representation of the factorial analysis of the lemmatised
corpus
12
its lexical conformation, corresponds to the world of the judicial unit and
represents 20.1% of the ECU. Nevertheless, Alceste has created another class
which is strictly about the actions during the “on-duty” period (Class 6),
that at the same time represents 27.2% of the ECU. This is hardly surprising
if we remember that Class 4 of the former corpus represented more than 56%
of the classified ECU. Therefore, here we could find one of the main reasons
of the multiplication of classes in the second analysis.
Regarding to Class 2 Alceste has not changed its results. In both corpora
it has created this class, and consequently they contain the same type of
specific vocabulary. Finally, both classes represent a similar percentage of
the ECU. Thus we think this is a class that has kept stable and unchanged
all along the process we have described.
With respect to classes 3 and 5 we have a similar case to the first one.
While in the former analysis we had a class which contained all the questions
that referred to proceedings and trial (see Table 1), the new analysis has
created two different classes that enable us distinguish both aspects. First,
Class 3 contains specific vocabulary referring to proceeding questions; second, Class 5 shows specific lexical elements of a trial. Finally, another
element of stability between the first and the second analysis is Class 4. As
in the case of the questions related to family law, the program has kept a
specific class referred to enforcement of judgments.
5.4
TextGarden Document Atlas results
The same procedure as applied on the stemmed corpus was used to visualize the lemmatised corpus. We have again removed the words not directly
related to the topics. Because the words were in their canonical form, this
was much easier than in the case of stemming. Again we used the first
125 LSI components (see Figure 5). It can be seen that the documents are
much more smoothly distributed around the plane in this visualization of the
lemmatised corpus, than in the visualization of the stemmed corpus (Figure
3). In the visualization of the stemmed corpus, there is a dense cluster of
documents in the centre. The content of these documents is too ambiguous
and the system cannot place them properly. However, when this documents
are preprocessed using lemmatisation instead of clustering, this cluster of
ambiguous documents is much smaller.
We have also checked whether the LSI components obtained from the
lemmatised corpus are more informative than the ones obtained from the
stemmed corpus. Each component is assigned a weight, known as the singular
value. We have checked the ratio between weights from lemmatised corpus
and weights from stemmed corpus. The sum of the first n LSI component
13
Class 1: Judicial unit
funcionar+ (21), juzgar(26), oficina(11), trabaj+(13), decir(26), llam+(16),
mand+(12), acudir(11), adjunto(4), busc+(4), consult+(4), dato(6), hablar(4), jurisprudencia(3), local+(3), material(6), necesit+(7), policia(14), prensa(4), sala(4),
funerari+(2), hurto(3), informacion(5), miedo(3), robo(3), servicio+(7), sustitu+(4),
tecnico(2), venir(15)
Class 2: Family law (gender violence, divorce, . . .
alejamiento(22), malo(22), medida(16), orden+(23), proteccion(17), senor+(13),
trat+(22), victima(11), mujer(11), padre(7), denunci+(12), domestico(8), violencia(8), agresor(4), dict+(10), madre(7), marido(6), nino(5), pension(4), psicolog+(5),
separacion(5), abus+(5), alimento(3), ayud+(4), casa(3), cautelar+(3), divorcio(2),
empresa(3), hijo(4), lesion+(6)
Class 3: Proceedings
escrit+(9), fiscal+(13), instruccion(9), ordinario(5), seguir(11), acumular(5),
audiencia-provincia(2), conform+(2), contradictori+(3), criterio+(10), cuantia(5),
falt+(7), injusto(3), interpretacion(3), ley(6), motiv+(3), pendiente(2), perito(5)
Class 4: Enforcement (judgment)
ejecucion(14), ejecut+(15), embarg+(11), finca+(9), depositar+(6), interes+(6),
pago(6), suspension(5), deposito(6), entreg+(6), quiebra(5), sentencia(9), solicit+(9),
vehiculo(4), acreedor(3), administracion(4), cantidad(4), conden+(4), cost+(4),
dinero(4), edicto(2), imposibilidad(3), multa(3), notificacion(4), pagar+(4)
Class 5: Trial
aport+(8), circunstancia(8), demand+(15), juicio(21), verbal(10), vist+(12), pericial(6), prueba(10), acto(5), admitir(6), interrogatorio(3), part+(18), abogado(9),
celebrar(5), letr+(2), suspender(5), testigo(5), citado(2), senalamiento(3), celebracion(2), comparecer(3), dia(5), discutir(2), presencia(2), present+(5), procurador(5)
Class 6: “On duty” actions
juez(36), compet+(11), ser(67), cadaver(14), centro(8), competencia(9), disposicion(8), extranjero(15), guardia(19), internamiento(14), autopsia(6), comision(7), deten+(18), hospital(7), investigacion(5), juzgado+(13), levantamiento+(8), pais(7),
prision(12), proced+(19), rogatorio(7), traslado+(8), urgencia(6), autorizar(4), capital(4), civil(10), delito(11), expulsion(4), forense(8), incapaz(4), ingres+(6), libertad(7)
Table 4: Specific vocabulary of each class of the lemmatised corpus.
14
Figure 5: Visualization of the lemmatized corpus.
weights of the lemmatised (and stemmed) corpus was divided by the sum of
the first n LSI component weights of the stemmed corpus (Figure 6). In this
way we can see how many times the lemmatised components are more clean
and informative compared to the stemmed components. For instance, the
first two lemmatised components are 1.164 times more informative (enabling
better reconstruction of the corpus) than the first two stemmed components.
6
Discussion
In this article we have shown how to improve the working of a knowledge management program through the distinction between stemming and
lemmatisation techniques. In Alceste’s lexical reduction process we observed
some irregularities because the program applies a reduction technique (stemming) which turns to be inadequate for our data provided in the Spanish
language. We have provided some discussion on the reasons for that. We
have shown that the pre-processing of the original data can affect the final
results in terms of stable class conformation (lexical worlds). Moreover, that
15
LSI components weights from stemmed and lemmatised corpus
lemmatised
stemmed
1.15
1.1
1.05
1
0.95
0
10
20
30
40
50
60
70
80
Figure 6: Plot of LSI components weights for the stemmed and lemmatised
corpus relative to the stemmed weights.
has been shown by applying two different systems for data processing, Alceste and Text Garden Document Atlas, on the original data pre-processed
using stemming (for the first set of experiments) and lemmatization instead
(for the second set of experiments).
The most relevant result has been that the programs have been able to
perform a more refined analysis of the vocabulary (as the corpus was “clean”).
In case of Alceste, this has turned into the creation of two more classes (from
the four classes of the former analysis). The mere creation of more classes
is not a good sign by itself; it is important, however, that these classes are
the result of the division of two former classes. This creates more uniform,
coherent and homogeneous lexical worlds, and this makes the interpretation
made by the analyst more easy (or even more elegant).
The analysis with the TextGarden system mirrors the findings with the
Alceste system. The lemmatised version of the corpus could be described in
a cleaner way. The latent semantic indexing components of the lemmatised
corpus are consistently more informative than the components obtained from
16
the stemmed corpus. Thus, we have demonstrated a solid quantitative improvement in the ability to model the corpus using LSI. For the visualization,
however, it was helpful to remove the stop words that were not representative
of the natural taxonomy of the questions in the corpus.
In conclusion, we can say that—regarding to the programs—an improvement of the input has turned into an improvement of the output.
Acknowledgments
This work was supported by the Slovenian Ministry of Education, Science
and Sport, the Spanish Ministry of Education and Science (Project Observatorio de Cultura Judicial del Consejo General del Poder Judicial SEC20012581-C02 (01-02) and Project CESS-ECE (HUM2004-21127-E) Corpus etiquetados sintactico-semánticamente del espa˜
nol, catal´
an y euskera), the IST
Programme of the European Community, under SEKT Semantically Enabled
Knowledge Technologies (IST-1-506826-IP), and PASCAL Network of Excellence (IST-2002-506778).
References
[1] Bastin, Gilles (2002): “Note sur la méthode Alceste: mondes socieaux et
mondes lexcicaux”, November 28 http://www.melissa.ens-cachan.
fr/imprimer.php3?id\_article=200.
[2] Bénzecri, Jean Paul (1982): L’analyse des donnés. 4th ed. 2 Vols. Paris:
Dunod.
[3] Carmona, J., Cervell, S. Márquez, L. Mart´ı, M.A., padró, L. Placer,
R., Rodr´ıguez, H., Taulé, M. & Turmo, J. (1998) “An Environment for
Morphosyntactic Processing of Unrestricted Spanish Text”. In Proceedings of the First Conference on Language Resources and Avaluation.
LREC’98, pages 915–922, Granada.
[4] Carroll, J.D. & Arabie, P. (1980): “Multidimensional scaling”. In M.R.
Rosenzweig & L.W. Porter (eds.). Annual Review of Psychology 31: 607649.
[5] Casanovas, Pompeu; Poblet, Marta; Casellas, N´
uria; Vallbé, Joan Josep;
Ramos, Francisco; Benjamins, Richard; Blázquez, Mercedes; Rodrigo,
Luis; Contreras, Jes´
us; Gorro˜
nogoitia, Jes´
us (2005): D 10.2.1Legal
17
Scenario. Deliverable WP10 Case Study: Intelligent integrated decision support for legal professionals. SEKT Project (www.sekt-project.
com).
[6] Civit, M. (2000): Gu´ıa para la anotaci´
on morfol´
ogica de corpus. Technical Report X-Tract WP-00/06, Universitat de Barcelona, 2000. available: http://www.lsi.upc.es/?civit/publicacions.html.
[7] Civit, Montserrat & Mart´ı, M. Antònia (2002): “Design Principles for
a Spanish Treebank”.The First Workshop on Treebanks and Linguistic
Theories. TLT2002, Bulgaria.
[8] Deerwester, S. Dumais, G. Furnas, T. Landuer & R. Harshman (1990):
“Indexing by Latent Semantic Analysis”, Journal of the American Society of Information Science 41(6): 391-407.
[9] Folch, Helka & Habert, Benoˆıt (2000): “Constructing a Navigable Topic Map by Inductive Semantic Acquisition Methods”.
www.gca.org/attend/2000\_conferences/Extreme\_2000/Papers/
Folch/Visuals/EXTREM.PPT.
[10] Monachini, M. & Calzolari, N. Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora. a common proposal
and applications to european languages. Technical report, EAGLES,
1996. available: http://www.ilc.pi.cnr.it/EAGLES96/browse.html.
[11] Padró, Llu´ıs (1997): A Hybrib Environment for Syntax-Semantic Tagging. PhD thesis, Software Department (LSI). Technical University of
Catalonia (UPC).
[12] Peyrat-Guillart, Dominique (2000): “Une application de la statistique
textuelle à la gestion des ressources humanies : appréhender le concept
d’implication au travail de fa¸con alternative”. JADT 2000. 5es Journées
Internationales d’Analyse Statistique des Données Textuelles.
[13] Reinert, Max (2002): “La tresse du sense et la méthode Alceste : Application aux Rêveries du promeneur solitaire”. JADT 2000: 5es Journées
Internationales d’Analyse Statistique des Données Textuelles.
[14] Reinert, Max (2003): “Le rôle de la répétition dans la représentation du
sens et son approche statistique par la méthode ALCESTE”, Semiotica 147 – 164, pp. 389-420.
18
[15] Saadi, Lahlou (1995): “Vers une théorie de l’interprétation en analyse
des données textuelles”, JADT 1995. 3rd International Conference on
Statistical Analysis of Textual Data. S. Bolasco, L. Lebart, A. Salem
(eds.). CISU, Roma, 1995, Vol. I, pp. 221-228.
[16] Salton, G. (1991): “Developments in Automatic Text Retrieval” , Science 253: 974-979.
19

as a PDF

Transcription

Similar documents

Public Access Plan

El Mirador

Building up Corpus of Technical Vocabulary – Strategies and Feasibility

UPDATE UPDATE

Young Volunteers affirm bonds of friendship: Welcome, again, SAP!

Bayfest SITUATIONAL AWARENESS FOR OUTDOOR EVENTS

13 - Festival of the Arts

Tickets Sold! - Corpus Christi Catholic Church

Fall - Coastal Community and Teachers Credit Union