as a PDF
Transcription
as a PDF
Stemming and lemmatisation: improving knowledge management through language processing techniques Joan-Josep Vallb´e ∗ Aleks Jakulin § M. Ant`onia Mart´ı † Dunja Mladeniˇc ¶ Blaˇz Fortuna ‡ Pompeu Casanovas k Abstract In this article we show how some aspects of knowledge management can be improved through the distinction between stemming and lemmatisation techniques. To do that we apply both techniques to the same corpus—in the framework of the building of a legal ontology— which is analyzed by Alceste and TextGarden. In the final part of the article we discuss whether the results of the experiment are significant or not regarding the analysis done by these software programs. Our discussion focuses around the results being useful for the ontology building process in the frame of European project named SEKT. 1 Introduction To perform the quantitative analysis of text we must choose a way of representing the words as mere numbers, stripped of meaning. Even if this operation may appear brusque, the patterns of words co-appearing with one another reveal much structure in a corpus of text. Furthermore, some ∗ [email protected], IDT - Institut de Dret i Tecnologia [Institute of Law and Technology], Universitat Aut` onoma de Barcelona (UAB). † [email protected], CLIC - Centre de Llenguatge i Computaci´o [Center of Language and Computation], Universitat de Barcelona, (UB). ‡ J.Stefan Institute, Ljubljana, Slovenia. § J.Stefan Institute, Ljubljana, Slovenia. ¶ J.Stefan Institute, Ljubljana, Slovenia. k [email protected], IDT - Institut de Dret i Tecnologia [Institute of Law and Technology], Universitat Aut` onoma de Barcelona (UAB). 1 words (judge, attorney, court) tend to appear in the same document or paragraph, so associations can be established directly from the patterns of coappearance. As a part of the SEKT (Semantically Enabled Knowledge Technologies, EU-IST Project IST-2003-506826, http://sekt-project.org/), we are developing tools that attempt to move beyond these simple patterns of coappearance, and extend the basic numerical representations with additional information. Such information can be understood as ‘semantic’, as it is a formalization of certain aspects of meaning. The final goal of this research is to improve the quality of access to judicial information as a part of one case study developed in the mentioned European project SEKT. Stemming and lemmatisation both represent related words as a single word. For example, the default quantification is to treat the words ‘am’, ’are’ and ‘is’ as separate. On the other hand, all these words have the same meaning, which we denote as ‘be’. This way, we simplify the representation, and capture patterns of identity. Stemming and lemmatisation are two approaches to this operation. The rest of this paper is structured as follows. Section 2 described the two used software programs. Section 3 describes the used collection of questions from legal domain provided in Spanish language and the results of experiments. Some potential problems caused by using stemming instead of a more sophisticated lemmatization are pointed out in Section 4. Section 5 briefly describes the used lemmatization software and provide results of experiments on the lemmatized corpus. Section 6 concludes the paper by providing discussion on the conducted experiments and findings. 2 2.1 The software Alceste ´ Alceste stands for Analyse des Lex`emes Co-occurents dans les Ennonc´ es Simples d’un TExt [Analysis of the co-ocurrent lexemes within the simple statements of a text] and its algorithm was created by Max Reinert at the CNRS—partly based on B´enzecri’s contributions on textual statistics [2]. The aim of Alceste is to quantify the text in order to extract its most significant structures. Its creator Max Reinert conceives the discourse not depending on its representation but according to the activity that takes place in it [14]. The program performs a particular analysis of the “topography of discourse” by creating, confronting and representing different “lexical worlds” from it. Thus we may say that we can model the trace of the meaning of a 2 text as the trace of a discourse activity (production and repetition of signs) [1]. Alceste’s method is known as Hierarchical Decreasing Classification (HDC). The corpus to be analysed is successively split in chunks (see Figure 1); then one observes the distribution of most significant words within every segment; it extracts the most representative words from the text. Figure 1: Schema of the way Alceste works [9]. In fact, it has to be said that in reality what the program extracts are not really words but word reductions. Thus, a “most significant word” is not a lexeme and a segment is not necessarilly a sentence. Lexical identification is made by a dictionary. Each word, then, is reduced to a root which is shared with other words; for instance, the different forms of a verb should be reduced to a single “lemma”. The creator of the program calls this reduction process lemmatisation, though all along this article we will show that it better fits to what is ususally refered to as stemming. The program, finally, classifies the chunks of the corpus (called elemental context units [ECU]). This chunking is done according to the punctuation of the text—if any—and to the number of words. Then, the program classifies 3 the ECU according to the distribution of the vocabulary that appears within these context units. Alceste finds the vocabulary in the different ECU and relates them or, in other words, connects the ECU that have common vocabulary, identifies the strongest vocabulary oppositions and extracts some categories of representative segments.1 2.2 Text Garden Document Atlas Text Garden is a library of software components enabling complex operations on text data that are usually needed in text mining (http://www. textmining.net/). Document Atlas is one of the components that provides different kinds of document corpus visualization based on Latent Semantic Indexing (LSI) [8] and Multidimensional scaling [4]. LSI is a technique for extracting background knowledge from text documents. It uses a technique from linear algebra called Singular Value Decomposition (SVD) and bag-of-words [16] representation of text documents for extracting words with similar meanings. This can be viewed as extraction of hidden semantic concepts from the documents corpora. Multidimensional scaling is applied to documents, represented using concepts from LSI, to position them on a 2D plane. The intuition behind first applying LSI is to describe the documents using only the main concepts, that appear in them, before doing the radical reduction of the space to a 2D plane. After the documents are mapped onto the plane, some other techniques can be used to make the structure of documents more explicit for the user: • Landscape generation: a landscape can be generated by using the density of points. This is done by estimating the distribution of documents on the plane using mixture of Gaussians. The landscape is then drawn as a background for the document map. • Keywords: each point on the plane can be assigned a set of keywords by averaging the TFIDF vectors of documents, which appear within a circle with the centre in this point and some radius R. These keywords are used to fill the map with words describing different areas of the map. This offers the user a quick flavour of the topics covered in the documents. For more detailed list of keywords the user can move mouse to the desired area. These features can be used to make visualizations more descriptive and informative. See Figures 3 and 5 for examples of visualization using Text Garden Document Atlas. 1 For more information about Alceste applications, see, among others, [15] and [12]. 4 3 3.1 Original corpus Type of corpus In the framework of the ethnographic fieldwork campaign held by UAB researchers all along Spain [5], a particular part of the interviews to judges implied that the researcher asked the judge to try to formulate concrete questions about his main problems of his daily work, concerning civil, criminal jurisdiction and his on-duty period. Some of the judges really formulated short and concrete questions; others preferred to explain their doubts or problems in a more detailed way. Nevertheless, UAB researchers extracted all the questions the judges formulated and that constitutes the corpus to be analysed (756 questions posed by the judges). 3.2 Interpretation of the Alceste results With respect to this corpus, Alceste classifies 431 ECU—out of 500 that were created, which represents 86.20%—and creates 4 stable classes, that are represented in Figure 2. Figure 2: Graphical representation of the factorial analysis on the original corpus. Following some conclusions regarding other related corpora [5], the X axis of the class projection may be interpreted as the representation of the path from the private world (family law, left side of the graphic) to the public world (legal system, right side of the graphic). The Y axis somehow represents the 5 boundary between both fields. In particular, word distribution along the Y axis refers (a) to the interlocutory decisions the judge has to make during the “on-duty” period, (b) to the proceedings (inferior part of the graphic), and (c) to the legal qualification, trial and judgment (superior part of the graphic). Table 1 shows the most representative words of each class. Class 1: Proceedings and trial juicio+(37), verbal+(16), cuantia+(9), instruccion(13), admit+(7), circunstancia+(8), fase+(8), inform+(9), ordinario+(5), pericial+(6), procedimiento+(18), vista+(13), acto+(7), celebr+(5), dia+(11), escrit+(6), interrogatorio(3), oral+(5), parte+(25), procurador+(8), prueba+(11), recurso+(5), resolver(6), suspenderse(4), decision+(3) Class 2: Enforcement (judgments) ejecucion+(15), ejecut+(15), embarg+(10), finca+(9), depositario(5), notificacion+(7), pago+(5), demand+(14), interes+(7), quiebra+(6), sentencia(10), accion+(4), acordarse(4), archiv+(6), bienes(4), cabo(12), coche(3), condenad+(3), deposit+(6), edicto+(2), entrega+(4), fallo(3), gasto+(5) Class 3: Family law (gender violence, divorce, separation,. . . ) alejamiento(21), malo+(22), medida+(14), mujer+(15), orden+(24), proteccion(17), senor+(14), tratos(22), victima+(10), domestic+(9), padre+(7), violencia(9), denunci+(13), madre(7), marido(6), pension(5), agresor(4), alimentos(4), hijo+(4), lesion+(7), maltratada+(4), nino+(5) Class 4: Actions during “on-duty” period juzgado+(64), juez(48), cadaver+(19), detenido+(18), funcionario+(20), guardia+(30), internamiento+(18), judicial+(38), llam+(26), medico+(23), policia(29), actuacion+(10), acud+(20), asunto+(21), causa+(8), comision(8), competente+(10), deten+(11), disposicion(9), funcion+(10), hacer(91), hospital(8), informacion+(8), levantamiento+(11) Table 1: Specific vocabulary of each class of the original corpus. Class 1 represents 18.10% of the ECU and it has been entitled as “Proceeding and trial”, because it mostly refers to the final part of the legal proceeding in the context of the legal system (not in the private field), as may be observed above. Class 2 represents 14.39% of the ECU and occupies more or less the same place of class 1in the word projection graphic. However, class 2 expresses a lexical world by itself, as it refers to a specific final part of the legal proceeding, which provokes doubts and problems to judges with less experience: the enforcement of judgments. Its vocabulary is specially illustrative. Class 3 represents 13.46% of the ECU and it rather clearly refers to the intermediate steps of the proceeding in family law cases. In general, it somehow contains some external elements of this type of proceedings. And finally we have Class 4, that represents 54.06% of the ECU and 6 expresses specific actions the judge has to perform during his “on-duty” period. This class is strongly opposed to classes 1 and 2, as it refers to the very initial steps of any incoming case during the “on-duty” period, and it is also opposed (although in a weaker way) to class 3 because we go from the private field (family law) to the public field (legal system). The lexical forms that represent this class include the typical “on-duty” activities, its main actors, and the physical places where its actions take place. 3.3 Interpretation of the Text Garden Document Atlas results In order to proces the test using Text Garden Document Atlas we have first extracted the stems from the Alceste reports in the form of, say, ‘funcionario+’. We have then used these patterns to perform a kind of stemming on the original corpus. Furthermore, a few stems and stop words that were thought not to be representative of the topic were removed: es, un, una, lo, los, ni, ido, ha, las, debe, viene, hace, puedo, dos, son, y, el, se, deber, hacer, etc. Namely, it is quite forseeable that we would obtain a partitioning of questions into the group that starts with ‘Why?’, another group with ‘What?’ and yet another group with ‘Who?’. While such a taxonomy would be interesting for a linguist, it is not interesting for the present context of the legal case. Each question is represented as a ‘document’ in the TextGarden nomenclature, and the corpus was then analysed and visualized by the system (see Figure 3). We have used the first 125 most important LSI components. Each component is described by two sets of keywords representing two extremes of the dimension spanned by this component. For instance the first two components are the following: • levantamiento, cadaver, proced, llev cabo, forense, / orden, proteccion, alejamiento, tratos, malo, senor, denunci • juicio rapido, fiscal, parte, policia, ministerio fiscal, / levantamiento cadaver, orden proteccion, llev cabo, alejamiento The first diemnsion captures the most of the documents from the corpora, and it can be interpreted as the content/semantics of the most documents is captured by representing the documents on the dimension where the one end can be described by words such as, “levantamiento, cadaver, proced, llev cabo, forense” and the opposite end by words such as, “orden, proteccion, alejamiento, tratos, malo, senor, denunci”. 7 Figure 3: Visualization of stemmed corpus. Each yellow cross is a particular question, and the representative words are dispersed among the questions. The lighter color indicates a higher density of documents in that area. 4 Problems with stemming If we pay attention to the forms that conform each class (i.e. the “lemmas”) instead to the different classes and their relations—in order to create a “lexical world”—we will observe that these word roots are not precisely lemmas. Table 2 shows some examples of it. 4.1 Stemming vs. lemmatisation The way by which stemming process works is to apply some rules to transform word into its stemm usually by cutting-off the word suffix. For instance, a stemmer would put all the variants of love (verb and noun), loving (verb), lover (noun), lovely (adjective), etc., behind the reduced form “lov+”. A lemmatiser, on the other hand, would assing the above words to three different lemmas: love, lover and lovely also based on their gramatical 8 Stems acumulacion amumularse acumul+ admision admit+ celebracion celebr+ misma+ mismo+ suspenderse suspend+ Lemmas acumulaci´on acumular — admisi´on admitir celebraci´on celebrar mismo — suspender — Table 2: Difference between stems and lemmas. form. If a stemming process is applied to languages such as Spanish, Catalan, Slovenia with rich inflection (which can have 60 forms for verbs, not counting composed forms) a lot of information keeps hidden and the reduction process based on stemming often produces results that are not refined enough. In poorly inflectioned languages, such as English, most verbal forms are reduced to only four forms for each verb: love, loves, loving, loved. Future tense is formed with “will” and the conditional with “would”, plus infinitive form. Thus variation is minimal. Furthermore, a lot of nominal, verbal and/or adjectival forms are identical, for instance hammer (can be verb and noun and can form adjectival forms through the adjuction of some other particles, as in hammer-shaped ). Moreover, stemming may put many different forms behind the same stem. For instance, in Spanish it may put estafa [swindle] and estafeta [post office] behind the root “estaf+”, while these two forms are obviously not even related. So this information keeps hidden. Lemmatisation would give two different forms: estafa and estafeta. Moreover, in some cases stemming gives different stems when there should be the same stem. As an example, in verbal forms like resolver [to resolve] and resuelto [resolved] may give two stems: “resol+” and “resuel+”, and it would be wrong, because they are two different forms of the same verb (resolver ), and the only valid lemma would be resolver. Some real examples from our experience can be seen in table 2. 9 5 Lemmatisation Following the intitial experiments using the stemmed corpus, we have identified the mentioned problems and repeated the experimenta using the lemmatized corpus as described in this Section. 5.1 MACO — Morphological Analyser In order to better handle natural language properties, we have decided to pre-proces the data using a morphological analyser and avoid using more sophisticated tools developed for corpus analysis in Spanish [7]. Namely, we do not need neither morphosyntactic tagging nor disambiguation (in that case, RELAX [11] would be a good solution, since it has an accuracy of the output that varies between 94-96%), nor syntactic chunking (as TACAT does). All these tools are organised in a pipeline-like process. MACO [3] is a Morphological Analyser for Spanish (and Catalan) which provides both lemma(s) and POS-tag(s) for each word and whose output has the following form: word tag1 − lemma1 ... tagn − lemman Tags codify 13 part-of-speech categories (noun, verb, adjective, adverb, pronoun, determiner, preposition, conjunction, interjection, dates, punctuation, numbers and abbreviations) as well as subcategories and morphological features [6], as it is proposed by Eagles [10]. The total amount of tags is 2853.2 5.2 Description of the lemmatized corpus Original corpus (stemming) N.o different forms 3074 o N. occurrences 19861 Max. freq. of a form 1230 Hapax 1666 Lemmatised corpus 2064 19946 2208 934 Table 3: Comparative table of results. While in the original corpus the program identified 3074 different forms (and the number of occurrences was 19861), regarding the lemmatised corpus 2 To have a description of the total amount of tags, see [7] and [6]. 10 we have 2064 different forms and 19946 occurrences. The difference in the number of different forms can be understood because the stemming system for Spanish that Alceste uses turns to count some forms as different when it should reduce them to a single one. Thus, while the number of forms is reduced, the number of occurrences increases. This fact, although is not important (it could be neglected), can be explained because, on one hand, Alceste recognizes some types of locution—i.e. a-pesar-de—and, on the other hand, it unifies some words as if they were locutions while they are not (i.e. para-que, lo-que). Indeed, it seems that this process is quite arbritary. Moreover, the average frequency per form is higher in the lemmatised corpus (10) than in the stemmed or original corpus (6), because the number of forms is higher, as the word reduction has been performed in a more refined way. And, according to this, the maximum frequency of a form (2208) is also higher in the second corpus than in the first one (1230). Finally, a relevant question is the difference between both results regarding to the forms that only appear once (hapax). While the first analysis identifies 1666 hapax, the refined analysis through lemmatisation reduces this number to 934, which may be significant in the sense that the dictionary of Alceste may not recognize some forms that could be reduced to a lemma and it counts them as isolated forms that, by chance, only occur once. 5.3 Alceste results Once the corpus has been lemmatised, it is applied again to Alceste. The main differences that we can observe in these new results affect to the first steps Alceste makes, that is to say, to the calculus of its dictionary (step A2 in the program). Regarding the creation of ECU, in this second analysis the program classifies 378 ECU out of 502 in classes, which represents 75.30%, a quite inferior percentage with respect to the first analysis. But, nevertheless, the second analysis creates 6 stable classes, while the first one created just 4. The first aspect we have to consider—as will be seen—is that the 6 classes of the second analysis represent ECU percentages which cannot be neglected (even the class 5). Regarding the content of the classes, the results are not obviously different from the former analysis, because in fact we are treating with the same corpus. Thus here we will not repeat the terms of general interpretation, as it can work for both corpora. However, we can affirm that through the lemmatised corpus we have helped the program to refine its analysis, so the interpretation turns to be easier and better. As shown in table 4, a part of representative vocabulary of Class 1 pertained to the former Class 4 (Actions during the “on-duty” period). However, this Class 1, due to 11 Figure 4: Graphical representation of the factorial analysis of the lemmatised corpus 12 its lexical conformation, corresponds to the world of the judicial unit and represents 20.1% of the ECU. Nevertheless, Alceste has created another class which is strictly about the actions during the “on-duty” period (Class 6), that at the same time represents 27.2% of the ECU. This is hardly surprising if we remember that Class 4 of the former corpus represented more than 56% of the classified ECU. Therefore, here we could find one of the main reasons of the multiplication of classes in the second analysis. Regarding to Class 2 Alceste has not changed its results. In both corpora it has created this class, and consequently they contain the same type of specific vocabulary. Finally, both classes represent a similar percentage of the ECU. Thus we think this is a class that has kept stable and unchanged all along the process we have described. With respect to classes 3 and 5 we have a similar case to the first one. While in the former analysis we had a class which contained all the questions that referred to proceedings and trial (see Table 1), the new analysis has created two different classes that enable us distinguish both aspects. First, Class 3 contains specific vocabulary referring to proceeding questions; second, Class 5 shows specific lexical elements of a trial. Finally, another element of stability between the first and the second analysis is Class 4. As in the case of the questions related to family law, the program has kept a specific class referred to enforcement of judgments. 5.4 TextGarden Document Atlas results The same procedure as applied on the stemmed corpus was used to visualize the lemmatised corpus. We have again removed the words not directly related to the topics. Because the words were in their canonical form, this was much easier than in the case of stemming. Again we used the first 125 LSI components (see Figure 5). It can be seen that the documents are much more smoothly distributed around the plane in this visualization of the lemmatised corpus, than in the visualization of the stemmed corpus (Figure 3). In the visualization of the stemmed corpus, there is a dense cluster of documents in the centre. The content of these documents is too ambiguous and the system cannot place them properly. However, when this documents are preprocessed using lemmatisation instead of clustering, this cluster of ambiguous documents is much smaller. We have also checked whether the LSI components obtained from the lemmatised corpus are more informative than the ones obtained from the stemmed corpus. Each component is assigned a weight, known as the singular value. We have checked the ratio between weights from lemmatised corpus and weights from stemmed corpus. The sum of the first n LSI component 13 Class 1: Judicial unit funcionar+ (21), juzgar(26), oficina(11), trabaj+(13), decir(26), llam+(16), mand+(12), acudir(11), adjunto(4), busc+(4), consult+(4), dato(6), hablar(4), jurisprudencia(3), local+(3), material(6), necesit+(7), policia(14), prensa(4), sala(4), funerari+(2), hurto(3), informacion(5), miedo(3), robo(3), servicio+(7), sustitu+(4), tecnico(2), venir(15) Class 2: Family law (gender violence, divorce, . . . alejamiento(22), malo(22), medida(16), orden+(23), proteccion(17), senor+(13), trat+(22), victima(11), mujer(11), padre(7), denunci+(12), domestico(8), violencia(8), agresor(4), dict+(10), madre(7), marido(6), nino(5), pension(4), psicolog+(5), separacion(5), abus+(5), alimento(3), ayud+(4), casa(3), cautelar+(3), divorcio(2), empresa(3), hijo(4), lesion+(6) Class 3: Proceedings escrit+(9), fiscal+(13), instruccion(9), ordinario(5), seguir(11), acumular(5), audiencia-provincia(2), conform+(2), contradictori+(3), criterio+(10), cuantia(5), falt+(7), injusto(3), interpretacion(3), ley(6), motiv+(3), pendiente(2), perito(5) Class 4: Enforcement (judgment) ejecucion(14), ejecut+(15), embarg+(11), finca+(9), depositar+(6), interes+(6), pago(6), suspension(5), deposito(6), entreg+(6), quiebra(5), sentencia(9), solicit+(9), vehiculo(4), acreedor(3), administracion(4), cantidad(4), conden+(4), cost+(4), dinero(4), edicto(2), imposibilidad(3), multa(3), notificacion(4), pagar+(4) Class 5: Trial aport+(8), circunstancia(8), demand+(15), juicio(21), verbal(10), vist+(12), pericial(6), prueba(10), acto(5), admitir(6), interrogatorio(3), part+(18), abogado(9), celebrar(5), letr+(2), suspender(5), testigo(5), citado(2), senalamiento(3), celebracion(2), comparecer(3), dia(5), discutir(2), presencia(2), present+(5), procurador(5) Class 6: “On duty” actions juez(36), compet+(11), ser(67), cadaver(14), centro(8), competencia(9), disposicion(8), extranjero(15), guardia(19), internamiento(14), autopsia(6), comision(7), deten+(18), hospital(7), investigacion(5), juzgado+(13), levantamiento+(8), pais(7), prision(12), proced+(19), rogatorio(7), traslado+(8), urgencia(6), autorizar(4), capital(4), civil(10), delito(11), expulsion(4), forense(8), incapaz(4), ingres+(6), libertad(7) Table 4: Specific vocabulary of each class of the lemmatised corpus. 14 Figure 5: Visualization of the lemmatized corpus. weights of the lemmatised (and stemmed) corpus was divided by the sum of the first n LSI component weights of the stemmed corpus (Figure 6). In this way we can see how many times the lemmatised components are more clean and informative compared to the stemmed components. For instance, the first two lemmatised components are 1.164 times more informative (enabling better reconstruction of the corpus) than the first two stemmed components. 6 Discussion In this article we have shown how to improve the working of a knowledge management program through the distinction between stemming and lemmatisation techniques. In Alceste’s lexical reduction process we observed some irregularities because the program applies a reduction technique (stemming) which turns to be inadequate for our data provided in the Spanish language. We have provided some discussion on the reasons for that. We have shown that the pre-processing of the original data can affect the final results in terms of stable class conformation (lexical worlds). Moreover, that 15 LSI components weights from stemmed and lemmatised corpus lemmatised stemmed 1.15 1.1 1.05 1 0.95 0 10 20 30 40 50 60 70 80 Figure 6: Plot of LSI components weights for the stemmed and lemmatised corpus relative to the stemmed weights. has been shown by applying two different systems for data processing, Alceste and Text Garden Document Atlas, on the original data pre-processed using stemming (for the first set of experiments) and lemmatization instead (for the second set of experiments). The most relevant result has been that the programs have been able to perform a more refined analysis of the vocabulary (as the corpus was “clean”). In case of Alceste, this has turned into the creation of two more classes (from the four classes of the former analysis). The mere creation of more classes is not a good sign by itself; it is important, however, that these classes are the result of the division of two former classes. This creates more uniform, coherent and homogeneous lexical worlds, and this makes the interpretation made by the analyst more easy (or even more elegant). The analysis with the TextGarden system mirrors the findings with the Alceste system. The lemmatised version of the corpus could be described in a cleaner way. The latent semantic indexing components of the lemmatised corpus are consistently more informative than the components obtained from 16 the stemmed corpus. Thus, we have demonstrated a solid quantitative improvement in the ability to model the corpus using LSI. For the visualization, however, it was helpful to remove the stop words that were not representative of the natural taxonomy of the questions in the corpus. In conclusion, we can say that—regarding to the programs—an improvement of the input has turned into an improvement of the output. Acknowledgments This work was supported by the Slovenian Ministry of Education, Science and Sport, the Spanish Ministry of Education and Science (Project Observatorio de Cultura Judicial del Consejo General del Poder Judicial SEC20012581-C02 (01-02) and Project CESS-ECE (HUM2004-21127-E) Corpus etiquetados sintactico-sem´anticamente del espa˜ nol, catal´ an y euskera), the IST Programme of the European Community, under SEKT Semantically Enabled Knowledge Technologies (IST-1-506826-IP), and PASCAL Network of Excellence (IST-2002-506778). References [1] Bastin, Gilles (2002): “Note sur la m´ethode Alceste: mondes socieaux et mondes lexcicaux”, November 28 http://www.melissa.ens-cachan. fr/imprimer.php3?id\_article=200. [2] B´enzecri, Jean Paul (1982): L’analyse des donn´es. 4th ed. 2 Vols. Paris: Dunod. [3] Carmona, J., Cervell, S. M´arquez, L. Mart´ı, M.A., padr´o, L. Placer, R., Rodr´ıguez, H., Taul´e, M. & Turmo, J. (1998) “An Environment for Morphosyntactic Processing of Unrestricted Spanish Text”. In Proceedings of the First Conference on Language Resources and Avaluation. LREC’98, pages 915–922, Granada. [4] Carroll, J.D. & Arabie, P. (1980): “Multidimensional scaling”. In M.R. Rosenzweig & L.W. Porter (eds.). Annual Review of Psychology 31: 607649. [5] Casanovas, Pompeu; Poblet, Marta; Casellas, N´ uria; Vallb´e, Joan Josep; Ramos, Francisco; Benjamins, Richard; Bl´azquez, Mercedes; Rodrigo, Luis; Contreras, Jes´ us; Gorro˜ nogoitia, Jes´ us (2005): D 10.2.1Legal 17 Scenario. Deliverable WP10 Case Study: Intelligent integrated decision support for legal professionals. SEKT Project (www.sekt-project. com). [6] Civit, M. (2000): Gu´ıa para la anotaci´ on morfol´ ogica de corpus. Technical Report X-Tract WP-00/06, Universitat de Barcelona, 2000. available: http://www.lsi.upc.es/?civit/publicacions.html. [7] Civit, Montserrat & Mart´ı, M. Ant`onia (2002): “Design Principles for a Spanish Treebank”.The First Workshop on Treebanks and Linguistic Theories. TLT2002, Bulgaria. [8] Deerwester, S. Dumais, G. Furnas, T. Landuer & R. Harshman (1990): “Indexing by Latent Semantic Analysis”, Journal of the American Society of Information Science 41(6): 391-407. [9] Folch, Helka & Habert, Benoˆıt (2000): “Constructing a Navigable Topic Map by Inductive Semantic Acquisition Methods”. www.gca.org/attend/2000\_conferences/Extreme\_2000/Papers/ Folch/Visuals/EXTREM.PPT. [10] Monachini, M. & Calzolari, N. Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora. a common proposal and applications to european languages. Technical report, EAGLES, 1996. available: http://www.ilc.pi.cnr.it/EAGLES96/browse.html. [11] Padr´o, Llu´ıs (1997): A Hybrib Environment for Syntax-Semantic Tagging. PhD thesis, Software Department (LSI). Technical University of Catalonia (UPC). [12] Peyrat-Guillart, Dominique (2000): “Une application de la statistique textuelle `a la gestion des ressources humanies : appr´ehender le concept d’implication au travail de fa¸con alternative”. JADT 2000. 5es Journ´ees Internationales d’Analyse Statistique des Donn´ees Textuelles. [13] Reinert, Max (2002): “La tresse du sense et la m´ethode Alceste : Application aux Rˆeveries du promeneur solitaire”. JADT 2000: 5es Journ´ees Internationales d’Analyse Statistique des Donn´ees Textuelles. [14] Reinert, Max (2003): “Le rˆole de la r´ep´etition dans la repr´esentation du sens et son approche statistique par la m´ethode ALCESTE”, Semiotica 147 – 164, pp. 389-420. 18 [15] Saadi, Lahlou (1995): “Vers une th´eorie de l’interpr´etation en analyse des donn´ees textuelles”, JADT 1995. 3rd International Conference on Statistical Analysis of Textual Data. S. Bolasco, L. Lebart, A. Salem (eds.). CISU, Roma, 1995, Vol. I, pp. 221-228. [16] Salton, G. (1991): “Developments in Automatic Text Retrieval” , Science 253: 974-979. 19