Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen

Transcription

Tekstin tallennus- ja hakumenetelmien kehittäminen suomen kielen
Tekstin tallennus- ja hakumenetelmien
kehittäminen suomen kielen
tulkintaohjelmien avulla:
FULLTEXT-projektin loppuraportti
Riitta Alkula & Timo Honkela
ABSTRACT
The project, Linguistic processing and retrieval techniques in Finnish fulltext databases
(FULLTEXT), dealt with the special problems of fulltext databases in the Finnish language. Finnish
has a rich inflectional and derivational morphology. Another typical characteristic is the use of
compounds; in the English language these compunds would be multi-word terms. The characteristics
of Finnish result in poor system performance when commercial information retrieval systems
developed for English are used. To decrease the size of the inverted file and to improve retrieval
efficiency, it is reasonable to normalize the inflectional variants of a word to the basic form.
In the FULLTEXT project, natural language analysis modeules for Finnish were incorporated into
the BASIS and APL-MINTTU retrieval systems and severeal test databases were produced.
When word forms were normalized to their basic form, the memory size of the index file was
smaller than the a traditional index, where the words are saved in their inflectional form. Even when
the components of the compound words were added to the basic form index it still remained smaller
than the traditional index.
In the retrieval tests, best recall was achieved in the index that contained the basic word forms and
components of compound words. It was found that good recall did not result in poor precision. The
precision ratio was about as good as in other indexes.
Queries had best precision in a database where the automatically truncated terms were searched in a
traditional index and then the retrievd index terms were analyzed and filtered with natural language
analysis modules. Unfortunately, in this case, the recall ration was lower than in other test databases.
Problems in the use of natural language modules were also investigated. When the search terms are
given in their basic form, the searcher must be more conscious with derivatives and compounds than
when using truncated search terms in traditional indexes. Methods to transform the search terms to
their correct basic form should be further developed.
Remarks
The scanned original full text report starts at the 4th page of this document.
References
Bain, Malcolm, Richard Bland, Lou Burnard, Jon Duke, Colin Edwards, David Lindsey, Nicholas Rossiter, and
Peter Willett. Free text retrieval systems: a review and evaluation. Taylor Graham Publishing, 1989.
Blair, David C. Language and representation in information retrieval. Elsevier North-Holland, Inc., 1990.
Doszkocs, Tamas E., James Reggia, and Xia Lin. "Connectionist models and information retrieval." Annual review
of information science and technology 25 (1990): 209-262.
Lehti, Merja, and Pirkko Eskola. Suorakäyttöisten tiedonhakujärjestelmien käyttö Suomessa 1985. Valtion
teknillinen tutkimuskeskus. Informaatiopalvelulaitos, 1987.
Harter, Stephen P. Online information retrieval: concepts, principles, and techniques. Academic Press
Professional, Inc., 1986.
Heimbürger, Anneli, Riitta Alkula, and Taru Kuhanen. Hyperteksti ja hypermedia. Valtion teknillinen
tutkimuskeskus, informaatiopalvelulaitos, 1990.
Honkela, Timo, and Ari M. Vepsäläinen. "Interpreting imprecise expressions: Experiments with Kohonen’s selforganizing maps and associative memory." In Proceedings of ICANN 2011, vol. 1, pp. 897-902. 1991.
Jäppinen, Harri, Aarno Lehtola, Esa Nelimarkka, and Matti Ylilammi. "Knowledge engineering approach to
morphological analysis." In Proceedings of the first conference on European chapter of the Association for
Computational Linguistics, pp. 49-51. Association for Computational Linguistics, 1983.
Karetnyk, David, Fred Karlsson, and Godfrey Smart. "Knowledge-based indexing of morpho-syntactically
analysed language." International Journal of Applied Expert Systems 4, no. 1 (1991): 1-29.
Karlsson, Fred. "Morphological tagging of Finnish." Computational Morphosyntax, Publica 13 (1985): 115-136.
Koskenniemi, Kimmo. "An application of the two-level model to Finnish." Computational morphosyntax: Report
on research 1984 (1981): 19-41.
Koskenniemi, Kimmo. "FINSTEMS: a module for information retrieval." Computational Morphosyntax: Report
on Research 84 (1981): 81-92.
Kotzias, Klaus. "How to respond to different language particularities by indexing texts using automatic text
analysis." In International online information meeting, pp. 61-68. 1990.
Laalo, Klaus. Säkeistä patoihin: suomen kielen monitulkintaiset sananmuodot. Suomalaisen kirjallisuuden seura,
1990.
Lin, Xia, Dagobert Soergel, and Gary Marchionini. "A self-organizing semantic map for information retrieval." In
Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information
retrieval, pp. 262-269. ACM, 1991.
Newton, Steve J. Text filing and retrieval systems: a practical evaluation guide. National computing centre, 1983.
Peters, Thomas A. "When Smart People Fail: An Analysis of the Transaction Log of an Online Public Access
Catalog." Journal of academic librarianship 15, no. 5 (1989): 267-73.
Ritter, Helge, and Teuvo Kohonen. "Self-organizing semantic maps." Biological cybernetics 61, no. 4 (1989): 241254.
Saffady, William. Text storage and retrieval systems: A technology survey and product directory. Meckler, 1989.
Salton, Gerard. Automatic Text Processing: The Transformation, Analysis, and Retrieval of. Addison-Wesley, 1989.
Tenopir, Carol, and Jung Soon Ro. Full text databases. Greenwood Press, 1990.
Thönssen, Barbara. "Automatische Indexierung und Schnittstellen zu Thesauri.[Interfaces Between Automatic
Indexing and Thesauri]." Nachrichten fur Dokumentation (West Germany) 39, no. 4 (1988): 227-230.
[The list of references has been reproduced to support search system operations. Errors are possible. Please check the original.]
Keywords and search terms:
Named entities: VTT, Valtion teknillinen tutkimuskeskus, TEKES, VTKK, KTA-Papyrus,
Aamulehti, Länsiväylä-lehti, Tampereen yliopisto, Eeva Palosuo, Juhani Virtanen, Matti Sihto,
Kimmo Koskenniemi, Mika Herpiö, Pekka Vuorio, Harri Arnola, Sauli Laitinen, Eero Sormunen,
Taru Kuhanen, Sanna Hätönen, Raili Salminen, Markku Kuokkala, Markku Ylinen, Tarja Hjorth,
Kaarina Nazarenko, Jaakko Anttila, Kari Martiskainen, Irma Salovaara, Pirjo Valpas, Tarja
Heinivaho, Klaus Nurmi, Tuija Tuominen, Kalervo Järvelin, Olli Paavola.
Finnish terms: Tiedonhakujärjestelmä, hakujärjestelmä, suomen kieli, taivutusmuoto, johdos,
yhdyssana, homografia, sanaliitto, hakusana, taivutusvartalo, perusmuoto, MINTTU, BASIS,
testikysely, hakemisto, TWOL, hakutulos, käyttäjä, perusmuotohaku, automaattinen katkaisu
English terms: Information retrieval system, database, free-text retrieval, inverted index, index term,
stop word, query, Finnish language, inflectional word forms, compound word, automatic truncation,
morphological analysis, APL language, C language
The full report follows in a scanned form.