Language Technology Support for Latvian - META-NORD
Transcription
Language Technology Support for Latvian - META-NORD
www.meta-net.eu [email protected] Tel: +49 30 3949 1833 Fax: +49 30 3949 1810 META-NET White Paper Series Languages in the European Information Society– Latvia Preface Preface This series of language white papers is for journalists, politicians, language communities, language teachers and others, who want to establish a truly multilingual Europe. This series promotes knowledge about language technology (LT) and it‘s potential. The coverage and use of language technology in Europe varies from language to language. Consequently, required actions to support research and development vary, and the necessary steps depend on many factors, such as the complexity of the language or the size of its community. META-NET has faced this challenge by initiating an analysis of the current state of affairs for language resources and technologies. The analysis focuses on the 23 official European languages and several important regional languages. The results of the analysis suggest that there are many significant gaps for each language. Detailed expert analysis and assessment of the situation for each language will help maximise the impact of language technology and minimize any associated risks. META-NET is a European Commission Network of Excellence that consists of 44 research centres from 31 countries. META-NET is working with stakeholders from many areas of society, industry and research to generate strategic visions and produce a strategic research agenda that shows how language technology applications can address any gaps by 2020. Imprint Authors/Editors: Dr. Aljoscha Burchardt, DFKI Prof. Dr. Markus Egg, Humboldt-Universität zu Berlin Kathrin Eichler, DFKI Dr. Georg Rehm, DFKI Prof. Dr. Manfred Stede, Universität Potsdam Prof. Dr. Hans Uszkoreit, Universität des Saarlandes and DFKI Prof. Dr. Inguna Skadiņa, Tilde Prof. Dr. Andrejs Veisbergs, University of Latvia Dr. Andrejs Vasiļjevs, Tilde Dr. Tatiana Gornostay, Tilde Iveta Keiša, Tilde Alda Rudzīte, Tilde The development of this white paper has been funded by the Seventh Framework Programme and the ICT Policy Support Programme of the European Commission through the contracts T4ME(grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893), and META-NORD (grant agreement no.: 270899). 2 Contents Table of Contents Preface .............................................................................................................................................................................. 2 Imprint ............................................................................................................................................................................. 2 Table of Contents ............................................................................................................................................................ 3 Executive Summary ........................................................................................................................................................ 4 A Risk for Our Languages and a Challenge for Language Technology .................................................................... 5 Language Borders Hinder the European Information Society ......................................................................................................... 5 Our Languages at Risk ..................................................................................................................................................................... 6 Language Technology is a Key Enabling Technology .................................................................................................................... 7 Opportunities for Language Technology ......................................................................................................................................... 7 Challenges Facing Language Technology ....................................................................................................................................... 8 Language Acquisition ...................................................................................................................................................................... 8 Latvian in the European Information Society ............................................................................................................ 10 General Facts ..................................................................................................................................................................................10 Particularities of the Latvian Language ..........................................................................................................................................11 Recent developments ......................................................................................................................................................................13 Language cultivation in Latvia .......................................................................................................................................................14 Language in Education ...................................................................................................................................................................16 International aspects .......................................................................................................................................................................17 Latvian on the Internet ....................................................................................................................................................................18 Selected Further Reading ................................................................................................................................................................20 Language Technology Support for Latvian ................................................................................................................ 22 Language Technologies ..................................................................................................................................................................22 Language Technology Application Architectures ..........................................................................................................................22 Core application areas .....................................................................................................................................................................23 Language checking .........................................................................................................................................................................23 Web search .....................................................................................................................................................................................24 Speech interaction ...........................................................................................................................................................................26 Machine Translation .......................................................................................................................................................................28 Language Technology ‗Behind the Scenes‘ ....................................................................................................................................31 Language Technology in Education ...............................................................................................................................................33 Language Technology Programs ....................................................................................................................................................33 Availability of Tools and Resources for Latvian ............................................................................................................................36 Status of Tools and Resources for Latvian .....................................................................................................................................38 Conclusions ....................................................................................................................................................................................39 References ...................................................................................................................................................................... 40 META-NET ................................................................................................................................................................... 42 META-NET‘s Three Lines of Action .............................................................................................................................................42 Composition of the META-NET Network of Excellence ...............................................................................................................44 How to Participate?.........................................................................................................................................................................45 3 A Risk for Our Languages and a Challenge for Language Technology Executive Summary Many European languages run the risk of becoming victims of the digital age because they are underrepresented and under-resourced online. Huge regional market opportunities remain untapped today because of language barriers. If we do not take action now, many European citizens will become socially and economically disadvantaged because they speak their native language. Innovative, language technology (LT) is an intermediary that will enable European citizens to participate in an egalitarian, inclusive and economically successful knowledge and information society. Multilingual language technology will be a gateway for instantaneous, cheap and effortless communication and interaction across language boundaries. Today, language services are primarily offered by commercial providers from the US. Google Translate, a free service, is just one example. The recent success of Watson, an IBM computer system that won an episode of the Jeopardy game show against human candidates, illustrates the immense potential of language technology. As Europeans, we have to ask ourselves several urgent questions: Should our communications and knowledge infrastructure be dependent upon monopolistic companies? Can we truly rely on language-related services that can be immediately switched off by others? Are we actively competing in the global market for research and development in language technology? Are third parties from other continents willing to address our translation problems and other issues that relate to European multilingualism? Can our European cultural background help shape the knowledge society by offering better, more secure, more precise, more innovative and more robust high-quality technology? The present whitepaper focuses on the Latvian language that is the sole state language in the Republic of Latvia, one of the official languages of the European Union, and one of the oldest European languages with about 1.5 million native speakers worldwide. While a number of basic language technologies and resources have been developed for the Latvian language, there are rather big gaps that should be urgently filled to ensure sustainable development of the language. For example, semantic analysis and discourse processing, summarization, question answering, speech recognition and advanced information access technologies and dialogue management systems, as well as discourse and multimedia and multimodal corpora and wordnet. Moreover, similar to several other languages of the European Union some of the existing tools and resources for Latvian are either not interoperable and fragmented or not freely available for the community. META-NET contributes to building a strong, multilingual European digital information space. By realising this goal, a multicultural union of nations can prosper and become a role model for peaceful and egalitarian international cooperation. If this goal cannot be achieved, Europe will have to choose between sacrificing its cultural identities or suffering economic defeat. 4 A Risk for Our Languages and a Challenge for Language Technology A Risk for Our Languages and a Challenge for Language Technology As recent events in North Africa illustrate, we are witnesses to a digital revolution that is dramatically impacting communication and society. Recent developments in digitised and network communication technology are sometimes compared to Gutenberg‘s invention of the printing press. What can this analogy tell us about the future of the European information society and our languages in particular? After Gutenberg‘s invention, real breakthroughs in communication and knowledge exchange were accomplished by efforts like Luther‘s translation of the Bible into common language. In subsequent centuries, cultural techniques have been developed to better handle language processing and knowledge exchange: the orthographic and grammatical standardisation of major languages enabled the rapid dissemination of new scientific and intellectual ideas; the development of official languages made it possible for citizens to communicate within certain (often political) boundaries; the teaching and translation of languages enabled an exchange across languages; the creation of journalistic and bibliographic guidelines assured the quality and availability of printed material; the creation of different media like newspapers, radio, television, books, and other formats satisfied different communication needs. In the past twenty years, information technology helped to automate and facilitate many of the processes: desktop publishing software replaces typewriting and typesetting; Microsoft PowerPoint replaces overhead projector transparencies; e-mail sends and receives documents often faster than with a fax machine; Skype makes Internet phone calls and hosts virtual meetings; audio and video encoding formats make it easy to exchange multimedia content; search engines provide keyword-based access to web pages; online services like Google Translate produce quick and approximate translations; social media platforms facilitate collaboration and information sharing. Although such tools and applications are helpful, can they sufficiently implement a sustainable, multilingual European information society, a modern and inclusive society where information and goods can flow freely? Language Borders Hinder the European Information Society We cannot precisely know what the future information society will look like. When it comes to discussing a common European energy strategy or foreign policy, we might want to listen to European foreign ministers speak in their 5 We are currently witnessing a digital revolution that is comparable to Gutenberg’s invention of modern printing. A Risk for Our Languages and a Challenge for Language Technology native language. We might want a platform where people, who speak many different languages and who have varying language proficiency, can discuss a particular subject while technology automatically gathers their opinions and generates brief summaries. We also might want to speak with a health insurance help desk that is located in a foreign country. It is clear that communicative needs have a different quality as compared to a few years ago. In a global economy and information space, more languages, speakers and content confront us and require us to quickly interact with new types of media. The current popularity of social media (Wikipedia, Facebook, Twitter and YouTube) is only the tip of the iceberg. A global economy and information space confronts us with more languages, speakers and content. Today, we can transmit gigabytes of text around the world in a few seconds before we recognize that it is in a language we do not understand. According to a recent report requested by the European Commission, 57% of Internet users in Europe purchase goods and services in languages that are not their native language. (English is the most common foreign language followed by French, German and Spanish.) 55% of users read content in a foreign language while only 35% use another language to write e-mails or post comments on the web.1 A few years ago, English might have been the lingua franca of the web—the vast majority of content on the web was in English. The situation has now changed drastically. The amount of online content in other languages (particularly Asian and Arabic languages) has exploded. A ubiquitous digital divide that is caused by language borders has surprisingly not gained much attention in the public discourse; yet, it raises a very pressing question, ―Which European languages will thrive and persist in the networked information and knowledge society?‖ Which European languages will thrive and persist in the networked information and knowledge society? Our Languages at Risk The printing press contributed to an invaluable exchange of information in Europe, but it also lead to the extinction of many European languages. Regional and minority languages were rarely printed. As a result, many languages like Cornish or Dalmatian were often limited to oral forms of transmission, which limited their continued adoption, spread and use. The approximately 60 languages of Europe are one of its richest and most important cultural assets. Europe‘s multitude of languages is also a vital part of its social success.2 While popular languages like English or Chinese will certainly maintain their presence in the emerging digital society and market, many European languages could be cut off by digital communications and become irrelevant for the Internet society. Such developments would certainly be unwelcome. On one hand, a strategic opportunity would be lost that would weaken Europe‘s global standing. On the other hand, such developments would conflict with the goal of equal participation for every European citizen regardless of language. According to a UNESCO report on multilingualism, 1 European Commission Directorate-General Information Society and Media, User language preferences online, Flash Eurobarometer #313, 2011 (http://ec.europa.eu/public_opinion/flash/fl_313_en.pdf). 2 European Commission, Multilingualism: an asset for Europe and a shared commitment, Brussels, 2008 (http://ec.europa.eu/education/languages/pdf/com/2008_0566_en.pdf). 6 The wide variety of languages in Europe is one of its most important cultural assets and an essential part of Europe‘s success. A Risk for Our Languages and a Challenge for Language Technology languages are an essential medium for the enjoyment of fundamental rights, such as political expression, education and participation in society.3 Language Technology is a Key Enabling Technology In the past, investment efforts have focused on language education and translation. For example, according to some estimates, the European market for translation, interpretation, software localisation and website globalisation was € 8.4 billion in 2008 and was expected to grow by 10% per annum. 4 Yet, this existing capacity is not enough to satisfy current and future needs. Language technology is a key enabling technology that can protect and foster European languages. Language technology helps people collaborate, conduct business, share knowledge and participate in social and political debates regardless of language barriers or computer skills. Language technology already assists everyday tasks, such as writing e-mails, searching for information online or booking a flight. We benefit from language technology when we: search for and translate web pages; use the spelling and grammar checking features in a word processor; view product recommendations at an online shop; hear the verbal instructions of a synthetic voice in a navigation system; translate web pages with an online service. The language technologies detailed in this paper are an essential part of innovative future applications. Language technology is typically an enabling technology within a larger application framework like a navigation system or a search engine. These white papers focus on the readiness of core technologies in the each language. In the near future, we need language technology for all European languages that is available, affordable and tightly integrated within larger software environments. An interactive, multimedia and multilingual user experience is not possible without language technology. Opportunities for Language Technology Language technology can make automatic translation, content production, information processing and knowledge management possible for all European languages. Language technology can also further the development of intuitive language-based interfaces for household electronics, machinery, vehicles, computers and robots. Although many prototypes already exist, commercial and industrial applications are still in the early stages of development. The current rate of progress creates a genuine window of opportunity with research steadily progressing during the last few years. For example, machine translation (MT) already delivers a reasonable amount of accuracy within specific domains, and experimental applications provide multilingual information and knowledge management as well as content production in many European languages. 3 UNESCO Director-General, Intersectoral mid-term strategy on languages and multilingualism, Paris, 2007 (http://unesdoc.unesco.org/images/0015/001503/150335e.pdf). 4 European Commission Directorate-General for Translation, Size of the language industry in the EU, Kingston Upon Thames, 2009 (http://ec.europa.eu/dgs/translation/publications/studies). 7 Language technology helps people collaborate, conduct business, share knowledge and participate in social and political debates across different languages. One can think of language technology as the operating system for the content and user interaction. A Risk for Our Languages and a Challenge for Language Technology Language applications, voice-based user interfaces and dialogue systems are traditionally found in highly specialised domains, and they often exhibit limited performance. One active field of research is the use of language technology for rescue operations in disaster areas. In such high-risk environments, translation accuracy can be a matter of life or death. The same reasoning applies to the use of language technology in the health care industry. Intelligent robots with cross-lingual language capabilities have the potential to save lives. There are huge market opportunities in the education and entertainment industries for the integration of language technologies in games, edutainment offerings, simulation environments or training programmes. Mobile information services, computer-assisted language learning software, eLearning environments, self-assessment tools and plagiarism detection software are just a few more examples where language technology can play an important role. The popularity of social media applications like Twitter and Facebook suggest a further need for sophisticated language technologies that can monitor posts, summarise discussions, suggest opinion trends, detect emotional responses, identify copyright infringements or track misuse. Language technology represents a tremendous opportunity for the European Union that makes both economic and cultural sense. Multilingualism in Europe has become the rule. European businesses, organisations and schools are multinational and diverse. Citizens want to communicate across the language borders that still exist in the European Common Market. Language technology can help overcome such remaining barriers while supporting the free and open use of language. Furthermore, innovative, multilingual language technology for European can also help us communicate with our global partners and their multilingual communities. Language technologies support a wealth of international economic opportunities. Multilingualism is the rule, not an exception. Challenges Facing Language Technology Although language technology has made considerable progress in the last few years, the current pace of technological development and product innovation is too slow. We cannot wait ten or twenty years for significant improvements to be made that can further communication and productivity in our multilingual environment. Language technologies with broad use, such as the grammar and spell checking features in word processors, are typically monolingual, and they are only available for a handful of languages. Applications for multilingual communication require a certain level of sophistication. Machine translation and online services like Google Translate or Bing Translator are excellent at creating a good approximation of a document‘s contents. But such online services and professional MT applications are fraught with various difficulties when highly accurate and complete translations are required. There are many well-known examples of funny sounding mistranslations, for example, literal translations of the names Bush or Kohl, that illustrate the challenges language technology must still face. Language Acquisition To illustrate how computers handle language and why language acquisition is a very difficult task, we take a brief look at the way humans acquire first and 8 The current pace of technological progress is too slow to arrive at substantial software products within the next ten to twenty years. A Risk for Our Languages and a Challenge for Language Technology second languages, and then we sketch how machine translation systems work—there‘s a reason why the field of language technology is closely linked to the field of artificial intelligence. Humans acquire language skills in two different ways. First, a baby learns its native language via examples. Exposure to concrete, linguistic specimens by language users, such as parents, siblings and other family members, helps babies from the age of about two or so produce their first words and short phrases. This is only possible because of a special genetic disposition humans have for learning their first language. Humans acquire language skills in two different ways: learning examples and learning the underlying language rules. Learning a second language usually requires much more effort. At school age, foreign languages are usually acquired by learning their grammatical structure, vocabulary and orthography from books and educational materials that describe linguistic knowledge in terms of abstract rules, tables and example texts. Learning a foreign language takes a lot of time and effort, and it gets more difficult with age. The two main types of language technology systems acquire language capabilities in a similar manner as humans. Statistical approaches obtain linguistic knowledge from vast collections of concrete example texts in a single language or in so-called parallel texts that are available in two or more languages. Machine learning algorithms model some kind of language faculty that can derive patterns of how words, short phrases and complete sentences are correctly used in a single language or translated from one language to another. The sheer number of sentences that statistical approaches require is huge. Performance quality increases as the number of analysed texts increases. It is not uncommon to train such systems on texts that comprise millions of sentences. This is one of the reasons why search engine providers are eager to collect as much written material as possible. Spelling correction in word processors, available online information, and translation services such as Google Search and Google Translate rely on a statistical (data-driven) approach. Rule-based systems are the second major type of language technology. Experts from linguistics, computational linguistics and computer science encode grammatical analysis (translation rules) and compile vocabulary lists (lexicons). The establishment of a rule-based system is very time consuming and labour intensive. Rule-based systems also require highly specialised experts. Some of the leading rule-based machine translation systems have been under constant development for more than twenty years. The advantage of rulebased systems is that the experts can more detailed control over the language processing. This makes it possible to systematically correct mistakes in the software and give detailed feedback to the user, especially when rule-based systems are used for language learning. Due to financial constraints, rulebased language technology is only feasible for major languages. 9 The two main types of language technology systems acquire language in a similar manner as humans. Language Technology Support for Latvian Latvian in the European Information Society General Facts Latvian is the sole state language in the Republic of Latvia and one of the official languages of the European Union. There are about 1.5 million native Latvian speakers worldwide, from which 1.38 million are living in Latvia while others are scattered in the USA, Russia, Australia, Canada, UK, Germany, Ireland, as well as Lithuania, Estonia, Sweden, Brazil and other countries. Latvian though apparently small, is in fact approximately the 150th most spoken language from about 6,900 languages of the world. At least 500,000 nonLatvians speak Latvian besides their own native language. Since regaining independence in 1990, Latvian has state language status which extends to all language use spheres. Accordingly, more and more minority language speakers in Latvia speak also Latvian. The 1989 population census data showed that 23% of Latvia's national minorities spoke Latvian language. According to the 2000 population census data, number of Latvian speakers among national minorities increased to 53%. However, due to low birth rates, Latvian speakers decrease by approximately 5,000 people (0.3%) annually. Latvian is the native language of 95.6% of Latvians. Among national minorities, Latvian is considered as the native language more often by Lithuanians (42.5%), Estonians (39.2%) and Germans (24.6%). For comparison, 39.6% of Latvia's citizens are native speakers of Russian. For a large number of other national minorities (Jews, Belarusians, Ukrainians, Poles) Russian is their mother tongue and everyday communication language. Although often referred to as a new language of a new republic, Latvian, in fact, is one of the oldest European languages with numerous similarities to Sanskrit, the language closest to the original Indo-European language. The Latvian language belongs to the Baltic branch of the Indo-European language family. The Baltic languages are divided into East Baltic and West Baltic languages. There are only two living Baltic languages nowadays: Latvian and Lithuanian, both of which belong to the East Baltic languages. Although Latvian is kindred to Lithuanian, speakers of both languages cannot communicate with each other freely. The similarity of both languages is like the one between Spanish and Italian or Russian and Polish. In the Latvian language, there are 3 dialects: the Central dialect, Tamian and the High Latvian dialect, and more than 500 vernaculars or sub-dialects. These separate dialects are influenced by standardization, social and culture historical factors and are subordinated to the process of improvement and accommodation to literary standard language. The literary standard language has been developed on the basis of the Central dialect. The written form of the Latvian language has existed for about 400 years. The first written monuments of Latvian are writings in the Gothic script of the l6th century when, under the ideas of Reformation, the clergy attempted to break the divide between the local peasants and the landlords of Teutonic descent. The first great landmark of Latvian is the translation of the Bible (1689). Thus Latvian obtained a powerful literary document the language of which was to affect the development of written Latvian (so-called Old Writing) for centuries. It imposed a standard on the written language and was also important as recognition of the language. It should be noted that the first scripts in Latvian were made by Baltic Germans and were mostly translations. Baltic Germans also produced Latvian grammars, dictionaries, collected and recorded folk10 Language Technology Support for Latvian songs, controlled and dominated the language scene in general. Real writing in Latvian started only in the l9th century when national literature and cultural aspirations emerged and Latvian linguistics came into the hands on native speakers. As a result of centuries of foreign domination one can trace in modern Latvian numerous lexical and morphological influences — loanwords, calques and borrowed idioms which have been fully assimilated. In spite of extensive and various contacts with other languages (German, Polish, Swedish, Russian, English), the inner system of Latvian has survived and the language maintains its stability. Latvian is characterized by a complex grammatical system and certain linguistic conservatism, yet openness to outside influences. Latvian orthography underwent a gradual reform from Gothic script to Latin (with diacritics) in the beginning of the 20th century. There have been two orthography traditions (with minor differences) since World War II: the orthography used by Latvians in Latvia and the orthography used by émigré Latvians abroad. Besides, Latgalian orthography tradition exists in the Eastern part of Latvia. Particularities of the Latvian Language The high linguistic quality and rich means of expression of the Latvian language is one of the prerequisites for the stability and competitiveness of the language. The Latvian language exhibits some specific characteristics, including: Pronunciation almost fully corresponds to the writing Plenty of grammar forms and endings due to inflections Large amount of derived words and derivational means Free word-order Punctuation principles: grammar and intonation The Latvian language uses the phono-morphological basis of orthography. The Latvian orthography almost fully corresponds to the pronunciation (diacritical marks are used for identifying the length of a sound, palatalization and sibilants), therefore, it is considered to be one of the best orthography systems. The new orthography (at the end of the 19th century) was created by the first Latvian intellectuals, who, in their search for the most suitable means for the written representation of Latvian sound system, found ideas in other languages (for example, letters of the Czech language were selected for sibilants). The first requirement for correct spelling is correct pronunciation (orthoepy). In Latvian, as a general rule, each sound is represented by its letter; in some cases one sound is represented by two letters (dz, dž), in some — one letter is represented by two sounds (letter e represents the narrow e sound and the broad e sound, letter ē — the narrow and broad [ē] sound; letter o represents three sounds: the short vowel [o], the long vowel [ō] and diphthong [uo]). Standard Latvian with a few minor exceptions has a fixed initial stress. Long vowels and diphthongs have a tone regardless of their position in the word. Syllable tones of sound intonations (3 types) is one of the rarities present in the Latvian language, preserved since the ancient syllable tone system of the Indo-European language, also present in Lithuanian, Slovenian and Serbian (to compare: tones are important also for other languages, such as Chinese). However, the tones may make it difficult to learn a language and frequently may cause misunderstandings, because a lengthening mark or even just a tone may differentiate meanings of a word (for example, ‗kazas‘ (goats) 11 Language Technology Support for Latvian and ‗kāzas‘ (wedding); ‗zāle‘ with level tone (hall) and ‗zāle‘ with broken tone (grass, herb)). The pronunciation of words based on the context must be noted not only by language learners but also by language technology developers. The Latvian language is a synthetically inflected language. Its words change their form according to the grammatical function. It means that endings of nouns, pronouns, adjectives, numerals, and verbs change depending on certain features. The main features in Latvian are gender, number, case, tense, voice, degree of comparison, person, definiteness of the ending, mode, reflexivity. Words belonging to a different part of speech have a different set of features. The different form is determined not only by a different ending. There is also a rich system of derivational affixes. For instance, in Latvian, nouns have 29 graphically different endings, adjectives have 24 and verbs have 28. Across all three word types, only half of the endings are unambiguous, for the rest, multiple base forms may be derived from the inflected form. The Latvian language does not have definite or indefinite articles. Definiteness can be indicated by the endings of adjectives. They can be either definite (‗-ais‘ for singular nominative masculine e.g. ‗lielais‘, ‗garais‘ and ‗-ā‘ for singular nominative feminine form e.g. ‗lielā‘, ‗garā‘) or indefinite (‗-s‘ or ‗-š‘ for singular nominative masculine e.g. ‗liels‘, ‗garš‘ and ‗-a‘ for singular nominative feminine form e.g. ‗liela‘, ‗gara‘). Due to the structure of the Latvian language it has a very rich word-building potential. Mostly, words are built morphologically — by adding affixes (word components) to the stem of the word; less often new words are built as compound words; there are also other methods to build new words. The new technologies have brought the capability to provide an accurate view on the building options of words and word forms: computations have shown that in combinations with about 40 word-building affixes the number of possible items might be about 40 million. The order of sentence parts is relatively free; the grammatical means for marking syntactic relations are mainly endings. For instance, sentence ‗kaķis ķer peli‘ (a cat is catching a mouse) with a direct word order SVO (subject, verb, object), could be also formed with OVS word order: ‗peli ķer kaķis‘ or VSO ‗ķer kaķis peli‘ or VOS ‗ķer peli kaķis‘. There is a tendency to place the word which carries the more important information at the end of the sentence. Subject, predicate, complement tends to be the most common order of sentence parts: (‗Māsa lasa grāmatu‘ – ‗The sister is reading a book‘) or subject, predicate, adverbial (‗Zēns mācās labi‘ – ‗The boy learns well‘). Latvian punctuation rules are so complicated that it is almost impossible to write without a thorough knowledge of grammar. Latvian punctuation is based on the grammatical punctuation principle. It means that punctuation marks mainly indicate the grammatical link and division between the text and sentence parts. According to the above rule, punctuation marks are used to separate sentences, parts of a compound sentence, equal parts of a sentence, etc. Besides the grammatical principle the intonational principle is also important in Latvian punctuation Based on the latter, punctuation marks are used to mark pauses and emphasis of word groups. The intonational principle sup12 Language Technology Support for Latvian plements the grammatical principle to provide a better representation of nuances of the content of text or sentence. Recent developments Although more than ten contact languages have left their traces during the development of the Latvian language in different historical periods, the most significant language competition has been faced from German, Russian and English. Over the last decade, there has been a significant increase of English influence. Steady English borrowing has been present in Latvian for a century, at first through Russian. The latest growth of borrowing has affected such areas as electronics, information technologies, music, sports, medicine, administration, politics, also colloquial and slang Latvian. This fast expansion came with the slackening of ideological barriers, diminishing of Russian influence and openness of the country to the West. The language aspect changed with new incentives. In the past, though being the major foreign language in Latvian schools (after Russian), English teaching, nevertheless reminded that of Latin, as there were no opportunities of ever using it. The political openness starting in late 1980ies changed this immediately. Currently, adverse trends occur in higher education and research. Like in many European countries there is a tendency to switch to English which poses a threat to the development of Latvian language. This trend can lead to an inability to communicate in one‘s native language in certain professional fields due to the deficit of appropriate linguistic means of expression. Negative trends appear also in other fields like entertainment industry and banking and finance sector. Concerns about the language in the Latvian community do not cease. They focus not only on the languages usage but also on the language quality. Changes in traditional culture, exposure to the global trends have also affected the language. In the global and digitalized environment of new technologies, language must function in an accelerated mode and the consequences are apparent: ambiguous standards of spoken and written language, lack of the authoritative recommendations, perfect etc. The speed of social and political life and the dynamic nature of the mass media require new expressions for the new concepts. The easiest way often is to select haphazard clichés created in haste. The developments are not regulated by official procedures and term builders are not efficient enough to timely propose terms and words that are correct from the point of view of the Latvian linguistic norms. But if one adopts buzzwords haphazardly, he/she runs the risk of being misunderstood. Although in practice redundant foreign words can be successfully replaced by national coinages or appropriate borrowing (e.g. ‗ofšors‘ – ‗ārzona‘, ‗kompjūters‘ – ‗dators‘), the percentage of full loans has constantly been very high. With the growth of information in foreign languages, there is an increasing trend to just transcribe words of other languages and to add Latvian endings. In fact, it is a return back to the 19th century when the use of Germanisms was widely spread. One can even assume that proportion of foreign words is constantly increasing with the speed of emergence of new concepts and growth of vocabulary. There are concerns that too many foreign words are used in Latvian, although there is no study basing this opinion. 13 Language Technology Support for Latvian Language cultivation in Latvia Latvian is the only state language in the Republic of Latvia as provided for by law: Article 4 of the 1922 Constitution, which states that Latvian is the official language of the Republic of Latvia, was revised by the 1989 Law on Language, which was amended in 1992. In order to better understand the strategy of the Latvian language policy, some knowledge of the historical background is required. In the 16th–19th centuries, German served in the key sociolinguistic functions. After the Great Northern War (1700–1721), the territory of Latvia was subjugated by Russia, however, a special agreement was signed on the use of German in the administrative and culture areas. Since the end of the 18th century, the Latvian language was developing in the background of an increased competition from the German and Russian languages. The speakers of Latvian were subject to covert or overt germanization and russification. The russification grew in strength at the turn of the 19th and 20th centuries and became threatening during the soviet period when Latvia was annexed by the USSR. As result the Latvian language was very close to becoming an endangered language with Russian dominating all public spheres except Latvian culture and education, and Latvian population almost becoming a minority in its own land. Now, thanks to the state language policy, the situation of the Latvian language slowly improves. Latvian language policy is complex and difficult to implement due to the extremely high proportion of ethnic minorities (about 40% of the population). These include Russians, Byelorussians, Ukrainians, Poles, Lithuanians, Jews, Roma, Germans, Tatars, Armenians, Estonians and other nationalities. The Slavic minorities were russified during the soviet occupation when according to the communist dogma only two languages could exist in Latvia - Latvian and Russian. As the ethnic situation was so unfavourable and explosive, Latvian language policy was developed by aligning it as much as possible with international instruments on human rights. Recommendations of international experts on minority rights were carefully followed. National Program ―Integration of Society in Latvia‖ (2001) promotes development of consolidated civil society and harmonious integration of all ethnic minorities. Among the major tasks are support for Latvian language training and reform of education system which was segregated in Russian and Latvian schools during the Soviet rule, as well as protection of language rights for minorities in Latvia. Article 5 of the 1999 State Language Law does, however, add that any language other than Latvian will be considered to be a foreign language. An exclusive status is applied to the Liv language: the Livonians are the only ethnic minority of Latvia with indigenous status (only about 20 speakers of Liv have remained). Today the Latvian language functions in all spheres of life. The law regulates the use of the state language in State, municipal, judicial and educational institutions, as well as in other agencies and businesses. Official, business and legal meetings, and those which take place in public service institutions must be carried out in Latvian or provide for an interpretation of the discussion in the state language, if at least one participant requests it. The same provisions apply to the private sector ―to the level which is considered to be necessary‖, an expression which leaves a large margin for manoeuvre in practice. The law 14 Language Technology Support for Latvian does not apply to private communications, languages used in a religious context or internal exchanges between certain ethnic groups. A strong step towards strengthening of the Latvian language was the assignment of the certain professions and jobs to the grades of language proficiency. In 1995, the Latvian Government set up the ‗National Latvian Language Learning Program‘ and in 2004 it created the National Agency for Learning Latvian, which offered free language lessons to professionals for whom knowledge of Latvian is imperative, such as police and medical staff, but also for large sections of the working population. The latest initiative was passed by the Riga City Council which proposed a project competition «Organizing and implementing Latvian language learning courses for residents of Riga city», providing an opportunity for residents of Riga whose native language is not Latvian to learn it free of charge. The institutions in charge of the language policy are the Saeima (the Parliament), the Cabinet of Ministers, the Ministry of Education and Science, municipalities, universities, schools. The Latvian Language Agency is the state regulatory authority, supervised by the Minister of Education and Science, which focuses on the language policy and its implementation and also provides consultation services on language issues. The State Language Centre was created in 1992 for the control of the observance of the language laws and is now also responsible for the translation of the European Union and NATO documents. It includes the Commission of the Latvian language experts that is authorized to decide on spelling issues. The Terminology Commission of the Latvia Academy of Sciences is the main institution for the development of unified, coordinated and harmonized terminology. New terms are coined and terminology issues are discussed in the sub-commissions for the specific domains. A kind of umbrella function is assigned to the State Language Commission, operating under the President of Latvia. Members of the Commission are heads of all of the above institutions, representatives of universities and several representatives of community, and the resolutions of the Commission are only advisory in nature. There are six subcommissions under the State Language Commission for topical issues. The subcommission ―Latvian in the new technologies‖ addresses the development of language technologies and their widespread usage. Since the restoration of independence in 1990, the Latvian language has changed considerably, and the changes never cease. There is a trend of a clearly negative approach to the current changes in language (represented by general expressions such as — ―the language is declining‖, ―the language is being cluttered up‖, and also specifically identifying unwelcome phenomena). Representatives of this trend (purists) would like to stabilize the vocabulary of the literary language by using solely the resources of Latvian to build new words. However, words borrowed from other languages adapt faster and easier in the language circulation than native neologisms. For example, translation of ‗marketing‘- 'tirgzinība' failed to be accepted, because mass media preferred usage of 'mārketings', which became popular in colloquial speech. In order to promote formation of words for new concepts in Latvian that are linguistically correct and at the same time are widely accepted by users, vol15 Language Technology Support for Latvian unteer language enthusiasts organize an annual survey ―Word and Antiword of the year‖. Some successful neologisms (e.g., 'mēstule' - spam, 'zīmols' brand, 'vingrums' - fitness) have gained wide appreciation. However, only some words are highlighted annually, while the number of new concepts waiting for their Latvian designations runs into thousands. Language in Education Language policy in education was defined by the Law on Education of 1991, in which it was stated that any language other than Latvian has the status of a foreign language. Without special dispensation, diplomas issued by the Latvian State and professional qualification exams can only be offered in the national language. Since September 2004, it has been compulsory for at least five disciplines to be taught in Latvian from the 10th grade throughout the public school system (including minority language schools). Indeed, in autumn 2006, 73.5% students in the 11th grade studied in Latvian programmes. Unfortunately, the legal framework and the actual situation do not always entirely fit together. The situation in the Russian minority schools is unusual. Most lessons are given in Russian, with some teaching in Latvian. These schools are finding it difficult to work towards the 60% of lessons taught in Latvian as required by law. In academic year 2010/2011 the total number of students in general full-time education programs was 216,307. Latvian was the language of instruction for 158,137 students (73.11%), Russian – for 56,636 students (26.18%), and other language of instruction - for only 1534 students (0.71%). Article 41 of the 1998 Law on Education states that educational institutions may offer programs adapted for national minorities as long as they are in accordance with the Ministry‘s regulations on education, but that these programs must be accompanied by subjects taught in the national language. The Russian community in Latvia has reservations about these provisions. Currently, there is a controversy regarding the submitted initiative for amendments in Article 112 of the Constitution to achieve a gradual transfer to the Latvian language as a language of instruction in all schools starting from 1 September 2012. If national minorities‘ schools were to transfer to the Latvian language of instruction, a uniform and cheaper system for language teaching would be among the benefits. However, representatives of national minorities claim that their children have rights to get education in their native language. According to the Education Law and Law on Institutions of Higher Education, Latvian must be the only language of instruction in public institutions of higher education. The choice of language of tuition in private universities is not regulated. However, there are several requirements: 1) the examinations of professional qualifications must be taken in the state language; 2) the works and papers for the academic and research degree must be developed and presented in the state language unless there are other requirements provided for in the law; 3) the improvement of professional skills and retraining financed by the state or municipal budget funds is carried out in the state language. 16 Language Technology Support for Latvian The language situation in the higher education is directly dependent on the language and education policy in Latvia and in the EU. In the context of language policy, there are two essential objectives: provide higher education that is able to prepare specialists, researchers and scholars who are competitive on a global scale; it means that these professionals must have a very good command of foreign languages; every country must be committed to ensure comprehensive functioning of its national language in the higher education and science. We can say that laws and regulations adopted in Latvia ensure the retention of the dominant role of the official language in the higher education in Latvia while providing opportunities to master professional qualifications knowledge on a competitive level also in other EU languages (mostly English). However, with the increase of the number of exchange studies programs and the need to get and provide professional information in foreign languages, higher education and science in Latvia, just like in many other European countries, tend to switch to English. International aspects Latvian is one of the official languages of the European Union. Every resident is entitled to apply to the EU institutions in Latvian and receive a reply in Latvian. The position of Latvian gains strength also due to the state language policy - foreign residents who come to Latvia for work or study must learn Latvian. In addition, due to its rich folklore heritage and the complex and ancient language system, Latvian is used by linguists from other countries for research. The detailed rules and principles of Latvian grammar may serve as a base for research on machine translation systems and other language technology products targeted for minor languages. Support to the Latvian language abroad is provided in two areas: support to learning Latvian as a foreign language at universities abroad (Latvian can be learned in 22 universities worldwide); support to the Latvian language among the various diaspora. Several Latvian institutions of higher education and the Latvian Language Agency cooperate with foreign universities regarding the learning of Latvian. The latest accomplishment is opening of a lecturer position in the Beijing Foreign Studies University in China for the academic year 2011/2012 to organize Latvian language courses and to teach a course on Latvian cultural history (in English). There are many possibilities to learn Latvian in the neighbouring Lithuania. Since 1995, there is the Letonika Centre of the Vytauto Magnus University in Kaunas. The accession of Lithuania and Latvia to the European Union enlarged the range of opportunities to develop academic connections: Socrates\Erasmus agreements were signed with a number of universities covering not only the exchange of students but also teachers, as well as provided sufficient financial support. Note that since 2008 Latvian is included in curricula of Lithuanian secondary schools as the third foreign language (optional). It can be learned in several secondary schools located near the Latvian border. 17 Language Technology Support for Latvian Political and socio-economic factors have contributed to the Latvian diaspora spread all over the world over the last 150 years. Preliminary data show that more than 1/10socio-economic factors have con currently. Among the tasks of the long-term program approved by the Latvian government are the supply of study aids and manuals to associations of Latvian diaspora, strengthening of Sunday Schools networks, provision of teachers of the Latvian language and literature, opportunities for younger generation of Latvian diaspora to study in Latvian universities and support to persons who wish to repatriate. The 2009 ―Usage of Language in Diaspora: Evaluation of Policy of Latvia and Experience of Other Countries‖ survey performed by the Latvian Language Agency with the support from the Norwegian government urges proactive actions to prevent expansion of the gap between the government (the state) and representatives of the new diaspora and steps to avert the negative attitude towards Latvia by those who have left recently. Latvian Language Agency has supported different teaching activities in Russian Federation and Ireland. It has prepared two programs: the Latvian language learning program for the diaspora and a further education program for teachers who work in the diaspora. Training program for teachers involved 61 participants from 14 countries. Latvian on the Internet Taking care of the synergy of the Latvian language and technologies, the Latvian Language in New Technologies State Language subcommission has set the following key goal: Latvian shall be provided a full software support in all popular technologies and the support shall be high-quality and maintained and developed in pace with the development of new technologies, and should be widely accessible and applied. To reach these goals, the subcommission has set the following priority tasks: to develop language computer technologies, ensure the availability and application of these technologies in widely used systems, to develop the regulations for the use of Latvian in computer systems, promote the development and implementation of IT and telecommunications terminology. According to the survey in the Discovery News website, there are 1,369,600 Internet users in Latvian; the number is lower than that for Lithuanian, but higher than for Estonian. TNS Latvia, a market, public opinion and media research agency, has gathered the latest results of the Internet audience survey for winter 2011. On average 64% or 1,123,000 residents of Latvia between the ages of 15 and 74 have used the Internet in the last six months; it is 4 percentage points more than in winter 2010. The fastest growth in the number of Internet users is among residents of Latvia aged between 20 and 29. The role of the Internet in business is confirmed by the survey carried out by the GARM Technologies in cooperation with the Latvian Internet Association. According to the survey, the disappearance of the Internet would have an adverse effect on the operation of 37% companies, and would cause to stop the operation of 4%. The language used on the Internet is specific and it has certain traditions, and may show characteristics of linguistic impunity. There are services on the Internet where the language used gets edited. However, there are extensive 18 Language Technology Support for Latvian materials available to the public where language is not edited. Internet communication introduces methods and vocabulary not used before: graphical characters like ―smileys‖ for the expression of emotions, omission of diacritic marks, unusual abbreviations, colloquialisms and slang. The Internet, just like other means of communication, is the source of language facts reflecting the development trends of a language. For language technology, the growing spread of the Internet is important in two ways. On the one hand, the large amount of digitally available language data represents a rich source for analysing the usage of natural language, in particular by collecting statistical information. On the other hand, the Internet offers a wide range of application areas for language technology. It is important to ensure that content in Latvian is well represented on the Internet. The National Library of Latvia is creating Latvian National Digital Library ―Letonica‖ including digitised collections of newspapers, pictures, maps, books, sheet-music and audio recordings. Its aim is to ensure digitising the collections of libraries and making them accessible on the web. Collection Periodicals5 offers 40 newspapers and magazines in Latvian, German, and Russian from 1895 to 1957 (more than 350,000 pages). Online encyclopaedias, dictionaries, literary works and language tools are provided at the portal Letonika.LV developed by Tilde. Letonika.LV includes numerous general and specialized dictionaries for 20 translation routes: from English, French, German and Russian into Latvian and vice versa, LatvianLithuanian, Lithuanian-Latvian and Estonian-Latvian, as well as more than 40 terminological dictionaries. Online collection of Latvian literature includes 200 full text works and collections of 22 authors with a total volume of 22,000 digitised pages. The Institute of Mathematics and Computer Science (IMCS) of the University of Latvia offers large collection of digital content, including lexical resources, texts and corpora and computer-assisted teaching aids. Most of resources are available on the Web6 and are used in humanities research and education. Among corpora collected by IMCS are the Balanced Corpus of Modern Latvian7 (~3.5 million running words), the Latvian Web Corpus (~100 million running words) and the Corpus of the Transcripts of the Saeima‟s (Parliament of Latvia) Sessions (more than 20 million running words.), corpus of the early written Latvian texts8 (Andronova 2007) and collection of classical Latvian literature. IMCS has collected numerous Latvian dictionaries – mainly explanatory dictionaries and dictionaries of terminology. Main resources are: electronic version of Mülenbach-Endzelin ―Lettisch-deutsches Wörterbuch‖9, the Dictionary of Standard Latvian Language (~64 000 entries) and the Explanatory Dictionary which contains more than 150 000 entries from about 120 Latvian dictionaries of different times and domains. E-learning materials developed by IMCS comprise e-courses, e-books, teaching aids, exercises and tests for different levels of language learners, starting 5 http://periodika.lv http://www.ailab.lv 7 http://www.korpuss.lv 8 http://www.ailab.lv/senie 9 http://www.ailab.lv/mev 6 19 Language Technology Support for Latvian from elementary school and ending with secondary school. In order to assist deaf children a sign language dictionary has been developed. Most of elearning materials are included in Latvian Education Information System LIIS. The Terminology Commission of the Latvian Academy of Sciences publishes official terminology in two large online databases: www.termnet.lv (about 150 000 terms) and termini.lza.lv/akadterm. The former database is also integrated with the largest European terminology portal EuroTermBank. The extensive online collection of Latvian folklore resources is created by Institute of Literature, Folklore and Art of University of Latvia and the Archives of Latvian Folklore10, including numerous audio and video recordings. Collection of Latvian folk songs Dainu skapis collected by Krišjānis Barons is included in the UNESCO Memory of the World list and its digitised version is accessible online11. Dialect materials are collected by Latvian Language Institute and regional universities, such as The Courland‘s Folklore and Language Centre12. In CLARIN project Latvian language resources and tools were identified and registered at CLARIN Repository13 which currently lists 34 resources and 9 tools. Selected Further Reading Break-out of Latvian. 2008. State Language Commission, Rīga, Zinātne, ISBN 978-9984-808-39-0 Gunta Kļava, V. Līcīte, K. Motivāne et al. 2009. Usage of Language in Diaspora: Evaluation of Policy of Latvia and Experience of Other Countries. Riga: Latviešu valodas aģentūra, 24 p. The influence of language proficiency on the standard of living of economically active part of population. Survey of the Sociolinguistic Research Data Serviss. 2006. Valsts valodas aģentūra, Riga, 24 p., ISBN 9984-9836-6-8 Guideline of the State Language Policy for 2005-2014. 2007. Valsts valodas aģentūra, Riga: Valsts valodas aģentūras bibliotēka, 32 p., ISBN 978-9984-9958-0-9 The State Language Policy Programme for 2006-2010 (http://www.valoda.lv/images/stories/Valsts_valodas_politikas_programma.pdf) Digital Library ―Letonica‖ (http://www.lnb.lv/en/digital-library) Digital resources and tools at Artificial Intelligence Laboratory of the Institute of Mathematics and Computer Science, University of Latvia (www.ailab.lv) Digitized collection of folk songs collected by Krišjānis Barons (http://www.dainuskapis.lv/) European Federation of National Institutions for Language, information about Latvian (http://www.efnil.org/documents/language-legislation-version-2007/latvia/latvia) The Latvian Education Informatization System (LIIS) project (http://www.liis.lv) The Latvian Institute (http://www.li.lv/) The Latvian Language Agency (http://www.valoda.lv/) Legislation of Republic of Latvia translated by the State Language Center (http://www.vvc.gov.lv/advantagecms/LV/tulkojumi/index.html) 10 http://www.lfk.lv/kratuve/records.jsp?lg=en http://www.dainuskapis.lv/ http://kfvc.liepu.lv/ 13 http://www.clarin.eu/view_resources 11 12 20 Language Technology Support for Latvian Tilde Latvian language and encyclopaedia portal Letonika.LV (http://www.letonika.lv) Project Modern Latvian on the Web (http://www.lu.lv/filol/valoda/) The State Language Commission (http://www.vvk.lv/ and http://www.president.lv/pk/content/?cat_id=8) The State Language Law (http://www.likumi.lv/doc.php?id=14740) 21 Language Technology Support for Latvian Language Technology Support for Latvian Language Technologies Language technologies are information technologies that are specialized for dealing with human language. Therefore, these technologies are also often subsumed under the term Human Language Technology. Human language occurs in spoken and written form. While speech is the oldest and most natural mode of language communication, complex information and the bulk of human knowledge is recorded and transmitted in written texts. Speech and text technologies process or produce language in these two forms. Language also has aspects common to both forms such as dictionaries, most of the grammar, and the meaning of sentences. Thus, large parts of Language Technology cannot be subsumed under either speech or text technologies. Knowledge technologies include technologies that link language to knowledge. Figure 1 illustrates the Language Technology landscape. In our communication, we mix language with other modes of communication and other information media. We combine speech with gestures and facial expressions. Texts can be combined with pictures and sounds. Movies may contain language in spoken and written form. Thus, speech and text technologies overlap and interact with many other technologies that facilitate the processing of multimodal communication and multimedia documents. Language Technology Application Architectures Typical software applications for language processing consist of several components that mirror different aspects of language and of the task they implement. Figure 2 displays a highly simplified architecture that can be found in a text processing system. The first three modules deal with the structure and meaning of the text input: Pre-processing: cleaning up the data, removing formatting, detecting the input language, if necessary, breaking text into sentences and tokens, etc. Grammatical analysis: finding the verb and its objects, modifiers, etc.; detecting the sentence structure. Semantic analysis: disambiguation (Which meaning of apple is the right one in the given context?), resolving anaphora and referring expressions like she, the car, etc.; representing the meaning of the sentence in a machine-readable way Task-specific modules then perform many different operations such as automatic summarization of an input text, database look-ups and many others. Below, we will illustrate core application areas and highlight certain modules of the different architectures in each section. Again, the architectures are highly simplified and idealized, serving for illustrating the complexity of language technology applications in a generally understandable way. The most important tools and resources involved are underlined in the text and can also be found in the table at the end of the chapter. After the introduction of the core application areas and brief description of the activities in the respective field in Latvia, we will shortly give an overview of the situation in LT research and education, concluding with an overview of (past) funding programs. At the end of this section, we will present an expert estimation on the situation regarding core LT tools and resources in a number 22 Figure 1: The Language Technology Landscape Input Text Preprosessing Grammatical Analysis Semantic Analysis Task-Specific Modules Output Figure 2: A Typical Text Processing Application Architecture Language Technology Support for Latvian of dimensions such as availability, maturity, or quality. This table gives a good overview on the situation of LT for Latvian. Core application areas Language checking Anyone using a word processing tool such as Microsoft Word has come across a spell checking component that indicates spelling mistakes and proposes corrections. 40 years after the first spelling correction program by Ralph Gorin, language checkers nowadays do not simply compare the list of extracted words against a dictionary of correctly spelled words, but have become increasingly sophisticated. In addition to language-dependent algorithms for handling morphology (e.g. plural formation or palatalization), some are now capable of recognizing syntax–related errors, such as a missing verb or a verb that does not agree with its subject in person and number, e.g. in ―She *write a letter.‖ However, for other common error types the above described methods are not sufficient. For example, take a look at the following first verse of a poem by Jerrold H. Zar (1992): Input text Spelling check Statistical language model Grammar check Eye have a spelling chequer, It came with my Pea Sea. It plane lee marks four my revue Miss Steaks I can knot sea. Most available spell checkers (including Microsoft Word) will find no errors in this poem because they mostly look at words in isolation. However, for detecting so-called homophone errors (e.g. ―Eye‖ instead of ―I‖), the language checker needs to consider the context in which a word occurs. This either requires the formulation of language-specific grammar rules, i.e. a high degree of expertise and manual labour, or the use of a statistical language model to calculate the probability of a particular word occurring along with the preceding and following words. For a statistical approach, usually based on ngrams, a large amount of language data (i.e. a corpus) is required to obtain sufficient statistical information. Up to now, these approaches have mostly been developed and evaluated on English language data. However, they do not necessarily transfer well to other languages, e.g. highly inflectional languages with a flexible word order like Latvian. For these more complex languages, an advanced high-precision language checker may require the development of more sophisticated methods, involving a deeper linguistic analysis. The use of language checking is not limited to word processing tools, but it is also applied in authoring support systems. Accompanying the rising number of technical products, the amount of technical documentation has rapidly increased over the last decades. Fearing customer complaints about wrong usage and damages resulting from bad or badly understood instructions, companies have begun to focus more and more on the quality of this technical documentation. Further, as technical products came to the international market, more and more readers were non-native English speakers. This lead to the first projects on developing a controlled simplified technical English that should make it easier for native and non-native readers to understand the instructional text. This controlled language contains a fixed vocabulary in a limited domain and rules for simplifying the sentence structures. Advances in 23 Correction proposals Figure 3: Language Checking (left: rule-based; right: statistical) Language Technology Support for Latvian natural language processing lead to the development of authoring support software, which assists the writer of technical documentation to use vocabulary and sentence structures consistent with these rules and terminology restrictions. The first spelling checker for Latvian was developed by Tilde in 1995. The spelling checker verifies the spelling of every word, and offers to replace the misspelled word with the correct one. It automatically changes words that are unambiguously misspelled. Every year Tilde‘s team improves the spelling checker by including new lexical items, adding new features (e.g. Intelligent AutoCorrect), integrating into the latest software applications. Now Latvian spelling checker recognizes more than 22 million forms generated from more than 130 thousand lemmas. Microsoft licensed Latvian Spelling Checker from Tilde and includes it into the Microsoft Office software suite. Tilde has also integrated its spelling checker into the Open Office and LibreOffice software suites. Tilde also has developed a hyphenation tool for Latvian. It puts hyphens in the Latvian words in the text according to the Latvian hyphenation rules. Both rules defining the usual hyphenation process and exception list (words which cannot be hyphenated using just rules) are used. Microsoft licensed Latvian hyphenator from Tilde and provides it in the Microsoft Office suite. The convenient tool to assist in writing texts is the Latvian Thesaurus created by Tilde. With the help of the Thesaurus, repetition of the same words can be avoided in order to improve the document‘s language. The Thesaurus not only offers the synonyms for a chosen word but also generates the correct inflectional form for replacement. It is integrated in the Microsoft Office environment. Grammar checker verifies the sentence structure and punctuation. In 2004 Tilde developed the first grammar checker for Latvian. The grammar checker was implemented using an advanced pattern matching, which allowed for recognizing and correcting several frequent types of errors: correctness of usage of capital letters, correctness of punctuation for some types of syntactic structures, correctness of abbreviations, correctness of multiword compounds and different types of agreement errors. Recently Tilde released a new version of grammar checker that is based on a full syntactic analysis of the text. The improved grammar checker identifies the most common grammar mistakes, including agreement between words, punctuation and comma errors, as well as numerous stylistic errors. The new approach allows the program to find long distance syntactical errors between different sub parts of the sentence. In addition, calques, slang and some other undesirable words or language constructions are identified. The grammar checker is integrated in Microsoft Word and Open Office text editors. Besides spelling checkers and authoring support, language checking is also important in the field of computer-assisted language learning and is applied to automatically correct queries sent to web search engines, e.g. Google‘s ―Did you mean…‖ suggestions. Web search Search on the web, in intranets, or in digital libraries is probably the most widely used and yet underdeveloped Language Technology today. The search 24 Language Technology Support for Latvian engine Google, which started in 1998, is currently used for about 80% of all search queries world-wide14. Neither the search interface nor the presentation of the retrieved results has significantly changed since the first version. In the current version, Google offers spelling correction for misspelled words, and in 2009, incorporated basic semantic search capabilities into their algorithmic mix15, which can improve search accuracy by analyzing the meaning of the query terms in context. The success story of Google shows that with a lot of data at hand and efficient techniques for indexing these data, a mainly statistically-based approach can lead to satisfactory results. However, for a more sophisticated request for information, integrating deeper linguistic knowledge is essential. In the research labs, experiments using machine-readable thesauri and ontological language resources have shown improvements by allowing the ability to find a page on the basis of synonyms of the search terms or even more loosely related terms. The next generation of search engines will have to include much more sophisticated Language Technology. If a search query consists of a question or another type of sentence rather than a list of keywords, retrieving relevant answers to this query requires an analysis of this sentence on a syntactic and semantic level, as well as the availability of an index that allows for fast retrieval of relevant documents. For example, imagine a user inputs the query ‗Give me a list of all companies that were taken over by other companies in the last five years‘. For a satisfactory answer, syntactic parsing needs to be applied to analyze the grammatical structure of the sentence and determine that the user is looking for companies that have been taken over and not companies that took over others. Also, the expression last five years needs to be processed in order to find out which years it refers to. Finally, the processed query needs to be matched against a huge amount of unstructured data in order to find the piece or pieces of information the user is looking for. This is commonly referred to as information retrieval and involves the search for and ranking of relevant documents. In addition, generating a list of companies, we also need to extract the information that a particular string of words in a document refers to a company name. This kind of information is made available by so-called named-entity recognizers. Even more demanding is the attempt to match a query to documents written in a different language. For cross-lingual information retrieval, we have to automatically translate the query to all possible source languages and transfer the retrieved information back to the target language. Prototypes of Latvian and Lithuanian information retrieval engines were developed as part of FP5 project CLARITY: A proposal for cross language information retrieval and organization of text and audio documents. The CLARITY cross-language information retrieval system was developed for the following language pairs: English-Latvian, Latvian-English, German-Latvian, Latvian-German, Russian-Latvian, Latvian-Russian, Lithuanian-English, English-Lithuanian, German-Lithuanian, Lithuanian-German, Lithuanian-Russian and Russian-German. With respect to Baltic languages, the results for document retrieval using direct query translation indicate that the average precision 14 http://www.spiegel.de/netzwelt/web/0,1518,619398,00.html 15 http://www.pcworld.com/businesscenter/article/161869/google_rolls_out_semantic_search_capabilities.ht ml 25 User query Web pages Preprosessing Preprosessing Query Analysis Semantic Processing Indexing Matching&Relevance Search Results Figure 4: A Web Search Architecture Language Technology Support for Latvian can reach a level of more than 70% compared to monolingual retrieval. In the case of transitive (pivot) translation the precision is lower, around 40%, but still at reasonable levels compared to monolingual (Demetrioua et al 2004). The increasing percentage of data available in non-textual formats drives the demand for services enabling multimedia information retrieval, i.e., information search on images, audio, and video data. For audio and video files, this involves a speech recognition module to convert speech content into text or a phonetic representation, to which user queries can be matched. Speech interaction Speech interaction technology is the basis for the creation of interfaces that allow a user to interact with machines using spoken language rather than a graphical display, a keyboard, or a mouse. Today, such voice user interfaces (VUIs) are usually employed for partially or fully automating service offerings provided by companies to their customers, employees, or partners via telephone. Business domains that rely heavily on VUIs are banking, logistics, public transportation, and telecommunications. Other usages of speech interaction technology are interfaces to particular devices, e.g. in-car navigation systems, and the employment of spoken language as an alternative to the input/output modalities of graphical user interfaces, e.g. in smartphones. At its core, speech interaction comprises the following four different technologies: Automatic speech recognition (ASR) is responsible for determining which words were actually spoken given a sequence of sounds uttered by a user. Syntactic analysis and semantic interpretation deal with analyzing the syntactic structure of a user‘s utterance and interpreting the latter according to the purpose of the respective system. Dialogue Management is required for determining, on the part of the system the user interacts with, which action shall be taken given the user‘s input and the functionality of the system. Speech Synthesis (Text-to-Speech, TTS) technology is employed for transforming the wording of that utterance into sounds that will be outputted to the user. One of the major challenges is to have the ASR system recognize the words uttered by a user as precisely as possible. This requires either a restriction of the range of possible user utterances to a limited set of keywords, or the manual creation of language models that cover a large range of natural language user utterances. Whereas the former results in a rather rigid and inflexible usage of a VUI and possibly causes poor user acceptance, the creation, tuning and maintenance of language models may increase the costs significantly. However, VUIs that employ language models and initially allow a user to flexibly express their intent – evoked, e.g., by a How may I help you greeting – show both a higher automation rate and a higher user acceptance and may therefore be considered as advantageous over a less flexible directed dialogue approach. For the output part of a VUI, companies mostly tend to use pre-recorded utterances of professional – ideally corporate – speakers. For static utterances in which the wording does not depend on the particular contexts of use or the 26 Speech input Signal processing Recognition Speech output Speech synthesis Phonetic lookup & Intonation planning Natural language understanding & dialogue Figure 5: A Simple Speech-based Dialogue Architecture Language Technology Support for Latvian personal data of the given user, this will result in a rich user experience. However, the more dynamic content an utterance needs to consider, the more user experience may suffer from a poor prosody resulting from concatenating single audio files. In contrast, today‘s TTS systems prove superior, though optimizable, regarding the prosodic naturalness of dynamic utterances. Regarding the market for speech interaction technology, the last decade underwent a strong standardization of the interfaces between the different technology components, as well as by standards for creating particular software artefacts for a given application. There also has been strong market consolidation within the last ten years, particularly in the field of ASR and TTS. Here, the national markets in the G20 countries – i.e. economically strong countries with a considerable population - are dominated by less than 5 players worldwide, with Nuance and Loquendo being the most prominent ones in Europe. Several research projects in speech technologies have been carried out in Latvia resulting in three speech synthesis systems that have achieved the level of practical usability: Tilde TTS (Tilde), T2S (IMCS) and Balss (SIA Rubuls & Co). In 2005 Tilde together with The Association of Blind People started a project to develop a Latvian text-to-speech (TTS) system (Goba and Vasiļjevs 2007; Goba 2007) with the primary goal was to address the needs of visually impaired people using computers in Latvian. The architecture of the system covers the traditional TTS transformation, performing text normalization, grapheme-to-phoneme conversion, prosody generation, and waveform synthesis. The optimal compromise between speed and effectiveness of speech synthesis, and the quality of the produced speech is achieved by a combined approach of synthesis and selection of speech units of variable lengths. The Institute of Mathematics and Computer Science of University of Latvia (IMCS) had several projects devoted to experimental TTS (Auziņa 2004; Pinnis and Auziņa, 2010) and speech recognition systems. The demonstration version of the TTS system is developed by IMCS16. The speech synthesis system was improved and an experimental speech recognition module for isolated words was created in the project ―Applications of Latvian Language Speech Synthesis and Analysis in Call Centers‖ financed by Lattelecom BPO. The TTS engine Balss (for Windows) provides transforming texts from all text processors with the copy function in Latvian language. The SDK for creation of new voices (languages) and the source code is commercially available. For the Latvian language and its relatively small number of speakers, commercially employable ASR products do not exist. There has not been any serious research in Latvian language speech recognition, but some individual experiments in sound recognition and isolated word recognition have been performed by IMCS. Regarding dialogue management technology and know-how, markets are strongly dominated by national players, which are usually SMEs. Rather than exclusively relying on a product business based on software licenses, these companies have positioned themselves mostly as full-service providers that offer the creation of VUIs as a system integration service. Finally, within the 16 http://runa.ailab.lv/tts2 27 Language Technology Support for Latvian domain of speech interaction, a genuine market for the linguistic core technologies for syntactic and semantic analysis does not exist yet. Looking beyond today‘s state of technology, there will be significant changes due to the spread of smartphones as a new platform for managing customer relationships in addition to the telephone, internet, and email channels. This tendency will also affect the employment of technology for speech interaction. On the one hand, demand for telephony-based VUIs will decrease in the long run. On the other hand, the usage of spoken language as a user-friendly input modality for smartphones will gain significant importance. This tendency is supported by the observable improvement of speaker independent speech recognition accuracy for speech dictation services that are already offered as centralized services to smartphone users. Given this ‗outsourcing‘ of the recognition task to the infrastructure of applications, the application-specific employment of linguistic core technologies will supposedly gain importance compared to the present situation. Statistical machine translation Machine Translation The idea of using digital computers for translation of natural languages came up in 1946 by A. D. Booth and was followed by substantial funding for research in this area in the 1950s and beginning again in the 1980s. Nevertheless, Machine Translation (MT) still fails to fulfil the high expectations it gave in its early years. At its basic level, MT simply substitutes words in one natural language by words in another. This can be useful in subject domains with a very restricted, formulaic language, e.g., weather reports. However, for a good translation of less standardized texts, larger text units (phrases, sentences, or even whole passages) need to be matched to their closest counterparts in the target language. The major difficulty here lies in the fact that human language is ambiguous, which yields challenges on multiple levels, e.g., word sense disambiguation on the lexical level (‗Jaguar‘ can mean a car or an animal) or the attachment of prepositional phrases on the syntactic level as in: Source text Text analysis (formatting, morphology, syntax, etc.) Target text Post-editing (formatting, context, etc.) Policists novēroja vīru ar telskopu. [The policeman observed the man with the telescope.] Translation rules Policists novēroja vīru ar revolveri. [The policeman observed the man with the revolver.] Figure 6: Machine Translation (top: statistical: bottom: rule-based) One way of approaching the task is based on linguistic rules. For translations between closely related languages, but often rule-based (or knowledge-driven) systems analyze the input text and create an intermediary, symbolic representation, from which the text in the target language is generated. The success of these methods is highly dependent on the availability of extensive lexicons with morphological, syntactic, and semantic information, and large sets of grammar rules carefully designed by a skilled linguist. Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for MT. The parameters of these statistical models are derived from the analysis of bilingual text corpora, such as the JRC-Acquis multilingual parallel corpus (Steinberger et al. 2006), the total body of European Union (EU) law applicable in the EU Member States in 27 European languages. Given enough data, statistical MT works well enough to derive an approximate meaning of a foreign 28 Language Technology Support for Latvian language text. Still, the current methods do not work equally well for all language pairs. With respect to European languages, good translation performance can be obtained for English and the Romance languages, but the quality is much worse for Germanic, Slavic, Finno-Ugric and Baltic languages (Koehn et al. 2009). However, unlike knowledge-driven systems, statistical (or data-driven) MT often generates ungrammatical output. On the other hand, besides the advantage that less human effort is required for grammar writing, data-driven MT can also cover particularities of the language that go missing in knowledge-driven systems, for example idiomatic expressions. As the strengths and weaknesses of knowledge- and data-driven MT are complementary, researchers nowadays unanimously target hybrid approaches combining methodologies of both. This can be done in several ways. One is to use both knowledge- and data-driven systems and have a selection module decide on the best output for each sentence. However, for longer sentences, no result will be perfect. A better solution is to combine the best parts of each sentence from multiple outputs, which can be fairly complex, as corresponding parts of multiple alternatives are not always obvious and need to be aligned. The rule-based approach has been dominant in Latvia since the mid-90-ies when experimental interlingua MT system LATRA was created at IMCS (Greitāne 1997). Research on rule-based systems continued at IMCS until 2004 by elaborating LATRA with semantic properties and by adapting it to new domains. Tilde also has worked on the rule-based approach aiming at development of commercial system for users who has poor or no foreign language skills. MT system Tildes Tulkotājs (Skadiņa et al. 2008) was released in 2007 as part of Tildes Birojs 2008 software suite. The system translates texts from English into Latvian and from Latvian into Russian. For Latvian, MT, especially Statistical Machine Translation (SMT), is particularly challenging because of the free word order and extensive inflection. Also, Latvian is so-called under-resourced language, i.e., only few parallel corpora are available for Latvian. Therefore work on SMT started only in 2005 by IMCS though project funded by Latvian Council of Sciences ―Evaluation of Statistical Machine Translation Methods for English-Latvian Translation System‖ (2005-2008) in which the baseline English-Latvian system was created (Skadiņa and Brālītis 2008). The system‘s performance in BLEU points was similar to other systems for inflected languages of that time. IMCS research on SMT continues with the project ―Application of Factored Methods in English-Latvian SMT System‖ (Skadiņa and Brālītis 2009), the latest version of the system is available on the Web17. Current developments at Tilde are focused on combining data-driven statistical MT with knowledge-based models to achieve the optimal quality of translation. In addition to publicly available resources, internal resources collected over a long period of time were used for SMT training. Tilde Translator currently provides English-Latvian, Latvian-English SMT systems and is expanding to other translation directions. Tilde Translator is publicly available on the web18 (Skadiņš et al. 2010), as part of Tildes Birojs suite of desktop 17 18 http://eksperimenti.ailab.lv/smt http://translate.tilde.com 29 Language Technology Support for Latvian software and also as mobile applications for the most commonly used platforms like Android and iOS. Several EC co-funded collaborative projects were undertaken for advanced research and development of machine translation for under-resourced languages, including Latvian. The CIP ICT PSP project LetsMT!19 and FP7 project ACCURAT20, coordinated by Tilde, developed innovative methods for making it easier to gather data for MT and to create customized MT systems for different domains and usage scenarios. The ACCURAT project researches novel methods that exploit comparable corpora to compensate for the shortage of linguistic resources to improve MT quality for under-resourced languages and narrow domains (Eisele and Xu 2010, Skadiņa et al. 2010). The ACCURAT project‘s target is to achieve strong improvement in translation quality for a number of new EU official languages and languages of associated countries (Croatian, Estonian, Greek, Latvian, Lithuanian and Romanian), and propose novel approaches for adapting existing MT technologies to specific narrow domains, significantly increasing language and domain coverage of automated translation. The LetsMT! project (Vasiļjevs et al. 2010) builds an innovative online collaborative platform for data sharing and MT generation. This cloud-based platform provides all categories of users with an opportunity to upload their proprietary resources to the repository and receive a tailored statistical MT system trained on such resources. The latter can be shared with other users who can exploit them further on. SMT Training and SMT web service News Translation Web page translation widget SMT Resource Repository Giza++ Moses SMT toolkit Moses decoders SMT Web Service Procesing, Evaluation ... Upload Sharing of training data SMT Resource Directory Web browser Plug-ins Interfaces for CAT tools SMT System Directory SMT Multi-Model Repository (trained SMT models) MT web page API API System Core Services: System management, user authentication, access rights control, translation web page, integration ... The translation services of the LetsMT! Project can be used in several ways: through the web portal, through a widget provided for free inclusion in a webpage, through browser plug-ins, and through integration in computer-assisted translation (CAT) tools and different online and offline applications. The quality of MT systems is still considered to have huge improvement potential. Challenges include the adaptability of the language resources to a given subject domain or user area and the integration into existing workflows with term bases and translation memories. 19 20 http://www.letsmt.eu http:// www.accurat-project.eu 30 Language Technology Support for Latvian Provided good adaptation in terms of user-specific terminology and workflow integration, the use of MT can significantly increase productivity of translation work. Recently Tilde performed experiment on the application of an English-Latvian SMT in localization through the integration of MT into SDL Trados translation environment (Skadiņš et al. 2011). The results of the experiment clearly demonstrated that it is feasible to integrate the current state of the art SMT systems for highly inflected languages into the localization process. The use of the EnglishLatvian SMT suggestions in addition to the translation memories in SDL Trados tool lead to the increase of translation performance by 32.9% while maintaining an acceptable quality of the translation. Even better performance results are achieved when using a customized SMT system that is trained on a specific domain and/or same customer parallel data. Evaluation campaigns allow to compare the quality of MT systems, various approaches and status of MT systems for the different languages. The following table21, presented within the EC Euromatrix+ project, shows the pair wise performances obtained for 27 official EU languages (Irish Gaelic is missing) in terms of BLEU score22. The best results (shown in green and blue) were achieved by languages which benefit from considerable research efforts, within coordinated programs, and from the existence of many parallel corpora (e.g., English, French, Dutch, Spanish, German), the worst (in red) by languages that are very different from other languages (e.g., Hungarian, Maltese, Finnish). Language Technology ‘Behind the Scenes’ Building Language Technology applications involves a range of subtasks that do not always surface at the level of interaction with the user, but provide significant service functionalities ‗under the hood‘ of the system. Therefore, 21 Ph. Koehn, A. Birch and R. Steinberger. 462 Machine Translation Systems for Europe, Machine Translation Summit XII, p. 65-72, 2009. 22 The higher the score, the better the translation, a human translator would get around 80.K. Papineni, S. Roukos, T. Ward, W.-J. Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of ACL, Philadelphia, PA. 31 Language Technology Support for Latvian they represent important research issues that have become individual subdisciplines of Computational Linguistics in academia. Question answering has become an active area of research, for which annotated corpora have been built and scientific competition has started. The idea is to move from keyword-based search (to which the engine responds with a whole collection of potentially relevant documents) to the scenario of a user asking a concrete question and the system providing a single answer: ‗At what age did Neil Armstrong step on the moon?‘ - ‗38‘. While this is obviously related to the aforementioned core area web search, question answering nowadays is primarily an umbrella term for research questions such as what types of questions should be distinguished and how should they be handled, how can a set of documents that potentially contain the answer be analyzed and compared (do they give conflicting answers?), and how can specific information - the answer - be reliably extracted from a document, without unduly ignoring the context. This is in turn related to the information extraction (IE) task, an area that was extremely popular and influential at the time of the ‗statistical turn‘ in Computational Linguistics, in the early 1990s. IE aims at identifying specific pieces of information in specific classes of documents; this could be e.g. the detection of the key players in company takeovers as reported in newspaper stories. Another scenario that has been worked on is reports on terrorist incidents, where the problem is to map the text to a template specifying the perpetrator, the target, time and location of the incident, and the results of the incident. Domain-specific template-filling is the central characteristic of IE, which for this reason is another example of a ‗behind the scenes‘ technology that constitutes a well-demarcated research area but for practical purposes then needs to be embedded into a suitable application environment. Two ‗borderline‘ areas, which sometimes play the role of a standalone application and sometimes that of supportive, ‗under the hood‘ component are text summarization and text generation. Summarization, obviously, refers to the task of making a long text short, and is offered for instance as a functionality within MS Word. It works largely on a statistical basis, by first identifying ‗important‘ words in a text (that is, for example, words that are highly frequent in this text but markedly less frequent in general language use) and then determining sentences that contain many important words. Such sentences are then marked in the document, or extracted from it, and are taken to constitute the summary. In this scenario, which is by far the most popular one, summarization equals sentence extraction: the text is reduced to a subset of its sentences. All commercial summarizers make use of this idea. An alternative approach, to which some research is devoted, is to actually synthesize new sentences, i.e., to build a summary of sentences that need not show up in that form in the source text. This requires a certain amount of deeper understanding of the text and therefore is much less robust. All in all, a text generator is in most cases not a stand-alone application but embedded into a larger software environment, such as into the clinical information system where patient data is collected, stored and processed, and report generation is just one of many functionalities. For Latvian, the situation in all the above-mentioned research areas is much less developed than it is for English. Some experiments have been performed only on Latvian text summarization. 32 Language Technology Support for Latvian Language Technology in Education Language Technology is a highly interdisciplinary field, involving the expertise of linguists, computer scientists, mathematicians, philosophers, psycholinguists, and neuroscientists, among others. As such, it has not yet acquired a fixed place in the Latvian faculty system. Some courses related to the language technology are though at Liepaja University since 2003, including Natural Language Processing for master's degree students in information technologies and Computational Linguistics for master's degree students in the Latvian philology. It is planned to start several courses related to Computational Linguistics at the University of Latvia in autumn semester of 2011. One course is planned for bachelor students in Computer Science, deeper studies in this field are planned for master's degree students in Cognitive Sciences and Communication. Important contribution to CL education was an opportunity for doctoral students from Latvia to participate in Nordic Graduate School of Language Technology, NGSLT23. Majority of students who attended NGSLT has successfully defended their PhD thesis or are PhD candidates currently. New opportunities for young researchers are provided through Initial Training Network in the Marie Curie Actions CLARA24. CLARA project aims to train a new generation of researchers who will be able to cooperate across national boundaries on the establishment of a common language resources infrastructure and its exploitation for the construction of the next generation of language models with wide theoretical and applied significance. Language Technology Programs In Latvia, activities for collecting language resources were initiated at the end of the 1980s at the IMCS (Milčonoka et al 2004). In 2004, the State Language Commission initiated development of the Latvian National Corpus. As different resources have been collected in a number of institutions, Latvian National Corpus Initiative envisions the establishment of an umbrella for all the available corpora of the Latvian language. The Agreement of Intention between the main language resource developers and holders, both academic and industry, has been signed and next practical steps are discussed. Most of research activities in Latvia are funded by the Latvian Council of Science (LCS). In 2005-2009, two LT related projects of IMCS were supported by LCS as a part of State Research Programs ―Scientific Foundations of Information Technology‖ and ―Latvian Studies (Letonica): Culture, Language and History‖. The SemTi-Kamols project25 aimed at development and adaptation of the semantic web technologies for semantic analysis in Latvian. The project ―Database of Latvian Explanatory Dictionaries and Recent Loanwords‖ was mainly dealing with semi-automatic transformation of the Dictionary of Standard Latvian Language into a machine-readable format. Work on semantic technologies continues in two large projects: ―Novel information technologies based on ontologies and model transformations” of the State Research Pro23 24 25 http://ngslt.org http://clara.uib.no http://www.semti-kamols.lv 33 Language Technology Support for Latvian gram and ―Semantic database platform for domain specialists‖ funded by the European Structural Funds. In addition, few smaller projects of IMCS related to LT have been funded by LCS in the last six years: “Evaluation of Statistical Machine Translation Methods for English-Latvian Translation System” (2005-2008), ―Modeling of Universal Lexicon System for the Latvian Language” (2005-2008), “Historical Dictionary of the Latvian Language (16-18th centuries)” (2005-2008), “Methods for Latvian-English Computer Aided Lexicography” (2008), “Application of Factored Methods in English-Latvian Statistical Machine Translation System” (2009-2012). Latvia is a partner in the CLARIN project – a pan-European effort to create language resource infrastructure for researchers in humanities and social sciences. Latvia‘s participation is financed by the Ministry of Education of Science. The advancement of CLARIN is mentioned in the strategic document ―Action Plan for Implementation of Guidelines for Science and Technology Development” approved by the Cabinet of Ministers in 2010. The CLARIN National Advisory Board was established and approved by the Ministry of Education of Science to prioritize the goals and tasks of the CLARIN in Latvia and to facilitate the creation of the CLARIN infrastructure. As the market for language technologies is very small in Latvia, there are only few industry players providing solutions in this field. Tilde, established in 1991, is the major language technology company in Latvia. Key experience of Tilde is in three language technology areas: translation tools, proofing tools, and terminology management. Language software by Tilde is widely used in Baltic countries with more than 270 000 licensed users for Latvian language translation and proofreading tools. Tilde develops online and mobile machine translation and terminology systems for Latvian and other European languages. Tilde actively participates in EU research and development collaboration coordinating several large-scale projects: EuroTermBank (eContent), ACCURAT (FP7), LetsMT! (ICT-PSP) and META-NORD (ICT-PSP). Other company developing machine translation solutions is Trident MT – recently opened Latvian branch of Ukrainian company Trident. This company participates in the ICT-PSP project itranslate4.eu26. Company Algorego develops solutions for processing and structuring information of digitized documents. Company Datorzinību Centrs develops e-learning applications including solutions for language learning. Taking into account the importance of LT in ensuring sustainable development of Latvian and other smaller languages, an initiative Language Shore was launched in 2009 under the patronage of the President of Latvia Valdis Zatlers. This initiative fosters the creation of a partnership between government, academia and industry to develop an international expertise cluster in language technology. Language Shore is aiming to achieve international leadership in technologies for smaller languages. In order to provide a successful development of the initiative at the government level, a Language Shore Steering Group is established, composed of five sector ministers. The first Language Shore pilot projects are successfully completed 26 http://itranslate4.eu/project/ 34 Language Technology Support for Latvian in cooperation of Tilde and Microsoft Research advancing Latvian machine translation for Bing Translator, developing a new crowd-sourcing model for MT data collection, establishing cooperation in terminology data sharing. Further development of the Language Shore is supported by the Competence Centre Programme funded by EU Structural Funds. The state support for the competence centers aims to stimulate business research and promote sectoral cooperation between companies and research institutions to develop innovative products and technologies to improve the competitiveness of enterprises. Latvian ICT Competence Centre was established in 2010 to carry out R&D activities in language technologies and business process analysis. Major Latvian IT companies and universities will cooperate in the ICT Competence Centre to develop advanced technologies for machine translation, speech processing and semantic analysis. Despite of the several achievements in language technology research and industrial development, Latvia lacks a dedicated national program on language technologies. Current research activities are fragmented and mostly organized around short-term projects which complicate long-term inter-institutional cooperation and development of larger resources. Public funding for LT in Europe is relatively low compared to the expenditures for language translation and multilingual information access by the USA27. In Latvia public funding is even lower than in many other European countries, including neighboring countries Estonia and Lithuania. 27Gianni Lazzari: „Sprachtechnologien http://tcstar.org/pubblicazioni/D17_HLT_DE.pdf für Europa“, 2006: 35 Language Technology Support for Latvian Availability of Tools and Resources for Latvian The following table provides an overview of the current situation in language technology support for Latvian. The rating of existing technologies and resources is based on educated estimations by several leading experts using the following criteria (each ranging from 0 to 6). 1. Quantity: Does a tool/resource exist for the language at hand? The more technologies/resources exist, the higher the rating. 0: no tools/resources whatsoever 6: many technologies/resources, large variety 2. Availability: Are technologies/resources accessible, i.e., are they Open Source, freely usable on any platform or only available for a high price or under very restricted conditions? 0: practically all technologies/resources are only available for a high price 6: a large amount of technologies/resources is freely, openly available under sensible Open Source or Creative Commons licenses that allow re-use and re-purposing 3. Quality: How well are the respective performance criteria of technologies and quality indicators of resources met by the best available tools, applications or resources? Are these technologies/resources current and also actively maintained? 0: toy resource/technology 6: high-quality technology, human-quality annotations in a resource 4. Coverage: To which degree do the best technologies meet the respective coverage criteria (styles, genres, text sorts, linguistic phenomena, types of input/output, number languages supported by an MT system etc.)? To which degree are resources representative of the targeted language or sublanguages? 0: special-purpose resource or technology, specific case, very small coverage, only to be used for very specific, non-general use cases 6: very broad coverage resource, very robust technology, widely applicable, many languages supported 5. Maturity: Can the technology/resource be considered mature, stable, ready for the market? Can the best available technologies/resources be used out-of-the-box or do they have to be adapted? Is the performance of such a technology adequate and ready for production use or is it only a prototype that cannot be used for production systems? An indicator may be whether resources/technologies are accepted by the community and successfully used in LT systems. 0: preliminary prototype, toy system, proof-of-concept, example resource exercise 6: immediately integratable/applicable component 6. Sustainability: How well can the technology/resource be maintained/integrated into current IT systems? Does the technology/resource fulfil a determined level of sustainability concerning documentation/manuals, explanation of use cases, front-ends, GUIs etc.? Does it use/employ standard/best-practice programming environments (such as Java EE)? Do industry/research standards/quasi-standards exist and if so, is the technology/resource compliant (data formats etc.)? 0: completely proprietary, ad hoc data formats and APIs 6: full standard-compliance, fully documented 36 Language Technology Support for Latvian 7. Adaptability: How well can the best technologies or resources be adapted/extended to new tasks/domains/genres/text types/use cases etc.? 0: practically impossible to adapt a technology/resource to another task, impossible even with large amounts of resources or person months at hand 6: very high level of adaptability; adaptation also very easy and efficiently possible 37 Language Technology Support for Latvian Adaptability Sustainability Maturity Coverage Quality Availability Quantity Status of Tools and Resources for Latvian Language Technology (Tools, Technologies, Applications) Tokenization, Morphology (tokenization, POS tagging, morphological analysis/generation) Parsing (shallow or deep syntactic analysis) Sentence Semantics (WSD, argument structure, semantic roles) Text Semantics (coreference resolution, context, pragmatics, inference) 3 4 4 5 5 5 4 2 4 3 2 5 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 3 3 2 3 1 3 1 3 3 1 2 0 0 0 0 0 0 0 3 0 2 6 0 4 4 0 4 4 0 3 5 0 5 2 0 5 4 0 4 0 0 0 0 0 0 0 4 1 1 0 4 4 1 0 0 4 5 4 0 0 4 Advanced Discourse Processing (text structure, coherence, rhetorical structure/RST, argumentative zoning, argumentation, text patterns, text types etc.) Information Retrieval (text indexing, multimedia IR, crosslingual IR) Information Extraction (named entity recognition, event/relation extraction, opinion/sentiment recognition, text mining/analytics) Language Generation (sentence generation, report generation, text generation) Summarization, Question Answering, advanced Information Access Technologies Machine Translation Speech Recognition Speech Synthesis Dialogue Management (dialogue capabilities and user modelling) Language Resources (Resources, Data, Knowledge Bases) Reference Corpora 2 5 4 5 Syntax-Corpora (treebanks, dependency banks) 1 4 1 1 Semantics-Corpora 1 2 1 1 Discourse-Corpora 0 0 0 0 Parallel Corpora, Translation Memories 2 4 4 3 Speech-Corpora (raw speech data, labeled/annotated speech data, speech dialogue data) Multimedia and multimodal data (text data combined with audio/video) Language Models Lexicons, Terminologies Grammars Thesauri, WordNets Ontological Resources for World Knowledge (e.g. upper models, Linked Data) 2 1 1 1 1 1 3 0 0 0 0 0 0 0 3 4 2 2 2 5 1 2 3 5 4 3 4 5 3 2 3 5 4 4 5 5 4 4 3 5 3 4 1 2 1 1 1 0 0 38 Language Technology Support for Latvian Conclusions 1) Interpretation of the table. For Latvian, key results regarding technologies and resources include the following: While several basic language resources and tools are rather well presented for Latvian language, more advanced resources and tools are missing; establishment of a Language Technology Program coordinating and supporting the LT field in Latvia is the most important task to resolve this issue; Reasonably good results are achieved in machine translation, the quality aspect depends on the availability of language resources, which is rather limited for such a small language as Latvian; Semantics is more difficult to process thus only a few research prototypes have been created; Creation of speech and multimodal resources are in an initial phase, most of these resources are not available for the Latvian language; Tools and resources for more advanced language technology such as discourse processing, information retrieval, summarization and dialogue management do not exist; Many of the resources lack standardization, i.e., even if they exist, sustainability is not given; concerted programs and initiatives are needed to standardize data and interchange formats. 2) Where do we stand and what needs to be done? Language technology in Latvia has a rather long history starting from the end of the 50-ies. However, language technology has never been a priority research field in Latvia and thus was supported only with very limited funding. This situation resulted in rather big gaps in language resources and tools needed for sustainable development of Latvian language. Moreover, Latvia lacks a dedicated national programme for LT research and development and current research activities are fragmented and mostly organized around short-term projects thus complicating long-term interinstitutional cooperation and development of larger resources. Targeted national research and development activities, e.g. Language Technology Programme, National Corpus Project, are urgently needed to fill these gaps. Another urgent problem is the lack of educational programmes in computational linguistic at Latvian universities. Currently only one semester-long course is taught at the Liepaja University. 39 Language Technology Support for Latvian References Auziņa Ilze, 2004. Latvian Text-to-Speech System. Proceedings of the first Baltic conference „Human Language Technologies – the Baltic Perspective‟, 21-26. Deksne Daiga, I. Skadiņa, R. Skadiņš, A. Vasiljevs, 2005. Foreign Language Reading Tool – First Step Towards English-Latvian Commercial Machine Translation. Proceedings of Second Baltic Conference „Human Language Technologies – the Baltic Perspective”, Tallinn. Demetrioua G., I. Skadiņa, H. Keskustalo, J. Karlgren, D. Deksne, D. Petrellie, P. Hansen, R. Gaizauskas, M. Sanderson, 2004.CrossLingualDocumentRetrieval, Categorisation and Navigation Based on Distributed Services. Proceedings of First Baltic Conference „HumanLanguage Technologies – the Baltic Perspective”, Riga, 107-114. Eisele Andreas, J. Xu, 2010. Improving Machine Translation Performance Using Comparable Corpora. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, European Language Resources Association (ELRA), La Valletta, Malta, 35-41. Ethnologue Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com. EUROMAP study, 2003.―Benchmarking HLT progress in Europe‖ EUROMAP study. Goba Kārlis, A. Vasiļjevs, 2007. Development of Text-To-Speech System for Latvian. Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, 67-72. Goba Kārlis, 2007. Development of a Prosody Model for Latvian TTS. Proceedings of Third Baltic Conference HLT‟2007, October 4-5, 2007, Kaunas, Lithuania. Greitāne Inguna, 1997. Mašīntulkošanas sistēma LATRA. LZA Vēstis Nr.3./4 (1997), 1-6. Internet World Stats, 2010.http://www.internetworldstats.com Copyright © Miniwatts Marketing Group. All rights reserved. Koehn Philipp, A. Birch, and Ralf Steinberger, 2009. 462 machine translation systems for Europe. Proceedings of the Twelfth Machine Translation Summit (MT Summit XII).International Association for Machine Translation, 2009. Milčonoka Everita, N. Grūzītis, A. Spektors, 2004. Natural Language Processing at the Institute of Mathematics and Computer Science: 10 Years Later. Proceedings of the first Baltic conference „Human Language Technologies – the Baltic Perspective‟, 6–12. 40 Language Technology Support for Latvian Pinnis Mārcis, I. Auziņa, 2010. Latvian Text-to-Speech Synthesizer. Proceedings of the Fourth International Conference Baltic HLT 2010, IOS Press, Frontiers in Artificial Intelligence and Applications, Vol. 219, 69-72. Skadiņa Inguna, A. Vasiļjevs, D. Deksne, R. Skadiņš, L. Goldberga, 2007.Comprehension Assistant for Languages of Baltic States. Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, Tartu, 2007, 167.-174. Skadiņa Inguna, E. Brālītis, 2008.Experimental Statistical Machine Translation System for Latvian. Proceedings of the 3rd Baltic Conference on HLT, 281-286. Skadiņa Inguna, E. Brālītis, 2009. English-Latvian SMT: knowledge or data? Proceedings of the 17th Nordic Conference on Computational Linguistics NODALIDA, May 14-16, 2009, Odense, Denmark, NEALT Proceedings Series, Vol. 4, 242–245. Skadiņa Inguna, A. Vasiļjevs, R. Skadiņš, R. Gaizauskas, D. Tufis, T. Gornostay, 2010. Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. European Language Resources Association (ELRA), La Valletta, Malta, 6-14. Skadiņa Inguna, I. Auziņa, N. Grūzītis, K. Levāne-Petrova, G. Nešpore, R. Skadiņš, A. Vasiļjevs, 2010. Language Resources and Technology for the Humanities in Latvia (2004–2010). Proceedings of the Fourth International Conference Baltic HLT 2010, IOS Press, Frontiers in Artificial Intelligence and Applications, Vol. 219, pp. 15-22. Skadiņš Raivis, I. Skadiņa, D. Deksne, T. Gornostay, 2008. English/RussianLatvian Machine Translation System. Proceedings of the 3rd Baltic Conference on HLT, 287-296. Skadiņš Raivis, K. Goba, V. Šics, 2010. Improving SMT for Baltic languages with factored models. Proceedings of the Fourth Baltic conference „Human Language Technologies – the Baltic Perspective‟. Skadiņš Raivis, M. Puriņš, I. Skadiņa and A. Vasiļjevs, 2011. Evaluation of SMT in localization to under-resourced inflected language. Proceedings of EAMT-2011. Steinberger Ralf, B.Pouliquen, A.Widiger, C.Ignat, T.Erjavec, D.Tufiş, D.Varga, 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. Vasiļjevs Andrejs, T. Gornostay and R. Skadins, 2010. LetsMT! – Online Platform for Sharing Training Data and Building User Tailored Machine Translation. Proceedings of the Fourth Baltic conference „Human Language Technologies – the Baltic Perspective‟. 41 Appendix META-NET 0META-NET is a Network of Excellence funded by the European Union. It currently consists of 44 members, representing 31 European countries, which are listed below. META-NET is fostering the Multilingual Europe Technology Alliance (META), a growing community of language technology professionals and organisations in Europe. META – The Multilingual Europe Technology AllianceMETA – The Multilingual Europe Technology Alliance Figure 6: Countries Represented in META-NET META-NET cooperates with a dozen other large initiatives like CLARIN, which is helping social sciences to establish the field Digital Humanities in Europe. META-NET is dedicated to fostering the technological foundations for establishing and maintaining a truly multilingual European information society that makes possible communication and cooperation across languages, safeguards equal access to information and knowledge for users of any language, offers advanced functionalities of networked information technology to all citizens at affordable costs. META-NET stimulates and promotes multilingual technologies for all European languages. The technologies enable automatic translation, content production, information processing and knowledge management for a wide variety of applications and subject domains. The network wants to improve current approaches, so better communication and cooperation across languages can take place. Europeans have an equal right to information and knowledge regardless of language. META-NET’s Three Lines of Action META-NET launched on 1 February 2010 with the goal of advancing research in language technology. The initiative supports a Europe that unites as a single, digital market and information space. META-NET has conducted several activities that further its goals. META-VISION, META-SHARE and META-RESEARCH are the network‘s three lines of action. 42 About META-NET Figure 7: Three Lines of Action in META-NET META-VISION fosters a dynamic and influential stakeholder community that unites around a shared vision and a common strategic research agenda (SRA). The main focus of this activity is to build a coherent and cohesive LT community in Europe by bringing together representatives from highly fragmented and diverse groups of stakeholders. In META-NET‘s first year, presentations at the FLaReNet Forum (Spain), language technology Days (Luxembourg), JIAMCATT 2010 (Luxembourg), LREC 2010 (Malta), EAMT 2010 (France) and ICT 2010 (Belgium) centred on public outreach. According to initial estimates, META-NET has already contacted more than 2,500 LT professionals to share its goals and visions with them. At the META-FORUM 2010 event in Brussels, META-NET shared the initial results of its vision building process to more than 250 participants. In a series of interactive sessions, the participants provided feedback on the visions presented by the network. META-SHARE creates an open, distributed facility for exchanging and sharing resources. The peer-to-peer network of repositories will contain language data, tools and web services that are documented with high-quality metadata and organised in standardised categories. The resources can be readily accessed and uniformly searched. The available resources include free, open source materials as well as restricted, commercially available, fee-based items. META-SHARE targets existing language data, tools and systems as well as new and emerging products that are required for building and evaluating new technologies, products and services. The reuse, combination, repurposing and re-engineering of language data and tools plays a crucial role. META-SHARE will eventually become a critical part of the LT marketplace for developers, localisation experts, researchers, translators and language professionals from small, mid-sized and large enterprises. META-SHARE addresses the full development cycle of LT—from research to innovative products and services. A key aspect of this activity is establishing META-SHARE as an important and valuable part of a European and global infrastructure for the LT community. META-RESEARCH builds bridges to related technology fields. This activity seeks to leverage advances in other fields and to capitalise on innovative research that can benefit language technology. In particular, this activity wants to bring more semantics into machine translation (MT), optimise the division of labour in hybrid MT, exploit context when computing automatic translations and prepare an empirical base for MT. META-RESEARCH is working with other fields and disciplines, such as machine learning and the Semantic Web community. META-RESEARCH focuses on collecting data, preparing data sets and organising language resources for evaluation purposes; compiling inventories of tools and methods; and organising workshops and 43 About META-NET training events for members of the community. This activity has already clearly identified aspects of MT where semantics can impact current best practices. In addition, the activity has created recommendations on how to approach the problem of integrating semantic information in MT. META-RESEARCH is also finalising a new language resource for MT, the Annotated Hybrid Sample MT Corpus, which provides data for English-German, English-Spanish and English-Czech language pairs. META-RESEARCH has also developed software that collects multilingual corpora that are hidden on the web. Composition of the META-NET Network of Excellence Country Austria Belgium Bulgaria Croatia Cyprus Czech Rep. Denmark Estonia Finland France Germany Greece Hungary Iceland Ireland Italy Latvia Lithuania Luxembourg Malta Netherlands Norway Poland Portugal Member (Affiliation) Universität Wien University of Antwerp University of Leuven Bulgarian Academy of Sciences Zagreb University University of Cyprus Charles University in Prague* Contacts Gerhard Budin Walter Daelemans Dirk van Compernolle Svetla Koeva Marko Tadic Jack Burston Jan Hajic University of Copenhagen Bente Maegaard, Bolette Sandford Pedersen University of Tartu Tiit Roosmaa Aalto University* Timo Honkela University of Helsinki Kimmo Koskenniemi, Krister Linden CNRS, LIMSI* Joseph Mariani ELDA* Khalid Choukri DFKI* Hans Uszkoreit, Georg Rehm RWTH Aachen* Hermann Ney ILSP, R.C. ―Athena‖* Stelios Piperidis Hungarian Academy of Sciences Tamás Váradi Budapest Technical University Géza Németh, Gábor Olaszy University of Iceland Eirikur Rögnvaldsson Dublin City University* Josef van Genabith ConsiglioNazionaleRicerche* Nicoletta Calzolari Fondazione Bruno Kessler* Bernardo Magnini Tilde Andrejs Vasiljevs Institute of Mathematics and Computer Inguna Skadina Science, University of Latvia Institute of the Lithuanian Language Jolanta Zabarskaitë Arax Ltd. Vartkes Goetcherian University of Malta Universiteit Utrecht* Mike Rosner Jan Odijk University of Bergen Polish Academy of Sciences University of Łódź University of Lisbon Inst. for Systems Engineering and Computers Koenraad De Smedt Adam Przepiórkowski Piotr Pezik Antonio Branco Isabel Trancoso 44 About META-NET Romania Serbia Romanian Academy of Sciences University AlexandruIoanCuza Belgrade University Pupin Institute Slovakia Slovak Academy of Sciences Slovenia Jozef Stefan Institute* Spain Barcelona Media* Technical University of Catalonia University Pompeu Fabra Sweden University of Gothenburg UK University of Manchester An * represents the founding members. Dan Tufis Dan Cristea Dusko Vitas, Cvetana Krstev, Ivan Obradovic Sanja Vranes Radovan Garabik Marko Grobelnik Toni Badia Asunción Moreno Núria Bel Lars Borin Sophia Ananiadou How to Participate? META-NET and META offer many opportunities for participation. Please check out www.meta-net.eu for information on upcoming events and activities. 45