Language Technology Support for Latvian - META-NORD

Transcription

Language Technology Support for Latvian - META-NORD
www.meta-net.eu
[email protected]
Tel: +49 30 3949 1833
Fax: +49 30 3949 1810
META-NET
White Paper Series
Languages in the European
Information Society– Latvia
Preface
Preface
This series of language white papers is for journalists, politicians, language
communities, language teachers and others, who want to establish a truly multilingual Europe.
This series promotes knowledge about language technology (LT) and it‘s
potential. The coverage and use of language technology in Europe varies from
language to language. Consequently, required actions to support research and
development vary, and the necessary steps depend on many factors, such as
the complexity of the language or the size of its community.
META-NET has faced this challenge by initiating an analysis of the current
state of affairs for language resources and technologies. The analysis focuses
on the 23 official European languages and several important regional languages. The results of the analysis suggest that there are many significant
gaps for each language. Detailed expert analysis and assessment of the situation for each language will help maximise the impact of language technology
and minimize any associated risks.
META-NET is a European Commission Network of Excellence that consists
of 44 research centres from 31 countries. META-NET is working with stakeholders from many areas of society, industry and research to generate strategic
visions and produce a strategic research agenda that shows how language
technology applications can address any gaps by 2020.
Imprint
Authors/Editors:
Dr. Aljoscha Burchardt, DFKI
Prof. Dr. Markus Egg, Humboldt-Universität zu Berlin
Kathrin Eichler, DFKI
Dr. Georg Rehm, DFKI
Prof. Dr. Manfred Stede, Universität Potsdam
Prof. Dr. Hans Uszkoreit, Universität des Saarlandes and DFKI
Prof. Dr. Inguna Skadiņa, Tilde
Prof. Dr. Andrejs Veisbergs, University of Latvia
Dr. Andrejs Vasiļjevs, Tilde
Dr. Tatiana Gornostay, Tilde
Iveta Keiša, Tilde
Alda Rudzīte, Tilde
The development of this white paper has been funded by the Seventh Framework Programme
and the ICT Policy Support Programme of the European Commission through the contracts
T4ME(grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U
(grant agreement no.: 270893), and META-NORD (grant agreement no.: 270899).
2
Contents
Table of Contents
Preface .............................................................................................................................................................................. 2
Imprint ............................................................................................................................................................................. 2
Table of Contents ............................................................................................................................................................ 3
Executive Summary ........................................................................................................................................................ 4
A Risk for Our Languages and a Challenge for Language Technology .................................................................... 5
Language Borders Hinder the European Information Society ......................................................................................................... 5
Our Languages at Risk ..................................................................................................................................................................... 6
Language Technology is a Key Enabling Technology .................................................................................................................... 7
Opportunities for Language Technology ......................................................................................................................................... 7
Challenges Facing Language Technology ....................................................................................................................................... 8
Language Acquisition ...................................................................................................................................................................... 8
Latvian in the European Information Society ............................................................................................................ 10
General Facts ..................................................................................................................................................................................10
Particularities of the Latvian Language ..........................................................................................................................................11
Recent developments ......................................................................................................................................................................13
Language cultivation in Latvia .......................................................................................................................................................14
Language in Education ...................................................................................................................................................................16
International aspects .......................................................................................................................................................................17
Latvian on the Internet ....................................................................................................................................................................18
Selected Further Reading ................................................................................................................................................................20
Language Technology Support for Latvian ................................................................................................................ 22
Language Technologies ..................................................................................................................................................................22
Language Technology Application Architectures ..........................................................................................................................22
Core application areas .....................................................................................................................................................................23
Language checking .........................................................................................................................................................................23
Web search .....................................................................................................................................................................................24
Speech interaction ...........................................................................................................................................................................26
Machine Translation .......................................................................................................................................................................28
Language Technology ‗Behind the Scenes‘ ....................................................................................................................................31
Language Technology in Education ...............................................................................................................................................33
Language Technology Programs ....................................................................................................................................................33
Availability of Tools and Resources for Latvian ............................................................................................................................36
Status of Tools and Resources for Latvian .....................................................................................................................................38
Conclusions ....................................................................................................................................................................................39
References ...................................................................................................................................................................... 40
META-NET ................................................................................................................................................................... 42
META-NET‘s Three Lines of Action .............................................................................................................................................42
Composition of the META-NET Network of Excellence ...............................................................................................................44
How to Participate?.........................................................................................................................................................................45
3
A Risk for Our Languages and a Challenge
for Language Technology
Executive Summary
Many European languages run the risk of becoming victims of the digital age
because they are underrepresented and under-resourced online. Huge regional
market opportunities remain untapped today because of language barriers. If
we do not take action now, many European citizens will become socially and
economically disadvantaged because they speak their native language.
Innovative, language technology (LT) is an intermediary that will enable European citizens to participate in an egalitarian, inclusive and economically
successful knowledge and information society. Multilingual language technology will be a gateway for instantaneous, cheap and effortless communication
and interaction across language boundaries.
Today, language services are primarily offered by commercial providers from
the US. Google Translate, a free service, is just one example. The recent success of Watson, an IBM computer system that won an episode of the Jeopardy
game show against human candidates, illustrates the immense potential of
language technology. As Europeans, we have to ask ourselves several urgent
questions:





Should our communications and knowledge infrastructure be dependent upon monopolistic companies?
Can we truly rely on language-related services that can be immediately switched off by others?
Are we actively competing in the global market for research and development in language technology?
Are third parties from other continents willing to address our translation problems and other issues that relate to European multilingualism?
Can our European cultural background help shape the knowledge society by offering better, more secure, more precise, more innovative
and more robust high-quality technology?
The present whitepaper focuses on the Latvian language that is the sole state
language in the Republic of Latvia, one of the official languages of the European Union, and one of the oldest European languages with about 1.5 million
native speakers worldwide. While a number of basic language technologies
and resources have been developed for the Latvian language, there are rather
big gaps that should be urgently filled to ensure sustainable development of
the language. For example, semantic analysis and discourse processing, summarization, question answering, speech recognition and advanced information
access technologies and dialogue management systems, as well as discourse
and multimedia and multimodal corpora and wordnet. Moreover, similar to
several other languages of the European Union some of the existing tools and
resources for Latvian are either not interoperable and fragmented or not freely
available for the community.
META-NET contributes to building a strong, multilingual European digital
information space. By realising this goal, a multicultural union of nations can
prosper and become a role model for peaceful and egalitarian international
cooperation. If this goal cannot be achieved, Europe will have to choose between sacrificing its cultural identities or suffering economic defeat.
4
A Risk for Our Languages and a Challenge
for Language Technology
A Risk for Our Languages and a Challenge for Language
Technology
As recent events in North Africa illustrate, we are witnesses to a digital revolution that is dramatically impacting communication and society. Recent developments in digitised and network communication technology are sometimes compared to Gutenberg‘s invention of the printing press. What can this
analogy tell us about the future of the European information society and our
languages in particular?
After Gutenberg‘s invention, real breakthroughs in communication and
knowledge exchange were accomplished by efforts like Luther‘s translation of
the Bible into common language. In subsequent centuries, cultural techniques
have been developed to better handle language processing and knowledge
exchange:
 the orthographic and grammatical standardisation of major languages
enabled the rapid dissemination of new scientific and intellectual ideas;
 the development of official languages made it possible for citizens to
communicate within certain (often political) boundaries;
 the teaching and translation of languages enabled an exchange across
languages;
 the creation of journalistic and bibliographic guidelines assured the
quality and availability of printed material;

the creation of different media like newspapers, radio, television,
books, and other formats satisfied different communication needs.
In the past twenty years, information technology helped to automate and facilitate many of the processes:
 desktop publishing software replaces typewriting and typesetting;
 Microsoft PowerPoint replaces overhead projector transparencies;
 e-mail sends and receives documents often faster than with a fax machine;
 Skype makes Internet phone calls and hosts virtual meetings;
 audio and video encoding formats make it easy to exchange multimedia content;
 search engines provide keyword-based access to web pages;
 online services like Google Translate produce quick and approximate
translations;
 social media platforms facilitate collaboration and information sharing.
Although such tools and applications are helpful, can they sufficiently implement a sustainable, multilingual European information society, a modern and
inclusive society where information and goods can flow freely?
Language Borders Hinder the European Information Society
We cannot precisely know what the future information society will look like.
When it comes to discussing a common European energy strategy or foreign
policy, we might want to listen to European foreign ministers speak in their
5
We are currently witnessing a digital revolution that is comparable to
Gutenberg’s invention of modern
printing.
A Risk for Our Languages and a Challenge
for Language Technology
native language. We might want a platform where people, who speak many
different languages and who have varying language proficiency, can discuss a
particular subject while technology automatically gathers their opinions and
generates brief summaries. We also might want to speak with a health insurance help desk that is located in a foreign country.
It is clear that communicative needs have a different quality as compared to a
few years ago. In a global economy and information space, more languages,
speakers and content confront us and require us to quickly interact with new
types of media. The current popularity of social media (Wikipedia, Facebook,
Twitter and YouTube) is only the tip of the iceberg.
A global economy and information
space confronts us with more languages, speakers and content.
Today, we can transmit gigabytes of text around the world in a few seconds
before we recognize that it is in a language we do not understand. According
to a recent report requested by the European Commission, 57% of Internet
users in Europe purchase goods and services in languages that are not their
native language. (English is the most common foreign language followed by
French, German and Spanish.) 55% of users read content in a foreign language while only 35% use another language to write e-mails or post comments on the web.1 A few years ago, English might have been the lingua franca of the web—the vast majority of content on the web was in English. The
situation has now changed drastically. The amount of online content in other
languages (particularly Asian and Arabic languages) has exploded.
A ubiquitous digital divide that is caused by language borders has surprisingly
not gained much attention in the public discourse; yet, it raises a very pressing
question, ―Which European languages will thrive and persist in the networked
information and knowledge society?‖
Which European languages will thrive
and persist in the networked information and knowledge society?
Our Languages at Risk
The printing press contributed to an invaluable exchange of information in
Europe, but it also lead to the extinction of many European languages. Regional and minority languages were rarely printed. As a result, many languages like Cornish or Dalmatian were often limited to oral forms of transmission, which limited their continued adoption, spread and use.
The approximately 60 languages of Europe are one of its richest and most
important cultural assets. Europe‘s multitude of languages is also a vital part
of its social success.2 While popular languages like English or Chinese will
certainly maintain their presence in the emerging digital society and market,
many European languages could be cut off by digital communications and
become irrelevant for the Internet society. Such developments would certainly
be unwelcome. On one hand, a strategic opportunity would be lost that would
weaken Europe‘s global standing. On the other hand, such developments
would conflict with the goal of equal participation for every European citizen
regardless of language. According to a UNESCO report on multilingualism,
1
European Commission Directorate-General Information Society and Media, User
language preferences online, Flash Eurobarometer #313, 2011
(http://ec.europa.eu/public_opinion/flash/fl_313_en.pdf).
2
European Commission, Multilingualism: an asset for Europe and a shared commitment, Brussels, 2008
(http://ec.europa.eu/education/languages/pdf/com/2008_0566_en.pdf).
6
The wide variety of languages in Europe
is one of its most important cultural
assets and an essential part of Europe‘s
success.
A Risk for Our Languages and a Challenge
for Language Technology
languages are an essential medium for the enjoyment of fundamental rights,
such as political expression, education and participation in society.3
Language Technology is a Key Enabling Technology
In the past, investment efforts have focused on language education and translation. For example, according to some estimates, the European market for
translation, interpretation, software localisation and website globalisation was
€ 8.4 billion in 2008 and was expected to grow by 10% per annum. 4 Yet, this
existing capacity is not enough to satisfy current and future needs.
Language technology is a key enabling technology that can protect and foster
European languages. Language technology helps people collaborate, conduct
business, share knowledge and participate in social and political debates regardless of language barriers or computer skills. Language technology already
assists everyday tasks, such as writing e-mails, searching for information
online or booking a flight. We benefit from language technology when we:
 search for and translate web pages;
 use the spelling and grammar checking features in a word processor;
 view product recommendations at an online shop;
 hear the verbal instructions of a synthetic voice in a navigation system;
 translate web pages with an online service.
The language technologies detailed in this paper are an essential part of innovative future applications. Language technology is typically an enabling technology within a larger application framework like a navigation system or a
search engine. These white papers focus on the readiness of core technologies
in the each language.
In the near future, we need language technology for all European languages
that is available, affordable and tightly integrated within larger software environments. An interactive, multimedia and multilingual user experience is not
possible without language technology.
Opportunities for Language Technology
Language technology can make automatic translation, content production,
information processing and knowledge management possible for all European
languages. Language technology can also further the development of intuitive
language-based interfaces for household electronics, machinery, vehicles,
computers and robots. Although many prototypes already exist, commercial
and industrial applications are still in the early stages of development. The
current rate of progress creates a genuine window of opportunity with research steadily progressing during the last few years. For example, machine
translation (MT) already delivers a reasonable amount of accuracy within
specific domains, and experimental applications provide multilingual information and knowledge management as well as content production in many
European languages.
3
UNESCO Director-General, Intersectoral mid-term strategy on languages and multilingualism, Paris,
2007 (http://unesdoc.unesco.org/images/0015/001503/150335e.pdf).
4
European Commission Directorate-General for Translation, Size of the language industry in the EU,
Kingston Upon Thames, 2009 (http://ec.europa.eu/dgs/translation/publications/studies).
7
Language technology helps people
collaborate, conduct business, share
knowledge and participate in social and
political debates across different languages.
One can think of language technology as
the operating system for the content and
user interaction.
A Risk for Our Languages and a Challenge
for Language Technology
Language applications, voice-based user interfaces and dialogue systems are
traditionally found in highly specialised domains, and they often exhibit limited performance. One active field of research is the use of language technology for rescue operations in disaster areas. In such high-risk environments,
translation accuracy can be a matter of life or death. The same reasoning applies to the use of language technology in the health care industry. Intelligent
robots with cross-lingual language capabilities have the potential to save lives.
There are huge market opportunities in the education and entertainment industries for the integration of language technologies in games, edutainment offerings, simulation environments or training programmes. Mobile information
services, computer-assisted language learning software, eLearning environments, self-assessment tools and plagiarism detection software are just a few
more examples where language technology can play an important role. The
popularity of social media applications like Twitter and Facebook suggest a
further need for sophisticated language technologies that can monitor posts,
summarise discussions, suggest opinion trends, detect emotional responses,
identify copyright infringements or track misuse.
Language technology represents a tremendous opportunity for the European
Union that makes both economic and cultural sense. Multilingualism in Europe has become the rule. European businesses, organisations and schools are
multinational and diverse. Citizens want to communicate across the language
borders that still exist in the European Common Market. Language technology can help overcome such remaining barriers while supporting the free and
open use of language. Furthermore, innovative, multilingual language technology for European can also help us communicate with our global partners
and their multilingual communities. Language technologies support a wealth
of international economic opportunities.
Multilingualism is the rule, not an exception.
Challenges Facing Language Technology
Although language technology has made considerable progress in the last few
years, the current pace of technological development and product innovation
is too slow. We cannot wait ten or twenty years for significant improvements
to be made that can further communication and productivity in our multilingual environment.
Language technologies with broad use, such as the grammar and spell checking features in word processors, are typically monolingual, and they are only
available for a handful of languages. Applications for multilingual communication require a certain level of sophistication. Machine translation and online
services like Google Translate or Bing Translator are excellent at creating a
good approximation of a document‘s contents. But such online services and
professional MT applications are fraught with various difficulties when highly
accurate and complete translations are required. There are many well-known
examples of funny sounding mistranslations, for example, literal translations
of the names Bush or Kohl, that illustrate the challenges language technology
must still face.
Language Acquisition
To illustrate how computers handle language and why language acquisition is
a very difficult task, we take a brief look at the way humans acquire first and
8
The current pace of technological progress is too slow to arrive at substantial
software products within the next ten to
twenty years.
A Risk for Our Languages and a Challenge
for Language Technology
second languages, and then we sketch how machine translation systems
work—there‘s a reason why the field of language technology is closely linked
to the field of artificial intelligence.
Humans acquire language skills in two different ways. First, a baby learns its
native language via examples. Exposure to concrete, linguistic specimens by
language users, such as parents, siblings and other family members, helps
babies from the age of about two or so produce their first words and short
phrases. This is only possible because of a special genetic disposition humans
have for learning their first language.
Humans acquire language skills in two
different ways: learning examples and
learning the underlying language rules.
Learning a second language usually requires much more effort. At school age,
foreign languages are usually acquired by learning their grammatical structure, vocabulary and orthography from books and educational materials that
describe linguistic knowledge in terms of abstract rules, tables and example
texts. Learning a foreign language takes a lot of time and effort, and it gets
more difficult with age.
The two main types of language technology systems acquire language capabilities in a similar manner as humans. Statistical approaches obtain linguistic
knowledge from vast collections of concrete example texts in a single language or in so-called parallel texts that are available in two or more languages. Machine learning algorithms model some kind of language faculty
that can derive patterns of how words, short phrases and complete sentences
are correctly used in a single language or translated from one language to
another. The sheer number of sentences that statistical approaches require is
huge. Performance quality increases as the number of analysed texts increases. It is not uncommon to train such systems on texts that comprise millions of
sentences. This is one of the reasons why search engine providers are eager to
collect as much written material as possible. Spelling correction in word processors, available online information, and translation services such as Google
Search and Google Translate rely on a statistical (data-driven) approach.
Rule-based systems are the second major type of language technology. Experts from linguistics, computational linguistics and computer science encode
grammatical analysis (translation rules) and compile vocabulary lists (lexicons). The establishment of a rule-based system is very time consuming and
labour intensive. Rule-based systems also require highly specialised experts.
Some of the leading rule-based machine translation systems have been under
constant development for more than twenty years. The advantage of rulebased systems is that the experts can more detailed control over the language
processing. This makes it possible to systematically correct mistakes in the
software and give detailed feedback to the user, especially when rule-based
systems are used for language learning. Due to financial constraints, rulebased language technology is only feasible for major languages.
9
The two main types of language technology systems acquire language in a
similar manner as humans.
Language Technology Support for Latvian
Latvian in the European Information Society
General Facts
Latvian is the sole state language in the Republic of Latvia and one of the
official languages of the European Union. There are about 1.5 million native
Latvian speakers worldwide, from which 1.38 million are living in Latvia
while others are scattered in the USA, Russia, Australia, Canada, UK, Germany, Ireland, as well as Lithuania, Estonia, Sweden, Brazil and other countries.
Latvian though apparently small, is in fact approximately the 150th most spoken language from about 6,900 languages of the world. At least 500,000 nonLatvians speak Latvian besides their own native language. Since regaining
independence in 1990, Latvian has state language status which extends to all
language use spheres. Accordingly, more and more minority language speakers in Latvia speak also Latvian. The 1989 population census data showed that
23% of Latvia's national minorities spoke Latvian language. According to
the 2000 population census data, number of Latvian speakers among national
minorities increased to 53%. However, due to low birth rates, Latvian speakers decrease by approximately 5,000 people (0.3%) annually.
Latvian is the native language of 95.6% of Latvians. Among national minorities, Latvian is considered as the native language more often by Lithuanians
(42.5%), Estonians (39.2%) and Germans (24.6%). For comparison, 39.6% of
Latvia's citizens are native speakers of Russian. For a large number of other
national minorities (Jews, Belarusians, Ukrainians, Poles) Russian is their
mother tongue and everyday communication language.
Although often referred to as a new language of a new republic, Latvian, in
fact, is one of the oldest European languages with numerous similarities to
Sanskrit, the language closest to the original Indo-European language. The
Latvian language belongs to the Baltic branch of the Indo-European language
family. The Baltic languages are divided into East Baltic and West Baltic
languages. There are only two living Baltic languages nowadays: Latvian and
Lithuanian, both of which belong to the East Baltic languages. Although Latvian is kindred to Lithuanian, speakers of both languages cannot communicate
with each other freely. The similarity of both languages is like the one between Spanish and Italian or Russian and Polish. In the Latvian language,
there are 3 dialects: the Central dialect, Tamian and the High Latvian dialect,
and more than 500 vernaculars or sub-dialects. These separate dialects are
influenced by standardization, social and culture historical factors and are
subordinated to the process of improvement and accommodation to literary
standard language. The literary standard language has been developed on the
basis of the Central dialect.
The written form of the Latvian language has existed for about 400 years. The
first written monuments of Latvian are writings in the Gothic script of the l6th
century when, under the ideas of Reformation, the clergy attempted to break
the divide between the local peasants and the landlords of Teutonic descent.
The first great landmark of Latvian is the translation of the Bible (1689). Thus
Latvian obtained a powerful literary document the language of which was to
affect the development of written Latvian (so-called Old Writing) for centuries. It imposed a standard on the written language and was also important as
recognition of the language. It should be noted that the first scripts in Latvian
were made by Baltic Germans and were mostly translations. Baltic Germans
also produced Latvian grammars, dictionaries, collected and recorded folk10
Language Technology Support for Latvian
songs, controlled and dominated the language scene in general. Real writing
in Latvian started only in the l9th century when national literature and cultural
aspirations emerged and Latvian linguistics came into the hands on native
speakers. As a result of centuries of foreign domination one can trace in modern Latvian numerous lexical and morphological influences — loanwords,
calques and borrowed idioms which have been fully assimilated. In spite of
extensive and various contacts with other languages (German, Polish, Swedish, Russian, English), the inner system of Latvian has survived and the language maintains its stability. Latvian is characterized by a complex grammatical system and certain linguistic conservatism, yet openness to outside influences.
Latvian orthography underwent a gradual reform from Gothic script to Latin
(with diacritics) in the beginning of the 20th century. There have been two
orthography traditions (with minor differences) since World War II: the orthography used by Latvians in Latvia and the orthography used by émigré
Latvians abroad. Besides, Latgalian orthography tradition exists in the Eastern
part of Latvia.
Particularities of the Latvian Language
The high linguistic quality and rich means of expression of the Latvian language is one of the prerequisites for the stability and competitiveness of the
language. The Latvian language exhibits some specific characteristics, including:





Pronunciation almost fully corresponds to the writing
Plenty of grammar forms and endings due to inflections
Large amount of derived words and derivational means
Free word-order
Punctuation principles: grammar and intonation
The Latvian language uses the phono-morphological basis of orthography.
The Latvian orthography almost fully corresponds to the pronunciation (diacritical marks are used for identifying the length of a sound, palatalization and
sibilants), therefore, it is considered to be one of the best orthography systems. The new orthography (at the end of the 19th century) was created by the
first Latvian intellectuals, who, in their search for the most suitable means for
the written representation of Latvian sound system, found ideas in other languages (for example, letters of the Czech language were selected for sibilants). The first requirement for correct spelling is correct pronunciation (orthoepy). In Latvian, as a general rule, each sound is represented by its letter;
in some cases one sound is represented by two letters (dz, dž), in some — one
letter is represented by two sounds (letter e represents the narrow e sound and
the broad e sound, letter ē — the narrow and broad [ē] sound; letter o represents three sounds: the short vowel [o], the long vowel [ō] and diphthong [uo]). Standard Latvian with a few minor exceptions has a fixed initial
stress. Long vowels and diphthongs have a tone regardless of their position in
the word. Syllable tones of sound intonations (3 types) is one of the rarities
present in the Latvian language, preserved since the ancient syllable tone system of the Indo-European language, also present in Lithuanian, Slovenian and
Serbian (to compare: tones are important also for other languages, such as
Chinese). However, the tones may make it difficult to learn a language and
frequently may cause misunderstandings, because a lengthening mark or even
just a tone may differentiate meanings of a word (for example, ‗kazas‘ (goats)
11
Language Technology Support for Latvian
and ‗kāzas‘ (wedding); ‗zāle‘ with level tone (hall) and ‗zāle‘ with broken
tone (grass, herb)). The pronunciation of words based on the context must be
noted not only by language learners but also by language technology developers.
The Latvian language is a synthetically inflected language. Its words change
their form according to the grammatical function. It means that endings of
nouns, pronouns, adjectives, numerals, and verbs change depending on certain
features. The main features in Latvian are gender, number, case, tense, voice,
degree of comparison, person, definiteness of the ending, mode, reflexivity.
Words belonging to a different part of speech have a different set of features.
The different form is determined not only by a different ending. There is also
a rich system of derivational affixes. For instance, in Latvian, nouns have
29 graphically different endings, adjectives have 24 and verbs have 28.
Across all three word types, only half of the endings are unambiguous, for the
rest, multiple base forms may be derived from the inflected form.
The Latvian language does not have definite or indefinite articles. Definiteness can be indicated by the endings of adjectives. They can be either definite
(‗-ais‘ for singular nominative masculine e.g. ‗lielais‘, ‗garais‘ and ‗-ā‘ for
singular nominative feminine form e.g. ‗lielā‘, ‗garā‘) or indefinite (‗-s‘ or ‗-š‘
for singular nominative masculine e.g. ‗liels‘, ‗garš‘ and ‗-a‘ for singular
nominative feminine form e.g. ‗liela‘, ‗gara‘).
Due to the structure of the Latvian language it has a very rich word-building
potential. Mostly, words are built morphologically — by adding affixes (word
components) to the stem of the word; less often new words are built as compound words; there are also other methods to build new words. The new technologies have brought the capability to provide an accurate view on the building options of words and word forms: computations have shown that in combinations with about 40 word-building affixes the number of possible items
might be about 40 million.
The order of sentence parts is relatively free; the grammatical means for
marking syntactic relations are mainly endings. For instance, sentence ‗kaķis
ķer peli‘ (a cat is catching a mouse) with a direct word order SVO (subject,
verb, object), could be also formed with OVS word order: ‗peli ķer kaķis‘ or
VSO ‗ķer kaķis peli‘ or VOS ‗ķer peli kaķis‘. There is a tendency to place the
word which carries the more important information at the end of the sentence.
Subject, predicate, complement tends to be the most common order of sentence parts: (‗Māsa lasa grāmatu‘ – ‗The sister is reading a book‘) or subject,
predicate, adverbial (‗Zēns mācās labi‘ – ‗The boy learns well‘).
Latvian punctuation rules are so complicated that it is almost impossible to
write without a thorough knowledge of grammar. Latvian punctuation is
based on the grammatical punctuation principle. It means that punctuation
marks mainly indicate the grammatical link and division between the text and
sentence parts. According to the above rule, punctuation marks are used to
separate sentences, parts of a compound sentence, equal parts of a sentence,
etc.
Besides the grammatical principle the intonational principle is also important
in Latvian punctuation Based on the latter, punctuation marks are used to
mark pauses and emphasis of word groups. The intonational principle sup12
Language Technology Support for Latvian
plements the grammatical principle to provide a better representation of nuances of the content of text or sentence.
Recent developments
Although more than ten contact languages have left their traces during the
development of the Latvian language in different historical periods, the most
significant language competition has been faced from German, Russian and
English.
Over the last decade, there has been a significant increase of English influence. Steady English borrowing has been present in Latvian for a century, at
first through Russian. The latest growth of borrowing has affected such areas
as electronics, information technologies, music, sports, medicine, administration, politics, also colloquial and slang Latvian. This fast expansion came with
the slackening of ideological barriers, diminishing of Russian influence and
openness of the country to the West. The language aspect changed with new
incentives. In the past, though being the major foreign language in Latvian
schools (after Russian), English teaching, nevertheless reminded that of Latin,
as there were no opportunities of ever using it. The political openness starting
in late 1980ies changed this immediately.
Currently, adverse trends occur in higher education and research. Like in
many European countries there is a tendency to switch to English which poses
a threat to the development of Latvian language. This trend can lead to an
inability to communicate in one‘s native language in certain professional
fields due to the deficit of appropriate linguistic means of expression. Negative trends appear also in other fields like entertainment industry and banking
and finance sector.
Concerns about the language in the Latvian community do not cease. They
focus not only on the languages usage but also on the language quality.
Changes in traditional culture, exposure to the global trends have also affected
the language. In the global and digitalized environment of new technologies,
language must function in an accelerated mode and the consequences are apparent: ambiguous standards of spoken and written language, lack of the authoritative recommendations, perfect etc. The speed of social and political life
and the dynamic nature of the mass media require new expressions for the
new concepts. The easiest way often is to select haphazard clichés created in
haste. The developments are not regulated by official procedures and term
builders are not efficient enough to timely propose terms and words that are
correct from the point of view of the Latvian linguistic norms. But if one
adopts buzzwords haphazardly, he/she runs the risk of being misunderstood.
Although in practice redundant foreign words can be successfully replaced by
national coinages or appropriate borrowing (e.g. ‗ofšors‘ – ‗ārzona‘,
‗kompjūters‘ – ‗dators‘), the percentage of full loans has constantly been very
high. With the growth of information in foreign languages, there is an increasing trend to just transcribe words of other languages and to add Latvian endings. In fact, it is a return back to the 19th century when the use of Germanisms was widely spread. One can even assume that proportion of foreign
words is constantly increasing with the speed of emergence of new concepts
and growth of vocabulary. There are concerns that too many foreign words
are used in Latvian, although there is no study basing this opinion.
13
Language Technology Support for Latvian
Language cultivation in Latvia
Latvian is the only state language in the Republic of Latvia as provided for by
law: Article 4 of the 1922 Constitution, which states that Latvian is the official language of the Republic of Latvia, was revised by the 1989 Law on Language, which was amended in 1992.
In order to better understand the strategy of the Latvian language policy, some
knowledge of the historical background is required. In the 16th–19th centuries,
German served in the key sociolinguistic functions. After the Great Northern
War (1700–1721), the territory of Latvia was subjugated by Russia, however,
a special agreement was signed on the use of German in the administrative
and culture areas. Since the end of the 18th century, the Latvian language was
developing in the background of an increased competition from the German
and Russian languages. The speakers of Latvian were subject to covert or
overt germanization and russification. The russification grew in strength at the
turn of the 19th and 20th centuries and became threatening during the soviet
period when Latvia was annexed by the USSR. As result the Latvian language
was very close to becoming an endangered language with Russian dominating
all public spheres except Latvian culture and education, and Latvian population almost becoming a minority in its own land. Now, thanks to the state
language policy, the situation of the Latvian language slowly improves.
Latvian language policy is complex and difficult to implement due to the extremely high proportion of ethnic minorities (about 40% of the population).
These include Russians, Byelorussians, Ukrainians, Poles, Lithuanians, Jews,
Roma, Germans, Tatars, Armenians, Estonians and other nationalities. The
Slavic minorities were russified during the soviet occupation when according
to the communist dogma only two languages could exist in Latvia - Latvian
and Russian. As the ethnic situation was so unfavourable and explosive, Latvian language policy was developed by aligning it as much as possible with
international instruments on human rights. Recommendations of international
experts on minority rights were carefully followed.
National Program ―Integration of Society in Latvia‖ (2001) promotes development of consolidated civil society and harmonious integration of all ethnic
minorities. Among the major tasks are support for Latvian language training
and reform of education system which was segregated in Russian and Latvian
schools during the Soviet rule, as well as protection of language rights for
minorities in Latvia.
Article 5 of the 1999 State Language Law does, however, add that any language other than Latvian will be considered to be a foreign language. An exclusive status is applied to the Liv language: the Livonians are the only ethnic
minority of Latvia with indigenous status (only about 20 speakers of Liv have
remained).
Today the Latvian language functions in all spheres of life. The law regulates
the use of the state language in State, municipal, judicial and educational institutions, as well as in other agencies and businesses. Official, business and
legal meetings, and those which take place in public service institutions must
be carried out in Latvian or provide for an interpretation of the discussion in
the state language, if at least one participant requests it. The same provisions
apply to the private sector ―to the level which is considered to be necessary‖,
an expression which leaves a large margin for manoeuvre in practice. The law
14
Language Technology Support for Latvian
does not apply to private communications, languages used in a religious context or internal exchanges between certain ethnic groups.
A strong step towards strengthening of the Latvian language was the assignment of the certain professions and jobs to the grades of language proficiency.
In 1995, the Latvian Government set up the ‗National Latvian Language
Learning Program‘ and in 2004 it created the National Agency for Learning
Latvian, which offered free language lessons to professionals for whom
knowledge of Latvian is imperative, such as police and medical staff, but also
for large sections of the working population. The latest initiative was passed
by the Riga City Council which proposed a project competition «Organizing
and implementing Latvian language learning courses for residents of Riga
city», providing an opportunity for residents of Riga whose native language is
not Latvian to learn it free of charge.
The institutions in charge of the language policy are the Saeima (the Parliament), the Cabinet of Ministers, the Ministry of Education and Science, municipalities, universities, schools. The Latvian Language Agency is the state
regulatory authority, supervised by the Minister of Education and Science,
which focuses on the language policy and its implementation and also provides consultation services on language issues.
The State Language Centre was created in 1992 for the control of the observance of the language laws and is now also responsible for the translation
of the European Union and NATO documents. It includes the Commission of
the Latvian language experts that is authorized to decide on spelling issues.
The Terminology Commission of the Latvia Academy of Sciences is the main
institution for the development of unified, coordinated and harmonized terminology. New terms are coined and terminology issues are discussed in the
sub-commissions for the specific domains.
A kind of umbrella function is assigned to the State Language Commission,
operating under the President of Latvia. Members of the Commission are
heads of all of the above institutions, representatives of universities and several representatives of community, and the resolutions of the Commission are
only advisory in nature. There are six subcommissions under the State Language Commission for topical issues. The subcommission ―Latvian in the new
technologies‖ addresses the development of language technologies and their
widespread usage.
Since the restoration of independence in 1990, the Latvian language has
changed considerably, and the changes never cease. There is a trend of a
clearly negative approach to the current changes in language (represented by
general expressions such as — ―the language is declining‖, ―the language is
being cluttered up‖, and also specifically identifying unwelcome phenomena).
Representatives of this trend (purists) would like to stabilize the vocabulary of
the literary language by using solely the resources of Latvian to build new
words. However, words borrowed from other languages adapt faster and easier in the language circulation than native neologisms. For example, translation
of ‗marketing‘- 'tirgzinība' failed to be accepted, because mass media preferred usage of 'mārketings', which became popular in colloquial speech.
In order to promote formation of words for new concepts in Latvian that are
linguistically correct and at the same time are widely accepted by users, vol15
Language Technology Support for Latvian
unteer language enthusiasts organize an annual survey ―Word and Antiword of
the year‖. Some successful neologisms (e.g., 'mēstule' - spam, 'zīmols' brand, 'vingrums' - fitness) have gained wide appreciation. However, only
some words are highlighted annually, while the number of new concepts waiting for their Latvian designations runs into thousands.
Language in Education
Language policy in education was defined by the Law on Education of 1991,
in which it was stated that any language other than Latvian has the status of a
foreign language. Without special dispensation, diplomas issued by the Latvian State and professional qualification exams can only be offered in the national language.
Since September 2004, it has been compulsory for at least five disciplines to
be taught in Latvian from the 10th grade throughout the public school system
(including minority language schools). Indeed, in autumn 2006, 73.5% students in the 11th grade studied in Latvian programmes.
Unfortunately, the legal framework and the actual situation do not always
entirely fit together. The situation in the Russian minority schools is unusual.
Most lessons are given in Russian, with some teaching in Latvian. These
schools are finding it difficult to work towards the 60% of lessons taught in
Latvian as required by law.
In academic year 2010/2011 the total number of students in general full-time
education programs was 216,307. Latvian was the language of instruction for
158,137 students (73.11%), Russian – for 56,636 students (26.18%), and other
language of instruction - for only 1534 students (0.71%).
Article 41 of the 1998 Law on Education states that educational institutions
may offer programs adapted for national minorities as long as they are in accordance with the Ministry‘s regulations on education, but that these programs must be accompanied by subjects taught in the national language. The
Russian community in Latvia has reservations about these provisions.
Currently, there is a controversy regarding the submitted initiative for
amendments in Article 112 of the Constitution to achieve a gradual transfer to
the Latvian language as a language of instruction in all schools starting from 1
September 2012. If national minorities‘ schools were to transfer to the Latvian
language of instruction, a uniform and cheaper system for language teaching
would be among the benefits. However, representatives of national minorities
claim that their children have rights to get education in their native language.
According to the Education Law and Law on Institutions of Higher Education,
Latvian must be the only language of instruction in public institutions of higher education. The choice of language of tuition in private universities is not
regulated. However, there are several requirements: 1) the examinations of
professional qualifications must be taken in the state language; 2) the works
and papers for the academic and research degree must be developed and presented in the state language unless there are other requirements provided for
in the law; 3) the improvement of professional skills and retraining financed
by the state or municipal budget funds is carried out in the state language.
16
Language Technology Support for Latvian
The language situation in the higher education is directly dependent on the
language and education policy in Latvia and in the EU. In the context of language policy, there are two essential objectives:
 provide higher education that is able to prepare specialists, researchers and scholars who are competitive on a global scale; it means that
these professionals must have a very good command of foreign languages;
 every country must be committed to ensure comprehensive functioning of its national language in the higher education and science.
We can say that laws and regulations adopted in Latvia ensure the retention of
the dominant role of the official language in the higher education in Latvia
while providing opportunities to master professional qualifications knowledge
on a competitive level also in other EU languages (mostly English). However,
with the increase of the number of exchange studies programs and the need to
get and provide professional information in foreign languages, higher education and science in Latvia, just like in many other European countries, tend to
switch to English.
International aspects
Latvian is one of the official languages of the European Union. Every resident
is entitled to apply to the EU institutions in Latvian and receive a reply in
Latvian. The position of Latvian gains strength also due to the state language
policy - foreign residents who come to Latvia for work or study must learn
Latvian. In addition, due to its rich folklore heritage and the complex and
ancient language system, Latvian is used by linguists from other countries for
research. The detailed rules and principles of Latvian grammar may serve as a
base for research on machine translation systems and other language technology products targeted for minor languages.
Support to the Latvian language abroad is provided in two areas:
 support to learning Latvian as a foreign language at universities
abroad (Latvian can be learned in 22 universities worldwide);
 support to the Latvian language among the various diaspora.
Several Latvian institutions of higher education and the Latvian Language
Agency cooperate with foreign universities regarding the learning of Latvian.
The latest accomplishment is opening of a lecturer position in the Beijing
Foreign Studies University in China for the academic year 2011/2012 to organize Latvian language courses and to teach a course on Latvian cultural
history (in English).
There are many possibilities to learn Latvian in the neighbouring Lithuania.
Since 1995, there is the Letonika Centre of the Vytauto Magnus University in
Kaunas. The accession of Lithuania and Latvia to the European Union enlarged the range of opportunities to develop academic connections: Socrates\Erasmus agreements were signed with a number of universities covering
not only the exchange of students but also teachers, as well as provided sufficient financial support. Note that since 2008 Latvian is included in curricula
of Lithuanian secondary schools as the third foreign language (optional). It
can be learned in several secondary schools located near the Latvian border.
17
Language Technology Support for Latvian
Political and socio-economic factors have contributed to the Latvian
diaspora spread all over the world over the last 150 years. Preliminary
data show that more than 1/10socio-economic factors have con currently. Among the tasks of the long-term program approved by the Latvian government are the supply of study aids and manuals to associations of Latvian
diaspora, strengthening of Sunday Schools networks, provision of teachers of
the Latvian language and literature, opportunities for younger generation of
Latvian diaspora to study in Latvian universities and support to persons who
wish to repatriate.
The 2009 ―Usage of Language in Diaspora: Evaluation of Policy of Latvia
and Experience of Other Countries‖ survey performed by the Latvian Language Agency with the support from the Norwegian government urges proactive actions to prevent expansion of the gap between the government (the
state) and representatives of the new diaspora and steps to avert the negative
attitude towards Latvia by those who have left recently.
Latvian Language Agency has supported different teaching activities in Russian Federation and Ireland. It has prepared two programs: the Latvian language learning program for the diaspora and a further education program for
teachers who work in the diaspora. Training program for teachers involved
61 participants from 14 countries.
Latvian on the Internet
Taking care of the synergy of the Latvian language and technologies, the Latvian Language in New Technologies State Language subcommission has set
the following key goal: Latvian shall be provided a full software support in all
popular technologies and the support shall be high-quality and maintained and
developed in pace with the development of new technologies, and should be
widely accessible and applied. To reach these goals, the subcommission has
set the following priority tasks: to develop language computer technologies,
ensure the availability and application of these technologies in widely used
systems, to develop the regulations for the use of Latvian in computer systems, promote the development and implementation of IT and telecommunications terminology.
According to the survey in the Discovery News website, there are
1,369,600 Internet users in Latvian; the number is lower than that for Lithuanian, but higher than for Estonian.
TNS Latvia, a market, public opinion and media research agency, has gathered
the latest results of the Internet audience survey for winter 2011. On average
64% or 1,123,000 residents of Latvia between the ages of 15 and 74 have used
the Internet in the last six months; it is 4 percentage points more than in winter 2010. The fastest growth in the number of Internet users is among residents of Latvia aged between 20 and 29.
The role of the Internet in business is confirmed by the survey carried out by
the GARM Technologies in cooperation with the Latvian Internet Association.
According to the survey, the disappearance of the Internet would have an
adverse effect on the operation of 37% companies, and would cause to stop
the operation of 4%.
The language used on the Internet is specific and it has certain traditions, and
may show characteristics of linguistic impunity. There are services on the
Internet where the language used gets edited. However, there are extensive
18
Language Technology Support for Latvian
materials available to the public where language is not edited. Internet communication introduces methods and vocabulary not used before: graphical
characters like ―smileys‖ for the expression of emotions, omission of diacritic
marks, unusual abbreviations, colloquialisms and slang. The Internet, just like
other means of communication, is the source of language facts reflecting the
development trends of a language.
For language technology, the growing spread of the Internet is important in
two ways. On the one hand, the large amount of digitally available language
data represents a rich source for analysing the usage of natural language, in
particular by collecting statistical information. On the other hand, the Internet
offers a wide range of application areas for language technology.
It is important to ensure that content in Latvian is well represented on the
Internet. The National Library of Latvia is creating Latvian National Digital
Library ―Letonica‖ including digitised collections of newspapers, pictures,
maps, books, sheet-music and audio recordings. Its aim is to ensure digitising
the collections of libraries and making them accessible on the web. Collection
Periodicals5 offers 40 newspapers and magazines in Latvian, German, and
Russian from 1895 to 1957 (more than 350,000 pages).
Online encyclopaedias, dictionaries, literary works and language tools are
provided at the portal Letonika.LV developed by Tilde. Letonika.LV includes
numerous general and specialized dictionaries for 20 translation routes: from
English, French, German and Russian into Latvian and vice versa, LatvianLithuanian, Lithuanian-Latvian and Estonian-Latvian, as well as more than 40
terminological dictionaries. Online collection of Latvian literature includes
200 full text works and collections of 22 authors with a total volume of
22,000 digitised pages.
The Institute of Mathematics and Computer Science (IMCS) of the University
of Latvia offers large collection of digital content, including lexical resources,
texts and corpora and computer-assisted teaching aids. Most of resources are
available on the Web6 and are used in humanities research and education.
Among corpora collected by IMCS are the Balanced Corpus of Modern Latvian7 (~3.5 million running words), the Latvian Web Corpus (~100 million
running words) and the Corpus of the Transcripts of the Saeima‟s (Parliament
of Latvia) Sessions (more than 20 million running words.), corpus of the early
written Latvian texts8 (Andronova 2007) and collection of classical Latvian
literature.
IMCS has collected numerous Latvian dictionaries – mainly explanatory dictionaries and dictionaries of terminology. Main resources are: electronic version of Mülenbach-Endzelin ―Lettisch-deutsches Wörterbuch‖9, the Dictionary of Standard Latvian Language (~64 000 entries) and the Explanatory Dictionary which contains more than 150 000 entries from about 120 Latvian
dictionaries of different times and domains.
E-learning materials developed by IMCS comprise e-courses, e-books, teaching aids, exercises and tests for different levels of language learners, starting
5
http://periodika.lv
http://www.ailab.lv
7
http://www.korpuss.lv
8
http://www.ailab.lv/senie
9
http://www.ailab.lv/mev
6
19
Language Technology Support for Latvian
from elementary school and ending with secondary school. In order to assist
deaf children a sign language dictionary has been developed. Most of elearning materials are included in Latvian Education Information System LIIS.
The Terminology Commission of the Latvian Academy of Sciences publishes
official terminology in two large online databases: www.termnet.lv (about
150 000 terms) and termini.lza.lv/akadterm. The former database is also integrated with the largest European terminology portal EuroTermBank.
The extensive online collection of Latvian folklore resources is created by
Institute of Literature, Folklore and Art of University of Latvia and the Archives of Latvian Folklore10, including numerous audio and video recordings.
Collection of Latvian folk songs Dainu skapis collected by Krišjānis Barons is
included in the UNESCO Memory of the World list and its digitised version is
accessible online11. Dialect materials are collected by Latvian Language Institute and regional universities, such as The Courland‘s Folklore and Language
Centre12.
In CLARIN project Latvian language resources and tools were identified and
registered at CLARIN Repository13 which currently lists 34 resources and 9
tools.
Selected Further Reading

Break-out of Latvian. 2008. State Language Commission, Rīga, Zinātne, ISBN 978-9984-808-39-0

Gunta Kļava, V. Līcīte, K. Motivāne et al. 2009. Usage of Language in Diaspora: Evaluation of Policy of Latvia
and Experience of Other Countries. Riga: Latviešu valodas aģentūra, 24 p.

The influence of language proficiency on the standard of living of economically active part of population. Survey
of the Sociolinguistic Research Data Serviss. 2006. Valsts valodas aģentūra, Riga, 24 p., ISBN 9984-9836-6-8

Guideline of the State Language Policy for 2005-2014. 2007. Valsts valodas aģentūra, Riga: Valsts valodas
aģentūras bibliotēka, 32 p., ISBN 978-9984-9958-0-9

The State Language Policy Programme for 2006-2010
(http://www.valoda.lv/images/stories/Valsts_valodas_politikas_programma.pdf)

Digital Library ―Letonica‖ (http://www.lnb.lv/en/digital-library)

Digital resources and tools at Artificial Intelligence Laboratory of the Institute of Mathematics and Computer Science, University of Latvia (www.ailab.lv)

Digitized collection of folk songs collected by Krišjānis Barons (http://www.dainuskapis.lv/)

European Federation of National Institutions for Language, information about Latvian
(http://www.efnil.org/documents/language-legislation-version-2007/latvia/latvia)

The Latvian Education Informatization System (LIIS) project (http://www.liis.lv)

The Latvian Institute (http://www.li.lv/)

The Latvian Language Agency (http://www.valoda.lv/)

Legislation of Republic of Latvia translated by the State Language Center
(http://www.vvc.gov.lv/advantagecms/LV/tulkojumi/index.html)
10
http://www.lfk.lv/kratuve/records.jsp?lg=en
http://www.dainuskapis.lv/
http://kfvc.liepu.lv/
13
http://www.clarin.eu/view_resources
11
12
20
Language Technology Support for Latvian

Tilde Latvian language and encyclopaedia portal Letonika.LV (http://www.letonika.lv)

Project Modern Latvian on the Web (http://www.lu.lv/filol/valoda/)

The State Language Commission (http://www.vvk.lv/ and http://www.president.lv/pk/content/?cat_id=8)

The State Language Law (http://www.likumi.lv/doc.php?id=14740)
21
Language Technology Support for Latvian
Language Technology Support for Latvian
Language Technologies
Language technologies are information technologies that are specialized for
dealing with human language. Therefore, these technologies are also often
subsumed under the term Human Language Technology. Human language
occurs in spoken and written form. While speech is the oldest and most natural mode of language communication, complex information and the bulk of
human knowledge is recorded and transmitted in written texts. Speech and
text technologies process or produce language in these two forms. Language
also has aspects common to both forms such as dictionaries, most of the
grammar, and the meaning of sentences. Thus, large parts of Language Technology cannot be subsumed under either speech or text technologies.
Knowledge technologies include technologies that link language to
knowledge. Figure 1 illustrates the Language Technology landscape. In our
communication, we mix language with other modes of communication and
other information media. We combine speech with gestures and facial expressions. Texts can be combined with pictures and sounds. Movies may contain
language in spoken and written form. Thus, speech and text technologies
overlap and interact with many other technologies that facilitate the processing of multimodal communication and multimedia documents.
Language Technology Application Architectures
Typical software applications for language processing consist of several components that mirror different aspects of language and of the task they implement. Figure 2 displays a highly simplified architecture that can be found in a
text processing system. The first three modules deal with the structure and
meaning of the text input:
 Pre-processing: cleaning up the data, removing formatting, detecting
the input language, if necessary, breaking text into sentences and tokens, etc.
 Grammatical analysis: finding the verb and its objects, modifiers, etc.;
detecting the sentence structure.
 Semantic analysis: disambiguation (Which meaning of apple is the
right one in the given context?), resolving anaphora and referring expressions like she, the car, etc.; representing the meaning of the sentence in a machine-readable way
Task-specific modules then perform many different operations such as automatic summarization of an input text, database look-ups and many others.
Below, we will illustrate core application areas and highlight certain modules of the different architectures in each section. Again, the architectures are
highly simplified and idealized, serving for illustrating the complexity of language technology applications in a generally understandable way. The most
important tools and resources involved are underlined in the text and can also
be found in the table at the end of the chapter.
After the introduction of the core application areas and brief description of the
activities in the respective field in Latvia, we will shortly give an overview of
the situation in LT research and education, concluding with an overview of
(past) funding programs. At the end of this section, we will present an expert
estimation on the situation regarding core LT tools and resources in a number
22
Figure 1: The Language Technology Landscape
Input Text
Preprosessing
Grammatical
Analysis
Semantic
Analysis
Task-Specific
Modules
Output
Figure 2: A Typical Text Processing
Application Architecture
Language Technology Support for Latvian
of dimensions such as availability, maturity, or quality. This table gives a
good overview on the situation of LT for Latvian.
Core application areas
Language checking
Anyone using a word processing tool such as Microsoft Word has come
across a spell checking component that indicates spelling mistakes and proposes corrections. 40 years after the first spelling correction program by Ralph
Gorin, language checkers nowadays do not simply compare the list of extracted words against a dictionary of correctly spelled words, but have become
increasingly sophisticated. In addition to language-dependent algorithms for
handling morphology (e.g. plural formation or palatalization), some are now
capable of recognizing syntax–related errors, such as a missing verb or a verb
that does not agree with its subject in person and number, e.g. in ―She *write a
letter.‖ However, for other common error types the above described methods
are not sufficient. For example, take a look at the following first verse of a
poem by Jerrold H. Zar (1992):
Input
text
Spelling
check
Statistical
language
model
Grammar
check
Eye have a spelling chequer,
It came with my Pea Sea.
It plane lee marks four my revue
Miss Steaks I can knot sea.
Most available spell checkers (including Microsoft Word) will find no errors
in this poem because they mostly look at words in isolation. However, for
detecting so-called homophone errors (e.g. ―Eye‖ instead of ―I‖), the language
checker needs to consider the context in which a word occurs. This either
requires the formulation of language-specific grammar rules, i.e. a high degree of expertise and manual labour, or the use of a statistical language model
to calculate the probability of a particular word occurring along with the preceding and following words. For a statistical approach, usually based on ngrams, a large amount of language data (i.e. a corpus) is required to obtain
sufficient statistical information.
Up to now, these approaches have mostly been developed and evaluated on
English language data. However, they do not necessarily transfer well to other
languages, e.g. highly inflectional languages with a flexible word order like
Latvian. For these more complex languages, an advanced high-precision language checker may require the development of more sophisticated methods,
involving a deeper linguistic analysis.
The use of language checking is not limited to word processing tools, but it is
also applied in authoring support systems. Accompanying the rising number
of technical products, the amount of technical documentation has rapidly increased over the last decades. Fearing customer complaints about wrong usage and damages resulting from bad or badly understood instructions, companies have begun to focus more and more on the quality of this technical documentation. Further, as technical products came to the international market,
more and more readers were non-native English speakers. This lead to the
first projects on developing a controlled simplified technical English that
should make it easier for native and non-native readers to understand the instructional text. This controlled language contains a fixed vocabulary in a
limited domain and rules for simplifying the sentence structures. Advances in
23
Correction
proposals
Figure 3: Language Checking (left:
rule-based; right: statistical)
Language Technology Support for Latvian
natural language processing lead to the development of authoring support
software, which assists the writer of technical documentation to use vocabulary and sentence structures consistent with these rules and terminology restrictions.
The first spelling checker for Latvian was developed by Tilde in 1995. The
spelling checker verifies the spelling of every word, and offers to replace the
misspelled word with the correct one. It automatically changes words that are
unambiguously misspelled. Every year Tilde‘s team improves the spelling
checker by including new lexical items, adding new features (e.g. Intelligent
AutoCorrect), integrating into the latest software applications. Now Latvian
spelling checker recognizes more than 22 million forms generated from more
than 130 thousand lemmas. Microsoft licensed Latvian Spelling Checker from
Tilde and includes it into the Microsoft Office software suite. Tilde has also
integrated its spelling checker into the Open Office and LibreOffice software
suites.
Tilde also has developed a hyphenation tool for Latvian. It puts hyphens in the
Latvian words in the text according to the Latvian hyphenation rules. Both
rules defining the usual hyphenation process and exception list (words which
cannot be hyphenated using just rules) are used. Microsoft licensed Latvian
hyphenator from Tilde and provides it in the Microsoft Office suite.
The convenient tool to assist in writing texts is the Latvian Thesaurus created
by Tilde. With the help of the Thesaurus, repetition of the same words can be
avoided in order to improve the document‘s language. The Thesaurus not only
offers the synonyms for a chosen word but also generates the correct inflectional form for replacement. It is integrated in the Microsoft Office environment.
Grammar checker verifies the sentence structure and punctuation. In 2004
Tilde developed the first grammar checker for Latvian. The grammar checker
was implemented using an advanced pattern matching, which allowed for
recognizing and correcting several frequent types of errors: correctness of
usage of capital letters, correctness of punctuation for some types of syntactic
structures, correctness of abbreviations, correctness of multiword compounds
and different types of agreement errors. Recently Tilde released a new version
of grammar checker that is based on a full syntactic analysis of the text. The
improved grammar checker identifies the most common grammar mistakes,
including agreement between words, punctuation and comma errors, as well
as numerous stylistic errors. The new approach allows the program to find
long distance syntactical errors between different sub parts of the sentence. In
addition, calques, slang and some other undesirable words or language constructions are identified. The grammar checker is integrated in Microsoft
Word and Open Office text editors.
Besides spelling checkers and authoring support, language checking is also
important in the field of computer-assisted language learning and is applied to
automatically correct queries sent to web search engines, e.g. Google‘s ―Did
you mean…‖ suggestions.
Web search
Search on the web, in intranets, or in digital libraries is probably the most
widely used and yet underdeveloped Language Technology today. The search
24
Language Technology Support for Latvian
engine Google, which started in 1998, is currently used for about 80% of all
search queries world-wide14. Neither the search interface nor the presentation
of the retrieved results has significantly changed since the first version. In the
current version, Google offers spelling correction for misspelled words, and in
2009, incorporated basic semantic search capabilities into their algorithmic
mix15, which can improve search accuracy by analyzing the meaning of the
query terms in context. The success story of Google shows that with a lot of
data at hand and efficient techniques for indexing these data, a mainly statistically-based approach can lead to satisfactory results.
However, for a more sophisticated request for information, integrating deeper
linguistic knowledge is essential. In the research labs, experiments using machine-readable thesauri and ontological language resources have shown improvements by allowing the ability to find a page on the basis of synonyms of
the search terms or even more loosely related terms.
The next generation of search engines will have to include much more sophisticated Language Technology. If a search query consists of a question or another type of sentence rather than a list of keywords, retrieving relevant answers to this query requires an analysis of this sentence on a syntactic and
semantic level, as well as the availability of an index that allows for fast retrieval of relevant documents. For example, imagine a user inputs the query
‗Give me a list of all companies that were taken over by other companies in
the last five years‘. For a satisfactory answer, syntactic parsing needs to be
applied to analyze the grammatical structure of the sentence and determine
that the user is looking for companies that have been taken over and not companies that took over others. Also, the expression last five years needs to be
processed in order to find out which years it refers to.
Finally, the processed query needs to be matched against a huge amount of
unstructured data in order to find the piece or pieces of information the user is
looking for. This is commonly referred to as information retrieval and involves the search for and ranking of relevant documents. In addition, generating a list of companies, we also need to extract the information that a particular string of words in a document refers to a company name. This kind of information is made available by so-called named-entity recognizers.
Even more demanding is the attempt to match a query to documents written in
a different language. For cross-lingual information retrieval, we have to automatically translate the query to all possible source languages and transfer the
retrieved information back to the target language.
Prototypes of Latvian and Lithuanian information retrieval engines were developed as part of FP5 project CLARITY: A proposal for cross language
information retrieval and organization of text and audio documents. The
CLARITY cross-language information retrieval system was developed for the
following language pairs: English-Latvian, Latvian-English, German-Latvian,
Latvian-German, Russian-Latvian, Latvian-Russian, Lithuanian-English, English-Lithuanian, German-Lithuanian, Lithuanian-German, Lithuanian-Russian
and Russian-German. With respect to Baltic languages, the results for document retrieval using direct query translation indicate that the average precision
14
http://www.spiegel.de/netzwelt/web/0,1518,619398,00.html
15
http://www.pcworld.com/businesscenter/article/161869/google_rolls_out_semantic_search_capabilities.ht
ml
25
User
query
Web
pages
Preprosessing
Preprosessing
Query
Analysis
Semantic
Processing
Indexing
Matching&Relevance
Search
Results
Figure 4: A Web Search Architecture
Language Technology Support for Latvian
can reach a level of more than 70% compared to monolingual retrieval. In the
case of transitive (pivot) translation the precision is lower, around 40%, but
still at reasonable levels compared to monolingual (Demetrioua et al 2004).
The increasing percentage of data available in non-textual formats drives the
demand for services enabling multimedia information retrieval, i.e., information search on images, audio, and video data. For audio and video files, this
involves a speech recognition module to convert speech content into text or a
phonetic representation, to which user queries can be matched.
Speech interaction
Speech interaction technology is the basis for the creation of interfaces that
allow a user to interact with machines using spoken language rather than a
graphical display, a keyboard, or a mouse. Today, such voice user interfaces
(VUIs) are usually employed for partially or fully automating service offerings provided by companies to their customers, employees, or partners via
telephone. Business domains that rely heavily on VUIs are banking, logistics,
public transportation, and telecommunications. Other usages of speech interaction technology are interfaces to particular devices, e.g. in-car navigation
systems, and the employment of spoken language as an alternative to the input/output modalities of graphical user interfaces, e.g. in smartphones.
At its core, speech interaction comprises the following four different technologies:
 Automatic speech recognition (ASR) is responsible for determining
which words were actually spoken given a sequence of sounds uttered
by a user.
 Syntactic analysis and semantic interpretation deal with analyzing the
syntactic structure of a user‘s utterance and interpreting the latter according to the purpose of the respective system.
 Dialogue Management is required for determining, on the part of the
system the user interacts with, which action shall be taken given the
user‘s input and the functionality of the system.
 Speech Synthesis (Text-to-Speech, TTS) technology is employed for
transforming the wording of that utterance into sounds that will be
outputted to the user.
One of the major challenges is to have the ASR system recognize the words
uttered by a user as precisely as possible. This requires either a restriction of
the range of possible user utterances to a limited set of keywords, or the manual creation of language models that cover a large range of natural language
user utterances. Whereas the former results in a rather rigid and inflexible
usage of a VUI and possibly causes poor user acceptance, the creation, tuning
and maintenance of language models may increase the costs significantly.
However, VUIs that employ language models and initially allow a user to
flexibly express their intent – evoked, e.g., by a How may I help you greeting
– show both a higher automation rate and a higher user acceptance and may
therefore be considered as advantageous over a less flexible directed dialogue
approach.
For the output part of a VUI, companies mostly tend to use pre-recorded utterances of professional – ideally corporate – speakers. For static utterances in
which the wording does not depend on the particular contexts of use or the
26
Speech
input
Signal
processing
Recognition
Speech
output
Speech
synthesis
Phonetic
lookup &
Intonation
planning
Natural language
understanding & dialogue
Figure 5: A Simple Speech-based
Dialogue Architecture
Language Technology Support for Latvian
personal data of the given user, this will result in a rich user experience. However, the more dynamic content an utterance needs to consider, the more user
experience may suffer from a poor prosody resulting from concatenating single audio files. In contrast, today‘s TTS systems prove superior, though optimizable, regarding the prosodic naturalness of dynamic utterances.
Regarding the market for speech interaction technology, the last decade underwent a strong standardization of the interfaces between the different technology components, as well as by standards for creating particular software
artefacts for a given application. There also has been strong market consolidation within the last ten years, particularly in the field of ASR and TTS. Here,
the national markets in the G20 countries – i.e. economically strong countries
with a considerable population - are dominated by less than 5 players worldwide, with Nuance and Loquendo being the most prominent ones in Europe.
Several research projects in speech technologies have been carried out in Latvia resulting in three speech synthesis systems that have achieved the level of
practical usability: Tilde TTS (Tilde), T2S (IMCS) and Balss (SIA Rubuls &
Co).
In 2005 Tilde together with The Association of Blind People started a project
to develop a Latvian text-to-speech (TTS) system (Goba and Vasiļjevs 2007;
Goba 2007) with the primary goal was to address the needs of visually impaired people using computers in Latvian. The architecture of the system covers the traditional TTS transformation, performing text normalization, grapheme-to-phoneme conversion, prosody generation, and waveform synthesis.
The optimal compromise between speed and effectiveness of speech synthesis, and the quality of the produced speech is achieved by a combined approach of synthesis and selection of speech units of variable lengths.
The Institute of Mathematics and Computer Science of University of Latvia
(IMCS) had several projects devoted to experimental TTS (Auziņa 2004; Pinnis and Auziņa, 2010) and speech recognition systems. The demonstration
version of the TTS system is developed by IMCS16. The speech synthesis
system was improved and an experimental speech recognition module for
isolated words was created in the project ―Applications of Latvian Language
Speech Synthesis and Analysis in Call Centers‖ financed by Lattelecom BPO.
The TTS engine Balss (for Windows) provides transforming texts from all
text processors with the copy function in Latvian language. The SDK for creation of new voices (languages) and the source code is commercially available.
For the Latvian language and its relatively small number of speakers, commercially employable ASR products do not exist. There has not been any
serious research in Latvian language speech recognition, but some individual
experiments in sound recognition and isolated word recognition have been
performed by IMCS.
Regarding dialogue management technology and know-how, markets are
strongly dominated by national players, which are usually SMEs. Rather than
exclusively relying on a product business based on software licenses, these
companies have positioned themselves mostly as full-service providers that
offer the creation of VUIs as a system integration service. Finally, within the
16
http://runa.ailab.lv/tts2
27
Language Technology Support for Latvian
domain of speech interaction, a genuine market for the linguistic core technologies for syntactic and semantic analysis does not exist yet.
Looking beyond today‘s state of technology, there will be significant changes
due to the spread of smartphones as a new platform for managing customer
relationships in addition to the telephone, internet, and email channels. This
tendency will also affect the employment of technology for speech interaction. On the one hand, demand for telephony-based VUIs will decrease in the
long run. On the other hand, the usage of spoken language as a user-friendly
input modality for smartphones will gain significant importance. This tendency is supported by the observable improvement of speaker independent speech
recognition accuracy for speech dictation services that are already offered as
centralized services to smartphone users. Given this ‗outsourcing‘ of the
recognition task to the infrastructure of applications, the application-specific
employment of linguistic core technologies will supposedly gain importance
compared to the present situation.
Statistical
machine
translation
Machine Translation
The idea of using digital computers for translation of natural languages came
up in 1946 by A. D. Booth and was followed by substantial funding for research in this area in the 1950s and beginning again in the 1980s. Nevertheless, Machine Translation (MT) still fails to fulfil the high expectations it gave
in its early years.
At its basic level, MT simply substitutes words in one natural language by
words in another. This can be useful in subject domains with a very restricted,
formulaic language, e.g., weather reports. However, for a good translation of
less standardized texts, larger text units (phrases, sentences, or even whole
passages) need to be matched to their closest counterparts in the target language. The major difficulty here lies in the fact that human language is ambiguous, which yields challenges on multiple levels, e.g., word sense disambiguation on the lexical level (‗Jaguar‘ can mean a car or an animal) or the
attachment of prepositional phrases on the syntactic level as in:
Source
text
Text analysis
(formatting,
morphology,
syntax, etc.)
Target
text
Post-editing
(formatting,
context, etc.)
Policists novēroja vīru ar telskopu.
[The policeman observed the man with the telescope.]
Translation rules
Policists novēroja vīru ar revolveri.
[The policeman observed the man with the revolver.]
Figure 6: Machine Translation (top:
statistical: bottom: rule-based)
One way of approaching the task is based on linguistic rules. For translations
between closely related languages, but often rule-based (or knowledge-driven)
systems analyze the input text and create an intermediary, symbolic representation, from which the text in the target language is generated. The success of
these methods is highly dependent on the availability of extensive lexicons
with morphological, syntactic, and semantic information, and large sets of
grammar rules carefully designed by a skilled linguist.
Beginning in the late 1980s, as computational power increased and became
less expensive, more interest was shown in statistical models for MT. The
parameters of these statistical models are derived from the analysis of bilingual text corpora, such as the JRC-Acquis multilingual parallel corpus (Steinberger et al. 2006), the total body of European Union (EU) law applicable in
the EU Member States in 27 European languages. Given enough data, statistical MT works well enough to derive an approximate meaning of a foreign
28
Language Technology Support for Latvian
language text. Still, the current methods do not work equally well for all language pairs. With respect to European languages, good translation performance can be obtained for English and the Romance languages, but the quality is much worse for Germanic, Slavic, Finno-Ugric and Baltic languages
(Koehn et al. 2009).
However, unlike knowledge-driven systems, statistical (or data-driven) MT
often generates ungrammatical output. On the other hand, besides the advantage that less human effort is required for grammar writing, data-driven
MT can also cover particularities of the language that go missing in
knowledge-driven systems, for example idiomatic expressions.
As the strengths and weaknesses of knowledge- and data-driven MT are complementary, researchers nowadays unanimously target hybrid approaches
combining methodologies of both. This can be done in several ways. One is to
use both knowledge- and data-driven systems and have a selection module
decide on the best output for each sentence. However, for longer sentences, no
result will be perfect. A better solution is to combine the best parts of each
sentence from multiple outputs, which can be fairly complex, as corresponding parts of multiple alternatives are not always obvious and need to be
aligned.
The rule-based approach has been dominant in Latvia since the mid-90-ies
when experimental interlingua MT system LATRA was created at IMCS
(Greitāne 1997). Research on rule-based systems continued at IMCS until
2004 by elaborating LATRA with semantic properties and by adapting it to
new domains. Tilde also has worked on the rule-based approach aiming at
development of commercial system for users who has poor or no foreign language skills. MT system Tildes Tulkotājs (Skadiņa et al. 2008) was released in
2007 as part of Tildes Birojs 2008 software suite. The system translates texts
from English into Latvian and from Latvian into Russian.
For Latvian, MT, especially Statistical Machine Translation (SMT), is particularly challenging because of the free word order and extensive inflection. Also, Latvian is so-called under-resourced language, i.e., only few parallel corpora are available for Latvian. Therefore work on SMT started only in 2005
by IMCS though project funded by Latvian Council of Sciences ―Evaluation
of Statistical Machine Translation Methods for English-Latvian Translation
System‖ (2005-2008) in which the baseline English-Latvian system was created (Skadiņa and Brālītis 2008). The system‘s performance in BLEU points
was similar to other systems for inflected languages of that time. IMCS research on SMT continues with the project ―Application of Factored Methods
in English-Latvian SMT System‖ (Skadiņa and Brālītis 2009), the latest version of the system is available on the Web17.
Current developments at Tilde are focused on combining data-driven statistical MT with knowledge-based models to achieve the optimal quality of translation. In addition to publicly available resources, internal resources collected
over a long period of time were used for SMT training. Tilde Translator currently provides English-Latvian, Latvian-English SMT systems and is expanding to other translation directions. Tilde Translator is publicly available
on the web18 (Skadiņš et al. 2010), as part of Tildes Birojs suite of desktop
17
18
http://eksperimenti.ailab.lv/smt
http://translate.tilde.com
29
Language Technology Support for Latvian
software and also as mobile applications for the most commonly used platforms like Android and iOS.
Several EC co-funded collaborative projects were undertaken for advanced
research and development of machine translation for under-resourced languages, including Latvian. The CIP ICT PSP project LetsMT!19 and FP7 project ACCURAT20, coordinated by Tilde, developed innovative methods for
making it easier to gather data for MT and to create customized MT systems
for different domains and usage scenarios.
The ACCURAT project researches novel methods that exploit comparable
corpora to compensate for the shortage of linguistic resources to improve MT
quality for under-resourced languages and narrow domains (Eisele and Xu
2010, Skadiņa et al. 2010). The ACCURAT project‘s target is to achieve
strong improvement in translation quality for a number of new EU official
languages and languages of associated countries (Croatian, Estonian, Greek,
Latvian, Lithuanian and Romanian), and propose novel approaches for adapting existing MT technologies to specific narrow domains, significantly increasing language and domain coverage of automated translation.
The LetsMT! project (Vasiļjevs et al. 2010) builds an innovative online collaborative platform for data sharing and MT generation. This cloud-based
platform provides all categories of users with an opportunity to upload their
proprietary resources to the repository and receive a tailored statistical MT
system trained on such resources. The latter can be shared with other users
who can exploit them further on.
SMT Training and SMT web service
News Translation
Web page
translation widget
SMT Resource
Repository
Giza++
Moses SMT toolkit
Moses decoders
SMT Web Service
Procesing, Evaluation ...
Upload
Sharing of training data
SMT Resource
Directory
Web browser
Plug-ins
Interfaces for CAT tools
SMT System
Directory
SMT Multi-Model
Repository
(trained SMT models)
MT web page
API
API
System Core Services: System management, user authentication, access rights control,
translation web page, integration ...
The translation services of the LetsMT! Project can be used in several ways:
through the web portal, through a widget provided for free inclusion in a webpage, through browser plug-ins, and through integration in computer-assisted
translation (CAT) tools and different online and offline applications.
The quality of MT systems is still considered to have huge improvement potential. Challenges include the adaptability of the language resources to a given subject domain or user area and the integration into existing workflows
with term bases and translation memories.
19
20
http://www.letsmt.eu
http:// www.accurat-project.eu
30
Language Technology Support for Latvian
Provided good adaptation in terms of user-specific terminology and workflow
integration, the use of MT can significantly increase productivity of translation work. Recently Tilde performed experiment on the application of an English-Latvian SMT in localization through the integration of MT into SDL
Trados translation environment (Skadiņš et al. 2011). The results of the experiment clearly demonstrated that it is feasible to integrate the current state of
the art SMT systems for highly inflected languages into the localization process. The use of the EnglishLatvian SMT suggestions in addition to the
translation memories in SDL Trados tool lead to the increase of translation
performance by 32.9% while maintaining an acceptable quality of the translation. Even better performance results are achieved when using a customized
SMT system that is trained on a specific domain and/or same customer parallel data.
Evaluation campaigns allow to compare the quality of MT systems, various
approaches and status of MT systems for the different languages. The following table21, presented within the EC Euromatrix+ project, shows the pair wise
performances obtained for 27 official EU languages (Irish Gaelic is missing)
in terms of BLEU score22.
The best results (shown in green and blue) were achieved by languages which
benefit from considerable research efforts, within coordinated programs, and
from the existence of many parallel corpora (e.g., English, French, Dutch,
Spanish, German), the worst (in red) by languages that are very different from
other languages (e.g., Hungarian, Maltese, Finnish).
Language Technology ‘Behind the Scenes’
Building Language Technology applications involves a range of subtasks that
do not always surface at the level of interaction with the user, but provide
significant service functionalities ‗under the hood‘ of the system. Therefore,
21
Ph. Koehn, A. Birch and R. Steinberger. 462 Machine Translation Systems for Europe, Machine Translation Summit XII, p. 65-72, 2009.
22
The higher the score, the better the translation, a human translator would get around 80.K. Papineni, S.
Roukos, T. Ward, W.-J. Zhu. BLEU: A Method for Automatic Evaluation of Machine Translation. In
Proceedings of the 40th Annual Meeting of ACL, Philadelphia, PA.
31
Language Technology Support for Latvian
they represent important research issues that have become individual subdisciplines of Computational Linguistics in academia.
Question answering has become an active area of research, for which annotated corpora have been built and scientific competition has started. The idea is
to move from keyword-based search (to which the engine responds with a
whole collection of potentially relevant documents) to the scenario of a user
asking a concrete question and the system providing a single answer: ‗At what
age did Neil Armstrong step on the moon?‘ - ‗38‘. While this is obviously
related to the aforementioned core area web search, question answering nowadays is primarily an umbrella term for research questions such as what types
of questions should be distinguished and how should they be handled, how
can a set of documents that potentially contain the answer be analyzed and
compared (do they give conflicting answers?), and how can specific information - the answer - be reliably extracted from a document, without unduly
ignoring the context.
This is in turn related to the information extraction (IE) task, an area that was
extremely popular and influential at the time of the ‗statistical turn‘ in Computational Linguistics, in the early 1990s. IE aims at identifying specific pieces of information in specific classes of documents; this could be e.g. the detection of the key players in company takeovers as reported in newspaper
stories. Another scenario that has been worked on is reports on terrorist incidents, where the problem is to map the text to a template specifying the perpetrator, the target, time and location of the incident, and the results of the incident. Domain-specific template-filling is the central characteristic of IE,
which for this reason is another example of a ‗behind the scenes‘ technology
that constitutes a well-demarcated research area but for practical purposes
then needs to be embedded into a suitable application environment.
Two ‗borderline‘ areas, which sometimes play the role of a standalone application and sometimes that of supportive, ‗under the hood‘ component are text
summarization and text generation. Summarization, obviously, refers to the
task of making a long text short, and is offered for instance as a functionality
within MS Word. It works largely on a statistical basis, by first identifying
‗important‘ words in a text (that is, for example, words that are highly frequent in this text but markedly less frequent in general language use) and then
determining sentences that contain many important words. Such sentences are
then marked in the document, or extracted from it, and are taken to constitute
the summary. In this scenario, which is by far the most popular one, summarization equals sentence extraction: the text is reduced to a subset of its sentences. All commercial summarizers make use of this idea. An alternative
approach, to which some research is devoted, is to actually synthesize new
sentences, i.e., to build a summary of sentences that need not show up in that
form in the source text. This requires a certain amount of deeper understanding of the text and therefore is much less robust. All in all, a text generator is
in most cases not a stand-alone application but embedded into a larger software environment, such as into the clinical information system where patient
data is collected, stored and processed, and report generation is just one of
many functionalities.
For Latvian, the situation in all the above-mentioned research areas is much
less developed than it is for English. Some experiments have been performed
only on Latvian text summarization.
32
Language Technology Support for Latvian
Language Technology in Education
Language Technology is a highly interdisciplinary field, involving the expertise of linguists, computer scientists, mathematicians, philosophers, psycholinguists, and neuroscientists, among others. As such, it has not yet acquired a
fixed place in the Latvian faculty system.
Some courses related to the language technology are though at Liepaja University since 2003, including Natural Language Processing for master's degree students in information technologies and Computational Linguistics for
master's degree students in the Latvian philology.
It is planned to start several courses related to Computational Linguistics at
the University of Latvia in autumn semester of 2011. One course is planned
for bachelor students in Computer Science, deeper studies in this field are
planned for master's degree students in Cognitive Sciences and Communication.
Important contribution to CL education was an opportunity for doctoral students from Latvia to participate in Nordic Graduate School of Language
Technology, NGSLT23. Majority of students who attended NGSLT has successfully defended their PhD thesis or are PhD candidates currently.
New opportunities for young researchers are provided through Initial Training
Network in the Marie Curie Actions CLARA24. CLARA project aims to train
a new generation of researchers who will be able to cooperate across national
boundaries on the establishment of a common language resources infrastructure and its exploitation for the construction of the next generation of language models with wide theoretical and applied significance.
Language Technology Programs
In Latvia, activities for collecting language resources were initiated at the end
of the 1980s at the IMCS (Milčonoka et al 2004). In 2004, the State Language
Commission initiated development of the Latvian National Corpus. As different resources have been collected in a number of institutions, Latvian National
Corpus Initiative envisions the establishment of an umbrella for all the available corpora of the Latvian language. The Agreement of Intention between the
main language resource developers and holders, both academic and industry,
has been signed and next practical steps are discussed.
Most of research activities in Latvia are funded by the Latvian Council of Science (LCS). In 2005-2009, two LT related projects of IMCS were supported by
LCS as a part of State Research Programs ―Scientific Foundations of Information Technology‖ and ―Latvian Studies (Letonica): Culture, Language and
History‖. The SemTi-Kamols project25 aimed at development and adaptation of
the semantic web technologies for semantic analysis in Latvian. The project
―Database of Latvian Explanatory Dictionaries and Recent Loanwords‖ was
mainly dealing with semi-automatic transformation of the Dictionary of Standard Latvian Language into a machine-readable format. Work on semantic technologies continues in two large projects: ―Novel information technologies
based on ontologies and model transformations” of the State Research Pro23
24
25
http://ngslt.org
http://clara.uib.no
http://www.semti-kamols.lv
33
Language Technology Support for Latvian
gram and ―Semantic database platform for domain specialists‖ funded by the
European Structural Funds.
In addition, few smaller projects of IMCS related to LT have been funded by
LCS in the last six years: “Evaluation of Statistical Machine Translation
Methods for English-Latvian Translation System” (2005-2008), ―Modeling of
Universal Lexicon System for the Latvian Language” (2005-2008), “Historical
Dictionary of the Latvian Language (16-18th centuries)” (2005-2008), “Methods for Latvian-English Computer Aided Lexicography” (2008), “Application
of Factored Methods in English-Latvian Statistical Machine Translation System” (2009-2012).
Latvia is a partner in the CLARIN project – a pan-European effort to create
language resource infrastructure for researchers in humanities and social sciences. Latvia‘s participation is financed by the Ministry of Education of Science. The advancement of CLARIN is mentioned in the strategic document
―Action Plan for Implementation of Guidelines for Science and Technology
Development” approved by the Cabinet of Ministers in 2010. The CLARIN
National Advisory Board was established and approved by the Ministry of
Education of Science to prioritize the goals and tasks of the CLARIN in Latvia and to facilitate the creation of the CLARIN infrastructure.
As the market for language technologies is very small in Latvia, there are only
few industry players providing solutions in this field. Tilde, established in
1991, is the major language technology company in Latvia. Key experience of
Tilde is in three language technology areas: translation tools, proofing tools,
and terminology management. Language software by Tilde is widely used in
Baltic countries with more than 270 000 licensed users for Latvian language
translation and proofreading tools. Tilde develops online and mobile machine
translation and terminology systems for Latvian and other European languages. Tilde actively participates in EU research and development collaboration coordinating several large-scale projects: EuroTermBank (eContent),
ACCURAT (FP7), LetsMT! (ICT-PSP) and META-NORD (ICT-PSP).
Other company developing machine translation solutions is Trident MT –
recently opened Latvian branch of Ukrainian company Trident. This company
participates in the ICT-PSP project itranslate4.eu26. Company Algorego develops solutions for processing and structuring information of digitized documents. Company Datorzinību Centrs develops e-learning applications including solutions for language learning.
Taking into account the importance of LT in ensuring sustainable development
of Latvian and other smaller languages, an initiative Language Shore was
launched in 2009 under the patronage of the President of Latvia Valdis Zatlers.
This initiative fosters the creation of a partnership between government, academia and industry to develop an international expertise cluster in language
technology. Language Shore is aiming to achieve international leadership in
technologies for smaller languages.
In order to provide a successful development of the initiative at the government
level, a Language Shore Steering Group is established, composed of five sector
ministers. The first Language Shore pilot projects are successfully completed
26
http://itranslate4.eu/project/
34
Language Technology Support for Latvian
in cooperation of Tilde and Microsoft Research advancing Latvian machine
translation for Bing Translator, developing a new crowd-sourcing model for
MT data collection, establishing cooperation in terminology data sharing.
Further development of the Language Shore is supported by the Competence
Centre Programme funded by EU Structural Funds. The state support for the
competence centers aims to stimulate business research and promote sectoral
cooperation between companies and research institutions to develop innovative products and technologies to improve the competitiveness of enterprises.
Latvian ICT Competence Centre was established in 2010 to carry out R&D
activities in language technologies and business process analysis. Major Latvian IT companies and universities will cooperate in the ICT Competence
Centre to develop advanced technologies for machine translation, speech processing and semantic analysis.
Despite of the several achievements in language technology research and industrial development, Latvia lacks a dedicated national program on language
technologies. Current research activities are fragmented and mostly organized
around short-term projects which complicate long-term inter-institutional
cooperation and development of larger resources. Public funding for LT in
Europe is relatively low compared to the expenditures for language translation
and multilingual information access by the USA27. In Latvia public funding is
even lower than in many other European countries, including neighboring
countries Estonia and Lithuania.
27Gianni
Lazzari:
„Sprachtechnologien
http://tcstar.org/pubblicazioni/D17_HLT_DE.pdf
für
Europa“,
2006:
35
Language Technology Support for Latvian
Availability of Tools and Resources for Latvian
The following table provides an overview of the current situation in language
technology support for Latvian. The rating of existing technologies and resources is based on educated estimations by several leading experts using the
following criteria (each ranging from 0 to 6).
1. Quantity: Does a tool/resource exist for the language at hand? The
more technologies/resources exist, the higher the rating.
 0: no tools/resources whatsoever
 6: many technologies/resources, large variety
2. Availability: Are technologies/resources accessible, i.e., are they
Open Source, freely usable on any platform or only available for a
high price or under very restricted conditions?
 0: practically all technologies/resources are only available for
a high price
 6: a large amount of technologies/resources is freely, openly
available under sensible Open Source or Creative Commons
licenses that allow re-use and re-purposing
3. Quality: How well are the respective performance criteria of technologies and quality indicators of resources met by the best available
tools, applications or resources? Are these technologies/resources current and also actively maintained?
 0: toy resource/technology
 6: high-quality technology, human-quality annotations in a resource
4. Coverage: To which degree do the best technologies meet the respective coverage criteria (styles, genres, text sorts, linguistic phenomena,
types of input/output, number languages supported by an MT system
etc.)? To which degree are resources representative of the targeted
language or sublanguages?
 0: special-purpose resource or technology, specific case, very
small coverage, only to be used for very specific, non-general
use cases
 6: very broad coverage resource, very robust technology,
widely applicable, many languages supported
5. Maturity: Can the technology/resource be considered mature, stable,
ready for the market? Can the best available technologies/resources be
used out-of-the-box or do they have to be adapted? Is the performance
of such a technology adequate and ready for production use or is it
only a prototype that cannot be used for production systems? An indicator may be whether resources/technologies are accepted by the
community and successfully used in LT systems.
 0: preliminary prototype, toy system, proof-of-concept, example resource exercise
 6: immediately integratable/applicable component
6. Sustainability: How well can the technology/resource be maintained/integrated into current IT systems? Does the technology/resource fulfil a determined level of sustainability concerning documentation/manuals, explanation of use cases, front-ends, GUIs etc.?
Does it use/employ standard/best-practice programming environments
(such as Java EE)? Do industry/research standards/quasi-standards exist and if so, is the technology/resource compliant (data formats etc.)?
 0: completely proprietary, ad hoc data formats and APIs
 6: full standard-compliance, fully documented
36
Language Technology Support for Latvian
7. Adaptability: How well can the best technologies or resources be
adapted/extended to new tasks/domains/genres/text types/use cases
etc.?
 0: practically impossible to adapt a technology/resource to
another task, impossible even with large amounts of resources
or person months at hand
 6: very high level of adaptability; adaptation also very easy
and efficiently possible
37
Language Technology Support for Latvian
Adaptability
Sustainability
Maturity
Coverage
Quality
Availability
Quantity
Status of Tools and Resources for Latvian
Language Technology (Tools, Technologies, Applications)
Tokenization, Morphology (tokenization, POS tagging, morphological analysis/generation)
Parsing (shallow or deep syntactic analysis)
Sentence Semantics (WSD, argument structure, semantic
roles)
Text Semantics (coreference resolution, context, pragmatics,
inference)
3
4
4
5
5
5
4
2
4
3
2
5
1
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
3
3
3
2
3
1
3
1
3
3
1
2
0
0
0
0
0
0
0
3
0
2
6
0
4
4
0
4
4
0
3
5
0
5
2
0
5
4
0
4
0
0
0
0
0
0
0
4
1
1
0
4
4
1
0
0
4
5
4
0
0
4
Advanced Discourse Processing (text structure, coherence,
rhetorical structure/RST, argumentative zoning, argumentation, text
patterns, text types etc.)
Information Retrieval (text indexing, multimedia IR, crosslingual IR)
Information Extraction (named entity recognition,
event/relation extraction, opinion/sentiment recognition, text mining/analytics)
Language Generation (sentence generation, report generation, text generation)
Summarization, Question Answering, advanced
Information Access Technologies
Machine Translation
Speech Recognition
Speech Synthesis
Dialogue Management (dialogue capabilities and user modelling)
Language Resources (Resources, Data, Knowledge Bases)
Reference Corpora
2
5
4
5
Syntax-Corpora (treebanks, dependency banks)
1
4
1
1
Semantics-Corpora
1
2
1
1
Discourse-Corpora
0
0
0
0
Parallel Corpora, Translation Memories
2
4
4
3
Speech-Corpora (raw speech data, labeled/annotated speech
data, speech dialogue data)
Multimedia and multimodal data
(text data combined with audio/video)
Language Models
Lexicons, Terminologies
Grammars
Thesauri, WordNets
Ontological Resources for World Knowledge (e.g.
upper models, Linked Data)
2
1
1
1
1
1
3
0
0
0
0
0
0
0
3
4
2
2
2
5
1
2
3
5
4
3
4
5
3
2
3
5
4
4
5
5
4
4
3
5
3
4
1
2
1
1
1
0
0
38
Language Technology Support for Latvian
Conclusions
1) Interpretation of the table.
For Latvian, key results regarding technologies and resources include the following:
 While several basic language resources and tools are rather well presented for Latvian language, more advanced resources and tools are
missing; establishment of a Language Technology Program coordinating and supporting the LT field in Latvia is the most important task
to resolve this issue;
 Reasonably good results are achieved in machine translation, the
quality aspect depends on the availability of language resources,
which is rather limited for such a small language as Latvian;
 Semantics is more difficult to process thus only a few research prototypes have been created;
 Creation of speech and multimodal resources are in an initial phase,
most of these resources are not available for the Latvian language;
 Tools and resources for more advanced language technology such as
discourse processing, information retrieval, summarization and dialogue management do not exist;
 Many of the resources lack standardization, i.e., even if they exist,
sustainability is not given; concerted programs and initiatives are
needed to standardize data and interchange formats.
2) Where do we stand and what needs to be done?
Language technology in Latvia has a rather long history starting from the end
of the 50-ies. However, language technology has never been a priority research field in Latvia and thus was supported only with very limited funding.
This situation resulted in rather big gaps in language resources and tools
needed for sustainable development of Latvian language.
Moreover, Latvia lacks a dedicated national programme for LT research and
development and current research activities are fragmented and mostly organized around short-term projects thus complicating long-term interinstitutional cooperation and development of larger resources. Targeted national research and development activities, e.g. Language Technology Programme, National Corpus Project, are urgently needed to fill these gaps.
Another urgent problem is the lack of educational programmes in computational linguistic at Latvian universities. Currently only one semester-long
course is taught at the Liepaja University.
39
Language Technology Support for Latvian
References
Auziņa Ilze, 2004. Latvian Text-to-Speech System. Proceedings of the first
Baltic conference „Human Language Technologies – the Baltic Perspective‟,
21-26.
Deksne Daiga, I. Skadiņa, R. Skadiņš, A. Vasiljevs, 2005. Foreign Language
Reading Tool – First Step Towards English-Latvian Commercial Machine
Translation. Proceedings of Second Baltic Conference „Human Language
Technologies – the Baltic Perspective”, Tallinn.
Demetrioua G., I. Skadiņa, H. Keskustalo, J. Karlgren, D. Deksne, D. Petrellie, P. Hansen, R. Gaizauskas, M. Sanderson, 2004.CrossLingualDocumentRetrieval, Categorisation and Navigation Based on Distributed Services. Proceedings of First Baltic Conference „HumanLanguage
Technologies – the Baltic Perspective”, Riga, 107-114.
Eisele Andreas, J. Xu, 2010. Improving Machine Translation Performance
Using Comparable Corpora. Proceedings of the 3rd Workshop on Building
and Using Comparable Corpora, European Language Resources Association
(ELRA), La Valletta, Malta, 35-41.
Ethnologue Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World,
Sixteenth edition. Dallas, Tex.: SIL International. Online version:
http://www.ethnologue.com.
EUROMAP study, 2003.―Benchmarking HLT progress in Europe‖ EUROMAP study.
Goba Kārlis, A. Vasiļjevs, 2007. Development of Text-To-Speech System for
Latvian. Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, 67-72.
Goba Kārlis, 2007. Development of a Prosody Model for Latvian TTS. Proceedings of Third Baltic Conference HLT‟2007, October 4-5, 2007, Kaunas,
Lithuania.
Greitāne Inguna, 1997. Mašīntulkošanas sistēma LATRA. LZA Vēstis Nr.3./4
(1997), 1-6.
Internet World Stats, 2010.http://www.internetworldstats.com Copyright ©
Miniwatts Marketing Group. All rights reserved.
Koehn Philipp, A. Birch, and Ralf Steinberger, 2009. 462 machine translation
systems for Europe. Proceedings of the Twelfth
Machine Translation Summit (MT Summit XII).International Association for
Machine Translation, 2009.
Milčonoka Everita, N. Grūzītis, A. Spektors, 2004. Natural Language Processing at the Institute of Mathematics and Computer Science: 10 Years Later. Proceedings of the first Baltic conference „Human Language Technologies – the Baltic Perspective‟, 6–12.
40
Language Technology Support for Latvian
Pinnis Mārcis, I. Auziņa, 2010. Latvian Text-to-Speech Synthesizer. Proceedings of the Fourth International Conference Baltic HLT 2010, IOS Press,
Frontiers in Artificial Intelligence and Applications, Vol. 219, 69-72.
Skadiņa Inguna, A. Vasiļjevs, D. Deksne, R. Skadiņš, L. Goldberga,
2007.Comprehension Assistant for Languages of Baltic States. Proceedings of
the 16th Nordic Conference of Computational Linguistics NODALIDA-2007,
Tartu, 2007, 167.-174.
Skadiņa Inguna, E. Brālītis, 2008.Experimental Statistical Machine Translation System for Latvian. Proceedings of the 3rd Baltic Conference on HLT,
281-286.
Skadiņa Inguna, E. Brālītis, 2009. English-Latvian SMT: knowledge or data?
Proceedings of the 17th Nordic Conference on Computational Linguistics
NODALIDA, May 14-16, 2009, Odense, Denmark, NEALT Proceedings Series, Vol. 4, 242–245.
Skadiņa Inguna, A. Vasiļjevs, R. Skadiņš, R. Gaizauskas, D. Tufis, T. Gornostay, 2010. Analysis and Evaluation of Comparable Corpora for Under
Resourced Areas of Machine Translation. Proceedings of the 3rd Workshop
on Building and Using Comparable Corpora. European Language Resources
Association (ELRA), La Valletta, Malta, 6-14.
Skadiņa Inguna, I. Auziņa, N. Grūzītis, K. Levāne-Petrova, G. Nešpore, R.
Skadiņš, A. Vasiļjevs, 2010. Language Resources and Technology for the
Humanities in Latvia (2004–2010). Proceedings of the Fourth International
Conference Baltic HLT 2010, IOS Press, Frontiers in Artificial Intelligence
and Applications, Vol. 219, pp. 15-22.
Skadiņš Raivis, I. Skadiņa, D. Deksne, T. Gornostay, 2008. English/RussianLatvian Machine Translation System. Proceedings of the 3rd Baltic Conference on HLT, 287-296.
Skadiņš Raivis, K. Goba, V. Šics, 2010. Improving SMT for Baltic languages
with factored models. Proceedings of the Fourth Baltic conference „Human
Language Technologies – the Baltic Perspective‟.
Skadiņš Raivis, M. Puriņš, I. Skadiņa and A. Vasiļjevs, 2011. Evaluation of
SMT in localization to under-resourced inflected language. Proceedings of
EAMT-2011.
Steinberger Ralf, B.Pouliquen, A.Widiger, C.Ignat, T.Erjavec, D.Tufiş,
D.Varga, 2006. The JRC-Acquis: A multilingual aligned parallel corpus with
20+ languages. Proceedings of the 5th International Conference on Language
Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006.
Vasiļjevs Andrejs, T. Gornostay and R. Skadins, 2010. LetsMT! – Online
Platform for Sharing Training Data and Building User Tailored Machine
Translation. Proceedings of the Fourth Baltic conference „Human Language
Technologies – the Baltic Perspective‟.
41
Appendix
META-NET
0META-NET is a Network of Excellence funded by the European Union. It
currently consists of 44 members, representing 31 European countries, which
are listed below. META-NET is fostering the Multilingual Europe Technology Alliance (META), a growing community of language technology professionals and organisations in Europe.
META – The Multilingual Europe
Technology AllianceMETA – The
Multilingual Europe Technology
Alliance
Figure 6: Countries Represented in META-NET
META-NET cooperates with a dozen other large initiatives like CLARIN,
which is helping social sciences to establish the field Digital Humanities in
Europe. META-NET is dedicated to fostering the technological foundations
for establishing and maintaining a truly multilingual European information
society that
 makes possible communication and cooperation across languages,
 safeguards equal access to information and knowledge for users of
any language,
 offers advanced functionalities of networked information technology
to all citizens at affordable costs.
META-NET stimulates and promotes multilingual technologies for all European languages. The technologies enable automatic translation, content production, information processing and knowledge management for a wide variety of applications and subject domains. The network wants to improve current
approaches, so better communication and cooperation across languages can
take place. Europeans have an equal right to information and knowledge regardless of language.
META-NET’s Three Lines of Action
META-NET launched on 1 February 2010 with the goal of advancing research in language technology. The initiative supports a Europe that unites as
a single, digital market and information space. META-NET has conducted
several activities that further its goals. META-VISION, META-SHARE and
META-RESEARCH are the network‘s three lines of action.
42
About META-NET
Figure 7: Three Lines of Action in META-NET
META-VISION fosters a dynamic and influential stakeholder community
that unites around a shared vision and a common strategic research agenda
(SRA). The main focus of this activity is to build a coherent and cohesive LT
community in Europe by bringing together representatives from highly fragmented and diverse groups of stakeholders. In META-NET‘s first year,
presentations at the FLaReNet Forum (Spain), language technology Days
(Luxembourg), JIAMCATT 2010 (Luxembourg), LREC 2010 (Malta),
EAMT 2010 (France) and ICT 2010 (Belgium) centred on public outreach.
According to initial estimates, META-NET has already contacted more than
2,500 LT professionals to share its goals and visions with them. At the META-FORUM 2010 event in Brussels, META-NET shared the initial results of
its vision building process to more than 250 participants. In a series of interactive sessions, the participants provided feedback on the visions presented by
the network.
META-SHARE creates an open, distributed facility for exchanging and sharing resources. The peer-to-peer network of repositories will contain language
data, tools and web services that are documented with high-quality metadata
and organised in standardised categories. The resources can be readily accessed and uniformly searched. The available resources include free, open
source materials as well as restricted, commercially available, fee-based
items. META-SHARE targets existing language data, tools and systems as
well as new and emerging products that are required for building and evaluating new technologies, products and services. The reuse, combination, repurposing and re-engineering of language data and tools plays a crucial role.
META-SHARE will eventually become a critical part of the LT marketplace
for developers, localisation experts, researchers, translators and language professionals from small, mid-sized and large enterprises. META-SHARE addresses the full development cycle of LT—from research to innovative products and services. A key aspect of this activity is establishing META-SHARE
as an important and valuable part of a European and global infrastructure for
the LT community.
META-RESEARCH builds bridges to related technology fields. This activity seeks to leverage advances in other fields and to capitalise on innovative
research that can benefit language technology. In particular, this activity
wants to bring more semantics into machine translation (MT), optimise the
division of labour in hybrid MT, exploit context when computing automatic
translations and prepare an empirical base for MT. META-RESEARCH is
working with other fields and disciplines, such as machine learning and the
Semantic Web community. META-RESEARCH focuses on collecting data,
preparing data sets and organising language resources for evaluation purposes; compiling inventories of tools and methods; and organising workshops and
43
About META-NET
training events for members of the community. This activity has already clearly identified aspects of MT where semantics can impact current best practices.
In addition, the activity has created recommendations on how to approach the
problem of integrating semantic information in MT. META-RESEARCH is
also finalising a new language resource for MT, the Annotated Hybrid Sample
MT Corpus, which provides data for English-German, English-Spanish and
English-Czech language pairs. META-RESEARCH has also developed software that collects multilingual corpora that are hidden on the web.
Composition of the META-NET Network of Excellence
Country
Austria
Belgium
Bulgaria
Croatia
Cyprus
Czech
Rep.
Denmark
Estonia
Finland
France
Germany
Greece
Hungary
Iceland
Ireland
Italy
Latvia
Lithuania
Luxembourg
Malta
Netherlands
Norway
Poland
Portugal
Member (Affiliation)
Universität Wien
University of Antwerp
University of Leuven
Bulgarian Academy of Sciences
Zagreb University
University of Cyprus
Charles University in Prague*
Contacts
Gerhard Budin
Walter Daelemans
Dirk van Compernolle
Svetla Koeva
Marko Tadic
Jack Burston
Jan Hajic
University of Copenhagen
Bente Maegaard, Bolette Sandford
Pedersen
University of Tartu
Tiit Roosmaa
Aalto University*
Timo Honkela
University of Helsinki
Kimmo Koskenniemi, Krister Linden
CNRS, LIMSI*
Joseph Mariani
ELDA*
Khalid Choukri
DFKI*
Hans Uszkoreit, Georg Rehm
RWTH Aachen*
Hermann Ney
ILSP, R.C. ―Athena‖*
Stelios Piperidis
Hungarian Academy of Sciences
Tamás Váradi
Budapest Technical University
Géza Németh, Gábor Olaszy
University of Iceland
Eirikur Rögnvaldsson
Dublin City University*
Josef van Genabith
ConsiglioNazionaleRicerche*
Nicoletta Calzolari
Fondazione Bruno Kessler*
Bernardo Magnini
Tilde
Andrejs Vasiljevs
Institute of Mathematics and Computer Inguna Skadina
Science, University of Latvia
Institute of the Lithuanian Language
Jolanta Zabarskaitë
Arax Ltd.
Vartkes Goetcherian
University of Malta
Universiteit Utrecht*
Mike Rosner
Jan Odijk
University of Bergen
Polish Academy of Sciences
University of Łódź
University of Lisbon
Inst. for Systems Engineering and
Computers
Koenraad De Smedt
Adam Przepiórkowski
Piotr Pezik
Antonio Branco
Isabel Trancoso
44
About META-NET
Romania
Serbia
Romanian Academy of Sciences
University AlexandruIoanCuza
Belgrade University
Pupin Institute
Slovakia Slovak Academy of Sciences
Slovenia Jozef Stefan Institute*
Spain
Barcelona Media*
Technical University of Catalonia
University Pompeu Fabra
Sweden
University of Gothenburg
UK
University of Manchester
An * represents the founding members.
Dan Tufis
Dan Cristea
Dusko Vitas, Cvetana Krstev, Ivan
Obradovic
Sanja Vranes
Radovan Garabik
Marko Grobelnik
Toni Badia
Asunción Moreno
Núria Bel
Lars Borin
Sophia Ananiadou
How to Participate?
META-NET and META offer many opportunities for participation.
Please check out www.meta-net.eu for information on upcoming
events and activities.
45