How the Computer Translates

Transcription

How the Computer Translates
How the Computer Translates
Svetlana Sokolova
President and CEO of PROMT, PhD.
How the Computer Translates
How the Computer Translates
Machine translation is a special field of computer application where almost
everyone believes that he/she is a specialist.
Firstly, everybody understands that the larger the dictionary volume, the better
the translation will be, so the first problem is to create large dictionaries for the
systems.
Secondly, it is clear that the system should be able to translate sentences like HI,
HOW ARE YOU DOING? So, another problem is to teach the system to
recognize common collocations.
Thirdly, it is obvious that a sentence subject to translation is written in
accordance with certain rules and should be translated under certain rules, so
there is one more problem: to store all these rules as a program. That's it.
The point is that these problems are really essential for development of machine
translation systems, however, methods of their solution are not commonly known
and not as simple as it may seem.
Machine translation systems of the PROMT family are perfect instances to
show effective solutions of these problems.
Dictionary
Methods of arrangement of large databases are well developed, but as for
translation, in order to provide correct retrieval of database elements, it may be
more important to know how to configure the information assigned to each
element. For example, how many dictionary entries should correspond to a
common Russian word "program"? And what is more, a large dictionary is a
dictionary that contains many entries, or a dictionary that allows recognizing
many words in a text?
A closer look reveals that, for example, Russian nouns change cases and
numbers, i.e. up to 12 different forms can exist for a noun, and, as a rule, even
greater number of different forms can exist for verbs and adjectives (more than
30). Therefore, in order to translate sentences containing Russian declinable
words like "program", "about program", "programs", etc., it would be useful to
implement a technique of searching a correlation between the "program" entry
contained in the computer dictionary and the appropriate word form in the text.
So, in order to describe both source and target languages the system should use a
formal method of morphology description that is the base for dictionary unit
retrieval.
PROMT Ltd (Headquarters)
16 Birzhevaya Liniya
Vasilyevsky Island
Saint-Petersburg
199034 Russia
PROMT (Moscow office)
6a Letnikovskaja Street
Moscow
115114 Russia
PROMT GmbH
Eiffestr. 632
20537 Hamburg
Tel.: +7-812-331 75 40
Fax: +7-812-327 44 83
E-Mail: [email protected]
Internet: www.e-promt.com
Tel.: +7-095-580 48 48
+7-095-509 35 94
E-Mail: [email protected]
Internet: www.promt.ru
Tel.: +49-40-219 01 140
Fax: +49-40-219 01 143
E-Mail: [email protected]
Internet: www.promt.de
2
How the Computer Translates
Actually, in every system pretending to be a translation system, the problem of
representation of morphological models is somehow solved. But some systems
can recognize 1,000,000 word forms on the basis of a dictionary containing
50,000 entries, and other systems with a dictionary of 100,000 entries can
recognize just these 100,000.
In PROMT family systems, morphological description is developed for all
languages to be handled by the systems. This description is almost unique due to
its completeness. It contains 800 types of word change for the Russian language,
more than 300 types for German and French languages, and even for English,
which is not an inflectional language, over 250 types of word change are defined.
The variety of endings in each language is stored as tree structures thus providing
not only the effective way of storage, but also the effective algorithm of
morphological analysis.
Furthermore, this morphology model was applied for development of the
advisory system for those users who create dictionaries themselves. This system
actually automates the process of stem extraction and determination of word
change type while entering new dictionary entries.
There is no such feature in other existing machine translation systems, even in
such well-known systems like Power Translator (Globalink, USA), Language
Assistant (MicroTac, USA), TRANSEND (Intergaph, USA), where users should
conjugate and decline words manually in order to define a morphological model.
Nevertheless, the development of morphology description allows to solve only
one problem, namely the problem of determination of the dictionary entry header,
which is used for identification of a text unit and a dictionary unit. But
determining the correlation between a word in the text and a dictionary entry is
performed not only for identification purposes, as it is required in spell checkers
or electronic dictionaries, but also for execution of translation procedures by the
software. So what information should a dictionary entry contain and how should
translation rules be described in order to make the software translate?
Dictionary
Here a historical digression is needed because machine translation, as a part of
applied linguistics, has a very dramatic history. In the 1950's, along with
development of first computers, the idea of machine translation has appeared. By
the way, the term "machine translation" exists since that time. The task seemed to
be quite easy to perform. This caused a kind of linguistic euphoria, and several
global projects on creation of translation systems for different languages were
launched.
PROMT Ltd (Headquarters)
16 Birzhevaya Liniya
Vasilyevsky Island
Saint-Petersburg
199034 Russia
PROMT (Moscow office)
6a Letnikovskaja Street
Moscow
115114 Russia
PROMT GmbH
Eiffestr. 632
20537 Hamburg
Tel.: +7-812-331 75 40
Fax: +7-812-327 44 83
E-Mail: [email protected]
Internet: www.e-promt.com
Tel.: +7-095-580 48 48
+7-095-509 35 94
E-Mail: [email protected]
Internet: www.promt.ru
Tel.: +49-40-219 01 140
Fax: +49-40-219 01 143
E-Mail: [email protected]
Internet: www.promt.de
3
How the Computer Translates
None of these projects had developed an operable system, and the commission
specially established by the US National Academy of Sciences in 1967 stated that
machine translation projects have no future and should not be financed. Only in
the beginning of the 1980's linguists recovered enough from consequences of this
verdict and resumed research and development in this field. Certainly, in many
respects this revival was connected with overall development of computer
industry and, more particularly, with growing interest in "artificial intelligence"
as a field of computers application.
Nevertheless, in the 1980's the history almost repeated itself, but in addition to
global projects, such as EUROTRA (European Economic Community), ARIANE
(France), METAL (USA and Germany), KANT (USA), SUSY (Germany), many
local projects having less ambitious purposes were launched.
The global projects were still aimed at solution of translation problem in general.
Within these projects, development of description of lexical units for the
dictionary and development of translation algorithms were considered as
different tasks. A variety of linguistic proceedings appeared offering structure of
description of live word properties in a computer dictionary entry. At the same
time, a number of independent researches were published devoted to issues, for
example, like "The Structure of Noun Phrase" or "Representation of Direct
Objects of Verbs of Saying". However, real commercial systems somehow
implementing results of these studies were not presented in the market. Each
developed system had a modest complement of "experimental" or "prototype".
But in practice no one of these systems had ever been finished and could be
considered as a consumer product. It was stipulated by the fact that applied
methods for description of translation, after their transferring to real environment
(i.e. upon their applying to arbitrary texts), revealed their inconsistency with
methods offered for creation of dictionary entries.
The exception, perhaps, is the METAL project. Although this project did not
finally resulted in a real commercial product, but during its development it was
redirected to creation of a system that would be capable to translate from German
into English and from English into German and to handle specialized dictionaries
for specific subject areas.
At the same time, local projects were oriented to narrow-scope solutions.
Developers' goal was to obtain any valuable result. In these projects, dictionary
description and description of algorithms were considered as integral parts of one
problem, but the solution, as a rule, was found by limiting the analyzed
environment, either grammar or semantic. For example, on the basis of the
"Belonging to a part of speech" attribute, the grammar of following types was
described:
· a noun phrase is a noun
· a noun phrase is an adjective + a noun phrase
· a verbal phrase is a verb + a noun phrase
· a sentence is a noun phrase + a verbal phrase
PROMT Ltd (Headquarters)
16 Birzhevaya Liniya
Vasilyevsky Island
Saint-Petersburg
199034 Russia
PROMT (Moscow office)
6a Letnikovskaja Street
Moscow
115114 Russia
PROMT GmbH
Eiffestr. 632
20537 Hamburg
Tel.: +7-812-331 75 40
Fax: +7-812-327 44 83
E-Mail: [email protected]
Internet: www.e-promt.com
Tel.: +7-095-580 48 48
+7-095-509 35 94
E-Mail: [email protected]
Internet: www.promt.ru
Tel.: +49-40-219 01 140
Fax: +49-40-219 01 143
E-Mail: [email protected]
Internet: www.promt.de
4
How the Computer Translates
It is clear that part of sentences in natural language can be described using such
grammar, but their number is insignificant and not enough for correct analysis
and translation of a real text. But it is possible to use effective methods for
construction of a converter on the basis of specific grammar or, at worst, to
compile a program that can build dependency trees for limited set of sentences by
means of linear search. Similarly, such systems also were called "experimental".
Though both these approaches did not result in commercial systems, research
works conducted in this area helped to understand the complexity of the task and,
at least, to detect bottlenecks in similar developments. Anyway, these local
projects became a platform that allowed creation of translation systems that are
now offered to end-users. Power Translator (Globalink company), Language
Assistant (MicroTac company) and TRANSEND (Intergraph company) are
among these systems.
Systems of STYLUS and PROMT families are not exceptions, as many
specialists of the PROMT Company were involved in similar development
projects. Nevertheless, a first-ever revolutionary approach was applied for the
development of PROMT systems, which led to impressing results. Translation
systems of the PROMT family are the systems designed on the basis of not
linguistic, but cybernetic methods.
It was revealed that it was very effective to consider the translation system not as
a translator assigned to the task of translation of a text allowable from the point
of view of source grammar, but rather as some complex system assigned to the
task of getting the result in case of arbitrary input data including texts which are
not correct from the point of view of system grammar in use.
Instead of accepted linguistic approach, which assumes implementation of
sequential processes of sentence analysis and synthesis, the architecture of the
system is based on representation of translation procedures in a form of "objectoriented" process founded on an hierarchy of sentence components to be
processed. That allowed PROMT systems to be stable and open.
Besides, such approach allowed applying of various formalisms for description
of translation on different levels. The systems also employ network grammars,
whose type is similar to extended transition networks, as well as working
algorithms for filling and transformation of frame structures for analysis of
complex predicates.
Lexical unit description within a dictionary entry, which actually is not limited in
its volume and can contain a number of various attributes, is closely
interconnected with the structure of system algorithms and is configured not on
the basis of an immemorial antithesis of "syntax-semantics", but rather on the
basis of text component levels.
Thus the systems can work using incompletely described dictionary entries,
which is a very important point for opening dictionaries for a user who cannot be
regarded as a highly experienced specialist in linguistics.
PROMT Ltd (Headquarters)
16 Birzhevaya Liniya
Vasilyevsky Island
Saint-Petersburg
199034 Russia
PROMT (Moscow office)
6a Letnikovskaja Street
Moscow
115114 Russia
PROMT GmbH
Eiffestr. 632
20537 Hamburg
Tel.: +7-812-331 75 40
Fax: +7-812-327 44 83
E-Mail: [email protected]
Internet: www.e-promt.com
Tel.: +7-095-580 48 48
+7-095-509 35 94
E-Mail: [email protected]
Internet: www.promt.ru
Tel.: +49-40-219 01 140
Fax: +49-40-219 01 143
E-Mail: [email protected]
Internet: www.promt.de
5
How the Computer Translates
The very first machine translation system, released by the PROMT Company in
1991, was able to translate specialized texts, relating to computer software, from
English into Russian. The system employed a small dictionary (about 17,000
words and expressions), it was DOS-compatible and had no tools for
customization. But even this first system was correctly arranged, and the present
technology of development of machine translation algorithms, applied by the
PROMT Company, was not subject to major modification. Moreover, the
approach found during that phase of development proved to be very effective for
many different languages.
First let's explain some definitions: along with the development of machine
translation, which is a part of applied linguistics, some system classifications also
appeared, and subdivision of translation systems into TRANSFER systems and
INTERLINGUA systems was adopted. This subdivision is based on aspects of
architectural solutions relating to linguistic algorithms.
Translation algorithms for TRANSFER systems are built as a composition of
three processes: analysis of the input sentence in terms of source language
structures, conversion of this structure into a similar target language structure
(TRANSFER), and, finally, synthesis of the output sentence according to the
constructed structure.
INTERLINGUA systems assume apriori that a certain structure metalanguage
(INTERLINGUA) is available, which, in principle, can be used for describing
any structure of both source and target languages. Therefore, it is supposed that
the translation algorithm employed in INTERLINGUA systems is more simple:
analysis of the input sentence in terms of the metalanguage and then synthesis of
a corresponding target language sentence using the metastructure. In this case,
the “only one" difficulty is development of the metalanguage itself and making a
description of the natural language in appropriate terms.
In spite of the fact that this classification actually exists and that among machine
translation developers it is considered good form to ask which type of system
your system belongs to, yet there is no real system developed based on the
INTERLINGUA principle.
Our system is not an exception, and we answer this question as follows: our
system performs the translation of the TRANSFER type. But this answer is very
simple, and actually it does not reflect any peculiarity of the PROMT system
architecture. The special feature of the system is that this (TRANSFER) method
is applied not according to standard linguistic approach.
As a matter of fact, a translation system generally operates under conditions of
using incomplete data, as the language is an alive, fast-evolving system: new
words, new functions of old words and, along with new essences, new meanings
are constantly being developed. In this situation, the main structural feature of
translation algorithms is stability of the system, with respect to arbitrary input
data. PROMT system translation algorithms are based not on sequential
TRANSFER procedures, but on hierarchical approach that provides subdivision
of translation process into interconnected TRANSFER procedures for different
units of analysis.
PROMT Ltd (Headquarters)
16 Birzhevaya Liniya
Vasilyevsky Island
Saint-Petersburg
199034 Russia
PROMT (Moscow office)
6a Letnikovskaja Street
Moscow
115114 Russia
PROMT GmbH
Eiffestr. 632
20537 Hamburg
Tel.: +7-812-331 75 40
Fax: +7-812-327 44 83
E-Mail: [email protected]
Internet: www.e-promt.com
Tel.: +7-095-580 48 48
+7-095-509 35 94
E-Mail: [email protected]
Internet: www.promt.ru
Tel.: +49-40-219 01 140
Fax: +49-40-219 01 143
E-Mail: [email protected]
Internet: www.promt.de
6
How the Computer Translates
The following levels are distinguished in the system: the lexical unit level, the
group level, the simple sentence level and the compound sentence level. All these
processes are interconnected and interact hierarchically according to text unit
hierarchy, and also exchange synthesized and inherited attributes. This kind of
algorithm arrangement allows to use different formal methods for description of
algorithms on different levels.
Now let's look at the lexical unit level: a lexical unit is a word or collocation that
is a unit of the lowest level. Each word is described as the composition of a stem
and an ending, in both source and target languages. On one hand, it provides the
possibility of source word recognition and source morphology analysis, and, on
the other hand, the possibility of convenient target word synthesis according to
relevant morphological data (the stem, the type of change and the address of
ending in the array of endings of this type). Thus, if rules of conversion of source
morphological data into target morphological data are available, it is possible to
carry out TRANSFER procedures on the morphological level.
The group level corresponds to more complex structures: groups of nouns,
adjectives, adverbs and complex verbal forms. This level is based on formal
network grammars, and, when analyzing, it allows compounding of groups for
creation of syntactic units. Each unit is characterized by synthesized structural
data and the main unit of the group. Corresponding to the source structure
composed in terms of immediate constituents, and along with synthesized
attributes, the target group is created as a set of lexical units with morphological
attribute values that can be inherited in accordance with results of group analysis.
In this way, the TRANSFER procedures are implemented on the group level.
The analysis of simple sentences that are considered as structures consisting of
syntactic units is performed on the basis of frame predicate structures providing
effective conversions. In simple sentences, the verb is considered as the main
element and its valences determine filling of the corresponding frame. For any
type of frame, there is a conversion law for creation of the target frame and
forming of actants. In this way, the TRANSFER procedures are implemented on
the sentence level. The analysis of compound sentences is required when it is
necessary to form the concord of tenses and provide correct translation of
conjunctions.
Conclusion
We hope that this information will allow potential users of translation systems to
understand that creation of a machine translation system is not a simple but
rather a knowledge-intensive task. And therefore, the quantity of real ready-tooperate translation systems that may appear per time
unit is essentially limited.
Svetlana Sokolova
President and CEO of PROMT(www.e-promt.com),
PhD.