How the Computer Translates
Transcription
How the Computer Translates
How the Computer Translates Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist. Firstly, everybody understands that the larger the dictionary volume, the better the translation will be, so the first problem is to create large dictionaries for the systems. Secondly, it is clear that the system should be able to translate sentences like HI, HOW ARE YOU DOING? So, another problem is to teach the system to recognize common collocations. Thirdly, it is obvious that a sentence subject to translation is written in accordance with certain rules and should be translated under certain rules, so there is one more problem: to store all these rules as a program. That's it. The point is that these problems are really essential for development of machine translation systems, however, methods of their solution are not commonly known and not as simple as it may seem. Machine translation systems of the PROMT family are perfect instances to show effective solutions of these problems. Dictionary Methods of arrangement of large databases are well developed, but as for translation, in order to provide correct retrieval of database elements, it may be more important to know how to configure the information assigned to each element. For example, how many dictionary entries should correspond to a common Russian word "program"? And what is more, a large dictionary is a dictionary that contains many entries, or a dictionary that allows recognizing many words in a text? A closer look reveals that, for example, Russian nouns change cases and numbers, i.e. up to 12 different forms can exist for a noun, and, as a rule, even greater number of different forms can exist for verbs and adjectives (more than 30). Therefore, in order to translate sentences containing Russian declinable words like "program", "about program", "programs", etc., it would be useful to implement a technique of searching a correlation between the "program" entry contained in the computer dictionary and the appropriate word form in the text. So, in order to describe both source and target languages the system should use a formal method of morphology description that is the base for dictionary unit retrieval. PROMT Ltd (Headquarters) 16 Birzhevaya Liniya Vasilyevsky Island Saint-Petersburg 199034 Russia PROMT (Moscow office) 6a Letnikovskaja Street Moscow 115114 Russia PROMT GmbH Eiffestr. 632 20537 Hamburg Tel.: +7-812-331 75 40 Fax: +7-812-327 44 83 E-Mail: [email protected] Internet: www.e-promt.com Tel.: +7-095-580 48 48 +7-095-509 35 94 E-Mail: [email protected] Internet: www.promt.ru Tel.: +49-40-219 01 140 Fax: +49-40-219 01 143 E-Mail: [email protected] Internet: www.promt.de 2 How the Computer Translates Actually, in every system pretending to be a translation system, the problem of representation of morphological models is somehow solved. But some systems can recognize 1,000,000 word forms on the basis of a dictionary containing 50,000 entries, and other systems with a dictionary of 100,000 entries can recognize just these 100,000. In PROMT family systems, morphological description is developed for all languages to be handled by the systems. This description is almost unique due to its completeness. It contains 800 types of word change for the Russian language, more than 300 types for German and French languages, and even for English, which is not an inflectional language, over 250 types of word change are defined. The variety of endings in each language is stored as tree structures thus providing not only the effective way of storage, but also the effective algorithm of morphological analysis. Furthermore, this morphology model was applied for development of the advisory system for those users who create dictionaries themselves. This system actually automates the process of stem extraction and determination of word change type while entering new dictionary entries. There is no such feature in other existing machine translation systems, even in such well-known systems like Power Translator (Globalink, USA), Language Assistant (MicroTac, USA), TRANSEND (Intergaph, USA), where users should conjugate and decline words manually in order to define a morphological model. Nevertheless, the development of morphology description allows to solve only one problem, namely the problem of determination of the dictionary entry header, which is used for identification of a text unit and a dictionary unit. But determining the correlation between a word in the text and a dictionary entry is performed not only for identification purposes, as it is required in spell checkers or electronic dictionaries, but also for execution of translation procedures by the software. So what information should a dictionary entry contain and how should translation rules be described in order to make the software translate? Dictionary Here a historical digression is needed because machine translation, as a part of applied linguistics, has a very dramatic history. In the 1950's, along with development of first computers, the idea of machine translation has appeared. By the way, the term "machine translation" exists since that time. The task seemed to be quite easy to perform. This caused a kind of linguistic euphoria, and several global projects on creation of translation systems for different languages were launched. PROMT Ltd (Headquarters) 16 Birzhevaya Liniya Vasilyevsky Island Saint-Petersburg 199034 Russia PROMT (Moscow office) 6a Letnikovskaja Street Moscow 115114 Russia PROMT GmbH Eiffestr. 632 20537 Hamburg Tel.: +7-812-331 75 40 Fax: +7-812-327 44 83 E-Mail: [email protected] Internet: www.e-promt.com Tel.: +7-095-580 48 48 +7-095-509 35 94 E-Mail: [email protected] Internet: www.promt.ru Tel.: +49-40-219 01 140 Fax: +49-40-219 01 143 E-Mail: [email protected] Internet: www.promt.de 3 How the Computer Translates None of these projects had developed an operable system, and the commission specially established by the US National Academy of Sciences in 1967 stated that machine translation projects have no future and should not be financed. Only in the beginning of the 1980's linguists recovered enough from consequences of this verdict and resumed research and development in this field. Certainly, in many respects this revival was connected with overall development of computer industry and, more particularly, with growing interest in "artificial intelligence" as a field of computers application. Nevertheless, in the 1980's the history almost repeated itself, but in addition to global projects, such as EUROTRA (European Economic Community), ARIANE (France), METAL (USA and Germany), KANT (USA), SUSY (Germany), many local projects having less ambitious purposes were launched. The global projects were still aimed at solution of translation problem in general. Within these projects, development of description of lexical units for the dictionary and development of translation algorithms were considered as different tasks. A variety of linguistic proceedings appeared offering structure of description of live word properties in a computer dictionary entry. At the same time, a number of independent researches were published devoted to issues, for example, like "The Structure of Noun Phrase" or "Representation of Direct Objects of Verbs of Saying". However, real commercial systems somehow implementing results of these studies were not presented in the market. Each developed system had a modest complement of "experimental" or "prototype". But in practice no one of these systems had ever been finished and could be considered as a consumer product. It was stipulated by the fact that applied methods for description of translation, after their transferring to real environment (i.e. upon their applying to arbitrary texts), revealed their inconsistency with methods offered for creation of dictionary entries. The exception, perhaps, is the METAL project. Although this project did not finally resulted in a real commercial product, but during its development it was redirected to creation of a system that would be capable to translate from German into English and from English into German and to handle specialized dictionaries for specific subject areas. At the same time, local projects were oriented to narrow-scope solutions. Developers' goal was to obtain any valuable result. In these projects, dictionary description and description of algorithms were considered as integral parts of one problem, but the solution, as a rule, was found by limiting the analyzed environment, either grammar or semantic. For example, on the basis of the "Belonging to a part of speech" attribute, the grammar of following types was described: · a noun phrase is a noun · a noun phrase is an adjective + a noun phrase · a verbal phrase is a verb + a noun phrase · a sentence is a noun phrase + a verbal phrase PROMT Ltd (Headquarters) 16 Birzhevaya Liniya Vasilyevsky Island Saint-Petersburg 199034 Russia PROMT (Moscow office) 6a Letnikovskaja Street Moscow 115114 Russia PROMT GmbH Eiffestr. 632 20537 Hamburg Tel.: +7-812-331 75 40 Fax: +7-812-327 44 83 E-Mail: [email protected] Internet: www.e-promt.com Tel.: +7-095-580 48 48 +7-095-509 35 94 E-Mail: [email protected] Internet: www.promt.ru Tel.: +49-40-219 01 140 Fax: +49-40-219 01 143 E-Mail: [email protected] Internet: www.promt.de 4 How the Computer Translates It is clear that part of sentences in natural language can be described using such grammar, but their number is insignificant and not enough for correct analysis and translation of a real text. But it is possible to use effective methods for construction of a converter on the basis of specific grammar or, at worst, to compile a program that can build dependency trees for limited set of sentences by means of linear search. Similarly, such systems also were called "experimental". Though both these approaches did not result in commercial systems, research works conducted in this area helped to understand the complexity of the task and, at least, to detect bottlenecks in similar developments. Anyway, these local projects became a platform that allowed creation of translation systems that are now offered to end-users. Power Translator (Globalink company), Language Assistant (MicroTac company) and TRANSEND (Intergraph company) are among these systems. Systems of STYLUS and PROMT families are not exceptions, as many specialists of the PROMT Company were involved in similar development projects. Nevertheless, a first-ever revolutionary approach was applied for the development of PROMT systems, which led to impressing results. Translation systems of the PROMT family are the systems designed on the basis of not linguistic, but cybernetic methods. It was revealed that it was very effective to consider the translation system not as a translator assigned to the task of translation of a text allowable from the point of view of source grammar, but rather as some complex system assigned to the task of getting the result in case of arbitrary input data including texts which are not correct from the point of view of system grammar in use. Instead of accepted linguistic approach, which assumes implementation of sequential processes of sentence analysis and synthesis, the architecture of the system is based on representation of translation procedures in a form of "objectoriented" process founded on an hierarchy of sentence components to be processed. That allowed PROMT systems to be stable and open. Besides, such approach allowed applying of various formalisms for description of translation on different levels. The systems also employ network grammars, whose type is similar to extended transition networks, as well as working algorithms for filling and transformation of frame structures for analysis of complex predicates. Lexical unit description within a dictionary entry, which actually is not limited in its volume and can contain a number of various attributes, is closely interconnected with the structure of system algorithms and is configured not on the basis of an immemorial antithesis of "syntax-semantics", but rather on the basis of text component levels. Thus the systems can work using incompletely described dictionary entries, which is a very important point for opening dictionaries for a user who cannot be regarded as a highly experienced specialist in linguistics. PROMT Ltd (Headquarters) 16 Birzhevaya Liniya Vasilyevsky Island Saint-Petersburg 199034 Russia PROMT (Moscow office) 6a Letnikovskaja Street Moscow 115114 Russia PROMT GmbH Eiffestr. 632 20537 Hamburg Tel.: +7-812-331 75 40 Fax: +7-812-327 44 83 E-Mail: [email protected] Internet: www.e-promt.com Tel.: +7-095-580 48 48 +7-095-509 35 94 E-Mail: [email protected] Internet: www.promt.ru Tel.: +49-40-219 01 140 Fax: +49-40-219 01 143 E-Mail: [email protected] Internet: www.promt.de 5 How the Computer Translates The very first machine translation system, released by the PROMT Company in 1991, was able to translate specialized texts, relating to computer software, from English into Russian. The system employed a small dictionary (about 17,000 words and expressions), it was DOS-compatible and had no tools for customization. But even this first system was correctly arranged, and the present technology of development of machine translation algorithms, applied by the PROMT Company, was not subject to major modification. Moreover, the approach found during that phase of development proved to be very effective for many different languages. First let's explain some definitions: along with the development of machine translation, which is a part of applied linguistics, some system classifications also appeared, and subdivision of translation systems into TRANSFER systems and INTERLINGUA systems was adopted. This subdivision is based on aspects of architectural solutions relating to linguistic algorithms. Translation algorithms for TRANSFER systems are built as a composition of three processes: analysis of the input sentence in terms of source language structures, conversion of this structure into a similar target language structure (TRANSFER), and, finally, synthesis of the output sentence according to the constructed structure. INTERLINGUA systems assume apriori that a certain structure metalanguage (INTERLINGUA) is available, which, in principle, can be used for describing any structure of both source and target languages. Therefore, it is supposed that the translation algorithm employed in INTERLINGUA systems is more simple: analysis of the input sentence in terms of the metalanguage and then synthesis of a corresponding target language sentence using the metastructure. In this case, the “only one" difficulty is development of the metalanguage itself and making a description of the natural language in appropriate terms. In spite of the fact that this classification actually exists and that among machine translation developers it is considered good form to ask which type of system your system belongs to, yet there is no real system developed based on the INTERLINGUA principle. Our system is not an exception, and we answer this question as follows: our system performs the translation of the TRANSFER type. But this answer is very simple, and actually it does not reflect any peculiarity of the PROMT system architecture. The special feature of the system is that this (TRANSFER) method is applied not according to standard linguistic approach. As a matter of fact, a translation system generally operates under conditions of using incomplete data, as the language is an alive, fast-evolving system: new words, new functions of old words and, along with new essences, new meanings are constantly being developed. In this situation, the main structural feature of translation algorithms is stability of the system, with respect to arbitrary input data. PROMT system translation algorithms are based not on sequential TRANSFER procedures, but on hierarchical approach that provides subdivision of translation process into interconnected TRANSFER procedures for different units of analysis. PROMT Ltd (Headquarters) 16 Birzhevaya Liniya Vasilyevsky Island Saint-Petersburg 199034 Russia PROMT (Moscow office) 6a Letnikovskaja Street Moscow 115114 Russia PROMT GmbH Eiffestr. 632 20537 Hamburg Tel.: +7-812-331 75 40 Fax: +7-812-327 44 83 E-Mail: [email protected] Internet: www.e-promt.com Tel.: +7-095-580 48 48 +7-095-509 35 94 E-Mail: [email protected] Internet: www.promt.ru Tel.: +49-40-219 01 140 Fax: +49-40-219 01 143 E-Mail: [email protected] Internet: www.promt.de 6 How the Computer Translates The following levels are distinguished in the system: the lexical unit level, the group level, the simple sentence level and the compound sentence level. All these processes are interconnected and interact hierarchically according to text unit hierarchy, and also exchange synthesized and inherited attributes. This kind of algorithm arrangement allows to use different formal methods for description of algorithms on different levels. Now let's look at the lexical unit level: a lexical unit is a word or collocation that is a unit of the lowest level. Each word is described as the composition of a stem and an ending, in both source and target languages. On one hand, it provides the possibility of source word recognition and source morphology analysis, and, on the other hand, the possibility of convenient target word synthesis according to relevant morphological data (the stem, the type of change and the address of ending in the array of endings of this type). Thus, if rules of conversion of source morphological data into target morphological data are available, it is possible to carry out TRANSFER procedures on the morphological level. The group level corresponds to more complex structures: groups of nouns, adjectives, adverbs and complex verbal forms. This level is based on formal network grammars, and, when analyzing, it allows compounding of groups for creation of syntactic units. Each unit is characterized by synthesized structural data and the main unit of the group. Corresponding to the source structure composed in terms of immediate constituents, and along with synthesized attributes, the target group is created as a set of lexical units with morphological attribute values that can be inherited in accordance with results of group analysis. In this way, the TRANSFER procedures are implemented on the group level. The analysis of simple sentences that are considered as structures consisting of syntactic units is performed on the basis of frame predicate structures providing effective conversions. In simple sentences, the verb is considered as the main element and its valences determine filling of the corresponding frame. For any type of frame, there is a conversion law for creation of the target frame and forming of actants. In this way, the TRANSFER procedures are implemented on the sentence level. The analysis of compound sentences is required when it is necessary to form the concord of tenses and provide correct translation of conjunctions. Conclusion We hope that this information will allow potential users of translation systems to understand that creation of a machine translation system is not a simple but rather a knowledge-intensive task. And therefore, the quantity of real ready-tooperate translation systems that may appear per time unit is essentially limited. Svetlana Sokolova President and CEO of PROMT(www.e-promt.com), PhD.