Portada La Solución ATAMIRI Auspicios del Proyecto QOPUCHAWI Contáctenos |
La Idea El Proyecto El grupo de R&D Comentarios de Prensa Historia Web Multilingue |
Experience with language implementations in ATAMIRI
Presented at the Global Symposium
on Promoting the Multilingual Internet
(International Centre of Geneva, 9-11 May 2006)
Iván Guzmán de Rojas
Abstract
ATAMIRI is a non-commercial system that operates in the Web as a truly multilingual machine translator, i.e. one program, one lexical and grammatical data base, supporting various languages capable of operating either as source or target language, with simultaneous translation from any source language to various target languages. The key aspect of this MT technology is its genuinely multilingual property. When a N-th language is implemented, this will immediately be related with the rest of the (N-1) languages in the system. Therefore, implementation costs are only proportional to N. This is an economically significant difference with other systems that try to cover the multilingual demand with multiple programs and dictionaries developed by language pairs therefore with implementation costs proportional to the N(N-1) translation directions in the language set.
This paper describes our operational experience with nine language implementations in ATAMIRI’s translator engine: the Latin languages Spanish, French, Portuguese, Italian, Romanian and Catalan and also English, German and Dutch. The resulting 72 language translation directions show various translation quality levels. Both language engineering and economical aspects are discussed. A project plan outline is suggested in order that ATAMIRI technology can be exploited in its full potential.
Multilingual translation demand in the Internet
Although the usefulness of automatic translation on the Internet is now quite well understood and widely available through various commercial on-line services, there remains a large gap between the small number of languages supported and the demand for multilingual translation. For example, the Google Directory lists 75 different languages in the web pages that it indexes suggesting the need in principle to support 5,550 (75x74) different translation directions, but even the most advanced systems currently support at most 45 language pairs with English and French as preferred source or target languages.
As an illustration consider that nowadays there are in the market machine translation systems for English and German (the two languages with greatest presence in the Web) capable to handle the six Latin languages present in Internet; however, there are lacking translation directions as is shown in the following table:
|
English |
German |
Spanish |
French |
Italian |
Portuguese |
Romanian |
English |
- |
YES |
YES |
YES |
YES |
YES |
NO |
German |
YES |
- |
YES |
YES |
NO |
NO |
NO |
Spanish |
YES |
YES |
- |
YES |
NO |
YES |
NO |
French |
YES |
YES |
YES |
- |
NO |
NO |
NO |
Italian |
YES |
NO |
NO |
NO |
- |
NO |
NO |
Portuguese |
YES |
NO |
YES |
NO |
NO |
- |
NO |
Romanian |
NO |
NO |
NO |
NO |
NO |
NO |
- |
Catalan |
NO |
NO |
YES |
NO |
NO |
NO |
NO |
Rows are source languages, columns are target languages. YES indicates translation direction offered by at least one translation web service or software supplier. Currently there are only 19 YES cases out of the 56 possibilities, all of which are now supported by ATAMIRI in its experimental operation.
The development of an automated translation system using the language pair approach needs enormous financial resources and it demands a considerable R&D effort. For this reason, most of the current developments focus on English.
Translation between Latin languages is not an interesting market for commercial companies; some languages are not even translated into English, like Romanian or Catalan, although Latin languages have particular characteristics in common, like grammatical similarity, which makes its MT development less costly.
Language implementation methodology
Lexicographic enrichment costs are low, since it's possible to centralize the data base management via Internet, while the introduction of new terms is done in a decentralized way, practically from any personal computer connected to Internet. Under the auspices of Unión Latina terminology has been entered from Paris, in collaboration of “Atlas de la Diversidad” the initial basic Catalan lexicon is being introduced from Barcelona, and also some entries in Romanian are being introduced by a volunteer from Bucharest.
The lexical coding system allows adding lexemes simultaneously in various languages with lexicographic consistency and ensuring integrity of the data base.
Current number of lexical entries (March 2006)
Spanish 28,106
French 22,574
Italian 15,205
Portuguese 13,660
Romanian 10,109
Catalan 1,564
English 27,387
German 15,836
Dutch 11,466
The lexicographic data base has also entries in other languages that have not been yet implemented in the translator engine or they are at a very preliminary implementation stage: Aymara (6,393), Russian (9,774), Swedish (2,639) and Hungarian (2,026).
The lexicographic data base can be viewed at: www.atamiri.cc/arunqera
Word flexion (conjugation of verbs, declination of nouns, adjectives and articles) is handled by ATAMIRI using special morphological tables for each language. Complex flexion rules, for example in Romanian, may take two to three months to develop complete tables. Also contraction rules are handled by tables.
ATAMIRI uses a matrix language representation so that syntagmas of a language are contained in multi-level tables. Our experience with 9 languages shows a requirement of not more than 2,000 syntagmas per language in order to generate well formed sentences of any kind. Syntagmas are manually introduced as they occur during translations. The experiment to use the same syntagmatic tables for Spanish, French, Portuguese and Italian has proven to be positive.
An explanation of this matrix language representation can be found in:
New Directions in Machine Translation ConferenceProceedings, Budapest 18/19- 8-1988 (John von Neumann Society for Computing Sciences / Dordrecht /Providence: Foris Publishers).
Any method applied to evaluate translation quality has to be designed according to the main purpose of the evaluation. We want to follow the achieved progress during an implementation process verifying how usable the generated translations are. For this simple approach it is enough to consider the intelligibility factor obtained in a translated text as an average of the corresponding factors assigned to each sentence of the text after reading it. This method is explained with examples at the Aynisiwi forum: www.atamiri.cc/aynisiwi in the lexicographic group section.
Please see:
Aynisiwi -> Lexicographical workgroup -> Catalan preliminary implementation
-> L’AMETLLER -> Evaluation of the English translation
Aynisiwi -> Lexicographical workgroup -> Evaluation Results
Conclusions
Experience with 9 languages implementations in ATAMIRI’s translator engine shows the following facts:
Project Plan Outline
World wide cooperation is required to mobilize the competencies needed to address the multilingualism issue; otherwise the Internet will remain as a collection of isolated linguistic islands, missing the opportunity of direct communication within cultural diversity. As the creator of ATAMIRI, I urge leaders of institutions and corporations that promote Language Engineering projects and government authorities concerned with the problematic of Human Language Technologies, to support a thorough ATAMIRI assessment operation to test its multilingual technology and verify its translation quality improvement capacity.
ATAMIRI's R&D is still advancing, even though too slowly. It is waiting for its technological potential to be exploited on large scale. To achieve this, capital investment, under equitable conditions that recognizes the value of this unique technology, is required. We propose to: