LANGUAGE ACCESS: AN INFORMATION BASED APPROACH Akshar Bharati Vineet Chaitanya Amba P. Kulkarni Rajeev Sangal Language Technologies Research Centre Indian Institute of Information Technology Hyderabad {vineet,amba,sangal}@iiit.net ABSTRACT The anusaaraka system (a kind of machine translation system ) makes text in one Indian language accessible through another Indian language. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents in the target language) spill over to the output. Some special notation is also devised. Anusaarakas have been built from five pairs of languages: Telugu, Kannada, Marathi, Bengali and Punjabi to Hindi. They are available for use through Email servers. Anusaarkas follows the principle of substitutibility and reversibility of strings produced. This implies preservation of information while going from a source language to a target language. For narrow subject areas, specialized modules can be built by putting subject domain knowledge into the system, which produce good quality grammatical output. However, it should be remembered, that such modules will work only in narrow areas, and will sometimes go wrong. In such a situation, anusaaraka output will still remain useful. 1. INTRODUCTION Fully-automatic general purpose high quality machine translation systems (FGH-MT) are extremely difficult to build. In fact, there is no system in the world for any pair of languages which qualifies to be called FGH-MT. The reasons are not far to seek. Translation is a creative process which involves interpretation of the given text by the translator. Translation would also vary depending on the audience and the purpose for which it is meant. Since at present, the machine is not capable of interpreting a general text with sufficient accuracy automatically - let alone re-expressing it for a given audience, it fails to perform as FGH-MT. The main difficulty that the machine faces, pertains to dealing with ambiguity. A given text *codes* only a part of the *information*. Ambiguity is resolved (by guessing) using world knowledge, domain specific knowledge, etc., a task which turns out to be very difficult for the machine. 1.1 INFORMATION CODING To understand the idea that the text expresses only a part of the information, let us consider an example. In Indian languages, which have relatively free word-order, information that relates an action (verb) to its participants (nouns) is primarily expressed by means of post-positions or case endings of nouns (collectively called vibhaktis of the noun). For example, in the following sentence in Hindi: rAma ne roTI khAI (1) Ram erg. bread ate Ram ate the bread. The ergative (erg.) post position marker ('ne') after 'rAma' indicates that Ram is the *karta* of eat, which here means that Ram is the *agent* of eating. (Note that in English, the primary device for expressing the same information is by means of word order.) Noun-verb agreement also helps in identifying the karta. For example, in the following sentence: rAma roTI khAtA HE (2) Ram(m.) bread(f.) eats(m.) Ram eats bread. the masculine (m.) ending of the verb indicates that the karta is masculine, which in this sentence unambiguously means Ram. However, this is not always unambiguous; consider the following sentence: chAvala rAma khAtA hE (3) rice (m.) Ram (m.) eats (m.) Ram eats rice. in which the agreement does not help in identifying the karta unambiguously, because there are two masculine nouns (Ram and rice) one of which is the karta. Translation to English, say, would be quite different depending on which one is the karta. In language, there is a tension between brevity and ambiguity. If everything was explicity stated, the text would be less ambiguous but would be long. Brevity also helps in focussing attention to the relevant parts. Ambiguity seems to be a necessary price for conciseness and focus. 1.2 FAITHFULNESS vs. NATURALNESS To build a practical MT system, the load has to be shared between man and machine. A clean way to share the load is for the machine to take up the task of language related processing, and to leave the processing related to background knowledge to the reader. Language related processing consists of analysis of the input source language text such as morphological processing, use of bilingual dictionary, and any other language related analysis or generation. These are the primary sources of difficulty to the reader. These are also the tasks which are relatively easier for the machine. On the other hand, world knowledge related aspects are left to the reader, who is naturally adapt at it. In translation, two opposing forces are at work: faithfulness and naturalness. The translator must chose between faithfulness to the original text and naturalness to the reader. Most translations that we come across, are weighted towards naturalness to the reader. Anusaaraka is at the other extreme: it tries to be as faithful to the original text as possible. In fact, its output must contain all the information in the source language text, and should have no other new information. There is a problem in coding "exactly" the same information (with 100% fidelity) from one language to another, particularly if we want to generate sentences of about equal length, paralleling the sentence constructions wherever possible. (In this sense, translation is sometimes said to be an impossible task). FOOTNOTE{This also suggests the incommensurability of information, discussed later in this paper.} 1.3 ANUSAARAKA or LANGUAGE ACCESS The anusaaraka answer lies in deviating from the target language in a systematic manner whenever necessary. This new language is something like a dialect of the target language. The anusaaraka output can be said to be the image of the source text, much like what the camera produces. Reading the image of source text is like reading the original text. It will have the same flavour. Translation, on the other hand, is like a painting. The translator interprets the original in the source language, and "paints" a text in the target language with approximately the same meaning and nuance. Readers will usually require some learning of the dialect of the target language (discussed in the next section). This learning time will be negligible compared to the learning time of the source language. Indian languages are relatively free word-order where the noun-groups can come in any order followed generally by the verb group. (The order conveys emphasis etc. but not the information about karaka relationships or theta roles.) If we take a sentence in a source language, and substitute the word groups in it by appropriate word groups in the target language, it works well because the languages make similar use of order to convey emphasis etc. The vibhaktis for the word groups (that is, case endings and post-position markers for nouns, and TAM for the verb groups), must be mapped from the source language to the target language carefully, as they contain important karaka information regarding the verb and the nouns. Again the languages behave in a similar way. Besides the above, there are similarities in the meanings of words. Many words in the languages have a shared origin (from Sanskrit), and because of shared culture, they usually also share meanings. This implies that for a source language word, the bilingual dictionary provides a unique answer in the target language for a large percentage of words (80% for Kannada to Hindi). Now, we will discuss some problems because the two languages differ, and see how these problems can be handled. We will take examples from Hindi, Telugu and Kannada, three of the major languages of India. Hindi is spoken by more than half a billion people, Telugu by about 80 million, and Kannada by about 50 million people. Telugu and Kannada differ from Hindi, as they are Dravidian languages, and are further apart from Hindi compared to other Indian languages. Even then, except from agreement, there are only three major syntactic differences between Hindi and Kannada. Surprisingly all of these can be taken care of by enriching Hindi with a few additional functional particles or suffixes as shown below. Thus, they can be viewed as lexical gaps or function word gaps. 2. ANUSAARAKA - LANGUAGE BRIDGES We now take up some important constructions in the south Indian languages, which differ from Hindi, and show how they have been bridged in anusaaraka. 2.1 COMP ("ki") CONSTRUCTION In case of embedded sentences in Hindi, the subordinate sentence is put after the main verb unlike in Kannada. For example (where, label H is for Hindi, !E is for English gloss): H: rAma ne kahA ki mEM ghara ko jAUMgA. (4) !E: Ram erg. said that I home acc. will_go E: (Ram said that he will go home.) There is a construction in Kannada using 'eneMdare' or 'that' which is similar, but is seldom used. Kannada uses another construction for which the anusaaraka Hindi is given below. K: mohana nALe baruvanu eMdu rAma heLidanu. (5) @H: mohana kala AyegA EsA rAma kahA. !E: Mohana tomorrow come-fut this Rama said. Although, 'EsA' construction is a proper construction in Hindi; it is seldom used. In the dialect of Hindi produced by anusaraka from south Indian languages however, this will be the normal construction used. 2.2 RELATIVE-CLAUSE ("jo") CONSTRUCTION In this section, we will discuss how anusaaraka handles participle verbs (behaving as adjectives) in Telugu to produce the same information in Hindi. The solution works for all south Indian languages, which display this phenomenon. We will first try to arrive at the information contained in TAM labels which stand for adjectival participle, in a mathematically precise way. Let us take the following Telugu example sentence: T: rAmuDu tinina camacA veVMDidi. (6) ------ --- -- ------ -------- 1 2a 2b 3 4 !E: Ram *eaten spoon of-silver E: The spoon with which Ram ate is of silver. (* 'eaten' is only an approximation, 'tinina' is a past-participle form of 'tina' or 'eat') We are interested in finding the meaning of the TAM label or suffix 'ina' suffix in 'tinina' above. Let us name it 2b, and the rest of the words are also named for easy reference. If a Telugu-Hindi bilingual person is asked to translate the sentence, he is likely to write down the following in Hindi: H: rAma ne jisa cammaca se khAyA, vaHa cAMdI kA HE. ---- ++ ------- --- -------- ++ 1 3 2a 4 !E: Ram erg. which spoon instr. ate, that silver_of is E: The spoon with which Ram ate is of silver. Here the Hindi words are marked corresponding to the Telugu words (other than 2b whose value we want to find out). '++' is used to denote words that have been put by the translator but which are not there in the original Telugu sentence. 'ne' corresponds to the ergative marker which is an idiosyncracy of Hindi. Also it is known that 'HE' at the end (copula) is mandatory in the Hindi sentence but is absent in the given Telugu sentence. We can rephrase the sentence in Hindi to get the words in the same order: H: rAma ne jisa se khAyA HE vaHa cammaca cAMdI kA HE. ---- ++ --- ------- -------- ++ 1 2a 3 4 or better still, we may rewrite the above as: H: rAma ne khAyA HE jisa se vaHa cammaca cAMdI kA HE. (7) ---- ++ --- ------- -------- ++ 1 2a 3 4 !E: Ram erg. eaten has which instr. that spoon silver_of is wherein the order of the words including the parts of words (2a and 2b) is exactly the same as the order in the original sentence. Now the part which remains unassigned, stands for 2b. Therefore, we get the equation: ina = yA_HE_jisa_se_vaHa -en_is_which_instr_that (English gloss) has_VERB_en_with_which_that (English explanation) But a closer scrutiny reveals an assumption, "se" or instrumental marker is not there in the Telugu sentence. For example, consider the following sentence: T: rAmuDu winina pleTu veVMDidi (8) ------ --- -- ------ -------- 1 2a 2b 3 4 !E: Ram eaten plate silver-of E: The plate in which Ram ate is of silver. Its equivalent Hindi sentence is: H: rAma ne khAyA HE jisa meM vaHa pleTa cAMdI kI HE. (9) ---- ++ --- ----- -------- ++ 1 2a 3 4 The above sentence yields the following equality: ina = yA_HE_jisa_meM_vaHa -en_is_which_loc_that (English gloss) has_VERB_en_in_which_that (English explanation) The two different equalities for 'ina', and similar other examples lead us to conclude that the 'se' or 'meM' markers are not there in the 'ina' but are supplied by the reader based on the world knowledge. Therefore, the equality becomes: ina = yA_HE_jo_*_vaHa where '*' stands for an unspecified post-position to be supplied based on context. After further refinement (not discussed here), it becomes: ina = yA_[HE/tHA]_jo_*_vaHa- -en_[is/was]_which_*_that (English gloss) The claim is that the above is a mathematically precise equivalence between the 'ina' Telugu TAM and anusaaraka Hindi. The above can be restated as follows: It shows the equivalence between the adjectival participle in Telugu and the relative clause in Hindi, which has been known, but which the above equation makes precise. Although, Hindi also has participial phrases, it has only two TAMS: yA and tA_HuA (with perfective and continuous aspects, respectively). H: khAyA HuA phala (10) eaten fruit H: khAtA HuA hiraNa (11) eating deer As a result, these are not sufficient to capture other TAMs which might occur in Telugu. (There are syntactic holes in participles in Hindi.) There is another problem, too, as we have seen. The two participial phrases in Hindi have coding for karaka relations (theta roles) which is absent in Telugu. TAM 'tA_HuA' codes karta karaka (roughly agent), and the sentence (11) says, the deer who is eating (not the one who is being eaten). Similarly, 'yA' codes karma as in sentence (10) (the fruit being eaten, and not the fruit who is eating). FOOTNOTE{More correctly, 'yA' codes karma in case of sakarmaka or transitive verbs, and karta in case of intransitive verbs.} Thus, Hindi is poorer than Telugu in coding tense, aspect, modality information, while richer in coding karaka information. But this creates another difficulty for anusaaraka. Using these constructions in Hindi, would mean putting in something that is not contained in the source language sentence, and the information equivalence would be lost. Unlike the 'ki' construction (Section 2.1), this idea takes some time and effort for the Hindi reader to get used to. Another construction, not discussed here for want of space, is the "ne" construction or ergative marker, which is a peculiarity of only the Western belt languages in India. Therefore, while building the anusaaraka from south Indian language to Hindi, such a construction would not occur in the output. 2.3 PRE-EDITING AND POST-EDITING Anusaaraka system has been designed so that the combination of man and machine together can perform translations, etc. The user can help in pre-editing the input and post-editing the output. In the pre-editing task, the input text is corrected and edited by the user: Words spelt with non-standarad spellings are changed to their standard spellings, external sandhi between words is broken (unless it changes meaning), etc. This is an important task for Indian Languages because of lack of standardization and consequent variation. Similarly, post-editing can be carried out on the output produced by the machine. There are three levels of post-editing. The first level of post-editing seeks to make the output grammatically correct. The emphasis is on speed and low cost. In the second level of post-editing the raw output is corrected not only grammatically but also stylistically. For example, 'Esa' construction would be changed to 'ki' (see Section 2.1). In the third level of post-editing the post-editor might change the setting and the events in the story to convey the same meaning to the reader who has a different cultural and social milieu. This is really trans-creation, and a creative post-editor (who can even be mono-lingual) can go all the way upto this level. 3. ANUSAARAKA PROCESSING Anusaaraka processing could also be viewed as a series of *information preserving* transformation, to bring the source language close to the target langauge. Information preserving transformations follow two properties: A. substitutivity, and B. reversibility. 3.1 SUBSTITUTIVITY This basically takes care of one-to-many mapping. When a word or a phrase has two equivalent meanings, both alternatives are put in the output, unless one is ruled out by local word grouping. For example, Hindi word 'khAtA' can be replaced with two possible English words: khAtA -> eats/ledger If Hindi morphological analyser replaces 'khAtA' with 'eats' because it is the more frequent usage, then substitutivity will be violated in the following sentence: H: rAma ne bEnka meM apanA khAtA kholA. (12) !E: Ram erg. bank in his ACCOUNT opened. E: Ram opened his account in the bank. Basic idea is that all possible substitutions must be exhaustively enumerated. (Original ambiguities must be carried over.) Substitution rule can use context, but the rule should be universal, i.e., should work in all possible contexts. (No guessing.) Non-trivial example: Participles in south Indian languages, have been already discussed in detail in Section 2.2: _ina -> _yA_[hE/thA]_jo_*_vaha- (Telugu to Hindi) 3.2 REVERSIBILITY The transformation should also be reversible. It should be possible to go back from transformed string to initial string. This takes care of many-to-one mapping, and the basic idea behind this principle is that the information should not be thrown away. We illustrate it by an example from Telugu to Hindi: A/adi/vADu/AmeV -> vaHa he/she/that (English) Single 'vaHa' for all four might seem natural and appealing. In fact, most MT systems in such a setting would be happy that they did not have to chose out of alternatives, and would simply use 'vaHa'. However, throwing away information is NOT good. For example, the sequence 'vaHa ghara' might stand for a single phrase, or two noun phrases: vaHa ghara --> that house H: vaHa ghara acchA HE. (13) !E: That house good is E: (That house is good.) vaHa ghara gayA --> He went home. H: vaHa ghara gayA. (14) !E: He home went. E: (He went home.) While in the above examples, the difference in meaning is readily apparent, it might not always be so. On the other hand, there is a different source word in Telugu for the two cases above: (A and adi etc.). In fact, they would be shown differently below: A -> vaHa- that(demonstrative pronoun) adi -> vaHa{non-masculine}` she/it/they vADu -> vaHa{masculine,singular}` he AmeV -> vaHa{fem.,singular}` she 3.3 METHOD Several levels of analyses and substitution are carried out in anusaaraka. They are at the levels given below: - morpheme level - word level - word group level - sentence level analysis For want of space, the processing per se is not discussed any further. 4. CONCLUSION We have discussed the anusaaraka approach to building language access system. It allows rapid development of systems, by separating the analysis based on language and that requiring world knowledge. It takes the view that language encodes information, and the information can be extracted and re-expressed in the target language, by enhancing it with additional notation. It tries to preserve information in the transfer. The user after some training learns to read and understand the text in this "new dialect" of the target language. The output can also be post-edited by a trained user to make it grammatically correct, and stylistically better. The anusaaraka approach has been successfully used in building systems between five pairs of Indian languages. Work is going on in building an English to Hindi anusaaraka system, which will be a test of building a system between two languages which are far apart. Anusaarkas follows the principle of substitutibility and reversibility of strings produced, which is nothing but preservation of information while going from a source language to target language. [Because of severe shortage of space, information dynamics has been taken out of this paper. Another attempt will be made to somehow fit the main results of information dynamics with a short introduction, in the final paper, if accepted.] For narrow subject areas, specialized modules can be built by putting subject domain knowledge into the system, which produce good quality grammatical output. However, it should be remembered, that such modules will work only in narrow areas, and will sometimes go wrong. In such a situation, anusaaraka output will still remain useful. As pat of future work, work is underway on building an English to Hindi anusaaraka. It will be a further test of the principles and ideas presented here, because they will get applied to two languages which are very different. ACKNOWLEDGEMENT Anusaarakas among Indian languages were built with funding from Ministry of Information Technology, under their program for Technology Development for Indian Languages (TDIL) during 1991-1998. The work was done when the authors were at I.I.T. Kanpur. Currently Satyam Computers Pvt. Ltd. is supporting the authors and the activity for building anusaaraka from English to Hindi. The system so developed will also be available (like the earlier anusaarakas) as "free" open-source software under GPL. GLOSSARY acc. - Accusative marker erg. - Ergative marker karaka role - Relation between verb and its arguments (approximately like theta role) karta karaka - approx. agent role TAM - Tense Aspect and Modality @H - Label to indicate that the ensuing sentence is the anusaaraka output (result of information preserving) !E - English gloss REFERENCES Natural Language Processing: A Paninian Perspective, Akshar Bharati, Vineet Chaitanya, Rajeev Sangal, Prentice-Hall of India, 1995. Anusaraka: A Device to Overcome the Language Barrier, V.N. Narayana, Ph.D. thesis, Dept. of CSE, I.I.T. Kanpur, 1994.