LERIL : Collaborative Effort for Creating Lexical Resources Akshar Bharati, Dipti M Sharma, Vineet Chaitanya, Amba P Kulkarni, Rajeev Sangal Language Technologies Research Centre International Institute of Information Technology Hyderabad {dipti,vc,amba,sangal}@iiit.net Durgesh D Rao National Centre for Software Technology, Mumbai durgesh@ncst.ernet.in [ To appear in Proceedings of Workshop on Language Resources in Asia, along with NLPRS-2001, Tokyo, 27-30 November 2001] Abstract The paper reports on efforts taken to create lexical resources pertaining to Indian languages, using the collaborative model. The lexical resources being developed are: (1) Transfer lexicon and grammar from English to several Indian languages. (2) Dependencey tree bank of annotated corpora for several Indian languages. The dependency trees are based on the Paninian model. (3) Bilingual dictionary of 'core meanings'. 1. Introduction Non-availability of lexical resources in the electronic form is a major bottleneck for anyone working in the field of NLP on Indian languages. It was decided to take some measures which would remove this bottleneck in a quick and efficient way. As a first step in this direction a collaborative effort was undertaken to develop a bilingual electronic dictionary in the open source model. The interesting aspect of this effort was that the work was carried out by school children, teachers, housewives, and retired people among others. People in about 8 cities were involved in the exercise. The school teachers participated, to some extent, in correcting and refining the work. This was later edited by a small core team of two. The development of the dictionary resource took advantage of the bilingual ability of the contributors. The contributors provided the basic data: a) A number of Hindi equivalents required to cover various senses of the English lexical item in different contexts. b) An English example sentence for every Hindi equivalent. The developed resource is now available as an "open resource" under General Public License. ( GPL,1991 ) It might appear difficult to create a major resource like a dictionary in this way, with a diverse set of people working on it. Admittedly there are variations in quality at present. But the coverage is already quite exhaustive. A number of factors, however, made it possible: 1. The contributors were advised to consult various mono- and bi-lingual dictionaries. Many contributors, including students working in a classroom setting with a local teacher, consulted monolingual advanced learner's dictionaries for English. (However, they did not copy the entries (which anyway were in English alone), instead they supplied Hindi equivalents for the available detailed differentiation of English senses, wherever the Hindi equivalents were different, or represented different meanings in their judgement.) 2. The initial information that was to be incorporated in the dictionary was kept to a minimum so that anyone who is sufficiently bilingual could participate in the activity. This is why even the school children could contribute to the effort. 3. Some amount of editing was carried out by a small central team. (However, in future we would like this also to be carried out in a distributed way, perhaps out of a few tens or hundreds of better trained people selected out of hundreds who participated in the initial exercise. Only the final output would be corrected by the small centralized team.) Modern technology permits the incremental improvement and enhancement of the basic resource over a period of time. This was a basic consideration in embarking on such an exercise. The result of this effort has led to the rapid creation of the present dictionary (Shabdaanjali; 2000). Which is available as an open resource under General Public License (GPL,1991.) The dictionary consists of more than 25000 headwords, with fairly detailed differentiation of senses. Here is an example entry from Shabdaanjali which gives the senses as well as example sentences illustrating the senses: "go","V", --"1.jAnA" I go to school. --"2.rakhA~jAnA" These clothes go into that suitcase. --"3.samAnA[i phala/k2->j kATakara/kr:j->i - Ram_erg. fruit having-cut pAnI/k2->i piyA::v:i water drank This linear notation is powerful enough to represent arbitrary dependency trees. Several defaults are used to reduce typing as shown below. - rAma_ne/k1->i phala/k2 kATakara/kr - Ram_erg. fruit having-cut pAnI/k2 piyA::v:i water drank For example, the word 'phala' marked by k2 karaka by default attaches to the nearest verb 'kATakara'. This work requires a higher level of expertise, and is planned to be done in a collaborative model by Sanskritists and other linguists trained in Paninian analysis. It is expected to draw people trained in the traditional Sanskrit shastras, particularly vyakarana shastra. (grammar). For maintaining the consistency, the same TAGSETS are provided to all the contributors. Also, most of the annotators come from a strong tradition of Sanskrit grammar. Which provides principled analysis of syntactico-semantic issues, thus resulting in consistent output. 5. Task 3: Shabda-Sutra - Bilingual dictionary of core meaning The third task involves semantic analysis of words across languages. Polysemy is a major problem that one has to deal with while building bilingual lexical resources for the machine translation. The concept of 'Shabdasutra' is an attempt to capture the underlying thread which relates various meanings in a polysymous word. The term 'sutra' in 'Shabdasutra' is used at two levels. A. At the first level the term 'Shabdasutra' means 'a formula' which encodes the basic semantic concept of a word and how it gets extended to varying usages. For example. The English word 'issue' has several meanings as available from Shabdanjali: Shabdasutra or formula for the English word 'issue' is viSaya[~~ < niSpAdana] or its rough gloss: topic[~~ < to come into existence] Notational symbol '<' means 'is derived from' and symbol '~' means that the sense has taken several turns in its evolution. Thus, the above notation says that the meaning of 'issue' is 'topic' which has arisen from 'to-come-into-existence'after taking many turns in its evolution. This 'sutra' is a formula which expresses that 'niSpAdana' appears to be the basic sense or the 'core' meaning of the English word 'issue'. From this 'core sense' various other meanings have evolved. B. The second sense in which the term is used is that of an 'underlying thread' which connects all the senses to which the meaning of a particular word gets extended. To continue the example of 'issue' above, the formula given above has the following underlying thread :- niSpAdana(astitwa meM lAnA/AnA) --> niSpatti kA srota --> niSpatti (santAna, sansakaraNa etc) which means: 'bring into existence ---> point of origin ---> the thing that comes into existence (child, edition etc). The relation between various senses of the word 'issue' can be seen through this 'sutra' with the help of following examples from english niSpAdana eg: "issue orders" -->niSpatti kA srota eg: "point of issue of a river" -->niSpatti eg : "has no issue after marriage, latest issue is out," The way the 'underlying thread' is compressed into a 'sutra(formula)' notationally can vary depending on the complexity of the sense it is encoding. Following are the steps in this task: - Begin with a bilingual dictionary of English to Indian Languages which contains different senses, and example sentences for each sense - Identify commonality of meanings for a word - Come up with core meaning or word-thread or sabda-sutra This is an intricate task, and has been completed by a group of dedicated researchers for 5000 words. For all the above tasks, a basic list of 5000 words based on high frequency is being used. The initial target is to complete 5000 high frequency words for all the Indian languages. In case, some group wants to go further and work for a larger dictionary they can cover the whole dictionary (with about 25000 Headwords). ??The target is to complete the first phase work (5000 words) in several Indian languages by the end of November, 2001.?? 6. Policy for Distribution The resources so developed would be available to people at no cost or low cost. These are like infrastructure, which everyone uses, but finds difficult to pay for. Most importantly, the above resources would be "open source" under GPL. This is to allow others to work on the resource, modify or refine it, and then redistribute it. 7. Conclusions This paper reports on some efforts which have created or are creating lexical resources pertaining to Indian languages, using the voluntary collaborative model. One of the novel idea was to involve several hundred school children spread over several cities, to yield a detailed bilingual dictionary, which is now not only available for consultation by the general public, but is also being used as a stepping stone for building several other kinds of lexical resources namely, (1) Transfer lexicon and Grammar, (2) Annotation of Corpora and (3) Bilingual Dictionary of core senses These resources are being developed with machine translation and information retrieval in mind. The lexical resources so produced will be distributed as "open" or "free" resources under GPL. Acknowledgements The frameworks for TransLexGram and AnnCorra in the result of discussions with several people such as: Prof. Aravind K Joshi, Dr. B. Srinivas, Dr. K.V. Rama Krishnama- charyulu Dr. Thakur Dass, Dr. V.P. Jain , among others. Many from LRNLP-2001 contributed to the framework through discussions. From that Workshop resulted the LERIL effort: Lexical Resources for Indian Languages. References 1. Bharati, Akshar, Vineet Chaitanya, Rajeev Sangal, Natural Language Processing: A Paninian Perspective, Prentice-Hall of India, 1995. 2. Sharma, Dipti M, Building Lexical Resources, in Proc. of Symposium in Information Revolution in Indian Languages,Osmania University, 13-15 November 1999. 3. Bharati, Akshar, Dipti M Sharma, Rajeev Sangal, TransLexGram : An Introduction, Technical Report no: TR-LTRC-011, LTRC, IIIT Hyderabad, Jan 2001. 4. Bharati, Akshar, Dipti M Sharma, Rajeev Sangal, TransLexGram : Guidelines for Verb Frames, Technical Report no: TR-LTRC-013, LTRC, IIIT Hyderabad, Jan 2001. 5. Bharati, Akshar, Dipti M Sharma, Rajeev Sangal, AnnCorra : An Introduction, Technical Report no: TR-LTRC-014, LTRC, IIIT Hyderabad, Mar 2001. 6. LRNLP-2001: Workshop on Lexical Resources for Natural Language Processing for Indian Languages, Hyderabad, January 2001. (lr_egroup@iiit.net) 7. Shabdaanjali: English - Hindi e-Dictionary ver.0.2, 2000 http://www.iiit.net. (click on 'Resources'.) 8. GPL, GNU General Public License 199 , http://www.gnu.org/licenses/licenses.html Appendix -I: 'FIELD NAMES' Provided in Task 1 (TransLexGram): HEADWORD - The lexical item for which the entry is being made MEANING - Indian language equivalent for the Headword. ENG_EXP - Example sentence in English TR_NAT - Natural translation TR_ENG-INFLNC - Translation having english influence FRAME_E - Frame for the English sentence FRAME_I - Frame for the Indian language translation ERR - Error (this column is for human use) COMNT - Comment (this column is for human use) Appendix -II: Tagsets The tagsets used here have been divided into two categories - 1) TAGSET-1 - Tags which express relationships are marked by a preceding '/' . For example karakas are grammatical relationships, thus they are marked '/k1', '/k2', '/k3' etc. 2) TAGSET-2 - Tags expressing type of node are marked by a preceding '::' Verbs etc. are nodes, so they will be marked '::v', Some example tags - TAGSET-1 (Expressing relationship labels) Marked '/' s : Sentence Example - [rAma ne khIra khAyI] [rAma postp milk-rice ate_fem] k1 : karta Example - [rAma_ne/k1 khIra khAyI] k2 : karma Example - [rAma_ne khIra/k2 khAyI] k3 : karana Example - [rAma_ne cammaca_se/k3 [rAma_postp spoon_with/k3 khIra khAyI] milk-rice ate_f] TAGSET-2 (for nodes) Marked '::' v : Verb Kr : Gerund vH : Verb-BE Example - rAma adhyApaka HE Ram teacher is yo :Conjunct The total number of tags is around 35. Since the task is going on, this may be revised, in case it is needed.