IndoWordNet

Jump to: navigation, search

IndoWordNet[1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu.

Background

In early 90s, the wordnet for English- called Princeton WordNet- was created in Princeton University by George Miller and Christiane Fellbaum who went on to get the prestigious Zampoli Prize in 2006.[2] Then followed the EuroWordNet- the conglomeration of European Language wordnets- which got created in 1998.[3] Wordnets are now essential resources for Natural Language Processing, Information Extraction, Word Sense Disambiguation and such other computations involving text.

Importance of Indian languages

Indian languages form a very significant component of the languages landscape of the world. There are 4 streams of language typology operative in the Indian subcontinent- Indo European, Dravidian, Tibeto Burman and Austro Asiatic.[4] Many languages rank within top 10 in the world in terms of the population speaking them, e.g., Hindi-Urdu 5th, Bangla 7th, Marathi 12th and so on as per the List of languages by number of native speakers. Creating wordnets of Indian languages is therefore a highly important techno-scientific and linguistic project.

Genesis of Indian language wordnets

Such project indeed took off in 2000 with Hindi WordNet being created by the Natural Language Processing group at the Center for Indian Language Technology (CFILT) in the Computer Science and Engineering Department at IIT Bombay.[5] It was made publicly available in 2006 under GNU license. The Hindi WordNet was created with support from the TDIL project of Ministry of Communication and Information Technology, India and also partially from Ministry of Human Resources Development, India.

Wordnets of other languages of India then followed suit. The large nationwide project of building Indian language wordnets was called the IndoWordNet project. IndoWordNet[1] is a linked lexical knowledge base of wordnets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu. The wordnets are getting created by using expansion approach from the Hindi WordNet. The Hindi WordNet was created from first principles (mentioned below) and was the first wordnet for an Indian language. The method adopted was same as the Princeton WordNet for English.

Polish WordNet is being mapped to Princeton WordNet based on the strategy followed by IndoWordNet.[6]

Principles of wordnet construction

The wordnets follow the principles of minimality, coverage and replaceability for the synsets. That means, there should be at least a 'core' set of lexemes in the synset that uniquely give the concept represented by the synset (minimality), e.g., {house, family} standing for the concept of 'family' ("she is from a noble house"). Then the synset should cover ALL the words representing the concept in the language (coverage), e.g., the word 'menage' will have to appear in the 'family' synset, albeit, towards the end of the synset, since its usage is rare. Finally, the words towards the beginning of the synset should be able to replace one another in reasonable amount of corpora (replaceability), e.g., 'house' and 'family' can replace each other in the sentence "she is from a noble house".

Statistics of Indian language wordnets

The number of synsets (As of August 2014) in the languages and the institutes creating the language WordNets are as below:

  1. Assamese 14958 Guwahati University, Guwahati, Assam
  2. Bengali 36346 Indian Statistical Institute, Kolkata, West Bengal
  3. Bodo 15785 Guwahati University, Guwahati, Assam
  4. Gujarati 35599 Dharamsinh Desai University, Nadiad, Gujarat
  5. Hindi 38607 IIT Bombay, Mumbai, Maharashtra
  6. Kannada 20033 Mysore University, Mysore, Karnataka
  7. Kashmiri 29469 Kashmir University, Srinagar, Jammu and Kashmir
  8. Konkani 32370 Goa University, Taleigao, Goa
  9. Malayalam 30060 Amrita University, Coimbatore, Tamil Nadu
  10. Meitei 16351 Manipur University, Imphal, Manipur
  11. Marathi 29674IIT Bombay, Mumbai, Maharashtra
  12. Nepali 11713 Assam University, Silchar, Assam
  13. Oriya 35284 Hyderabad Central University, Hyderabad, Andhra Pradesh
  14. Punjabi 32364 Thapar University and Punjabi University, Patiala, Punjab
  15. Sanskrit 23140 IIT Bombay, Mumbai, Maharashtra
  16. Tamil 25431 Tamil University, Thanjavur, Tamil Nadu
  17. Telugu 21925 Dravidian University, Kuppam, Andhra Pradesh
  18. Urdu 34280 Jawaharlal Nehru University, New Delhi

Summary

IndoWordNet is highly similar to EuroWordNet. However, the pivot language is Hindi which, of course, is linked to the English WordNet. Also typical Indian language phenomena like complex predicates and causative verbs are captured in IndoWordNet.

IndoWordNet is publicly browsable. The Indian language wordnet building efforts forming the subcomponents of IndoWordNet project are: North East WordNet project, Dravidian WordNet Project and Indradhanush project all of which are funded by the TDIL project.

References

  1. ^ a b Pushpak Bhattacharyya, IndoWordNet, Lexical Resources Engineering Conference 2010 (LREC 2010), Malta, May, 2010.
  2. ^ Christiane Fellbaum (ed.), WordNet: An Electronic Lexical Database, MIT Press, 1998.
  3. ^ P. Vossen (ed.), EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Pub., 1998.
  4. ^ Joseph E. Schwartzberg,Encyclopædia Britannica, India—Linguistic Composition, 2007.
  5. ^ Dipak Narayan, Debasri Chakrabarty, Prabhakar Pande and P. Bhattacharyya An Experience in Building the Indo WordNet- a WordNet for Hindi, International Conference on Global WordNet (GWC 02), Mysore, India, January, 2002.
  6. ^ Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). Mapping plWordNet onto Princeton WordNet, 24th International Conference on Computational Linguistics (COLING), India, December 2012