It is 13 million word corpus of Nepali. The corpus consists written texts from 15 different genres with 2000 words each published between 1990 and 1992 and texts from a wide range of sources such as the internet webs, newspapers or books. Words are lemmatised and tagged. Thanks to Bal Krishna Bal from Language Technology Kendra and Andrew Hardie from Lancaster University.
Corpus homepage: The Nepali National Corpus at NeLRaLEC project pages.
Corpus publication: Yadava, Yogendra P., Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi, Srishtee Gurung, Amar Gurung, Tony McEnery, Jens Allwood, and Pat Hall. Construction and annotation of a corpus of contemporary Nepali.Corpora 3, no. 2 (2008): 213-225.
Tagset documentation: Hardie, A, Lohani, R, Regmi, B and Yadava, Y (2005). Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01). Nelralec/Bhasha Sanchar Working Paper 2, pp. 171–198.