It is 13 million word corpus of  Nepali. The corpus consists written texts from 15 different genres with 2000 words each published between 1990 and 1992 and texts from a wide range of sources such as the internet webs, newspapers or books. Words are lemmatised and tagged. Thanks to Bal Krishna Bal from Language Technology Kendra and Andrew Hardie from Lancaster University.

Corpus homepage: The Nepali National Corpus at NeLRaLEC project pages.

Corpus publication: Yadava, Yogendra P., Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi, Srishtee Gurung, Amar Gurung, Tony McEnery, Jens Allwood, and Pat Hall. Construction and annotation of a corpus of contemporary Nepali.Corpora 3, no. 2 (2008): 213-225.

Tagset documentation: Hardie, A, Lohani, R, Regmi, B and Yadava, Y (2005).  Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01). Nelralec/Bhasha Sanchar Working Paper 2, pp. 171–198.


(taken from Wayback Machine of original web page at http://www.bhashasanchar.org/ncorpus_written.php)

There are two subtypes of the written corpus: core corpus and general collections.

Core corpus

The core corpus is a collection of Nepali written texts that concur as far as possible with the date, number and genres of the international FLOB and FROWN corpora consisting of 500 texts of 15 different genres with 2000 words each published between 1990 and 1992. This framework is as follows:

 Table 1: Core sample framework (based on FLOB/FROWN corpora)
Category Label Category Title Number of samples
A Press: Reportage 44
B Press: Editorial 27
C Press: Review 17
D Religion 17
E Skills, Trades and Hobbies 38
F Popular Lore 44
G Belles Lettres, Biographies, Essays 77
H Miscellaneous 30
I Science 80
J General Fiction 29
K Mystery and Detective Fiction 24
L Science fiction 6
M Adventure and Western 29
N Romance and Love story 29
O Humour 9
TOTAL 500

The primary purpose of the Core Sample was to provide a match to other corpora created from the same sampling frame. However, there were made some adaptations for selecting genres as all the genres existing in English writings (e.g. science fiction) did not exist in Nepali because of cultural and other East-West differences. Besides, only 398 (instead of 500) texts could be collected for Nepali core corpus since texts from some genres could not be available from the 1991/92 time frame when writings in Nepali were very much restricted and  just started broadening with the advent of liberalism after the restoration of democracy in the country.

These collected core corpus is presented in Table 2.

 Table 1: Core sample framework (based on FLOB/FROWN corpora)
Category name No of files No of words
A (Press reportage) 33 (44) 66800
B (Press editorial) 23 (27) 46520
C (Press review) 6 (17) 12095
D (Religion) 13 (17) 26412
E (Skills, Trades and Hobbies) 29 (38) 58935
F (Popular lore) 32 (44) 64878
G (Belles Letters, Biographies, Essays) 68 (77) 137873
H (Miscellaneous) 28 (30) 56680
J (Science) 56 (80) 113507
S (Fiction) 110 (126) 220874
Grand total 398 (500) 804574

The internal structure of the core corpus is as follows:

Press editorial
Daily: 5
From Kathmandu: 6
From Outside:
Weekly: 17
From Kathmandu 15
From Outside:
Halfweekly:
    1
From Kathmandu:
From Outside: 1
Total 23
Religion
From Book 10
One text translated from Hindi and one text based on Sanskrit
From article: 3
Total 13
Fiction
Novel 66
Short story 44
Total 110
Science
From book 37
From periodicals 19
Total 56
Sub-category
Science and technology 3
Criticism 20
Anthropology / culture 8
History / Archeology 5
Language and grammar 3
Law / politics 5
Psychology 1
Philosophy 4
Business / economics / administration 6
Unclassified 1

These 1 million words appearing in 398 texts extracted from various books, journals, magazines and newspapers were digitized in Nepali Unicode. For the purpose of computer processing, these texts were then manually formatted using XML tagging in the body, paragraph, sentences and foreign words appearing therein. Each text was provided with the metadata or  bibliographical details such as book/article/ issue title, author, publisher, publication date, publication place, name of the typist, etc. in XML header.  Additional relevant XML tags were also added automatically.

A set of 112 parts-of-speech (POS) tags were developed empirically to annotate the core corpus (For details see POS Tagset). This tagset was first manually used to annotate 160 files in the core corpus. Based on this manually tagged corpus, an automatic tagger was developed at LU called Unitag, and has been used to automatically tag the whole of the text corpus using lexicon, rules and probalistic generalizations. However, in line with our policy of technology transfer we have been building our own parts-of-speech tagger at MPP as part of a general morphology analyser for a range of uses.

General collections

The general collections in the NNC contain digitized written texts collected opportunistically from a wide range of sources such as internet webs, newspapers, books, publishers, and authors. These texts of nearly 14 million words keyboarded in various fonts have been unicodified with a software called ‘Font Converter’ , developed at Bhashasanchar Project to convert non-unicode fonts such as Kantipur, Preeti, Jag Himali, etc. into Unicode, and tagged using XML markup and automatic POS tagger.

The texts in the general collections are arranged according to their types.

1. Web-texts (collected during March 2005 to May 2006)

These texts are classified according to their web addresses and are further classified as per their text types (for example, anthropology, art, business, crime, criticism, education, editorial,  health, news, law, opinion, sport, politics etc.) and publication date, e.g. kantipur-editorial-2061-12-15.

2. Books (69 books of different genre and size)

Books are identified according to their genre, title and publication date. For example, alikhit by Dhruva Chandra Gautam has been named as ‘book-fiction-alikhit-2058’.

3. Newspaper/journal (complete text of a newspaper or a journal without classification)

In this class we have texts from 94 issues for himalkhabar patrika. Each file  has been named after their name and publication date, e.g.himalkhabarpatrika-2057-05-01.