Nepali National Corpus search

NNC: Nepali National corpus

The Nepali National corpus (NNC) is a Nepali corpus built up 13 million words that are lemmatised and part-of-speech tagged. The corpus consists of three different types of corpora: written corpus, parallel corpus and spoken corpus which is not part of the NNC corpus in Sketch Engine. The corpus was created within the NeLRaLEC project funded by Asia IT & C Programme of the European Commission. Corpus texts were PoS tagged and later lemmatised by Bal Krishna Bal from Language Technology Kendra and Andrew Hardie from Lancaster University.

Corpus homepage: Language Technology Kendra (a cross-institutional center which created the corpus), Nepali National Corpus (available only via Internet Archive) project pages which i)

Part-of-speech tagset

The NNC corpus is POS annotated with using the Nelralec tagset.

Content in detail

The version of the National Nepali Corpus in Sketch Engine consists of two corpora.

1. written corpus (two collections collecting 500 texts of 15 different genres with 2000 words published between 1990 and 1992; ca 11 million of words) with 2 collections:

Core corpus

The core corpus is a collection of Nepali written texts that concur as far as possible with the date, number and genres of the international FLOB and FROWN corpora consisting of 500 texts of 15 different genres with 2000 words each published between 1990 and 1992. This framework is as follows:

Table 1: Core sample framework (based on FLOB/FROWN corpora)
Category Label	Category Title	Number of samples
A	Press: Reportage	44
B	Press: Editorial	27
C	Press: Review	17
D	Religion	17
E	Skills, Trades and Hobbies	38
F	Popular Lore	44
G	Belles Lettres, Biographies, Essays	77
H	Miscellaneous	30
I	Science	80
J	General Fiction	29
K	Mystery and Detective Fiction	24
L	Science fiction	6
M	Adventure and Western	29
N	Romance and Love story	29
O	Humour	9
	TOTAL	500

The primary purpose of the Core Sample was to provide a match to other corpora created from the same sampling frame. However, there were made some adaptations for selecting genres as all the genres existing in English writings (e.g. science fiction) did not exist in Nepali because of cultural and other East-West differences. Besides, only 398 (instead of 500) texts could be collected for Nepali core corpus since texts from some genres could not be available from the 1991/92 time frame when writings in Nepali were very much restricted and just started broadening with the advent of liberalism after the restoration of democracy in the country.

These collected core corpus is presented in Table 2.

Table 1: Core sample framework (based on FLOB/FROWN corpora)
Category name	No of files	No of words
A (Press reportage)	33 (44)	66800
B (Press editorial)	23 (27)	46520
C (Press review)	6 (17)	12095
D (Religion)	13 (17)	26412
E (Skills, Trades and Hobbies)	29 (38)	58935
F (Popular lore)	32 (44)	64878
G (Belles Letters, Biographies, Essays)	68 (77)	137873
H (Miscellaneous)	28 (30)	56680
J (Science)	56 (80)	113507
S (Fiction)	110 (126)	220874
Grand total	398 (500)	804574

The internal structure of the core corpus is as follows:

Press editorial
Daily:		5
From Kathmandu:	6
From Outside:	–
Weekly:		17
From Kathmandu	15
From Outside:	–
Half–weekly:		1
From Kathmandu:	–
From Outside:	1
Total		23

Religion
From Book	10
One text translated from Hindi and one text based on Sanskrit
From article:	3
Total	13
Fiction
Novel	66
Short story	44
Total	110
Science
From book	37
From periodicals	19
Total	56
Sub-category
Science and technology	3
Criticism	20
Anthropology / culture	8
History / Archeology	5
Language and grammar	3
Law / politics	5
Psychology	1
Philosophy	4
Business / economics / administration	6
Unclassified	1

These 1 million words appearing in 398 texts extracted from various books, journals, magazines and newspapers were digitized in Nepali Unicode. For the purpose of computer processing, these texts were then manually formatted using XML tagging in the body, paragraph, sentences and foreign words appearing therein. Each text was provided with the metadata or bibliographical details such as book/article/ issue title, author, publisher, publication date, publication place, name of the typist, etc. in XML header. Additional relevant XML tags were also added automatically.

A set of 112 parts-of-speech (POS) tags were developed empirically to annotate the core corpus (For details see POS Tagset). This tagset was first manually used to annotate 160 files in the core corpus. Based on this manually tagged corpus, an automatic tagger was developed at LU called Unitag, and has been used to automatically tag the whole of the text corpus using lexicon, rules and probalistic generalizations. However, in line with our policy of technology transfer we have been building our own parts-of-speech tagger at MPP as part of a general morphology analyser for a range of uses.

General collections

The general collections in the NNC contain digitized written texts collected opportunistically from a wide range of sources such as internet webs, newspapers, books, publishers, and authors. These texts of nearly 14 million words keyboarded in various fonts have been unicodified with a software called ‘Font Converter’ , developed at Bhashasanchar Project to convert non-unicode fonts such as Kantipur, Preeti, Jag Himali, etc. into Unicode, and tagged using XML markup and automatic POS tagger.

The texts in the general collections are arranged according to their types.

1. Web-texts (collected during March 2005 to May 2006)

These texts are classified according to their web addresses and are further classified as per their text types (for example, anthropology, art, business, crime, criticism, education, editorial, health, news, law, opinion, sport, politics etc.) and publication date, e.g. kantipur-editorial-2061-12-15.

2. Books (69 books of different genre and size)

Books are identified according to their genre, title and publication date. For example, alikhit by Dhruva Chandra Gautam has been named as ‘book-fiction-alikhit-2058’.

3. Newspaper/journal (complete text of a newspaper or a journal without classification)

In this class we have texts from 94 issues for himalkhabar patrika. Each file has been named after their name and publication date, e.g.himalkhabarpatrika-2057-05-01.

2. parallel corpus (two genres: computing and national development; ca 3 million of words)

Tools to work with the Nepali National corpus

A complete set of Sketch Engine tools is available to work with this Nepali National corpus to generate:

word sketch – Nepali collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Nepali nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Bibliography

Corpus publication

Yadava, Yogendra P., Andrew Hardie, Ram Raj Lohani, Bhim N. Regmi, Srishtee Gurung, Amar Gurung, Tony McEnery, Jens Allwood, and Pat Hall. Construction and annotation of a corpus of contemporary Nepali.Corpora 3, no. 2 (2008): 213-225.

Part-of-speech tagset documentation

Hardie, A, Lohani, R, Regmi, B and Yadava, Y (2005). Categorisation for automated morphosyntactic analysis of Nepali: introducing the Nelralec Tagset (NT-01). Nelralec/Bhasha Sanchar Working Paper 2, pp. 171–198.

Part-of-speech tagging and lemmatisation

Hardie, A, Lohani, R and Yadava, YP (2011) Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation. Himalayan Linguistics 10 (1): 151-165.

Yadava, Y.P., Hardie, A., Lohani R.R., Regmi B.N., Gurung, S., Gurung, A., McEnery, T., Allwood, J., and Hall, P. (2008). Construction and annotation of a corpus of contemporary Nepali. Corpora 3(2): 213-225.

Search the Nepali National corpus

Sketch Engine offers a range of tools to work with the Nepali National corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

NNC: Nepali National corpus

Part-of-speech tagset

Content in detail

Tools to work with the Nepali National corpus

Search the Nepali National corpus

Other text corpora in Sketch Engine

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine