Bulgarian National Corpus

BulgarianNC: Bulgarian National Corpus

The Bulgarian National Corpus (BulgarianNC) is a Bulgarian corpus made up of texts collected from various sources such as scanned books, transcribed data, internet texts, etc. The corpus is classified according to genre, domain, source type. It consists of 419 million words in total (both web and non-web part).

In Sketch Engine, BulgarianNC is organised hierarchically as follows:

BulgarianNC_web: The web corpus from Bulgarian NC
BulgarianNC_nonweb: All except the web
BulgarianNC_all: BulgarianNC_web + BulgarianNC_nonweb –> This is a test case of our new feature – the Virtual Corpus or the Super Corpus

Part-of-speech tagset

The Bulgarian National Corpus is PoS tagged using the following Bulgarian tagset.

Tools to work with the Bulgarian corpus

A complete set of tools is available to work with this Bulgarian corpus to generate:

word sketch – Bulgarian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Bulgarian nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word units
text type analysis – statistics of metadata in the corpus