bgTenTen u2013 Bulgarian corpus from the web

bgTenTen: Corpus of the Bulgarian Web

The Bulgarian Web Corpus (bgTenTen) is a Bulgarian corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The bgTenTen corpus is annotated by TreeTagger trained on Bulgarian TreeBank part-of-speech tagset.

Tools to work with the Bulgarian Web corpus

A complete set of Sketch Engine tools is available to work with this Bulgarian corpus to generate:

word sketch – Bulgarian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Bulgarian nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

version 1 (bgTenTen12)

initial version, obtained from the web in 2012

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The tenten corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Bulgarian corpus

Sketch Engine offers a range of tools to work with this Bulgarian corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

bgTenTen: Corpus of the Bulgarian Web

Part-of-speech tagset

Tools to work with the Bulgarian Web corpus

version 1 (bgTenTen12)

TenTen corpora

Search the Bulgarian corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

bgTenTen – Bulgarian corpus from the web

bgTenTen: Corpus of the Bulgarian Web

Part-of-speech tagset

Tools to work with the Bulgarian Web corpus

version 1 (bgTenTen12)

TenTen corpora

Search the Bulgarian corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine