BengaliWaC: Bangla web corpus

The Bengali web Corpus (BengaliWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared by the Corpus factory method, a method for developing large general language corpora which can be applied to many languages (A. Kilgarriff et al 2010). Corpus texts are available with lemmatization and POS tagging.

Part-of-speech tagset

This BengaliWaC has the part-of-speech tagset for Bengali created using an Annotation tool developed in Microsoft Research India.

Tools work with the BengaliWaC corpus

A complete set of Sketch Engine tools is available to work with this Bengali web corpus to generate:

  • word lists – lists of Bengali nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 2 (2017 April)

  • applied new tokenizer and lemmatizer
  • corpus tagged using an Annotation Tool developed in Microsoft Research India

(2012)

  • shallow tagging (no part-of-speech classification)
  • universal word sketch grammar

version 1 2010

  • initial version

Bibliography

Indian Language Part-of-Speech Tagset: Bengali

Bali, Kalika, Monojit Choudhury, and Priyanka Biswas. Indian Language Part-of-Speech Tagset: Bengali LDC2010T16. Web Download. Philadelphia: Linguistic Data Consortium, 2010.

Web corpora

BARONI, Marco; KILGARRIFF, Adam. Large linguistically-processed web corpora for multiple languages. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics, 2006, pp. 87–90.

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. A corpus factory for many languages. In LREC, May 2010.

Search the Bengali web corpus

Sketch Engine offers a range of tools to work with the Bengali web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.