Serbo-Croatian web corpora

The Serbo-Croatian web corpora are language corpora made up of texts collected from the Internet. Sketch Engine includes offer Bosnian, Croatian, Serbian corpora obtained from the web by Nikola Ljubešić and Filip Klubička in 2011 and 2013. Corpora were built using following steps:

  • data obtained from the web using Brno web corpus processing pipeline (SpiderLing, chared, jusText, onion);
  • lemmatised by CST’s Lemmatiser (Jongejan and Dalianis, 2009);
  • morphosyntactic tagging with HunPos12 (Halácsy et al., 2007);
  • all models trained on the Croatian 90k-token annotated corpus SETimes.HR14 (Agić and Ljubešić, 2014).

Part-of-speech tagset

Each corpus was annotated with the MULTEXT-East Morphosyntactic Specifications version 5 with small modifications for each language.

A complete set of Sketch Engine tools is available to work with this Bosnian, Croatian and Serbian Web corpus to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

6 May 2015

  • bsWaC 1.2, hrWaC 2.2, slWaC 2.1, srWaC 1.2 with MULTEXT-East tagset version 5

12 May 2014

  • initial version: bsWaC 1.0, hrWaC 2.0, srWaC 1.0

Bibliography

Ljubešic, N., & Klubicka, F. (2014). {bs, hr, sr} WaC – web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9)pp. 29–35

Search the Serbo-Croatian Web corpus

Sketch Engine offers a range of tools to work with the Bosnian, Croatian and Serbian Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.