Bosnian, Croatian, Serbian corpora obtained from the web by Nikola Ljubešić and Filip Klubička in 2011 and 2013:

  • data obtained from the web using Brno web corpus processing pipeline (SpiderLing, chared, justext, onion);
  • lemmatised by CST’s Lemmatiser (Jongejan and Dalianis, 2009);
  • morphosyntactic tagging with HunPos12 (Halácsy et al., 2007);
  • all models trained on the Croatian 90k-token annotated corpus SETimes.HR14 (Agić and Ljubešić, 2014).

More information can be found in the original WaC-9 workshop in the paper below.

Structural attributes

Common web corpora attributes


  • cyrillic_num (cyrillic char count)
  • cyrillic_perc (cyrillic char perc)
  • diacr_perc (diacritic char perc)

Other attributes (described in the paper) may be added at your request.



6 May 2015

12 May 2014

  • initial version: bsWaC 1.0, hrWaC 2.0, srWaC 1.0


Ljubešic, N., & Klubicka, F. (2014). {bs, hr, sr} WaC – web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9)pp. 29–35