Kannada WaC (web as corpus). The corpus is prepared by Corpus factory method described here. Full details are described in Kilgarriff et al. at LREC 2010.


v2.0 (17th Jan 2012)

The corpus is tagged using a new POS tagger (77.63% accuracy), lemmatizer and morph analyser downloaded from  http://sivareddy.in/downloads

The tagset details are described in POS guidelines for Indian languages (crawled from Webarchive at http://ltrc.iiit.ac.in/tr031/posguidelines.pdf)

We wrote a simple sketch grammar for Kannada and generated word sketches and distributional thesaurus for Kannada. If you would like to contribute, please contact us.

Reference for the corpus and tagger

REDDY, Siva; SHAROFF, Serge. Cross language POS taggers (and other tools) for Indian languages: An experiment with Kannada using Telugu resources. Cross Lingual Information Access, 2011, 11.