LithuanianWaC: Corpus of the Lithuanian Web
The Lithuanian Web Corpus (LithuanianWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Text data was provided by Andrius Utka and contain 48 million words in total. Corpus texts are lemmatized and POS tagged.
The LithuanianWaC corpus has part-of-speech tagging with the following POS tagset legend.