Dutch Web Corpus

This corpus was created within the Corpus Factory project as…

Lithuanian WaC

(version 2) This corpus was created Corpus Factory method…

Indonesian WaC

The corpus is prepared by Corpus factory method described here.…

Croatian Web Corpus

(version 1.1) Tagset ​MULTEXT-East Morphosyntactic Specifications,…

Kannada WaC

Kannada WaC (web as corpus). The corpus is prepared by Corpus…

Yoruba WaC corpus

Yoruba web as corpus. It was compiled in June 2015 with encoding…

Hebrew web corpora

Hebrew General corpus This corpus was crawled from the Internet…

TatarWaC corpus

Tatar sample corpus is ca 200 thousand words crawled from the…

Urdu

The web corpus containing 53 million words built with Corpus…

Russian Web Corpus

This corpus was gathered by Serge Sharoff at the University of…

pukWaC

The same as ukWaC, but with a further layer of annotation added,…

Romanian WaC (RoWaC) corpus

This Romanian web as corpus was gathered by Monica Macoveiciuc,…

Polish Web Corpus

Polish web as corpus has 103 million words and the encoding is…

Internet-ZH corpus

Internet-ZH is a Chinese web corpus collected by Serge Sharoff.…

Domain Web Corpus

The corpora available here have been collected using the WebBootCat…