deWaC – German corpus from the .de domain
The German web corpus (deWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider in 2009 and comprises more than 1,34 billion words.
sdeWaC: Parsed German web corpus
The sdeWaC corpus is a 750-million-word subset of German web corpus prepared by Janina Kopp and Niels Ott in 2012. The sdeWaC does not contain duplications. It was parsed by TreeTagger and FSPar dependency parser which created annotations of sentence analysis.
The corpus is tagged with TreeTagger using the Stuttgart-Tubingen tagset.