The web corpus containing 53 million words built with Corpus Factory method. The corpus has structure attributes for sentences, paragraphs, and documents. It is encoded in UTF-8. Girish Duvuru is the author and Vít Baisa is the current maintainer. For more information about mentioned method, see references.


Adam Kilgarriff, Siva Reddy, Jan Pomikálek, and Avinesh PVS. A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.