Web corpus crawled, deduplicated, multiple domains: blog posts, newspapers, commercial pages, …

See a full description.