French TenTen corpus. French web corpus crawled by SpiderLing in February 2012. Encoded in UTF-8, cleaned and deduplicated. Tagged by TreeTagger.

List of tags is here.


v. 2.0 (February 18 2015)

  • retokenised and processed with up to date versions of tools
  • Spanish documents filtered out

v. 1.0 (April 2012)

  • initial version – 10.7 billion word