American Spanish TenTen corpus. Crawled by SpiderLing in December 2011. Encoded in UTF-8, cleaned and deduplicated.

The corpus is tagged with TreeTagger using the Spanish parameter file (UTF-8) or Freeling 3.1.

Changelog 

11 February 2015

  • re-tagged using Freeling 3.1

2 April 2014

  • fixed encoding issues

29 July 2013

  • re-tagged using Freeling 3.0

March 2013

  • global subcorpora by country (first ca. 10 million tokens remains the same as in the previous version, the rest of documents sorted by country follows)

April 2012

  • initial version – 7.5 billion words