American Spanish TenTen corpus. Crawled by SpiderLing in December 2011. Encoded in UTF-8, cleaned and deduplicated.
The corpus is tagged with TreeTagger using the Spanish parameter file (UTF-8) or Freeling 3.1.
11 February 2015
- re-tagged using Freeling 3.1
2 April 2014
29 July 2013
- re-tagged using Freeling 3.0
- global subcorpora by country (first ca. 10 million tokens remains the same as in the previous version, the rest of documents sorted by country follows)
- initial version – 7.5 billion words