esTenTen is a Spanish TenTen corpus. The source data was crawled from the internet in 2011. Therefore documents mostly from 2011 and the preceding years are included.
The data was cleaned (re-encoded to UTF-8, boilerplate removal applied, de-duplicated) and tokenised using Corpus tools. Part of speech tagging and lemmatisation were performed using Freeling 3.1 with Spanish configuration & data applying Spanish Freeling tagset.
The corpus consists of two subcorpora: European Spanish and American Spanish downloaded from web domains in the respective continents. Thus a subcorpus effectively determines the language variety. Select the desired subcorpus in the corpus query interface to limit the query to a single Spanish variety.
11 February 2015
- re-tagged using Freeling 3.1
2 April 2014
- fixed encoding issues
29 July 2013
- re-tagged using Freeling 3.0
17 October 2012
- American and European parts from 2011 put together
- subcorpora can be used to query the parts separately now
12 January 2012
- American Spanish data crawled by web crawler SpiderLing in December 2011
- these documents were put into a separate corpus “esAmTenTen”
30 September 2011
- removed Catalan and Galician texts
- corpus size reduced by 79 million tokens