esTenTen is a Spanish TenTen corpus. The source data was crawled from the internet in 2011. Therefore documents mostly from 2011 and the preceding years are included.

The data was cleaned (re-encoded to UTF-8, boilerplate removal applied, de-duplicated) and tokenised using Corpus tools. Part of speech tagging and lemmatisation were performed using Freeling 3.1 with Spanish configuration & data applying Spanish Freeling tagset.

The corpus consists of two subcorpora: European Spanish and American Spanish downloaded from web domains in the respective continents. Thus a subcorpus effectively determines the language variety. Select the desired subcorpus in the corpus query interface to limit the query to a single Spanish variety.



January 2015

  • re-tagged using Freeling 4

11 February 2015

  • re-tagged using Freeling 3.1

2 April 2014

  • fixed encoding issues

29 July 2013

  • re-tagged using Freeling 3.0

17 October 2012

  • American and European parts from 2011 put together
  • subcorpora can be used to query the parts separately now

12 January 2012

  • American Spanish data crawled by web crawler SpiderLing in December 2011
  • these documents were put into a separate corpus “esAmTenTen”

30 September 2011

  • removed Catalan and Galician texts
  • corpus size reduced by 79 million tokens

13 April 2011