Search the Spanish esTenTen corpus

esTenTen is a Spanish TenTen corpus. The source data was crawled from the web in 2011. Therefore documents mostly from 2011 and the preceding years.

The data was cleaned (re-encoded to UTF-8, boilerplate removal applied, de-duplicated) and tokenised using Corpus tools. Part-of-speech tagging and lemmatisation were performed using Freeling 3.1 with Spanish configuration & data applying Spanish Freeling tagset.

The corpus consists of two subcorpora: European Spanish and American Spanish downloaded from web domains in the respective continents. Thus a subcorpus effectively determines the language variety. Select the desired subcorpus in the corpus query interface to limit the query to a single Spanish variety.

Search the Spanish esTenTen corpus

Sketch Engine offers a range of tools to work with this Spanish corpus.

or

Spanish esTenTen text corpus from the webSpanish web corpus esTenTen by country

Changelog

January 2015

  • re-tagged using Freeling 4

11 February 2015

  • re-tagged using Freeling 3.1

2 April 2014

  • fixed encoding issues

29 July 2013

  • re-tagged using Freeling 3.0

17 October 2012

  • American and European parts from 2011 put together
  • subcorpora can be used to query the parts separately now

12 January 2012

  • American Spanish data crawled by web crawler SpiderLing in December 2011
  • these documents were put into a separate corpus “esAmTenTen”

30 September 2011

  • removed Catalan and Galician texts
  • corpus size reduced by 79 million tokens

13 April 2011

Learn to use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.