Search the Spanish esTenTen corpus
The Spanish Web corpus (esTenTen) is a text corpus created from the collected internet texts. The corpus belongs to the TenTen corpus family which is a set of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to Tenten corpora in more than 30 languages.
The corpus consists of two subcorpora: European Spanish and American Spanish downloaded from web domains in the respective continents. From these subcorpora will also be prepared two separate corpora:
- European Spanish Web (eseuTenTen)
- American Spanish Web corpus (esamTenTen)
Thanks to this approach, users can effectively determine the language variety. This enables to select a specific subcorpus (or corpus) and limits a search to a single Spanish variety.
The data was cleaned (re-encoded to UTF-8, boilerplate removal applied, de-duplicated) and tokenised using Corpus tools. Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.