TenTen is a new generation of Web corpora. These corpora are created by Web crawling and processed with our latest boilerplate cleaning and de-duplication tools. The “tenten-corpus” designates the target sizes of the corpora which are 1010 (10 billion) words.

Available corpora:

Description of preparing TenTen corpora

Corpora are crawled from the Internet by Spiderling (a web spider for linguistics). After downloading texts, they are first processed by jusText (a heuristic based boilerplate removal tool) for removing irrelevant content such as navigation links, advertisements, headers, and footers. Afterwards, onion (a tool for removing duplicate parts) performs deduplication on paragraph level – the ones consisting of more than 50 % word 7-tuples encountered in previously processed data are removed. Then the corpus is tokenized into words, lemmatised (matching stemmed form of the word to the words), and part-of-speech tagged by suitable taggers.

Used tools are described on the Language resources tools documentation page.

The full description of preparing corpora is explained in references (below).

Structural attributes

A list of structural attributes shared by all TenTen corpora follows. Attributes specific to a corpus can be found on the information page of the corpus.

Document

  • Top level domain (e.g. “com”)
  • Web site (e.g. “wikipedia.org)
  • Web domain (e.g. “nytimes.com”)
  • Crawl date = date of downloading the document from the web
  • url = URL of the source document
  • wordcount
  • length

Paragraph

  • heading = 1 if the paragraph is a heading, 0 otherwise

Related papers

The TenTen Corpus Family by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel at 7th International Corpus Linguistics Conference, Lancaster, July 2013.

Efficient Web Crawling for Large Text Corpora by Jan Pomikálek, Vít Suchomel at ACL SIGWAC Web as Corpus (at conference WWW), Lyon, April 2012.