TenTen Corpus Family
The TenTen Corpus Family (TenTen corpora) is a collection of text corpora created from the Web. TenTen corpora are prepared according to the same criteria that may guarantee quality result corpus texts and also an option to compare them with each other.
The corpus belongs to the TenTen corpus family which is a set of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to Tenten corpora in more than 30 languages.
Description of preparing TenTen corpora
- Corpora are crawled from the Internet with the Spiderling tool, a web spider (see Wikipedia) for the linguistic purpose that collects texts from websites.
- The web download is followed by text cleaning when texts are processed by jusText, a heuristic based boilerplate removal tool removing irrelevant (non-text or poor text) content such as navigation links, advertisements, headers, footers, etc.
- The next step is a tokenization process – dividing texts
- Afterward, onion (a tool for removing duplicate parts) performs deduplication on paragraph level.
- Finally, corpus texts are lemmatized and part-of-speech tagged for language for which there are tagger and lemmatizer tools are available.
Detailed information about the mentioned tools can be read on the corpus.tools website and the building of TenTen corpora TenTen building is described in the bibliography (below).
A list of corpus metadata (structural attributes in corpus linguistics) shared by all TenTen corpora.