A corpus has to be selected before you can start using any of the Sketch Engine features. If you are a new user, it might not be clear which corpus you should be using. Here are a few tips for beginner users.

Featured corpora

Featured corpora are a good start for monolingual corpora. These have been pre-selected based on the size and the availability of maximum number of features.

Featured corpora

No featured corpus?

If there is no featured corpus in your language, switch to All and use the drop-down to select the language and pick the largest corpus.


These corpora are excellent general purpose corpora. The main advantage is their large size, typically several billion words.

TenTen is a new generation of Web corpora. They are created by crawling the web in a sophisticated way. The downloaded texts undergo a complex process before they are included in the corpus. The downloaded texts are cleaned from non-text, e.g. navigation menus, legal text or small print, and duplicate text is removed. Downloaded texts are also evaluated and texts which are too short or contain too much content unsuitable for the use in a corpus are removed.  TenTen stands for 1010 (10 billion) words. TenTen corpora in detail»

Parallel corpora

Most parallel corpora in Sketch are multilingual corpora, i.e. consist of the same text in many languages. Separately they can be used as monolingual corpora too.

Selecting a parallel corpus

You cannot select a parallel corpus as such, what you need to do is:

  • select the first language
  • go to a feature (e.g. concordance search, or bi-lingual word sketch)
  • when setting the criteria, you will select the second language
    (in the case of the concordance search , you can even select more than one language)

OPUS corpora (recommended)

OPUS is a collection of translated texts from the web and it covers a wide selection of subjects and topics and is available in the largest number of languages. This should be your first choice for parallel corpora. more on OPUS»

EUROPARL corpora

The corpus is created from the proceedings of the European Parliament and is available in 21 Eruopean langauges. The nature of the corpus makes it a great resource for topics discussed in the European Parliament and for general formal language. Searching language from topic areas which are rare in the European Parliament may not produce good results. more on EUROPARL»

EUR-Lex Corpus

A corpus created from translated documents of the European Union available in the 24 official EU languages. Recommended for general formal language and subject areas covered in EU documents. Since EU documentation relates to many areas, it is suitable for general use too.  more on EUR-Lex»

Corpus information

Detailed corpus information can be displayed by clicking the (i) info buttnon next to each corpus.

More corpora

For a complete list of corpora, refer to the list of corpora.