The Europarl parallel corpus

The Europarl corpus is a parallel corpus created from the European Parliament Proceedings in the official languages of the EU. It includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. The corpus was repeatedly expanded with the final size around 60 million words per language.

Most languages of the Europarl corpus were processed with the TreeTagger tool and thus there are available lemmas and part-of-speech tags in corpora.

Corpus data and more information can be found on the official website http://www.statmt.org/europarl/

Tools to work with Europarl corpora

A complete set of Sketch Engine tools is available to work with the Europarl corpora to generate:

  • word sketch – collocations categorized by grammatical relations (requires POS tagging)
  • thesaurus – synonyms and similar words for every word (requires POS tagging)
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 7 TreeTagger (spring 2015)

  • corpus tagged by TreeTagger

version 7.0 (May 2012)

  • A further expanded and improved version of the corpus was released on 15th May 2012.

version 5.0 (May 2010)

  • A corpus further expanded and improved version of the earlier version was released on 20th January 2010.

Bibliography

Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86).

Other parallel corpora

EUR-Lex Corpora – public documents of the European Union (available in EUR-Lex database)

Eur-Lex judgments corpus – judgments of the Court of Justice

OPUS 2 parallel corpora

DGT-Translation Memory corpora

Polish-Swahili Bible corpora

Search the Europarl corpus

Sketch Engine offers a range of tools to work with the Europarl corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.