Europarl spoken parallel corpus

The Europarl parallel corpus

The Europarl corpus is a parallel corpus created from the European Parliament Proceedings in the official languages of the EU. It includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. The corpus was repeatedly expanded with a final size of around 60 million words per language. Texts are from the period April 1996 – November 2011 (depending on the specific language pair) and it corresponds to the Europarl corpus version 7.

Most languages of the Europarl corpus were processed with the TreeTagger tool and thus there are available lemmas and part-of-speech tags in corpora.

Corpus data and more information can be found on the official website http://www.statmt.org/europarl/

Tools to work with Europarl parallel corpora

A complete set of Sketch Engine tools is available to work with the Europarl spoken parallel corpora to generate:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

version 7 TreeTagger (spring 2015)

corpus tagged by TreeTagger

version 7.0 (May 2012)

A further expanded and improved version of the corpus was released on 15th May 2012.

version 5.0 (May 2010)

A corpus further expanded and improved version of the earlier version was released on 20th January 2010.

Bibliography

Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86).

Search the Europarl spoken parallel corpus

Sketch Engine offers a range of tools to work with this spoken parallel corpus.

Tip

Create your own multilingual or parallel corpora in Sketch Engine.

See our user guide.

More parallel corpora

DGT Translation Memory parallel corpora – European Union’s legislative documents

EUR-Lex 2/2016 parallel corpora – texts from the EUR-Lex database containing public EU documents

Eur-Lex judgments 12/2016 parallel corpora – judgments of the Court of Justice of the European Union

Open Parallel Corpus (OPUS) – translated texts from various sources, e.g. medical documents, subtitles, technical documentation, etc.

OpenSubtitles 2018 parallel corpora – movie subtitles from the OpenSubtitles database

United Nations Parallel Corpus (UNPC) – official records and other parliamentary documents of the United Nation

corpora in Sketch Engine

about Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

The Europarl parallel corpus

Tools to work with Europarl parallel corpora

version 7 TreeTagger (spring 2015)

version 7.0 (May 2012)

version 5.0 (May 2010)

Search the Europarl spoken parallel corpus

Tip

More parallel corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine