A general purpose multilingual corpus available in Sketch Engine

THe EUR-Lex Corpus is a multilingual corpus in all the official languages of the European Union. The corpus has been built from HTML files available in EUR-Lex database. Thanks to the coverage of a vast area of subjects, the corpus is an excellent general purpose resource for anyone looking for translation examples in many languages.

A substantial part of the documents is translated into all official languages of the European Union (currently 24). Languages which joined the EU later are represented by smaller corpora proportional to the length of their membership.

Structure

Technically speaking, the documents are segmented and aligned on paragraph level. This means that the user can search for a matching paragraph containing the translation. The paragraphs are, however,  fine-grained and usually correspond with sentences which means that the user is able to search for matching sentences or very short paragraphs.

Sketch Engine offers also the smaller corpus of judgments of the European Parliament, see more.

How to get the data

Academic and research institutions

The EUR-Lex corpus is released under CC-BY-NC-SA licence. Because of the file size, please email us at support@sketchengine.co.uk first and we will set up a temporary download link for you. For the original documents, see the official EUR-Lex website.

For commercial use

Please contact us for a quote.

How to cite

Please, consider mentioning Lexical Computing Ltd in Acknowledgements and referring to the original paper (below) if you use EUR-Lex corpus.

Vít Baisa, Jan Michelfeit, Marek Medveď, Miloš Jakubíček: European Union Language Resources in Sketch Engine. In The Proceedings of tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). Portorož, Slovenia. 2016.

Important copyright notice

© European Union, 1998-2016

Except where otherwise stated, reuse of the EUR-Lex data for commercial or non-commercial purposes is authorised provided the source is acknowledged (see above). The reuse policy of the European Commission is implemented by the Commission Decision of 12 December 2011. Some documents, like the International Accounting Standards, may be subject to special conditions of use, which are mentioned in the respective Official Journal. For all other copyright issues regarding EUR-Lex, please contact op-info-copyright@publications.europa.eu.

A list of metadata in Eur-Lex corpora

  • Author (institution or person) of document: an author who created the document
  • CELLAR: a string storing in a single place all metadata and digital content managed by the Publications Office in a standardized way
  • CELEX number: number composed of the number of the sector, then 4 digits for the year, then one or two letters for the type of document and finally 2-4 digits for the number of the document (e.g. 42011L0019 – 4 is the sector, legislation; 2011 is the year of publication in the OJ*; L represents EU directives and 0019 is the number under which the directives was published in the OJ)
  • Date of document: specific date when the document was created
  • Date of publication in Official Journal: specific date when the document was publicized in the Official Journal of the European Union
  • Document category (CELEX sector): grouped more document types into one superior category
  • Document title: a title of the document
  • Document type: the main topic of the document, e.g. Financial Regulation Budget, etc.
  • EuroVoc thesaurus term: a term from the EUROVOC THESAURUS TERM
  • Last modification date: last date when the document was modified
  • Year of document: year when the document was created
  • Year of publication in Official Journal: year when the document was publicized in the Official Journal of the European Union

* the Official Journal of the European Union

EUR-Lex
number of aligned structures

The table shows the number of aligned structures (paragraphs in the case of EUR-Lex) for each pair of languages. In millions, click to enlarge.

The number of words is typically hundreds of millions of words for the largest languages .

Tip

Learn to work with multilingual and parallel corpora in Sketch Engine. Refer to the user guide.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.