The corpus collection of 40-languages

The OPUS parallel corpus is a set of text corpora which have aligned sentences so sentences correspond the same sentences in other languages. OPUS project collects 40 languages. On account of this, user can check translation sentence pairs for many languages. 

The parallel corpora available here have been collected, prepared and aligned by Joerg Tiedermann in the OPUS project (see http://opus.lingfil.uu.se/). We are most grateful to him for his great work and co-operation. The data was prepared for the Sketch Engine using a range of lemmatisers, part-of-speech taggers and Sketch Grammars.

Unlike the first version, the alignment is now m:n, which allows for just one corpus per language.

OPUS an open source parallel corpus allows to search bilingual and multilingual data in many languages, find concordances, collocations, word list and more.

The OPUS project in Sketch Engine contains 40 languages: Afrikaans, Albanian, Arabic, Bosnian, Bulgarian, Chinese Simplified, Chinese Traditional, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Norwegian, Persian, Polish, Portuguese, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian.

List of Sub-corpora:

  • ECB – European Central Bank corpus (v.0.1)
  • EMEA – European Medicines Agency documents (v.0.3)
  • EUconst – The European constitution (v.0.1)
  • EUROPARL – European Parliament Proceedings (v.3)
  • OpenOffice 3 corpus
  • Opensubs – Open Subtitles corpus (v.2)
  • KDE4 – KDE4 localization files (v.2)
  • KDEdoc – KDE manual corpus
  • MultiUN – Translated UN documents
  • OpenOffice
  • OpenOffice (v. 3)
  • OpenSubtitles2011 – Open Subtitles corpus (2011 version)
  • RF – Regeringsförklaringen – Declarations of Government Policy by the Swedish Government
  • SETIMES2 – A parallel corpus of the Balkan languages (v.2)
  • SPC – Stockholm Parallel Corpora (v.1)
  • TEP – The Tehran English-Persian subtitle corpus (v.0.1)
  • Tatoeba
  • TedTalks
  • UN – Translated UN documents
  • hrenWaC

List of tools & grammar used for various languages

Language Tools Used Grammar
Arabic Stanford tagger using Faster Arabic model trained on the Penn Arabic Treebank with  Tagset Universal Sketch Grammar with AMIRA Tagset
Bulgarian TreeTagger using UTF-8 Bulgarian parameter file trained on Tagset Yes
Chinese (Traditional, Simplified, mixed) Segmented using Stanford Segmenter modelled on segmentation standards by Chinese Penn Treebank. Tagged using Stanford tagger trained on a combination of Chinese Treebank texts from Chinese and Hong Kong sources with Tagset Universal Sketch Grammar with tags
Dutch TreeTagger using UTF-8 Italian parameter file trained on Tagset Dutch Sketch Grammar (NLWAC tagset) v4.0 by Carole Tiberius
English TreeTagger using UTF-8 English parameter file trained on Tagset English Sketch Grammar v.2.0 (Penn Treebank tagset) by Niels Ott
Estonian TreeTagger using UTF-8 Estonian parameter file trained on Tagset Estonian Sketch Grammar v1.2 by Maria and Jelena
French TreeTagger using UTF-8 French parameter file trained on Tagset French Sketch Grammar v1.0 by Adam Kilgarriff
German TreeTagger using UTF-8 German parameter file trained on Tagset Sketch Grammar for German by Matej Durco v3.3
Italian TreeTagger using UTF-8 Italian parameter file trained on Tagset Sketch Grammar for Italian v1.2 by Marco Baroni
Portuguese TreeTagger using Pablo Gamalo’s parameter file Portuguese Wordsketches (Linguateca parsed data) v1.0 by Adam Kilgarriff & DP
Russian TreeTagger using Serge Sharoff’s Russian parameter file trained on Tagset Russian Wordsketches v1.0 by Maria Khokhlova
Spanish TreeTagger using UTF-8 Spanish parameter file trained on Tagset Spanish Wordsketches v1.0 by Nuria Bel and Hada Ross Salazar (Pompeu Fabra University, Barcelona)

All other languages without tagging tools were just tokenised using Universal tokenizer built by Jan Pomikalek and inspired by Laurent Pointal’s TreeTagger wrapper.

table-1

Tools to work with the OPUS parallel corpora

A complete set of Sketch Engine tools is available to work with OPUS parallel corpora to generate:

  • word sketch – collocations categorized by grammatical relations (this function requires part-of-speech tagging)
  • thesaurus – synonyms and similar words for every word (this function requires part-of-speech tagging)
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Bibliographic references

For a more detailed description of the DGT-TM, including more statistics on the resource, see the following publication. When making reference to DGT-TM in scientific publications, please refer to:

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2013). DGT-TM: A freely available translation memory in 22 languagesarXiv preprint arXiv:1309.5226.

For a contrastive overview of DGT-TM and the other multilingual text resources offered for download on this site, you can read the following journal article:

Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., & Gilbro, S. (2014). An overview of the European Union’s highly multilingual parallel corporaLanguage resources and evaluation48(4), 679-707.

Search the OPUS parallel corpus

Sketch Engine offers a range of tools to work with the OPUS parallel corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.