The parallel corpora available here have been collected, prepared and aligned by Joerg Tiedermann in the OPUS project (see http://opus.lingfil.uu.se/). We are most grateful to him for his great work and co-operation. The data was prepared for the Sketch Engine using a range of lemmatisers, part-of-speech taggers and Sketch Grammars.
Unlike the first version, the alignment is now m:n, which allows for just one corpus per language.
OPUS an open source parallel corpus allows to search bilingual and multilingual data in many languages, find concordances, collocations, word list and more.
The OPUS project in Sketch Engine contains 40 languages: Afrikaans, Albanian, Arabic, Bosnian, Bulgarian, Chinese Simplified, Chinese Traditional, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Norwegian, Persian, Polish, Portuguese, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian.
List of Sub-corpora:
- ECB – European Central Bank corpus (v.0.1)
- EMEA – European Medicines Agency documents (v.0.3)
- EUconst – The European constitution (v.0.1)
- EUROPARL – European Parliament Proceedings (v.3)
- OpenOffice 3 corpus
- Opensubs – Open Subtitles corpus (v.2)
- KDE4 – KDE4 localization files (v.2)
- KDEdoc – KDE manual corpus
- MultiUN – Translated UN documents
- OpenOffice (v. 3)
- OpenSubtitles2011 – Open Subtitles corpus (2011 version)
- RF – Regeringsförklaringen – Declarations of Government Policy by the Swedish Government
- SETIMES2 – A parallel corpus of the Balkan languages (v.2)
- SPC – Stockholm Parallel Corpora (v.1)
- TEP – The Tehran English-Persian subtitle corpus (v.0.1)
- UN – Translated UN documents
List of tools & grammar used for various languages
All other languages without tagging tools were just tokenised using Universal tokenizer built by Jan Pomikalek and inspired by Laurent Pointal’s TreeTagger wrapper.