MATAS: the Morphologically Annotated Lithuanian Corpus

The Morphologically Annotated Lithuanian Corpus (MATAS) is a language corpus made up of different text genres. The corpus was compiled and prepared by the Center of Computational Linguistics (CCL) at Vytautas Magnus University. The corpus consists of 739,176 words with manual annotation which indicates detail grammatical category. Texts are extracted from the Corpus of the Contemporary Lithuanian Language at CCL (100-million-word corpus).

For more information see https://clarin.vdu.lt/xmlui/handle/20.500.11821/9?show=full

Part-of-speech tagset

MATAS corpus is manually annotated at morphological level with the following POS tagset.

Access policy

Access to the corpus is only limited to academic use. To gain access, send an email to support@sketchengine.co.uk with a proof of your academic affiliation.

Distribution of text genres

Tools to work with the Morphologically Annotated Lithuanian Corpus

A complete set of Sketch Engine tools is available to work with this Lithuanian corpus to generate:

  • word lists – lists of Lithuanian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Bibliography

Rimkutė, E. (2014). Lithuanian morphologically annotated corpus-MATAS, CLARIN-LT digital library in the Republic of Lithuania, http://hdl.handle.net/20.500.11821/9.

Search MATAS corpus

Sketch Engine offers a range of tools to work with the Morphologically Annotated Lithuanian Corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.