Corpus of Classical Tibetan

The Annotated Corpora of Classical Tibetan (ACTib) is a corpus containing 80 million words of Classical Tibetan has been available in Sketch Engine since spring 2017.

Corpus texts were taken from the e-text collection of the Buddhist Digital Resource Center. The ACTiB corpus was lemmatized and part-of-speech tagged. There was also prepared a word sketch grammar for the Tibetan language enables users to explore the grammatical and collocational behavior of Tibetan words.

The corpus was built as part of the Tibetan in Digital Communication project. More information about the project and the author’s contacts can be found on the project page.

Part-of-speech tagging

The morpho-syntactically annotation was processed by Memory-Based Tagger (Hill & Garrett 2017) trained on 300,000 words comprised of three distinct collections (Classical corpus, Saint Petersburg corpus, Tibetan catalogue of the Berlin State Library). See the Tibetan part-of-speech tagset.

Availability

The corpus is accessible to all users including trial users in Sketch Engine or can be downloaded in its entirety from Zenodo.

DOI for Segmented version: forthcoming

DOI for POS-tagged version: https://doi.org/10.5281/zenodo.822537

Changelog

ACTib 2.1

  • forthcoming

ACTib 2.0

  • 80 million words automatically segmented and POS-tagged (no manual correction)
  • created word sketch grammar for the Tibetan language

ACTib 1.0

  • initial size 21 million words automatically segmented and POS-tagged (no manual correction)

Bibliography

Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015) ‘The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries.’ Revue d’Etudes Tibétaines, 32. pp. 51-86.

Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878

Meelen, Marieke and Hill, Nathan W. (forthcoming) ‘Segmenting and POS tagging Classical Tibetan’ in Himalayan Linguistics.

Hill, Nathan W. and Meelen, Marieke (forthcoming) ‘Creating an Annotated Corpus of Classical Tibetan (ACTib)’.

How to cite?

Meelen, Marieke; Hill, Nathan; Handy, Christopher (2017b), The Annotated Corpus of Classical Tibetan (ACTib), Part II – POS-tagged version, based on the BDRC digitised text collection, tagged with the Memory-Based Tagger from TiMBL. (https://doi.org/10.5281/zenodo.822537).

Search the Tibetan corpus

Sketch Engine offers a range of tools to work with the Tibetan corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.