Corpus of Classical Tibetan

The Annotated Corpora of Classical Tibetan (ACTib) version 2.0 is a Tibetan corpus containing 170 million words. The corpus consists of Classical Tibetan texts and was built as part of the Tibetan in Digital Communication project (2012-2015). The Annotated Corpus of Classical Tibetan is a collection of Tibetan electronic texts compiled by the Buddhist Digital Resource Center and can be downloaded from this Zenodo repository.

Part-of-speech tagging

The corpus is lemmatized and PoS tagged using the TreeTagger tool created by Helmut Schmid. The TreeTagger model for Tibetan was trained by Yeshe Tenley (the parameter file and training corpus can be found here). The lexicon, corpus, and enumeration of tags in the training data come from Dr. Nathan Hill.

Availability

The corpus is accessible to all users including trial users in Sketch Engine or can be downloaded in its entirety from Zenodo repository.

DOI for part-of-speech-tagged version: 10.5281/zenodo.3785070

Tools to work with the Tibetan corpus of Classical Tibetan

A complete set of Sketch Engine tools is available to work with this Tibetan corpus to generate:

  • word sketch – Tibetan collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word units
  • word lists – lists of Tibetan nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Meelen, Marieke, & Roux, Élie. (2020). The Annotated Corpus of Classical Tibetan (ACTib) – Version 2.0 (Segmented & POS-tagged) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3951503

Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015) ‘The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries.’ Revue d’Etudes Tibétaines, 32. pp. 51-86.

Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878

Meelen, Marieke and Hill, Nathan W. (forthcoming) ‘Segmenting and POS tagging Classical Tibetan’ in Himalayan Linguistics.

Hill, Nathan W. and Meelen, Marieke (forthcoming) ‘Creating an Annotated Corpus of Classical Tibetan (ACTib)’.

ACTib 2.0

  • 197 million tokens

ACTib 1.0

  • 90 million tokens automatically segmented and POS-tagged (no manual correction)
  • created word sketch grammar for the Tibetan language

initial version of ACTib

  • initial size of 21 million words automatically segmented and POS-tagged (no manual correction)

Search the ACTib corpus

Sketch Engine offers a range of tools to work with the Tibetan corpus.

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms with Sketch Engine. Use our Quick Start Guide to learn it in minutes.