Tibetan corpus | Sketch Engine

Corpus of Classical Tibetan

The Annotated Corpora of Classical Tibetan (ACTib) version 2.0 is a Tibetan corpus containing 170 million words. The corpus consists of Classical Tibetan texts and was built as part of the Tibetan in Digital Communication project (2012-2015). The Annotated Corpus of Classical Tibetan is a collection of Tibetan electronic texts compiled by the Buddhist Digital Resource Center and can be downloaded from this Zenodo repository.

Part-of-speech tagging

The corpus is lemmatized and PoS tagged using the TreeTagger tool created by Helmut Schmid. The TreeTagger model for Tibetan was trained by Yeshe Tenley (the parameter file and training corpus can be found here). The lexicon, corpus, and enumeration of tags in the training data come from Dr. Nathan Hill.

Availability

The corpus is accessible to all users including trial users in Sketch Engine or can be downloaded in its entirety from Zenodo repository.

DOI for part-of-speech-tagged version: 10.5281/zenodo.3785070

Tools to work with the Tibetan corpus of Classical Tibetan

A complete set of Sketch Engine tools is available to work with this Tibetan corpus to generate:

word sketch – Tibetan collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word units
word lists – lists of Tibetan nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

How to cite?

Meelen, Marieke, & Roux, Élie. (2020). The Annotated Corpus of Classical Tibetan (ACTib) – Version 2.0 (Segmented & POS-tagged) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3951503

Bibliography

Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015) ‘The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries.’ Revue d’Etudes Tibétaines, 32. pp. 51-86.

Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878

Meelen, Marieke and Hill, Nathan W. (forthcoming) ‘Segmenting and POS tagging Classical Tibetan’ in Himalayan Linguistics.

Hill, Nathan W. and Meelen, Marieke (forthcoming) ‘Creating an Annotated Corpus of Classical Tibetan (ACTib)’.

Changelog

ACTib 2.0

197 million tokens

ACTib 1.0

90 million tokens automatically segmented and POS-tagged (no manual correction)
created word sketch grammar for the Tibetan language

initial version of ACTib

initial size of 21 million words automatically segmented and POS-tagged (no manual correction)

Search the ACTib corpus

Sketch Engine offers a range of tools to work with the Tibetan corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

Corpus of Classical Tibetan

Part-of-speech tagging

Availability

Tools to work with the Tibetan corpus of Classical Tibetan

ACTib 2.0

ACTib 1.0

Search the ACTib corpus

Other text corpora in Sketch Engine

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine