Tajik part-of-speech tagset

A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense etc.) of each token in a text corpus.

Tajik part-of-speech tagset is available in the Tajik corpus. A tag entry consists of lemma and number describing part of speeches which have assigned the numbers from 01 to 16 according to 16 categories. The part-of-speech tagging did not pass a process of disambiguation that means there can be more tags for one word.

The Tajik tagger was developed by Gulshan Dovudov, Vít Suchomel and Pavel Šmerk and introduced at the Language Technology for Normalisation of Less-Resourced Languages conference in 2012.

Tajik Web

corpus

An Example of a tag in the CQL concordance search box: [tag=".+01.*"] finds all nouns, e.g. сол, дар (note: please make sure that you use straight double quotation marks)

Tagset

Tag	Description	Example (lemma+tag)
01	nouns	сол:01
02	adjectives	нав:02
03	numerals	як:03
04	pronouns	он:04
05	verbs	аст:05
06	infinitives	кардан:06
07	adjectival participles	карда:07
08	adverbial participles	ситез:08
09	adverbs	пас:09
10	prepositions	ба:10
11	postpositions	қатӣ:11
12	conjunctions	аммо:12
13	particles	низ:13
14	interjections	а:14
15	onomatopoeia	шарр:15
16	numeratives	нафар:01,нафар:16

Reference

DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. POS Annotated 50M Corpus of Tajik Language. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012). Istanbul: European Language Resources Association (ELRA), 2012. s. 93-98, 6 s. ISBN 978-2-9517408-7-7.

Tagset

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine