A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense etc.) of each token in a text corpus.

English CLAWS part-of-speech tagset version 5 is available in English corpora annotated by the tool using CLAWS (the Constituent Likelihood Automatic Word-tagging System) developed by University Centre for Computer Corpus Research on Language at Lancaster University.

The Constituent Likelihood Automatic Word-tagging System abbreviated CLAWS was developed by UCREL at Lancaster University. This is the 5th version of used tagset.

An Example of a tag in the CQL concordance search box: [tag="VBD"] finds all past forms of the verb “be”: was, were  (note: please make sure that you use straight double quotation marks)

TAGSET

POS Tag Description
AJ0 adjective (unmarked) (e.g. GOOD, OLD)
AJC comparative adjective (e.g. BETTER, OLDER)
AJS superlative adjective (e.g. BEST, OLDEST)
AT0 article (e.g. THE, A, AN)
AV0 adverb (unmarked) (e.g. OFTEN, WELL, LONGER, FURTHEST)
AVP adverb particle (e.g. UP, OFF, OUT)
AVQ wh-adverb (e.g. WHEN, HOW, WHY)
CJC coordinating conjunction (e.g. AND, OR)
CJS subordinating conjunction (e.g. ALTHOUGH, WHEN)
CJT the conjunction THAT
CRD cardinal numeral (e.g. 3, FIFTY-FIVE, 6609) (excl ONE)
DPS possessive determiner form (e.g. YOUR, THEIR)
DT0 general determiner (e.g. THESE, SOME)
DTQ wh-determiner (e.g. WHOSE, WHICH)
EX0 existential THERE
ITJ interjection or other isolate (e.g. OH, YES, MHM)
NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)
NN1 singular noun (e.g. PENCIL, GOOSE)
NN2 plural noun (e.g. PENCILS, GEESE)
NP0 proper noun (e.g. LONDON, MICHAEL, MARS)
NULL the null tag (for items not to be tagged)
ORD ordinal (e.g. SIXTH, 77TH, LAST)
PNI indefinite pronoun (e.g. NONE, EVERYTHING)
PNP personal pronoun (e.g. YOU, THEM, OURS)
PNQ wh-pronoun (e.g. WHO, WHOEVER)
PNX reflexive pronoun (e.g. ITSELF, OURSELVES)
POS the possessive (or genitive morpheme) ‘S or ‘
PRF the preposition OF
PRP preposition (except for OF) (e.g. FOR, ABOVE, TO)
PUL punctuation – left bracket (i.e. ( or [ )
PUN punctuation – general mark (i.e. . ! , : ; – ? … )
PUQ punctuation – quotation mark (i.e. ` ‘ ” )
PUR punctuation – right bracket (i.e. ) or ] )
TO0 infinitive marker TO
UNC “unclassified” items which are not words of the English lexicon
VBB the “base forms” of the verb “BE” (except the infinitive), i.e. AM, ARE
VBD past form of the verb “BE”, i.e. WAS, WERE
VBG -ing form of the verb “BE”, i.e. BEING
VBI infinitive of the verb “BE”
VBN past participle of the verb “BE”, i.e. BEEN
VBZ -s form of the verb “BE”, i.e. IS, ‘S
VDB base form of the verb “DO” (except the infinitive), i.e.
VDD past form of the verb “DO”, i.e. DID
VDG -ing form of the verb “DO”, i.e. DOING
VDI infinitive of the verb “DO”
VDN past participle of the verb “DO”, i.e. DONE
VDZ -s form of the verb “DO”, i.e. DOES
VHB base form of the verb “HAVE” (except the infinitive), i.e. HAVE
VHD past tense form of the verb “HAVE”, i.e. HAD, ‘D
VHG -ing form of the verb “HAVE”, i.e. HAVING
VHI infinitive of the verb “HAVE”
VHN past participle of the verb “HAVE”, i.e. HAD
VHZ -s form of the verb “HAVE”, i.e. HAS, ‘S
VM0 modal auxiliary verb (e.g. CAN, COULD, WILL, ‘LL)
VVB base form of lexical verb (except the infinitive)(e.g. TAKE, LIVE)
VVD past tense form of lexical verb (e.g. TOOK, LIVED)
VVG -ing form of lexical verb (e.g. TAKING, LIVING)
VVI infinitive of lexical verb
VVN past participle form of lex. verb (e.g. TAKEN, LIVED)
VVZ -s form of lexical verb (e.g. TAKES, LIVES)
XX0 the negative NOT or N’T
ZZ0 alphabetical symbol (e.g. A, B, c, d)

NOTE: “DITTO TAGS”

Any of the tags listed above may, in theory, be modified by the addition of a pair of numbers to it: eg. DD21, DD22 This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:

		 in_II31 terms_II32 of_II33 

The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence.

Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto tags may or may not be required for a particular word sequence:

		at_RR21 length_RR22
		a_DD21/RR21 lot_DD22/RR22
		in_CS21/II that_CS22/DD1

Source: http://ucrel.lancs.ac.uk/claws5tags.html

Largest English corpus

Explore our English Trends corpus, which totals over 80 billion words and grows automatically every week.