Shallow tagging is used for languages which we cannot tag with an existing tagger. The following tags are based on regular expressions and on frequency properties of tokens:

  • FREQ – frequent words (200 most frequent word in language)
  • CONTENT – other words
  • CRD – numerals
  • PUN – punctuations
  • OTHER – other

Once a corpus is tagged with this simple tagset, it can be processed with Universal Sketch Grammar by Siva Reddy, Adam Kilgarriff, Pavel Rychlý.