The Chinese Wiki corpus is first segmented with Stanford Word Segmenter. Later tagged with Stanford Tagger using a model trained on a combination of Chinese Treebank texts from Chinese and Hong Kong sources.

The tag set used for tagging is LDC Chinese Treebank Tag set.

Changelog

v1.0 (16 April 2012)

  • initial version – 0.1 billion tokens