Simplified Chinese TenTen corpus was created from the Internet in 2011. It contains almost 2.6 million documents with more than 1.7 billion words in over 72 million sentences.

The corpus has been processed with Stanford Chinese Word Segmenter and Stanford Log-linear Part-Of-Speech Tagger using the Chinese Penn Treebank standard models.

Sketch grammar has been prepared by Simon Smith.

Tagset overview

 

See the tagset legend of the Chinese Penn Treebank.

v1.0 (2 December 2011)Changelog

  • initial version – 1.7 billion words