Simplified Chinese TenTen corpus was created from the Internet in 2011. It contains almost 2.6 million documents with more than 1.7 billion words in over 72 million sentences.
Sketch grammar has been prepared by Simon Smith.
See the tagset legend of the Chinese Penn Treebank.
v1.0 (2 December 2011)Changelog
- initial version – 1.7 billion words