Simplified Chinese TenTen corpus was created from the Internet in 2011. It contains almost 2.6 million documents with more than 1.7 billion words in over 72 million sentences.

The corpus has been processed with Stanford Chinese Word Segmenter and Stanford Log-linear Part-Of-Speech Tagger using the Chinese Penn Treebank standard models.

Sketch grammar has been prepared by Simon Smith.

Tagset overview

The full tagset description including Chinese characters is available here (this page was taken from WayBack Machine dump )

AD adverb
AS aspect marker
BA in ba-construction
CC coordinating conjunction
CD cardinal number
CS subordinating conjunction
DEC in a relative-clause
DEG associative
DER in V-de const. and V-de-R
DEV before VP
DT determiner
ETC for words
FW foreign words
IJ interjection
JJ other noun-modifer
LB in long bei-const
LC localizer
M measure word
MSP other particle
NN common noun
NR proper noun
NT temporal noun
OD ordinal number
ON onomatopoeia
P preposition
PN pronoun
PU punctuation
SB in short bei-const
SP sentence-final particle
VA predicative adjective
VE as the main verb
VV other verb


v1.0 (2 December 2011)

  • initial version – 1.7 billion words