Chinese corpus from Wikipedia search

Corpus of Chinese Wikipedia

The Chinese Wikipedia corpus is a Chinese corpus created from the Chinese internet encyclopedia Wikipedia in 2012. For the building corpus was used Wikipedia dump (from April 2014). The corpus was segmented by Stanford Word Segmenter. Later tagged with Stanford Tagger using a model trained on a combination of Chinese Treebank texts from Chinese and Hong Kong sources.

Part-of-speech tagset

POS tags are based on Chinese Penn TreeBank tagset.

Available tools

A complete set of is available to work with this Chinese corpus to generate:

word sketch – Chinese collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Chinese nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context