The Chinese Gigaword corpus from the Linguistic Data Consortium consists of two parts – simplified-character and traditional-character of the Chinese language.

In Sketch Engine, this corpus was divided into two separate corpora:

Chinese GigaWord 2 Corpus: Mainland, simplified

  • source data is journalism from the Central News Agency, Beijing from 1991 and 2002
  • size more than 200 million words

Chinese GigaWord 2 Corpus: Taiwan, traditional

  • source data is journalism from the Xinhua News Agency, Taiwan from 1990 and 2002
  • size more than 380 million words

Tokenisation and tagging for both of these corpora were undertaken at Academia Sinica, Taiwan and were described in
Wei-yun Ma and Chu-Ren Huang. 2006. Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus. Proc. 5th International Conference on Language Resources and Evaluation (LREC2006). Genoa, Italy. 24-28 May, 2006.