The Chinese Gigaword corpus from the Linguistic Data Consortium consists of two parts – simplified-character and traditional-character of the Chinese language.
In Sketch Engine, this corpus was divided into two separate corpora:
Chinese GigaWord 2 Corpus: Mainland, simplified
- source data is journalism from the Central News Agency, Beijing from 1991 and 2002
- size more than 200 million words
Chinese GigaWord 2 Corpus: Taiwan, traditional
- source data is journalism from the Xinhua News Agency, Taiwan from 1990 and 2002
- size more than 380 million words
Tokenisation and tagging for both of these corpora were undertaken at Academia Sinica, Taiwan and were described in
Wei-yun Ma and Chu-Ren Huang. 2006. Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus. Proc. 5th International Conference on Language Resources and Evaluation (LREC2006). Genoa, Italy. 24-28 May, 2006.