The corpus is part of the Chinese Gigaword corpus from the Linguistic Data Consortium. The material is journalism from the Central News Agency, Taiwan from the 1990s and 2000s. As the Chinese Gigaword is part simplified-character and part traditional-character, we have encoded it in these two separate parts; this part is only the traditional-character part.
Tokenisation and tagging were undertaken at Academia Sinica, Taiwan and were described in
Wei-yun Ma and Chu-Ren Huang. 2006. Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus. Proc. 5th International Conference on Language Resources and Evaluation (LREC2006). Genoa, Italy. 24-28 May, 2006.
We shall shortly be upgrading to the Second edition which has additional and more recent data.