The TurkishWaC corpus is a 32 million word collection of samples of Turkish websites. The corpus contains word sketches generated from Turkish Malt Parser. The data for the corpus were prepared by Corpus factory method; see full details in the document A corpus factory for many languages (Kilgarriff et al. at LREC 2010).
v. 3.0 (19 Nov 2014)
- added document structures
- added lempos
- corrected tokenisation of quotes
v2.0 (25 Oct 2011)
Word Sketches are compiled using below resources.
The morphological analyser and morphological disambiguator (POS tagger) are from Kemal Oflazer and Deniz Yüret downloadable at http://deniz.yuret.com/turkish/tr-disamb.tgz.
Word Sketches are generated from an existing dependency parser. Dependency parser can be downloaded from http://web.itu.edu.tr/gulsenc/TurkishDepModel.html
We would like to thank Gülşen Eryiğit and Kemal Oflazer for answering our emails and providing us with the tools.
Ambati, Bharat Ram, Siva Reddy, and Adam Kilgarriff (2012). Word Sketches for Turkish. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pp. 2945–2950.