The Web as Corpus (WaC) corpora were prepared by the Corpus factory method. Full details are described in the paper below. List of corpora (in order by language):

A

Arabic (Arabic web corpus)

B

Basque (basque_WaC) Bengali (bengaliWaC) Bosnian (bosnianWaC14)

C

Cantonese (Cantonese WaC) Chinese (ChineseTaiwanWaC) Croatian (hrWaC, hrWaC_10M)

D

Danish (danishWaC) Dutch (Dutch web corpus, nlWaC, nlWaC_1)

E

English (pukWaCukWaC, ukWaC_1, ukWaC_10M, ukWaC_10M_1, ukWaC2, ukWaC2_1, ukWaC3, ukWaC_mcd, ukWaCsst)

F

Filipino (filipinoWaC) Finnish (finnishWaC) Frisian (frisianWaC) French (frWaC, frWaC1_1)

G

Georgian (georgianWaC) German (deWaC, Parsed DeWaC (sDeWaC)) Greek (gkWaC) Gujarati (gujarathiWaC)

H

Hebrew (hebWaC) Hindi (hindiWaC, hindiWaC3)

I

Igbo (igboWaC) Indonesian (indonesianWaC) Italian (itWaC)

J

Japanese (jpWaC, jpWaC_10M, jpWaC2)

K

Korean (koreanWaC) Kannada (Kannada WaC)

L

Latin (latinWaC, latinWaC2) Latvian (latvianWaC, latvianWaC_shallow) Lithuanian (lithuanianWaC, lithuanianWaC_v2, lithuanianWaC_v2_10M)

M

Malay (malayalamWaC, malaysianWaC2) Maltese (malteseWaC, malteseWaC2, malteseWaC2_sample) Maori (maoriWaC), Mongolian (MongolianWaC)

N

Nepali (nepaliWaC) Norwegian (norwegianWaC)

P

Persian (WBC-Per) Polish (Polish Web Corpus)

R

Romanian (romanian_WaC) Russian (Russian Web Corpus)

S

Samoan (SamoanWaC) Serbian (serbianWaC, serbianWaC14, srWaC, srWaC22M) Setswana (setswanaWaC, setswanaWaC2) Spanish (Spanish wen corpus) Swahili (swahiliWaC, swahiliWaC_1) Swedish (swedishWaC, swedish_WaC, swedish_WaC_10M)

T

Tamil (tamilWaC) Tatar (Tatar Sample) Telugu (teluguWaC, teluguWaC2) Thai (thaiWaC) Turkish (turkishWaC, turkishWaC2, turkishWaC2_1, turkishWaC2_1_s, turkishWaC2_1_uniattr)

U

Urdu

V

vietnameseWaC2 (Viatnamese)

W

Welsh (welshWaC)

Y

Yoruba (Yoruba web corpus)


Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A Corpus Factory for Many Languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.