Overview of text corpora publicly available in Sketch Engine

A corpus is a public corpus if it is available to trial or paying subscribers or if it is open via the open access interface. In addition to these corpora, Sketch Engine holds other corpora with restricted access subject to copyright regulations or owned and controlled by third parties.

Category

main – corpora available only for regular (paying) users

trial – corpora available for both trial and regular users

open – corpora available without registration

Click a corpus name for full details.

Name Language Category Number of tokens
CHILDES Afrikaans Corpus Afrikaans main 33,134
OPUS2 Afrikaans Afrikaans trial 743,954
OPUS2 Albanian Albanian trial 55,099,328
Amharic WaC [2013 + 2015 + 2016] Amharic trial 20,287,250
Arabic Web Arabic main 174,239,600
KSUCCA (Classical Arabic) Arabic main 59,693,146
Arabic Learner Corpus (ALC) Arabic main 386,583
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) Arabic main 131,159,731
OPUS2 Arabic Arabic main 406,527,277
Quran annotated corpus [unvowelled Arabic] Arabic main 128,243
Quran annotated corpus [unvowelled Latin] Arabic main 128,243
Quran annotated corpus [vowelled Arabic] Arabic main 128,243
Quran annotated corpus [vowelled Latin] Arabic main 128,243
Timestamped JSI web corpus 2014-2016 Arabic Arabic trial 1,084,155,423
Arabic Web 2012 (arTenTen12, Stanford tagger) Arabic trial 8,322,097,229
Turkic web – Azerbaijani Azerbaijani trial 115,280,755
Basque Web (BasqueWaC) Basque trial 123,856,183
Bengali Web (BengaliWaC) Bengali trial 13,752,575
OPUS2 Bosnian Bosnian main 55,224,138
Bosnian Web 2014 (BosnianWaC14) Bosnian trial 290,176,507
Bulgarian National Corpus (BulgarianNC) Bulgarian main 26,518,884
Bulgarian National Corpus nonweb genres Bulgarian main 27,721,533
Bulgarian National Corpus with web Bulgarian main 545,637,740
DGT, Bulgarian Bulgarian main 32,778,982
EUR-Lex judgments Bulgarian 12/2016 Bulgarian main 21,537,635
OPUS2 Bulgarian Bulgarian main 238,945,836
Bulgarian Web 2012 (bgTenTen12) Bulgarian trial 843,328,184
EUR-Lex Bulgarian 2/2016 Bulgarian trial 457,463,831
EUROPARL7, Bulgarian Bulgarian trial 10,602,635
CHILDES Catalan Corpus Catalan main 277,816
Catalan Web 2014 (caTenTen14) Catalan trial 4,777,786,899
Timestamped JSI web corpus 2014-2016 Catalan Catalan trial 114,317,450
Chinese GigaWord 2 Corpus: Mainland, simplified Chinese Simplified main 250,124,230
Chinese Web (Internet-ZH) Chinese Simplified main 277,931,664
OPUS2 Chinese Simplified Chinese Simplified main 299,338,099
Chinese Web 2011 (zhTenTen11, sample 10M) Chinese Simplified main 11,028,308
Guangwai - Lancaster Chinese Learner Corpus Chinese Simplified open 1,664,237
Chinese Web 2011 (zhTenTen11) Chinese Simplified trial 2,106,661,021
Chinese GigaWord 2 Corpus: Taiwan, traditional Chinese Traditional main 455,526,209
Chinese Traditional Web (TaiwanWaC) Chinese Traditional main 349,198,060
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) Chinese Traditional main 349,198,060
OPUS2 Chinese Traditional Chinese Traditional main 622,382
CHILDES Croatian Corpus Croatian main 389,674
DGT, Croatian Croatian main 5,123,494
EUR-Lex judgments Croatian 12/2016 Croatian main 7,416,811
OPUS2 Croatian Croatian main 156,942,211
Croatian Web 2011 & 2013 (hrWaC 2.2) Croatian trial 1,397,757,548
EUR-Lex Croatian 2/2016 Croatian trial 156,309,317
csSkELL v1 (whole documents) Czech main 2,072,446,673
csSkELL v2.1 (only sentences with GDEX scores) Czech main 1,863,757,837
csSkELL v2.2 (sentences with GDEX scores) Czech main 1,726,564,383
csSkELL v2 (only sentences with GDEX scores) Czech main 1,946,264,694
CzechParl 2012 Czech main 51,366,108
Czech news and web 1995–2002 (czes2) Czech main 458,225,771
Czech Web 2012 (czTenTen12 v8, sample) Czech main 64,607,138
DGT, Czech Czech main 57,094,285
EUR-Lex judgments Czech 12/2016 Czech main 23,906,139
OPUS2 Czech Czech main 275,519,334
Timestamped JSI web corpus 2014-2016 Czech Czech trial 344,176,348
Czech Web 2012 (czTenTen12 v9) Czech trial 5,069,447,935
EUR-Lex Czech 2/2016 Czech trial 501,361,784
EUROPARL7, Czech Czech trial 15,290,586
CHILDES Danish Corpus Danish main 372,811
Danish Web (DanishWaC) Danish main 353,703,002
DGT, Danish Danish main 58,810,703
EUR-Lex judgments Danish 12/2016 Danish main 45,307,188
OPUS2 Danish Danish main 153,261,335
Danish Web 2014 (daTenTen14) Danish trial 2,395,139,491
EUR-Lex Danish 2/2016 Danish trial 731,423,452
EUROPARL7, Danish Danish trial 55,794,038
CHILDES Dutch Corpus Dutch main 7,592,039
DGT, Dutch Dutch main 62,654,517
EUR-Lex judgments Dutch 12/2016 Dutch main 49,746,950
Araneum Nederlandicum Maius [2013] Dutch main 1,200,000,837
OPUS2 Dutch Dutch main 446,240,037
EUR-Lex Dutch 2/2016 Dutch trial 783,154,917
EUROPARL7, Dutch Dutch trial 59,756,704
Timestamped JSI web corpus 2014-2016 Dutch Dutch trial 463,471,686
Dutch Web 2014 (nlTenTen14) Dutch trial 3,013,056,738
British Law Report Corpus English main 10,036,051
Brown Family, CLAWS + TreeTagger tags English main 8,073,482
Brown Family English main 8,099,732
CHILDES English Corpus English main 29,480,736
Cambridge Academic English English main 3,738,308
DGT, English English main 74,365,007
English Historical Book Collection (EEBO, ECCO, Evans) English main 987,242,247
e-flux (International art English) English main 6,238,592
e-flux (International art English) English main 6,238,592
Araneum Anglicum Africanum Maius [2015] English main 1,200,000,194
Araneum Anglicum Asiaticum Maius [2015] English main 1,200,000,489
English Preposition Corpus English main 2,430,218
English Web 2012 (enTenTen12, sample 40M) English main 40,920,950
English Web 2008 (enTenTen08) English main 3,268,798,627
English Wikipedia English main 1,632,582,504
Project Gutenberg English English main 529,531,582
EUR-Lex judgments English 12/2016 English main 51,499,120
London English Corpus English main 2,959,320
LEXMCI English main 1,720,056,987
New Model Corpus English main 114,627,650
New corpus for English (NCI English) English main 257,900,777
Open American National Corpus (spoken) English main 3,369,613
Open American National Corpus (written) English main 13,572,382
OPUS2 English English main 1,441,844,046
pukWaC (ukWaC parsed with MaltParser) English main 46,256,586
ScienceBlogs English main 122,942,494
SiBol/Port (English broadsheet newspapers) English main 387,585,716
English Corpus for SkELL 3.6 English main 1,237,286,904
TED_en (transcripts of TED talks) English main 3,421,262
ukWaC (British Web corpus) English main 1,559,716,979
UKWaC super sensed English main 370,023,634
ACL Anthology Reference Corpus (ARC) English open 49,348,397
British Academic Spoken English Corpus (BASE) English open 1,252,256
British Academic Written English Corpus (BAWE) English open 8,336,262
Brown English open 1,175,675
EcoLexicon English corpus English open 28,616,037
British National Corpus (BNC), tagged by CLAWS English trial 112,181,015
British National Corpus (BNC) English trial 112,345,722
Directory of Open Access Journals (English) English trial 3,349,931,737
Araneum Anglicum Maius [2015] English trial 1,200,023,361
Timestamped JSI web corpus 2014-2016 English English trial 21,336,894,049
Timestamped web corpus combined 2005-2015 (Newsfeed+Feed) English trial 9,928,357,596
English Web 2013 (enTenTen13) English trial 22,728,686,012
EUR-Lex English 2/2016 English trial 845,040,420
EUROPARL7, English English trial 60,741,877
Timestamped web corpus 2005-2014 (Feed) English trial 640,820,898
Susanne English trial 150,426
CHILDES Estonian Corpus Estonian main 399,547
DGT, Estonian Estonian main 46,445,829
Estonian Reference corpus with Web (EstonianNC) Estonian main 563,220,548
Estonian Reference corpus (EstonianRC) Estonian main 249,923,332
Estonian Web 2013 (etTenTen13) [New Word Sketches] Estonian main 330,045,196
EUR-Lex judgments Estonian 12/2016 Estonian main 20,279,247
OPUS2 Estonian Estonian main 88,432,596
Estonian Web 2013 (etTenTen13) Estonian trial 330,045,196
EUR-Lex Estonian 2/2016 Estonian trial 437,435,453
EUROPARL7, Estonian Estonian trial 13,162,640
Filipino Web (FilipinoWaC) Filipino trial 31,845,404
Philippine Web (philippineWaC16) Filipino trial 40,302,836
DGT, Finnish Finnish main 47,397,459
Araneum Finnicum Maius [2014] Finnish main 1,200,000,486
EUR-Lex judgments Finnish 12/2016 Finnish main 30,993,755
OPUS2 Finnish Finnish main 180,134,681
EUR-Lex Finnish 2/2016 Finnish trial 558,884,960
EUROPARL7, Finnish Finnish trial 40,979,520
Timestamped JSI web corpus 2014-2016 Finnish Finnish trial 143,709,979
Finnish Web 2014 (fiTenTen14, TreeTagger v2) Finnish trial 1,703,429,270
CHILDES French Corpus French main 3,287,017
DGT, French French main 70,602,745
Frantext (French literature of the 18th-20th century) French main 26,265,698
Araneum Francogallicum Maius [2015] French main 1,200,004,721
French Web 2012 sample (frTenTen12) French main 39,472,639
EUR-Lex judgments French 12/2016 French main 58,993,172
OPUS2 French French main 956,614,852
French web corpus French main 126,850,281
EUR-Lex French 2/2016 French trial 920,640,086
EUROPARL7, French French trial 66,661,141
Timestamped JSI web corpus 2014-2016 French French trial 2,188,593,260
French Web 2012 (frTenTen12) French trial 11,444,973,582
Western Frisian Web 2013 (FrisianWaC) Frisian trial 3,738,968
Georgian Web (georgianWaC) Georgian trial 63,632,861
Araneum Germanicum Maius [2013] German main 1,200,000,146
German Web 2013 sample (deTenTen13) German main 65,804,983
German Web (deWaC) German main 1,627,169,557
DGT, German German main 58,319,542
GerManC (German Newspapers 1650-1800) German main 800,783
EUR-Lex judgments German 12/2016 German main 44,891,478
OPUS2 German German main 157,849,124
Parsed German Web (sDeWaC) German main 886,661,231
German Web 2013 (deTenTen13) German trial 19,808,173,163
Timestamped JSI web corpus 2014-2016 German German trial 2,378,228,966
EUR-Lex German 2/2016 German trial 718,370,201
EUROPARL7, German German trial 55,251,638
DGT, Greek Greek main 64,538,668
Greek Web (GkWaC) Greek main 149,067,023
EUR-Lex judgments Greek 12/2016 Greek main 44,825,698
OPUS2 Greek Greek main 305,404,357
Greek Web 2014 (elTenTen14) Greek trial 1,958,348,129
EUR-Lex Greek 2/2016 Greek trial 775,079,501
EUROPARL7, Greek Greek trial 44,097,921
Gujarati Web (GujarathiWaC) Gujarati trial 22,201,247
CHILDES Hebrew Corpus Hebrew main 1,034,238
Hebrew General Corpus (web crawled, mostly newspapers) Hebrew main 192,119,449
Hebrew Web (HebWaC) Hebrew main 60,351,738
OPUS2 Hebrew Hebrew main 252,278,074
Timestamped JSI web corpus 2014-2016 Hebrew Hebrew trial 134,830,039
Hebrew Web 2014 (heTenTen14) Hebrew trial 1,061,788,271
Hindi Web (HindiWaC v. 3) Hindi main 65,772,188
Hindi Web (HindiWaC v. 4) Hindi main 120,600,574
OPUS2 Hindi Hindi main 1,642,973
Hindi Web 2013 (hiTenTen13) Hindi trial 405,366,140
CHILDES Hungarian Corpus Hungarian main 311,543
DGT, Hungarian Hungarian main 55,276,730
Hungarian Web 2012 (huTenTen12) Hungarian main 3,184,161,466
EUR-Lex judgments Hungarian 12/2016 Hungarian main 24,542,189
OPUS2 Hungarian Hungarian main 218,409,426
EUR-Lex Hungarian 2/2016 Hungarian trial 499,799,589
EUROPARL7, Hungarian Hungarian trial 14,655,015
Araneum Hungaricum Maius [2014] Hungarian trial 1,200,001,609
Timestamped JSI web corpus 2014-2016 Hungarian Hungarian trial 218,405,214
Icelandic texts [sample] Icelandic trial 9,968,822
Igbo Web 2015 (IgboWaC15) Igbo trial 396,276
Indonesian Web (IndonesianWaC) Indonesian trial 109,281,359
CHILDES Gaelic Corpus Irish main 20,823
DGT, Irish Irish main 1,251,732
EUR-Lex Irish 2/2016 Irish trial 37,467,080
New Corpus for Ireland (NCI Irish) Irish trial 34,358,267
CHILDES Italian Corpus Italian main 572,217
DGT, Italian Italian main 65,936,285
Araneum Italicum Maius (Italian, 14.12) 1,20 G Italian main 1,200,000,174
Italian Web 2010 sample (itTenTen) Italian main 48,904,255
Italian web corpus (itWaC) Italian main 1,909,535,703
EUR-Lex judgments Italian 12/2016 Italian main 52,943,414
OPUS2 Italian Italian main 231,143,960
EUR-Lex Italian 2/2016 Italian trial 829,319,312
EUROPARL7, Italian Italian trial 59,177,399
Timestamped JSI web corpus 2014-2016 Italian Italian trial 1,573,777,557
Italian Web 2010 (itTenTen) Italian trial 3,076,908,415
CHILDES Japanese Corpus Japanese main 2,187,308
Japanese Web (JpWaC) Japanese main 413,310,996
OPUS2 Japanese Japanese main 6,596,733
Japanese Web 2011 (jpTenTen11) Japanese trial 10,321,875,664
Japanese Web 2011 sample (jpTenTen11, LUW) Japanese trial 203,674,569
Turkic web – Kazakh Kazakh trial 175,445,327
CHILDES Korean Corpus Korean main 53,339
OPUS2 Korean Korean main 500,152
Timestamped JSI web corpus 2014-2016 Korean Korean trial 547,918,466
Korean Web 2012 (koTenTen12) Korean trial 560,945,022
Turkic web – Kyrgyz Kyrgyz trial 24,084,100
LatinISE historical corpus v2 Latin trial 12,995,824
DGT, Latvian Latvian main 54,287,472
EUR-Lex judgments Latvian 12/2016 Latvian main 21,977,367
Latvian Web (LatvianWaC) Latvian main 74,447,302
OPUS2 Latvian Latvian main 34,012,690
EUR-Lex Latvian 2/2016 Latvian trial 491,388,506
EUROPARL7, Latvian Latvian trial 14,253,247
Latvian Web 2014 (lvTenTen14) Latvian trial 657,522,048
DGT, Lithuanian Lithuanian main 52,155,372
EUR-Lex judgments Lithuanian 12/2016 Lithuanian main 21,558,688
Lithuanian Web (LithuanianWaC v2) Lithuanian main 63,645,700
OPUS2 Lithuanian Lithuanian main 40,933,573
EUR-Lex Lithuanian 2/2016 Lithuanian trial 476,891,405
EUROPARL7, Lithuanian Lithuanian trial 13,733,247
Lithuanian Web 2014 (ltTenTen14) Lithuanian trial 981,517,649
OPUS2 Macedonian Macedonian trial 49,066,513
Malayalam Web (malayalamWaC) Malayalam trial 21,193,984
Malaysian Web (MalaysianWaC) Malay trial 230,509,568
DGT, Maltese Maltese main 30,172,433
EUR-Lex judgments Maltese 12/2016 Maltese main 26,865,968
EUR-Lex Maltese 2/2016 Maltese trial 466,854,303
Maltese MLRS Corpus Maltese trial 125,267,653
Maori Web (MaoriWaC) Maori trial 8,351,983
Mongolian Web Texts 2016 (mnWaC16) Mongolian trial 7,540,919
Nepali Web (NepaliWaC) Nepali main 1,464,492
Nepali National Corpus Nepali trial 15,137,459
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ N'Ko open 4,636,227
CHILDES Norwegian Corpus Norwegian main 61,075
Norwegian dictionary corpus (Nynorskkorpuset) Norwegian main 87,228,361
OPUS2 Norwegian Norwegian main 26,467,755
Norwegian Web 2015 (noTenTen15; Bokmål and Nynorsk) Norwegian trial 1,953,892,201
Oromo WaC [2016] Oromo trial 5,091,696
Kannada Web (KannadaWaC) -- other (UTF-8) -- main 16,031,481
CHILDES Farsi Corpus Persian main 150,505
TalkBank Persian (blog posts) Persian main 549,165,952
OPUS2 Persian Persian trial 5,367,401
BIBLE Polish-Swahili Polish main 169,934
CHILDES Polish Corpus Polish main 1,247,919
DGT, Polish Polish main 58,520,395
EUR-Lex judgments Polish 12/2016 Polish main 23,884,080
OPUS2 Polish Polish main 285,188,755
Araneum Polonicum Maius [2013] Polish main 1,110,120,694
Polish Web 2012 sample (plTenTen12) Polish main 55,381,476
Polish Web (PolishWac) Polish main 128,185,119
EUR-Lex Polish 2/2016 Polish trial 510,957,144
EUROPARL7, Polish Polish trial 15,171,493
Polish Web 2012 (plTenTen12) Polish trial 9,387,142,186
Timestamped JSI web corpus 2014-2016 Polish Polish trial 190,687,002
Brazilian Portuguese corpus (Corpus Brasileiro) Portuguese main 1,133,416,757
CHILDES Portuguese Corpus Portuguese main 245,805
DGT, Portuguese Portuguese main 65,967,069
EUR-Lex judgments Portuguese 12/2016 Portuguese main 44,247,824
OPUS2 Brazilian Portuguese Portuguese main 355,049,778
OPUS2 Portuguese Portuguese main 377,677,225
Newspapers in Portuguese (CetemPúblico, CetenFolha) Portuguese main 66,319,147
Araneum Portugallicum Maius [2015] Portuguese main 1,200,006,068
Portuguese Web 2011 sample (ptTenTen11, Freeling) Portuguese main 44,446,042
Portuguese Web 2011 (ptTenTen11, Palavras parsed) Portuguese main 3,245,834,337
EUR-Lex Portuguese 2/2016 Portuguese trial 801,597,194
EUROPARL7, Portuguese Portuguese trial 61,414,188
Timestamped JSI web corpus 2014-2016 Portuguese Portuguese trial 1,312,377,855
Portuguese Web 2011 (ptTenTen11, Freeling v3) Portuguese trial 4,626,584,246
DGT, Romanian Romanian main 33,395,126
EUR-Lex judgments Romanian 12/2016 Romanian main 22,055,262
OPUS2 Romanian Romanian main 360,212,949
EUR-Lex Romanian 2/2016 Romanian trial 461,819,855
EUROPARL7, Romanian Romanian trial 10,795,858
Romanian Web (roWaC) Romanian trial 53,457,522
Romanian Web 2016 (roTenTen16) Romanian trial 3,142,636,172
CHILDES Russian Corpus Russian main 59,759
OPUS2 Russian Russian main 381,468,257
Araneum Russicum Maius (Russian, 15.02) 1,20 G Russian main 1,200,001,911
Araneum Russicum Externum Maius (non-Russia Russian, 15.03) 1,20 G Russian main 1,200,053,619
Araneum Russicum Maius [2013] Russian main 1,216,800,424
ruSkELL 1.3 Russian main 1,223,960,925
Russian web corpus Russian main 187,965,822
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G Russian trial 1,200,000,258
Timestamped JSI web corpus 2014-2016 Russian Russian trial 1,402,853,056
Russian Web 2011 (ruTenTen11) Russian trial 18,280,486,876
Samoan Web (SamoanWac1) Samoan trial 3,583,362
Scottish Gaelic Wiki 2015 (gdWiki) Scottish Gaelic trial 1,223,562
OPUS2 Serbian Serbian main 198,141,613
Serbian Web 2014 (srWaC14) Serbian trial 561,529,963
Timestamped JSI web corpus 2014-2016 Serbian Serbian trial 100,816,398
Setswana/Tswana Web (SetswanaWaC v2) Setswana trial 13,511,692
DGT, Slovak Slovak main 56,095,893
EUR-Lex judgments Slovak 12/2016 Slovak main 23,707,422
OPUS2 Slovak Slovak main 82,952,296
Slovak Web 2011 (skTenTen11, ambiguity tag) Slovak main 876,003,720
EUR-Lex Slovak 2/2016 Slovak trial 366,709,333
EUROPARL7, Slovak Slovak trial 15,042,066
Araneum Slovacum Maius [2013] Slovak trial 1,200,005,746
Slovak Web 2011 (skTenTen11) Slovak trial 656,067,998
DGT, Slovenian Slovenian main 57,009,023
Lektor (Learner corpus of proofread and translations) Slovenian main 1,244,028
EUR-Lex judgments Slovenian 12/2016 Slovenian main 23,991,001
KAS-Dipl (diplome) Slovenian main 713,212,210
KAS-Dr (doktorati) Slovenian main 39,850,036
KAS-Mag (magisteriji) Slovenian main 196,745,908
OPUS2 Slovenian Slovenian main 163,160,520
EUR-Lex Slovenian 2/2016 Slovenian trial 509,063,338
EUROPARL7, Slovenian Slovenian trial 14,616,666
Slovenian reference corpus (FidaPLUS v2) Slovenian trial 738,503,145
Slovenian Web 2015 (slTenTen15) Slovenian trial 988,513,467
Somali WaC [2016] Somali trial 79,741,231
CHILDES Spanish Corpus Spanish main 1,358,475
DGT, Spanish Spanish main 68,721,827
Araneum Hispanicum Maius [2013] Spanish main 1,200,000,609
Spanish Web 2011 sample (esTenTen11, Eu + Am, Freeling v4) Spanish main 73,597,801
EUR-Lex judgments Spanish 12/2016 Spanish main 47,235,792
OPUS2 Spanish Spanish main 870,615,999
Spanish Web corpus (SpanishWaC) Spanish main 116,900,060
American Spanish Web 2011 (esamTenTen11) Spanish trial 8,641,717,816
European Spanish Web 2011 (eseuTenTen11) Spanish trial 2,343,829,757
Spanish Web 2011 (esTenTen11, Eu + Am) Spanish trial 10,985,547,573
EUR-Lex Spanish 2/2016 Spanish trial 811,673,158
EUROPARL7, Spanish Spanish trial 60,862,330
Timestamped JSI web corpus 2014-2016 Spanish Spanish trial 4,665,332,420
BIBLE Swahili-Polish Swahili main 169,612
Swahili Web 2014 (SwahiliWaC) Swahili trial 21,359,529
CHILDES Swedish Corpus Swedish main 665,889
DGT, Swedish Swedish main 55,407,291
EUR-Lex judgments Swedish 12/2016 Swedish main 37,061,009
OPUS2 Swedish Swedish main 128,245,911
SwedishParole Swedish main 25,731,328
EUR-Lex Swedish 2/2016 Swedish trial 640,815,888
EUROPARL7, Swedish Swedish trial 51,759,122
Swedish Web 2014 (svTenTen14) Swedish trial 3,900,846,988
Tajik Web (TajikWaC) Tajik trial 109,805,133
CHILDES Tamil Corpus Tamil main 21,865
Tamil Web 2015 (TamilWaC) Tamil trial 32,861,569
Tatar Web 2015 sample Tatar trial 290,351
Telugu Web (TeluguWaC) Telugu trial 4,697,932
CHILDES Thai Corpus Thai main 299,962
Thai Web (ThaiWaC) Thai trial 108,013,897
Tibetan Corpus 2 Tibetan trial 91,107,466
Tigrinya WaC [2016] Tigrinya trial 2,531,443
CHILDES Turkish Corpus Turkish main 233,097
OPUS2 Turkish Turkish main 207,223,730
Turkish Web Turkish main 40,539,507
Turkish Web 2012 (trTenTen12) Turkish trial 4,124,558,200
Turkic web – Turkmen Turkmen trial 2,536,935
OPUS2 Ukrainian Ukrainian main 3,374,552
Ukrainian Web 2014 (uaTenTen14) Ukrainian trial 2,734,851,744
Urdu Web (UrduWaC) Urdu trial 60,808,847
Turkic web – Uzbek Uzbek trial 24,570,516
Vietnamese Web (VietnameseWaC) Vietnamese trial 129,781,089
Welsh web corpus Welsh main 62,753,279
Welsh Web 2013 (WelshWaC) Welsh trial 14,786,791
Yoruba Web 2015 (YorubaWaC15) Yoruba trial 3,500,353