Overview of text corpora publicly available in Sketch Engine

A corpus is a public corpus if it is available to trial or paying subscribers or if it is open via the open access interface. In addition to these corpora, Sketch Engine holds other corpora with restricted access subject to copyright regulations or owned and controlled by third parties.

Category

main – corpora available only for regular (paying) users

trial – corpora available for both trial and regular users

open – corpora available without registration

Click a corpus name for full details.

Name Language Category Number of tokens
CHILDES Afrikaans Corpus Afrikaans main 33,134
OPUS2 Afrikaans Afrikaans trial 743,954
OPUS2 Albanian Albanian trial 55,099,328
Arabic Web Arabic main 174,239,600
KSUCCA (Classical Arabic) Arabic main 59,693,146
Arabic Learner Corpus (ALC) Arabic main 386,583
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) Arabic main 131,159,731
OPUS2 Arabic Arabic main 406,527,277
Quran annotated corpus [unvowelled Arabic] Arabic main 128,243
Quran annotated corpus [unvowelled Latin] Arabic main 128,243
Quran annotated corpus [vowelled Arabic] Arabic main 128,243
Quran annotated corpus [vowelled Latin] Arabic main 128,243
Arabic Web 2012 (arTenTen12, Stanford tagger) Arabic trial 8,322,097,229
Turkic web – Azerbaijani Azerbaijani trial 115,280,755
Bengali Web (BengaliWaC) Bengali trial 13,719,158
OPUS2 Bosnian Bosnian main 55,224,138
Bosnian Web 2014 (BosnianWaC14) Bosnian trial 290,176,507
Bulgarian National Corpus (BulgarianNC) Bulgarian main 26,518,884
Bulgarian National Corpus nonweb genres Bulgarian main 27,721,533
Bulgarian National Corpus with web Bulgarian main 545,637,740
DGT, Bulgarian Bulgarian main 32,778,982
EUR-Lex Bulgarian 2/2016 Bulgarian main 457,463,831
OPUS2 Bulgarian Bulgarian main 238,945,836
Bulgarian Web 2012 (bgTenTen12) Bulgarian trial 846,834,715
EUROPARL7, Bulgarian Bulgarian trial 10,602,635
CHILDES Catalan Corpus Catalan main 277,816
Catalan Web 2014 (caTenTen14) Catalan trial 4,777,786,899
Chinese GigaWord 2 Corpus: Mainland, simplified Chinese Simplified main 250,124,230
Chinese Web (Internet-ZH) Chinese Simplified main 277,931,664
OPUS2 Chinese Simplified Chinese Simplified main 299,338,099
Chinese Web 2011 (zhTenTen11, sample 10M) Chinese Simplified main 11,028,308
Chinese Web 2011 (zhTenTen11) Chinese Simplified trial 2,106,661,021
Chinese GigaWord 2 Corpus: Taiwan, traditional Chinese Traditional main 455,526,209
Chinese Traditional Web (TaiwanWaC) Chinese Traditional main 349,198,060
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) Chinese Traditional main 349,198,060
OPUS2 Chinese Traditional Chinese Traditional main 622,382
CHILDES Croatian Corpus Croatian main 389,674
DGT, Croatian Croatian main 5,123,494
EUR-Lex Croatian 2/2016 Croatian main 156,309,317
OPUS2 Croatian Croatian main 156,942,211
Croatian Web 2014 (hrWaC14) Croatian trial 1,404,262,704
Czech news and web 1995–2002 (czes2) Czech main 458,225,771
Czech Web 2012 (czTenTen12 v8, sample) Czech main 64,607,138
DGT, Czech Czech main 57,094,285
EUR-Lex Czech 2/2016 Czech main 501,361,784
OPUS2 Czech Czech main 275,519,334
Czech Web 2012 (czTenTen12 v8) Czech trial 5,069,447,935
EUROPARL7, Czech Czech trial 15,290,586
CHILDES Danish Corpus Danish main 372,811
Danish Web (DanishWaC) Danish main 353,703,002
DGT, Danish Danish main 58,810,703
EUR-Lex Danish 2/2016 Danish main 731,423,452
OPUS2 Danish Danish main 153,261,335
Danish Web 2014 (daTenTen14) Danish trial 2,395,139,491
EUROPARL7, Danish Danish trial 55,794,038
CHILDES Dutch Corpus Dutch main 7,592,039
DGT, Dutch Dutch main 62,654,517
EUR-Lex Dutch 2/2016 Dutch main 783,154,917
Araneum Nederlandicum Maius [2013] Dutch main 1,200,000,837
OPUS2 Dutch Dutch main 446,240,037
EUROPARL7, Dutch Dutch trial 59,756,704
Dutch Web 2014 (nlTenTen14) Dutch trial 3,013,056,738
British Law Report Corpus English main 10,036,051
Brown Family, CLAWS + TreeTagger tags English main 8,073,482
Brown Family English main 8,099,732
CHILDES English Corpus English main 29,480,736
DGT, English English main 74,365,007
Early English Books Online (EEBO) English main 987,242,247
e-flux English main 6,238,592
Araneum Anglicum Africanum Maius [2015] English main 1,200,000,194
Araneum Anglicum Asiaticum Maius [2015] English main 1,200,000,489
English Jozef Stefan Institute Newsfeed English main 9,287,536,698
English Web 2012 (enTenTen12, sample 40M) English main 40,920,950
English Wikipedia English main 1,632,582,504
EUR-Lex English 2/2016 English main 845,040,420
FeedCorpus v6 English main 640,820,898
Project Gutenberg English English main 529,531,582
London English Corpus English main 2,959,320
LEXMCI English main 1,720,056,987
New Model Corpus English main 114,627,650
NCI English English main 257,900,777
Open American National Corpus (spoken) English main 3,369,613
Open American National Corpus (written) English main 13,572,382
OPUS2 English English main 1,441,844,046
pukWaC English main 46,256,586
ScienceBlog English main 122,942,494
SiBol/Port English main 387,585,716
English Corpus for SkELL 3.3 English main 1,520,438,256
English Corpus for SkELL 3.4 English main 1,316,028,475
Susanne English main 150,426
TED_en English main 3,421,262
ukWaC English main 1,559,716,979
UKWaC super sensed English main 370,023,634
ACL Anthology Reference Corpus (ARC) English open 49,348,397
British Academic Spoken English Corpus (BASE) English open 1,252,256
British Academic Written English Corpus (BAWE) English open 8,336,262
Brown English open 1,175,675
British National Corpus (BNC), tagged by CLAWS English trial 112,181,015
British National Corpus (BNC) English trial 112,289,776
Araneum Anglicum Maius [2015] English trial 1,200,023,361
English Web 2013 (enTenTen13) English trial 22,728,686,012
EUROPARL7, English English trial 60,741,877
CHILDES Estonian Corpus Estonian main 399,547
DGT, Estonian Estonian main 46,445,829
EstonianNC Estonian main 563,220,548
EstonianRC Estonian main 249,923,332
EUR-Lex Estonian 2/2016 Estonian main 437,435,453
OPUS2 Estonian Estonian main 88,432,596
Estonian Web 2013 (etTenTen13) Estonian trial 330,045,196
EUROPARL7, Estonian Estonian trial 13,162,640
Filipino Web Corpus (FilipinoWaC) Filipino trial 31,845,404
Philippine Web corpus (philippineWaC16) Filipino trial 40,302,836
DGT, Finnish Finnish main 47,397,459
EUR-Lex Finnish 2/2016 Finnish main 558,884,960
Araneum Finnicum Maius [2014] Finnish main 1,200,000,486
fiTenTen [2014] Finnish main 1,706,310,900
OPUS2 Finnish Finnish main 180,134,681
EUROPARL7, Finnish Finnish trial 40,979,520
Finnish Web 2014 (fiTenTen14, TreeTagger v2) Finnish trial 1,703,429,270
CHILDES French Corpus French main 3,287,017
DGT, French French main 70,602,745
EUR-Lex French 2/2016 French main 920,640,086
Frantext (copyright-free part only) French main 26,265,698
Araneum Francogallicum Maius [2015] French main 1,200,004,721
frTenTen [2012, sample] French main 39,472,639
OPUS2 French French main 956,614,852
French web corpus French main 126,850,281
EUROPARL7, French French trial 66,661,141
French Web 2012 (frTenTen12) French trial 11,444,973,582
Frisian web corpus (FrisianWaC) Frisian trial 3,738,968
Georgian Web Corpus (georgianWaC) Georgian trial 63,632,861
Araneum Germanicum Maius [2013] German main 1,200,000,146
deTenTen [2013, sample] German main 65,804,983
deTenTen10 (simple WS) German main 2,844,839,761
deWaC German main 1,627,169,557
DGT, German German main 58,319,542
EUR-Lex German 2/2016 German main 738,563,342
GerManC German main 800,783
OPUS2 German German main 157,849,124
Parsed DeWaC (sDeWaC) German main 886,661,231
German Web 2013 (deTenTen13) German trial 19,918,263,493
EUROPARL7, German German trial 54,899,037
DGT, Greek Greek main 64,538,668
EUR-Lex Greek 2/2016 Greek main 775,079,501
GkWaC Greek main 149,067,023
OPUS2 Greek Greek main 305,404,357
Greek Web 2014 (elTenTen14) Greek trial 1,958,348,129
EUROPARL7, Greek Greek trial 44,097,921
Gujarati Web Corpus (GujarathiWaC) Gujarati trial 22,201,247
CHILDES Hebrew Corpus Hebrew main 1,034,238
HebrewGC Hebrew main 192,119,449
HebWaC Hebrew main 60,351,738
OPUS2 Hebrew Hebrew main 252,278,074
Hebrew Web 2014 (heTenTen2014) Hebrew trial 1,061,788,271
OPUS2 Hindi Hindi main 1,642,973
Hindi Web Corpus (HindiWaC) Hindi trial 65,772,188
CHILDES Hungarian Corpus Hungarian main 311,543
DGT, Hungarian Hungarian main 55,276,730
EUR-Lex Hungarian 2/2016 Hungarian main 499,799,589
huTenTen12 Hungarian main 3,184,161,466
OPUS2 Hungarian Hungarian main 218,409,426
EUROPARL7, Hungarian Hungarian trial 14,655,015
Araneum Hungaricum Maius [2014] Hungarian trial 1,200,001,609
Icelandic texts [sample] Icelandic trial 9,968,822
Igbo Web corpus (IgboWaC15) Igbo trial 396,276
Indonesian Web Corpus (IndonesianWaC) Indonesian trial 109,281,359
CHILDES Gaelic Corpus Irish main 20,823
DGT, Irish Irish main 1,251,732
EUR-Lex Irish 2/2016 Irish main 37,467,080
New Corpus for Ireland (NCI Irish) Irish trial 34,358,267
CHILDES Italian Corpus Italian main 572,217
DGT, Italian Italian main 65,936,285
EUR-Lex Italian 2/2016 Italian main 829,319,312
Araneum Italicum Maius (Italian, 14.12) 1,20 G Italian main 1,200,000,174
itTenTen [2010, sample] Italian main 48,904,255
itWaC Italian main 1,909,535,703
OPUS2 Italian Italian main 231,143,960
EUROPARL7, Italian Italian trial 59,177,399
Italian Web 2010 (itTenTen) Italian trial 3,076,908,415
CHILDES Japanese Corpus Japanese main 2,187,308
jpTenTen11 [LUW, sample] with term grammar Japanese main 203,674,569
JpWaC Japanese main 413,310,996
OPUS2 Japanese Japanese main 6,596,733
Japanese Web 2011 (jpTenTen11) Japanese trial 10,321,875,664
Japanese Web 2011 (jpTenTen11 [LUW, sample]) Japanese trial 203,674,569
Turkic web – Kazakh Kazakh trial 175,445,327
CHILDES Korean Corpus Korean main 53,339
OPUS2 Korean Korean main 500,152
Korean Web 2012 (koTenTen12) Korean trial 560,945,022
Turkic web – Kyrgyz Kyrgyz trial 24,084,100
LatinISE historical corpus v2 Latin trial 12,995,824
DGT, Latvian Latvian main 54,287,472
EUR-Lex Latvian 2/2016 Latvian main 491,388,506
LatvianWaC Latvian main 74,447,302
OPUS2 Latvian Latvian main 34,012,690
EUROPARL7, Latvian Latvian trial 14,253,247
Latvian Web 2014 (lvTenTen14) Latvian trial 657,522,048
DGT, Lithuanian Lithuanian main 52,155,372
EUR-Lex Lithuanian 2/2016 Lithuanian main 476,891,405
LithuanianWaC Lithuanian main 63,645,700
OPUS2 Lithuanian Lithuanian main 40,933,573
EUROPARL7, Lithuanian Lithuanian trial 13,733,247
Lithuanian Web 2014 (ltTenTen14) Lithuanian trial 981,517,649
OPUS2 Macedonian Macedonian trial 49,066,513
Mayalam Web Corpus (malayalamWaC) Malayalam trial 21,193,984
Malayan Web Corpus (MalayWaC) Malay trial 230,509,568
DGT, Maltese Maltese main 30,172,433
EUR-Lex Maltese 2/2016 Maltese main 466,854,303
Maltese MLRS Corpus Maltese trial 125,267,653
Maori Web Corpus (MaoriWaC) Maori trial 8,351,983
Mongolian Web Texts 2016 (mnWaC16) Mongolian trial 7,540,919
NepaliWaC Nepali main 1,464,492
Nepali National Corpus Nepali trial 15,137,459
CHILDES Norwegian Corpus Norwegian main 61,075
Nynorskkorpuset Norwegian main 87,228,361
OPUS2 Norwegian Norwegian main 26,467,755
Norwegian Web 2015 (noTenTen15) Norwegian trial 1,953,892,201
Kannada Web (KannadaWaC) -- other (UTF-8) -- main 16,031,481
CHILDES Farsi Corpus Persian main 150,505
TalkBank Persian Persian main 549,165,952
OPUS2 Persian Persian trial 5,367,401
BIBLE Polish, swahili-Polish Polish main 169,934
CHILDES Polish Corpus Polish main 1,247,919
DGT, Polish Polish main 58,520,395
EUR-Lex Polish 2/2016 Polish main 510,957,144
OPUS2 Polish Polish main 285,188,755
Araneum Polonicum Maius [2013] Polish main 1,110,120,694
plTenTen12 [sample] Polish main 55,381,476
Polish Web Corpus Polish main 128,185,119
EUROPARL7, Polish Polish trial 15,171,493
Polish Web 2012 (plTenTen12) Polish trial 9,677,787,906
Corpus Brasileiro (CB) Portuguese main 1,133,416,757
CHILDES Portuguese Corpus Portuguese main 245,805
DGT, Portuguese Portuguese main 65,967,069
EUR-Lex Portuguese 2/2016 Portuguese main 801,597,194
OPUS2 Brazilian Portuguese Portuguese main 355,049,778
OPUS2 Portuguese Portuguese main 377,677,225
Cetenfolha, Cetempublico Portuguese main 66,319,147
Araneum Portugallicum Maius [2015] Portuguese main 1,200,006,068
ptTenTen11 [Freeling, sample] Portuguese main 44,446,042
ptTenTen [2011, Palavras parsed] Portuguese main 3,245,834,337
EUROPARL7, Portuguese Portuguese trial 61,414,188
Portuguese Web 2011 (ptTenTen11, Freeling v3) Portuguese trial 4,626,584,246
DGT, Romanian Romanian main 33,395,126
EUR-Lex Romanian 2/2016 Romanian main 461,819,855
OPUS2 Romanian Romanian main 360,212,949
EUROPARL7, Romanian Romanian trial 10,795,858
Romanian web corpus (roWaC) Romanian trial 53,457,522
Romanian web [2016] Romanian trial 3,142,636,172
CHILDES Russian Corpus Russian main 59,759
OPUS2 Russian Russian main 381,468,257
Araneum Russicum Maius (Russian, 15.02) 1,20 G Russian main 1,200,001,911
Araneum Russicum Externum Maius (non-Russia Russian, 15.03) 1,20 G Russian main 1,200,053,619
Araneum Russicum Maius [2013] Russian main 1,216,800,424
Russian web corpus Russian main 187,965,822
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G Russian trial 1,200,000,258
Russian Web 2011 (ruTenTen11) Russian trial 18,280,486,876
Samoan Web corpus (SamoanWac1) Samoan trial 3,583,362
Scottish Gaelic Wiki corpus (gdWiki) Scottish Gaelic trial 1,223,562
OPUS2 Serbian Serbian main 198,141,613
Serbin Web Corpus (srWaC14) Serbian trial 561,529,963
Setswana/Tswana Web Corpus (SetswanaWaC v2) Setswana trial 13,511,692
DGT, Slovak Slovak main 56,095,893
EUR-Lex Slovak 2/2016 Slovak main 366,709,333
OPUS2 Slovak Slovak main 82,952,296
skTenTen [2011] Slovak main 802,785,426
skTenTen11 Slovak main 876,003,720
EUROPARL7, Slovak Slovak trial 15,042,066
Araneum Slovacum Maius [2013] Slovak trial 1,200,005,746
DGT, Slovenian Slovenian main 57,009,023
EUR-Lex Slovenian 2/2016 Slovenian main 509,063,338
Lektor Slovenian main 1,244,029
Kres Slovenian main 120,447,573
OPUS2 Slovenian Slovenian main 163,160,520
EUROPARL7, Slovenian Slovenian trial 14,616,666
Slovenian reference corpus (FidaPLUS v2) Slovenian trial 738,503,145
CHILDES Spanish Corpus Spanish main 1,358,475
DGT, Spanish Spanish main 68,721,827
Araneum Hispanicum Maius [2013] Spanish main 1,200,000,609
esTenTen [2011, Eu + Am, Freeling v4, sample] Spanish main 73,597,801
EUR-Lex Spanish 2/2016 Spanish main 836,039,928
OPUS2 Spanish Spanish main 870,615,999
Spanish web corpus Spanish main 116,900,060
American Spanish Web 2011 (esamTenTen11, Freeling v4) Spanish trial 8,640,399,540
European Spanish Web 2011 (eseuTenTen11, Freeling v4) Spanish trial 2,354,216,667
Spanish Web 2011 (esTenTen11, Eu + Am, Freeling v4) Spanish trial 10,994,616,207
EUROPARL7, Spanish Spanish trial 60,862,330
BIBLE Swahili, Swahili-Polish Swahili main 169,612
Swahili Web Corpus (SwahiliWaC) Swahili trial 21,359,529
CHILDES Swedish Corpus Swedish main 665,889
DGT, Swedish Swedish main 55,407,291
EUR-Lex Swedish 2/2016 Swedish main 640,815,888
OPUS2 Swedish Swedish main 128,245,911
SwedishParole Swedish main 25,731,328
SwedishWaC Swedish main 114,022,801
EUROPARL7, Swedish Swedish trial 51,759,122
Swedish Web 2014 (svTenTen14) Swedish trial 3,900,846,988
Tajik Web (TajikWaC) Tajik trial 109,805,133
CHILDES Tamil Corpus Tamil main 21,865
Tamil Web Corpus (TamilWaC) Tamil trial 32,861,569
Tatar Web Corpus sample Tatar trial 290,351
Telugu Web Corpus (TeluguWaC) Telugu trial 4,697,932
CHILDES Thai Corpus Thai main 299,962
Thai Web Corpus (ThaiWaC) Thai trial 108,013,897
CHILDES Turkish Corpus Turkish main 233,097
OPUS2 Turkish Turkish main 207,223,730
TurkishWaC Turkish main 40,539,507
Turkish Web 2012 (trTenTen12) Turkish trial 4,124,558,200
Turkic web – Turkmen Turkmen trial 2,536,935
OPUS2 Ukrainian Ukrainian main 3,374,552
Ukrainian Web 2014 (uaTenTen14) Ukrainian trial 2,734,851,744
Urdu Web Corpus (UrduWaC) Urdu trial 60,808,847
Turkic web – Uzbek Uzbek trial 24,570,516
Vietnamese Web Corpus (VietnameseWaC) Vietnamese trial 129,781,089
Welsh corpus Welsh main 62,753,279
WelshWaC Welsh trial 14,786,791
Yoruba WaC [2015] Yoruba trial 3,500,353