The reference corpora are used in connection with keyword and term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of default reference corpora in Sketch Engine. The user can select a different reference corpus in the Keywords and Terms settings.

LanguageDefault reference corpusWords
Afrikaans Afrikaans Wikipedia 2022 22,227,137
Albanian Albanian Web 2020 (sqTenTen20) 528,084,150
Amharic Amharic Web 2013-17 (amWaC17) 25,975,846
Arabic Arabic Web 2018 (arTenTen18) 4,637,956,234
Armenian Armenian Wikipedia corpus 2020 (hywiki20) 51,349,694
Assamese Assamese Wikipedia 2023 (asWiki23) 2,581,684
Azerbaijani Turkic web – Azerbaijani 94,267,206
Bashkir Bashkir Drama Corpus 18,723
Basque Basque Web (BasqueWaC v2) 99,719,584
Belarusian Belarusian Web 2016 (beTenTen16) 63,327,264
Bengali Bengali Web 2021 (bnTenTen21) 470,732,738
Bosnian Bosnian Web (bsWaC 1.2) 248,478,730
Breton OpenSubtitles 2018 parallel – Breton 85,503
Bulgarian Bulgarian Web 2012 (bgTenTen12) 705,156,683
Cantonese Cantonese Web (CantoneseWaC) 30,898,663
Catalan Catalan Web 2014 (caTenTen14) 182,608,420
Chinese Simplified Chinese Web 2017 (zhTenTen17) Simplified 13,531,331,169
Chinese Traditional Chinese Web 2017 (zhTenTen17) Traditional 2,400,405,372
Crimean Tatar National corpus of Crimean Tatar language (beta) 2,697,093
Croatian Croatian Web (hrWaC 2.2, RFTagger) 1,211,328,660
Czech Czech Web 2023 (csTenTen23) 4,456,427,977
Danish Danish Web 2020 (daTenTen20) 3,480,275,804
Dutch Dutch Web 2020 (nlTenTen20) 5,890,009,964
English English Web 2021 (enTenTen21) 52,268,286,493
Estonian Estonian Web 2021 (etTenTen21) 725,832,092
Filipino Tagalog (Filipino) Web 2019 (tlTenTen19) 198,303,250
Finnish Finnish Web 2014 (fiTenTen14) 1,404,083,812
French French Web 2023 (frTenTen23) 23,874,070,858
Frisian Western Frisian Web 2013 (FrisianWaC) 3,116,119
Georgian Georgian Web 2013 (kaWaC) 50,713,604
German German Web 2020 (deTenTen20) 17,512,733,172
Greek Greek Web 2019 (elTenTen19) 2,342,091,029
Gujarati Gujarati Web 2021 (guTenTen21) 88,574,710
Hausa (Boko) Hausa Web 2015 (hausaWaC15) 5,304,300
Hebrew Hebrew Web 2021 (heTenTen21) 2,775,686,699
Hindi Hindi Web 2021 (hiTenTen21) 792,395,313
Hungarian Hungarian Web 2020 (huTenTen20) 5,164,717,029
Icelandic Icelandic Web 2020 (isTenTen20) 518,620,759
Igbo Igbo Web 2015 (IgboWaC15) 331,042
Indonesian Indonesian Web (IndonesianWaC) 90,120,046
Irish Irish Web 2022 (gaTenTen22) 125,040,541
Italian Italian Web 2020 (itTenTen20) 12,451,734,885
Japanese Japanese Web 2011 sample (jaTenTen11, LUW) 163,837,764
Kannada Kannada Web 2012 (knWaC12) 11,056,526
Kazakh Turkic web – Kazakh 139,417,763
Khmer Khmer Web 2018 (kmTenTen18) 16,500,379
Korean Korean Web 2018 (koTenTen18) 1,668,851,720
Kyrgyz Turkic web – Kyrgyz 19,369,507
Lao Lao Web 2019 (loTenTen19) 105,018,584
Latin LatinISE historical corpus v2.2 11,036,900
Latvian Latvian Web 2014 (lvTenTen14) 530,367,474
Lithuanian Lithuanian Web 2014 (ltTenTen14) 778,151,979
Macedonian MaCoCu Macedonian Web v2 (2021) 512,171,886
Malay Malay Web 2020 (msTenTen20) 296,419,465
Malayalam Malayalam Web (malayalamWaC) 15,950,663
Maldivian Maldivian Wikipedia corpus 2019 (dvwiki) 548,211
Maltese Maltese MLRS Corpus 110,714,844
Maori Maori Web 2013 and 2020 (miTenTen20) 11,814,825
Nepali Nepali National Corpus 13,440,835
Norwegian Norwegian Web 2017 (noTenTen17, Bokmål) 2,461,704,417
Norwegian Bokmål Norwegian Web 2017 (noTenTen17, Bokmål) 2,461,704,417
Norwegian Nynorsk Norwegian Web 2017 (noTenTen17, Nynorsk) 169,145,386
Oromo Oromo Web 2016 (orWaC16) 4,249,953
Persian TalkBank Persian (blog posts) 269,753,238
Polish Polish Web 2019 (plTenTen19) 4,253,636,443
Portuguese Portuguese Web 2020 (ptTenTen20) 12,578,775,252
Punjabi (Gurmukhi) Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) 2,806,904
Romanian Romanian Web 2021 (roTenTen21) 2,763,173,824
Russian Russian Web 2017 (ruTenTen17) 9,034,837,939
Samoan Samoan Web (SamoanWac1) 3,115,385
Scottish Gaelic Scottish Gaelic Wiki 2015 (gdWiki) 980,026
Serbian Serbian Web (srWaC 1.2 processed by Hunpos) 477,724,164
Serbian (Latin) Serbian Web (srWaC 1.2 processed by RFTagger v1) 441,888,202
Setswana Setswana/Tswana Web (SetswanaWaC v2) 11,496,687
Sinhalese OpenSubtitles 2018 parallel – Sinhalese 3,430,727
Slovak Slovak Web 2023 (skTenTen23) 898,031,101
Slovenian Slovenian Web 2015 (slTenTen15, TreeTagger v2) 829,544,337
Somali Somali Web 2016 (soWaC16) 71,871,585
Spanish Spanish Web 2018 (esTenTen18) 16,953,735,742
Swahili Swahili Web 2014 (swWaC) 17,882,483
Swedish Swedish Web 2014 (svTenTen14) 3,401,035,817
Tagalog Tagalog (Filipino) Web 2019 (tlTenTen19) 198,303,250
Tajik Tajik Web (TajikWaC) 93,151,897
Tamil Tamil Web 2021 (taTenTen21) 823,837,031
Tatar Tatar Mixed Corpus 102,779,803
Telugu Telugu Web (TeluguWaC) 3,691,203
Thai Thai Web 2018 (thTenTen18) 640,530,227
Tigrinya Tigrinya Web 2016 (tiWaC16) 2,087,613
Turkish Turkish Web 2020 (trTenTen20) 4,980,168,485
Turkmen Turkic web – Turkmen 2,105,359
Ukrainian Ukrainian Web 2022 (ukTenTen22) 7,594,784,148
Urdu Urdu Web (UrduWaC) 53,269,273
Uzbek Turkic web – Uzbek 18,720,334
Vietnamese Vietnamese Web (viWaC) 106,664,817
Welsh Welsh Web 2013 (WelshWaC) 12,458,397
Yoruba Yoruba Web 2015 (YorubaWaC15) 2,816,965