The reference corpora are used in connection with term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of such reference corpora which will be used for term extraction by default. The user can change the reference corpora in the Keywords and Terms settings.

Language Reference corpus Part-of-speech tagging available for user corpora Word sketches available for user corpora
Afrikaans OPUS2 Afrikaans (743,954 tokens; tagged; with word sketches) no yes
Albanian OPUS2 Albanian (55,099,328 tokens; tagged; with word sketches) no no
Arabic Arabic Web 2012 (arTenTen12, Stanford tagger) (8,322,097,229 tokens; tagged; with word sketches) yes yes
Azerbaijani Turkic web – Azerbaijani (115,280,755 tokens) no no
Basque Basque Web (BasqueWaC) (123,856,183 tokens; tagged; with word sketches) no yes
Bengali Bengali Web (BengaliWaC) (13,719,158 tokens; tagged; with word sketches) no no
Bosnian Bosnian Web 2014 (BosnianWaC14) (290,176,507 tokens; tagged) no no
Bulgarian Bulgarian Web 2012 (bgTenTen12) (843,328,184 tokens; tagged; with word sketches) yes yes
Catalan Catalan Web 2014 (caTenTen14) (4,777,786,899 tokens; tagged) no yes
Chinese Simplified Chinese Web 2011 (zhTenTen11) (2,106,661,021 tokens; tagged; with word sketches) yes yes
Chinese Traditional zhTenTen [2011] (2,106,661,021 tokens; tagged; with word sketches) yes yes
Croatian Croatian Web 2014 (hrWaC14) (1,404,262,704 tokens; tagged; with word sketches) yes yes
Czech Czech Web 2012 (czTenTen12 v8) (5,069,447,935 tokens; tagged; with word sketches) yes yes
Danish Danish Web 2014, old version (2,395,139,491 tokens; tagged; with word sketches) no yes
Dutch Dutch Web 2014 (nlTenTen14) (3,013,056,738 tokens; tagged; with word sketches) yes yes
English English Web 2012 (enTenTen12) (12,968,375,937 tokens; tagged; with word sketches) yes yes
Estonian Estonian Web 2013 (etTenTen13) (330,045,196 tokens; tagged; with word sketches) yes yes
Filipino Filipino Web (FilipinoWaC) (31,845,404 tokens; tagged; with word sketches) no no
Finnish Finnish Web 2014 (fiTenTen14, TreeTagger v2) (1,703,429,270 tokens; tagged; with word sketches) yes no
French French Web 2012 (frTenTen12) (11,444,973,582 tokens; tagged; with word sketches) yes yes
Frisian Western Frisian Web 2013 (FrisianWaC) (3,738,968 tokens) no no
Georgian Georgian Web (georgianWaC) (63,632,861 tokens) no no
German German Web 2013 (deTenTen13) (19,918,263,493 tokens; tagged; with word sketches) yes yes
Greek Greek Web 2014 (elTenTen14) (1,958,348,129 tokens) no yes
Gujarati Gujarati Web (GujarathiWaC) (22,201,247 tokens; tagged; with word sketches) no no
Hebrew Hebrew Web 2014 (heTenTen14) (1,061,788,271 tokens) yes no
Hindi Hindi Web (HindiWaC) (65,772,188 tokens; tagged; with word sketches) no no
Hungarian Araneum Hungaricum Maius [2014] (1,200,001,609 tokens; tagged; with word sketches) yes no
Icelandic Icelandic texts [sample] (9,968,822 tokens) no no
Igbo Igbo Web 2015 (IgboWaC15) (396,276 tokens) no no
Indonesian Indonesian Web (IndonesianWaC) (109,281,359 tokens; tagged; with word sketches) no no
Irish New Corpus for Ireland (NCI Irish) (34,358,267 tokens; tagged; with word sketches) no yes
Italian Italian Web 2010 (itTenTen) (3,076,908,415 tokens; tagged; with word sketches) yes yes
Japanese Japanese Web 2011 (jpTenTen11 [LUW, sample]) (203,674,569 tokens; tagged; with word sketches) yes yes
Kazakh Turkic web – Kazakh (175,445,327 tokens) no no
Korean Korean Web 2012 (koTenTen12) (560,945,022 tokens; tagged) yes yes
Kyrgyz Turkic web – Kyrgyz (24,084,100 tokens) no no
Latin LatinISE historical corpus v2 (12,995,824 tokens; tagged; with word sketches) no yes
Latvian Latvian web [2014] (658,585,131 tokens; tagged) yes yes
Lithuanian Lithuanian Web 2014 (ltTenTen14) (981,517,649 tokens) no yes
Macedonian OPUS2 Macedonian (49,066,513 tokens; tagged; with word sketches) no no
Malayalam Mayalam Web Corpus (malayalamWaC) (21,193,984 tokens; tagged; with word sketches) no no
Malay Malayan Web Corpus (MalayWaC) (230,509,568 tokens; tagged; with word sketches) no no
Maltese Maltese MLRS Corpus (125,267,653 tokens; tagged; with word sketches) no no
Maori Maori Web Corpus (MaoriWaC) (8,351,983 tokens) no no
Mongolian none no no
Nepali Nepali National Corpus (15,137,459 tokens; tagged) no no
Norwegian Norwegian Web 2015 (noTenTen15; Bokmål and Nynorsk) (1,953,892,201 tokens; tagged; with word sketches) no yes
Persian OPUS2 Persian (5,367,401 tokens; tagged; with word sketches) no yes
Polish Polish Web 2012 (plTenTen12) (9,677,787,906 tokens; tagged; with word sketches) yes yes
Portuguese ptTenTen11 (Freeling, v3) (4,637,901,353 tokens; tagged; with word sketches) yes yes
Romanian Romanian Web (roWaC) (53,457,522 tokens; tagged; with word sketches) yes yes
Russian Russian Web 2011 (ruTenTen11) (18,280,486,876 tokens; tagged; with word sketches) yes yes
Samoan Samoan Web corpus (SamoanWac1) (3,583,362 tokens) no no
Scottish Gaelic Scottish Gaelic Wiki corpus (gdWiki) (1,223,562 tokens) no no
Serbian Serbin Web 2014 (srWaC14) (561,529,963 tokens; tagged) yes yes
Setswana Setswana/Tswana Web (SetswanaWaC v2) (13,511,692 tokens; tagged; with word sketches) no no
Slovak Araneum Slovacum Maius [2013] (1,200,005,746 tokens; tagged; with word sketches) yes yes
Slovenian Slovenian reference corpus (FidaPLUS v2) (738,503,145 tokens; tagged; with word sketches) yes yes
Spanish Spanish Web 2011 (esTenTen11, Eu + Am, Freeling v4) (10,994,616,207 tokens; tagged; with word sketches) yes yes
Swahili Swahili Web (SwahiliWaC) (21,359,529 tokens; tagged; with word sketches) yes no
Swedish Swedish Web 2014 (svTenTen14) (3,900,846,988 tokens; tagged; with word sketches) yes yes
Tajik Tajik Web (TajikWaC) (109,805,133 tokens; tagged) no no
Tamil Tamil Web (TamilWaC) (32,861,569 tokens; tagged; with word sketches) no no
Tatar Tatar Web Corpus sample (290,351 tokens) no no
Telugu Telugu Web (TeluguWaC) (4,697,932 tokens; tagged; with word sketches) no no
Thai Thai Web (ThaiWaC) (108,013,897 tokens; tagged; with word sketches) no no
Turkish Turkish Web 2012 (trTenTen12) (4,124,558,200 tokens) no no
Turkmen Turkic web – Turkmen (2,536,935 tokens) no no
Ukrainian Ukrainian Web 2014 (uaTenTen14) (2,734,851,744 tokens) no no
Urdu Urdu Web Corpus (UrduWaC) (60,808,847 tokens) no no
Uzbek Turkic web – Uzbek (24,570,516 tokens) no no
Vietnamese Vietnamese Web Corpus (VietnameseWaC) (129,781,089 tokens; tagged; with word sketches) no yes
Welsh WelshWaC (14,786,791 tokens; tagged; with word sketches) no no
Yoruba Yoruba WaC [2015] (3,500,353 tokens) no no