The reference corpora are used in connection with term extraction. The corpus used as the source of the keywords has to be compared to a general corpus so that keywords and terms can be identified correctly. Here is a list of such reference corpora which will be used for term extraction by default. The user can change the reference corpora in the Keywords and Terms settings.

Reference corpus Language Tokens
German Web 2013 (deTenTen13, RFTagger v2) German 19,918,263,493
Russian Web 2011 sample (ruTenTen11) Russian 1,253,892,814
Polish Web 2012 (plTenTen12) Polish 9,387,142,186
European Spanish Web 2011 (eseuTenTen11) Spanish 2,343,829,757
Portuguese Web 2011 (ptTenTen11, Freeling v3, old) Portuguese 4,637,901,353
Japanese Web 2011 sample (jpTenTen11, LUW) Japanese 203,674,569
Korean Web 2012 sample (koTenTen12) Korean 43,113,814
Czech Web 2012 (czTenTen12 v8, sample) Czech 64,607,138
Slovak Web 2011 (skTenTen11) Slovak 656,067,998
Slovenian Web 2015 (slTenTen15) Slovenian 988,513,467
Chinese Web 2011 (zhTenTen11) Chinese Simplified 2,106,661,021
Chinese Web 2011 (zhTenTen11) Chinese Traditional 2,106,661,021
Dutch Web 2014 (nlTenTen14) Dutch 3,013,056,738
Italian Web 2010 sample (itTenTen) Italian 48,904,255
French Web 2012 (frTenTen12) French 11,444,973,582
British National Corpus (BNC) English 112,345,722