The Corpus Brasileiro (CB) is the result of a project funded by Fapesp (Sao Paulo State Research Foundation) from 5/2008 to 4/2010, supported by CEPRIL (Center for Research and Information on Language), the Graduate Program in Applied Linguistics (LAEL) at Sao Paulo Catholic University (PUCSP), Brazil. The project team included Tony Berber Sardinha (head), José Lopes Moreira Filho, and Eliane Alambert. It was a GELC (Corpus Linguistics Research Group) initiative.

The corpus is tagged with TreeTagger using Pablo Gamalo’s parameter file.

Subregister Tokens Percentage
Articles 25,85,85,002 23.76%
Theses and dissertations 31,09,72,387 28.58%
Annals 69,47,244 0.64%
Screenplays 2,89,389 0.03%
Miscellanea 8,93,98,389 8.22%
Wikipedia 4,59,10,768 4.22%
Soccer broadcasts 86,323 0.01%
Manuals 7,08,239 0.07%
Magazines 4,94,974 0.05%
Newspaper 25,37,32,527 23.32%
Horoscope 4,319 0.00%
Interviews 40,03,975 0.37%
Miscellanea 90,97,447 0.84%
Short stories 60,777 0.01%
Essays (crônicas) 1,60,525 0.01%
Miscellanea 86,59,955 0.80%
Biographies 5,34,965 0.05%
Drug labels 1,13,228 0.01%
State assembly proceedings 39,77,450 0.37%
TV debates 22,033 0.00%
Presidential speeches 18,03,404 0.17%
Sessions of congress 7,71,39,578 7.09%
Miscellanea 9,14,786 0.08%
Bible 8,59,004 0.08%
Reports and manuals 1,37,42,224 1.26%

Links