The New model Corpus is a ~100 million words domain corpus built from web data in 2008. For more information see in attachments (below).

There is also versions of the corpus with word family tagging or tagged with SuperSenseTagger.

Text types

Genres

Genre # documents
blog 13,957
news 12,388
general 10,216
business 1,433
speech (subtitles) 1,088
medical 516
law 451
fiction 123

Web top level domains

TLD # documents
com 15,954
uk 12,077
org 2,852
net 944
edu 379
gov 237
ca 154
us 104
au 94
ie 92
info 30
other 116
unknown 7,139

Attachments

Further information about New Model Corpus.