A page relevant to corpora.

Pages

Argamon corpus

The current Argamon corpus contains blog posts to various Farsi…

ACL Anthology Reference Corpus (ARC)

The corpus is prepared by Steven Bird. The process is described…

Algemeen Nederlands Woordenboek (ANW) corpus

The Algemeen Nederlands Woordenboek (ANW) corpus is a balanced…

New Model Corpus

The New model Corpus is a ~100 million words domain corpus built…

UKWaC corpus

The corpus was prepared by Adriano Ferraresi. The whole process…

London English corpora

The corpus consists of transcripts of informal conversation-like…

zhTenTen corpus

Simplified Chinese TenTen corpus was created from the Internet…

yoTenTen corpus

Yoruba TenTen web corpus. The corpus is cleaned by jusText,…

uaTenTen corpus

Ukrainian TenTen corpus was crawled by SpiderLing in 2014.…

trTenTen corpus

Turkish TenTen corpus. Crawled by SpiderLing in December 2011…

svTenTen corpus

Swedish TenTen web corpus. The corpus is cleaned by jusText,…

skTenTen corpus

Slovak TenTen corpus. The corpus has been tagged by the ​Ľ.…

ruTenTen corpus

Russian TenTen corpus. Russian web corpus crawled by SpiderLing…

ptTenTen corpus

Portuguese TenTen corpus. The corpus is processed with Eckhard…

plTenTen corpus

Polish TenTen web corpus was crawled by a web spider SpiderLing…