Turkic web corpora

There are the following Turkic language family corpora in Sketch…

TalkBank Persian

The TalkBank Persian corpus contains blog posts to various Farsi…

TED_en corpus

A corpus of transcripts of TED talks. Prepared by Akshay Min…

jpTenTen11 LUW corpus

Japanese TenTen corpus gathered from the web in December 2011.…

SiBol/Port corpus

The SiBol/Port (Siena-Bologna, Portsmouth) corpus is a corpus…

Scottish Gaelic Wiki corpus

Scottish Gaelic Wikipedia corpus. Downloaded in February 2015.…

Russian Web Corpus

This corpus was gathered by Serge Sharoff at the University of…

pukWaC

The same as ukWaC, but with a further layer of annotation added,…

Romanian WaC (RoWaC) corpus

This Romanian web as corpus was gathered by Monica Macoveiciuc,…

Portuguese corpus

The CetemPúblico/CetenFolha Portuguese corpus installed here…

Polish Web Corpus

Polish web as corpus has 103 million words and the encoding is…

Polish-Swahili Bible parallel corpora

These corpora are parallel Bible texts in Polish and Swahili.…

Parallel Corpora Registry Info

General Attribute Set ATTRIBUTE word STRUCTURE s{ ATTRIBUTE…

PICAE: Pearson International Corpus of Academic English

This corpus was created by Kirsten Ackermann and David Tugwell,…

OPUS parallel corpora (version 2 with m:n alignment)

The parallel corpora available here have been collected, prepared…