amWaC: Amharic web corpus
The Amharic web corpus (amWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider three times, in August 2013 and October 2015 and January 2016 with the final size 17 million words. Texts are in Ge’ez script with matching SERA transliteration (The system for Ethiopic representation in ASCII).
Transliteration of selected Ge’ez characters in SERA system.
Document count – the most frequent web domains and domain size distribution:
|Top level domains
||Domain size distribution
||At least 1000 documents
||At least 500 documents
||At least 100 documents
||At least 50 documents
||At least 10 documents
||At least 1 document
The content of news/politic and religious sites has a significant presence in the corpus sources.
The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/AmharicCorpus
The Amharic WaC corpus was tagged with the TreeTagger based on manual annotation of Amharic 1065 news items containing 210,000 prosodic words. See the Amharic part-of-speech tag legend.