amWaC: Amharic web corpus

The Amharic web corpus (amWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Data was crawled by the SpiderLing web spider three times, in August 2013 and October 2015 and January 2016 with the final size 17 million words. Texts are in Ge’ez script with matching SERA transliteration (The system for Ethiopic representation in ASCII).

Transliteration of selected Ge’ez characters in SERA system.

Document count – the most frequent web domains and domain size distribution:

Top level domains Web domains Domain size distribution
org 14,582 *.jw.org 6,717 At least 1000 documents 7
com 11,927 *.gov.et 4,599 At least 500 documents 15
et 5,090 waltainfo.com 2,818 At least 100 documents 42
net 1,084 ginbot7.org 2,666 At least 50 documents 63
cz 724 eotcmk.org 1,141 At least 10 documents 149
info 85 ethsat.com 894 At least 1 document 573

The content of news/politic and religious sites has a significant presence in the corpus sources.

The corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/AmharicCorpus

Part-of-speech tagset

The Amharic WaC corpus was tagged with the TreeTagger based on manual annotation of Amharic 1065 news items containing 210,000 prosodic words. See the Amharic part-of-speech tag legend.

Tools to work with the Amharic Web corpus

A complete set of Sketch Engine tools is available to work with this Amharic Web corpus to generate:

  • word sketch – Amharic collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Amharic nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 1 (21st April 2017)

  • created word sketches
  • added attribute “sera”

initial version (5th April 2017)

  • size 17 million words

Bibliography

Amharic web corpus

Rychlý, P., & Suchomel, V. (2016, September). Annotated Amharic Corpora. In International Conference on Text, Speech, and Dialogue (pp. 295-302). Springer International Publishing.

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Search the Amharic Web corpus

Sketch Engine offers a range of tools to work with the Amharic Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.