The Hausa web corpus (hausaWaC) is a language corpus made up of texts collected from the Internet. Data was crawled by the SpiderLing web spider and the WebBootCat tool in June 2015 and comprises of more than 5 million words. Corpus texts are written in Boko, a Latin script of the Hausa Language. The corpus has not contained a part-of-speech tagging yet.
Tools to work with the Hausa web corpus
A complete set of Sketch Engine tools is available to work with this hausaWaC15 corpus to generate:
word lists – lists of Hausa nouns, verbs, adjectives etc. organized by frequency