The frWaC corpus is a French text corpus collected from the .fr domain with using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds. The corpus consists of French websites with total size 1.3 billion words.
The corpus texts were POS tagged with TreeTagger using the following tagset.
Tools to work with the French web corpus
A complete set of Sketch Engine tools is available to work with this French frWaC corpus to generate:
word sketch – French collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of French nouns, verbs, adjectives etc. organized by frequency