VietnameseWaC: Web as Corpus

The VietnameseWaC is a text corpus crawled from the Web consisting of 100 million words. Corpus texts are lemmatized and processed with morphological analysis. The corpus contains a word sketch grammar for the Vietnamese language enabling users to explore the grammatical and collocational behavior of Vietnamese words. Sketch Engine provides access to this corpus from 2010.

Part-of-speech tagset

See the Vietnamese part-of-speech tagset describing POS tags used in the corpus.

A complete set of Sketch Engine tools is available to work with this Vietnamese web corpus to generate:

  • word sketch – Vietnamese collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Vietnamese nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 2 (2012)

  • created word sketches

version 1 (2010)

  • initial version

Bibliography

KILGARRIFF, Adam; LE-HONG, Phuong. Vietnamese Word Sketches. In: Proceedings of the First International Workshop on Vietnamese Language and Speech Processing. p. 1-4.

Search the VietnameseWaC corpus

Sketch Engine offers a range of tools to work with the VietnameseWaC corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.