jpTenTen: Corpus of the Japanese Web

The Japanese Web Corpus (jpTenTen) is a corpus made up of texts collected from the internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The whole corpus is processed with the MeCab tool (version 0.98), a Part-of-Speech Tagger and Morphological Analyzer, together with the UniDic tool (version 2.1.0). The relevant POS tagset summary is available here.

A small sample of the jpTenTen corpus was tokenized with long unit words. This corpus is named as Japanese Web 2011 sample (jpTenTen11, LUW) with a different POS tagging system.

Special attributes

  • 語彙素読み(lemma_kana)
  • 活用型(infl_type) > inflected type
  • 活用形(infl_form) > inflected forms

Tools to work with the Japanese Web corpus

A complete set of Sketch Engine tools is available to work with this Japanese Web corpus to generate:

  • word sketch – Japanese collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Japanese nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 2.0 (June 2013)

  • tagging corrections (punctuation from Nc.v.s to Supsym, swapped brackets Supsym.b)
  • added end of sentence tags after Supsym.p (where missing)
  • word sketch grammar version 1.6 and subsequently version 1.7 both of them by Irena Srdanović

version 1.0 (January 2013)

  • initial version – 164 million words

Bibliography

Srdanović, I., Suchomel, V., Ogiso, T., & Kilgarriff, A. (2013). Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen. In Proceeding of the 3rd Japanese corpus linguistics workshop. Tokyo: NINJAL, Department of Corpus Studies/Center for Corpus Development (pp. 229-238).

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen Corpus Family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Search the Japanese Web corpus

Sketch Engine offers a range of tools to work with the Japanese Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.