Japanese TenTen corpus gathered from the web in December 2011. Tokenized to short unit words (SUW).

The corpus was introduced by Irena Srdanović at conference 第3回 コーパス日本語学ワークショップ in talk 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング. See the presentation for more information about the content of the corpus. The paper: Irena Srdanović, Vit Suchomel, Toshinobu Ogiso, Adam Kilgarriff: 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen) 『第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies/Center for Corpus Development, NINJAL, 2013), 229–238

Special attributes

  • 語彙素読み(lemma_kana)
  • 活用型(infl_type) > inflected type
  • 活用形(infl_form) > inflected forms

Processing chain

  • crawled by SpiderLing,
  • encoding detected by chared, re-encoded into UTF-8,
  • cleaned by jusText and other tools,
  • deduplicated by onion,
  • tokenized (SUW) and tagged by MeCab 0.98 + UniDic 2.1.0,
  • MeCab tags converted to English names tagset.

Changelog

v1.0 (April 2012)

  • initial version – 9.1 billion words

v2.0 (January 2013)

  • re-processed using new versions – 8.43 billion words

v3.0 (June 2013)

  • tagging corrections (punctuation from Nc.v.s to Supsym, swapped brackets Supsym.b)
  • added end of sentence tag/structure
  • word sketch grammar version 1.6 by Irena Srdanović