Japanese TenTen corpus gathered from the web in December 2011. Tokenized to long unit words (LUW). A 164 million words sample of jpTenTen11.

The corpus was introduced by Irena Srdanović at conference 第3回 コーパス日本語学ワークショップ in talk 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング. See the presentation for more information about the content of the corpus. The paper: Irena Srdanović, Vit Suchomel, Toshinobu Ogiso, Adam Kilgarriff: 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen) 『第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies/Center for Corpus Development, NINJAL, 2013), 229-238

Processing chain

  • crawled by SpiderLing,
  • encoding detected by chared, re-encoded into UTF-8,
  • cleaned by jusText and other tools,
  • deduplicated by onion,
  • tokenized (LUW) and tagged by MeCab 0.98 + UniDic 2.1.0 + Comainu 0.60,
  • MeCab tags converted to English names tagset.


v1.0 (January 2013)

  • initial version – 164 million words

v2.0 (June 2013)

  • tagging corrections (punctuation from Nc.v.s to Supsym, swapped brackets Supsym.b)
  • added end of sentence tags after Supsym.p (where missing)
  • word sketch grammar version 1.6 by Irena Srdanović