Search the Japanese jpTenTen text corpus

THe Japanese TenTen corpus was crawled from the web in December 2011 and tokenized to short unit words (SUW).

The corpus was introduced by Irena Srdanović at conference 第3回 コーパス日本語学ワークショップ in talk 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング. See the presentation for more information about the content of the corpus.

The paper: Irena Srdanović, Vit Suchomel, Toshinobu Ogiso, Adam Kilgarriff: 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen) 『第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies/Center for Corpus Development, NINJAL, 2013), 229–238

Search the jpTenTen corpus

Sketch Engine offers a range of tools to work with this Japanese corpus.


Special attributes

  • 語彙素読み(lemma_kana)
  • 活用型(infl_type) > inflected type
  • 活用形(infl_form) > inflected forms

Corpus processing chain


v1.0 (April 2012)

  • initial version – 9.1 billion words

v2.0 (January 2013)

  • re-processed using new versions – 8.43 billion words

v3.0 (June 2013)

  • tagging corrections (punctuation from Nc.v.s to Supsym, swapped brackets Supsym.b)
  • added end of sentence tag/structure
  • word sketch grammar version 1.6 by Irena Srdanović

Learn to use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.