The corpus was prepared by Tomaž Erjavec using a list of URLs provided by Serge Sharoff at the University of Leeds using the method described here, designed to produce a general language resource. There has been little checking of the content.

It was segmented, part-of-speech tagged and lemmatised using Chasen, an open-source toolset for Japanese.

Word sketches were prepared by Irena Srdanovic.

See the Japanese tagset.


v2.0 (25 February 2011)

  • added a new corpus attribute jp_tag which contains Japanese names of the part-of-speech tags