Timestamped JSI web corpus

created from Jozef Stefan Institute Newsfeed

The Timestamped JSI web corpus is a new family of web corpora created from IJS newsfeed created by Jozef Stefan Institute, Slovenia (Trampus et al 2004).

JSI web corpus is a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world. The newsfeed is available in many languages (see the info box).

The project continuously processes 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. For example, this means ca 1 billion words per month for English. more on the project»

The Timestamped JSI web corpus was tagged for parts of speech and the timestamps were used to augment the corpus with diachronic annotation. Currently, the corpus covers the time period of 2014 and 2017. By combining this data with other web corpora, a total period of between 2009 and 2015 can be covered. There are now regular monthly updates from Jozef Stefan Institute and regularly amend the corpus with the latest data.

The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.

Availability

The corpus is accessible to all users including trial users.

Tools to work with the Timestamped JSI Web corpora

A complete set of Sketch Engine tools is available to work with Timestamped JSI Web corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use

Search the Timestamped JSI web corpus

Sketch Engine offers a range of tools to work with the Timestamped JSI web corpus.

or

Timestamped JSI web Corpus

Arabic (1.4 billion words)

Catalan (139 million)

Czech (388 million)

Dutch (529 million)

English (25.4 billion)

Finnish (159 million)

French (2.5 billion)

German (2.6 billion)

Hebrew (160 million)

Hungarian (242 million)

Italian (1.8 million)

Korean (632 million)

Polish (210 million)

Portuguese (1.5 billion)

Russian (1.5 billion)

Serbian (129 million)

Spanish (5.5 billion)

Swedish (460 million).

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.