Timestamped JSI web corpus

created from Jozef Stefan Institute Newsfeed

The Timestamped JSI web corpus is a new family of web corpora created from IJS newsfeed created by Jozef Stefan Institute, Slovenia (Trampus et al 2004).

JSI web corpus is a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world. The newsfeed is available in many languages (see the info box).

The project continuously processes 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. For example, this means ca 1 billion words per month for English. more on the project»

The Timestamped JSI web corpus was tagged for parts of speech and the timestamps were used to augment the corpus with diachronic annotation. Currently, the corpus covers the time period of 2014 and 2018. By combining this data with other web corpora, a total period of between 2009 and 2015 can be covered. There are now regular monthly updates from Jozef Stefan Institute and regularly amend the corpus with the latest data.

The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.


The corpus version with texts from the years 2014–2016 is accessible to all users including trial users.

The corpus version with up-to-date texts from the year 2014 to now (monthly updated) is accessible to users with a regular subscription.

Tools to work with the Timestamped JSI Web corpora

A complete set of Sketch Engine tools is available to work with Timestamped corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use

Search the Timestamped corpus

Sketch Engine offers a range of tools to work with these Timestamped corpora.


Timestamped JSI web Corpus

Arabic (1.7 billion words)

Catalan (160 million)

Czech (440 million)

Dutch (600 million)

English (28.2 billion)

Finnish (180 million)

French (2.9 billion)

German (3 billion)

Hebrew (190 million)

Hungarian (280 million)

Italian (2.1 million)

Korean (750 million)

Polish (250 million)

Portuguese (1.7 billion)

Russian (1.8 billion)

Serbian (172 million)

Spanish (6.2 billion)

Swedish (520 million).

Other text corpora

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.