Timestamped JSI web corpus

created from Jozef Stefan Institute Newsfeed

The Timestamped JSI web corpus is a family of web corpora created from IJS newsfeed developed by Jozef Stefan Institute, Slovenia (Trampus et al 2012).

JSI web corpus is a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world. The newsfeeds are available in many languages (see the info box).

The project, which was concluded in November 2022, continuously processed 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. For example, this means circa 1 billion words per month for English. more on the project» (accessible  via the Wayback Machine)

The Timestamped JSI web corpus was tagged for parts of speech and the timestamps were used to augment the corpus with diachronic annotation. Currently, the corpus covers the time period of 2014 and 2019. There are now regular monthly updates from the Jožef Stefan Institute and regularly amend the corpus with the latest data.

The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.

Availability

The corpus version with texts from the years 2014–2016 is accessible to all users including trial users.

The corpus version with up-to-date texts from the year 2014 to November 2022 is accessible to users with a regular subscription.

Tools to work with the Timestamped JSI Web corpora

A complete set of Sketch Engine tools is available (depending on the language) to work with Timestamped corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

Note: Some of the functions may not be available for some Timestamped JSI corpora.

Newsfeed corpora

Bušta, J., & Herman, O. JSI Newsfeed Corpus. In The 9th International Corpus Linguistics Conference. Corpus Linguistics 2017 Conference, University of Birmingham, 25-28 July 2017.

Newsfeed data

Trampus, Mitja and Novak, Blaz: The Internals Of An Aggregated Web News Feed. Proceedings of 15th Multiconference on Information Society 2012 (IS-2012).

Search the Timestamped corpus

Sketch Engine offers a range of tools to work with these Timestamped corpora.

or

Timestamped JSI web Corpus

Arabic (5.5+ billion words)

Catalan (430+ million words)

Czech (1+ billion words)

Dutch (1.3+ billion words)

Estonian (270+ million words)

English (73+ billion words)

Finnish (400+ million words)

French (6.8+ billion words)

German (6.9+ billion words)

Hebrew (450+ million words)

Hungarian (860+ million words)

Italian (7.2+ billion words)

Korean (1.5+ billion words)

Polish (970+ million words)

Portuguese (4.6+ billion words)

Russian (5.7+ billion words)

Serbian (560+ million words)

Spanish (15.8+ billion words)

Swedish (1.1+ billion words).

Other text corpora

Sketch Engine offers 700+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.