Timestamped corpora – time annotated corpora

Timestamped JSI web corpus

created from Jozef Stefan Institute Newsfeed

The Timestamped JSI web corpus is a family of web corpora created from IJS newsfeed developed by Jozef Stefan Institute, Slovenia (Trampus et al 2012).

JSI web corpus is a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world. The newsfeeds are available in many languages (see the info box).

The project, which was concluded in November 2022, continuously processed 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. For example, this means circa 1 billion words per month for English. more on the project» (accessible via the Wayback Machine)

The Timestamped JSI web corpus was tagged for parts of speech and the timestamps were used to augment the corpus with diachronic annotation. Currently, the corpus covers the time period of 2014 and 2019. There are now regular monthly updates from the Jožef Stefan Institute and regularly amend the corpus with the latest data.

The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.

Availability

The corpus version with texts from the years 2014–2016 is accessible to all users including trial users.

The corpus version with up-to-date texts from the year 2014 to November 2022 is accessible to users with a regular subscription.

Tools to work with the Timestamped JSI Web corpora

A complete set of Sketch Engine tools is available (depending on the language) to work with Timestamped corpora to generate:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus

Note: Some of the functions may not be available for some Timestamped JSI corpora.

Bibliography

Newsfeed corpora

Bušta, J., & Herman, O. JSI Newsfeed Corpus. In The 9th International Corpus Linguistics Conference. Corpus Linguistics 2017 Conference, University of Birmingham, 25-28 July 2017.

Newsfeed data

Trampus, Mitja and Novak, Blaz: The Internals Of An Aggregated Web News Feed. Proceedings of 15th Multiconference on Information Society 2012 (IS-2012).