The FeedCorpus is a corpus with about 300 million words, which was compiled and completed in early 2013. It comprises of the content posted on particular feeds which were discovered using the technique (Feed Corpus: An Ever Growing Up-To-Date Corpus) during the time 2012–2013. The content hence was downloaded from the internet using smart crawling techniques. The documents in the corpus contain the following meta fields:
- “meta” – Contains the Meta information such as the headings, etc. of the Feed from where the content is taken
- “tld” – Contains the top-level domain information on the content URL.
- “quarter” – Contains the quarter in which the content was posted. For e.g., 2012q1 means the First quarter of the year 2012.
- “month” – Contains the month along with the year in which the content was posted. For e.g., 2013-01 refers to January 2013.
- “content_url” – Contains the URL from the where the content was downloaded.
- “time” – Contains the timestamp information when the content URL was posted on the feed link.
- “feed_source_url”– Contains the source feed URL where the content URL was posted.
- “domain” – Contains the domain to which the content_url belongs.
- “year” – Contains the year in which the content was posted. For e.g. 2012, 2013 etc.
To support searches by lemma and part of speech, the corpus has been annotated with lemmas and PoS tags using TreeTagger, see the Tagset documentation.
- December 2015 – version 6: re-tagged using new tagger model and post-processing, fixed metadata, Sketch grammar v3.0
- August 2015 – version 5: added content from 2014, brought up to ~550M words
Minocha, Akshay, Siva Reddy, and Adam Kilgarriff (2014). Feed Corpus: An Ever Growing Up-To-Date Corpus. In Proceedings of the eighth Web as Corpus, ACL SIGWAC 8