TalkBank Persian corpus o blog posts

TalkBank Persian Corpus created from the Web

The TalkBank Persian corpus is a Persian corpus made up of blog posts from various Farsi blog sites. Texts for this Persian corpus was collected by Shlomo Argamon’s research group at Illinois Institute of Technology (IIT). They shared them with Brian MacWhinney’s group at Carnegie Mellon University as part of the IARPA metaphor program.

Note: In this description, “Persian” and “Farsi” are considered synonyms when referring to the language.

Part-of-speech tagset

The TalkBank Persian corpus was tagged by Persian Syntactic Dependency Treebank.

More material from other blogs and news sources will be added in due course, with the eventual, combined corpus being called “Farsi-CMU”.

The Farsi text pre-processing tools (availabe in Web Archive) from Uppsala University were then applied to normalize spacing between Farsi words and their affixes. CMU’s Farsi text normalizer was then applied, to remove Arabic and Persian diacritics and normalize variant forms of the Farsi letter “ye” to a single Unicode representation. Finally, CMU applied its Farsi part-of-speech tagger, created by TurboTagger (from Noah Smith’s TurboParser), which was trained on the part of speech tags in the Persian dependency Treebank of Dadegan Research Group with a few minor modifications.

The word sketch grammar was developed by Benjamin Mericli at Carnegie Mellon.

Tools to work with the TalkBank Persian corpus

A complete set of tools is available to work with this Persian corpora created from blog posts to generate:

word sketch – Persian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word units
word lists – lists of Persian nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Bibliography

Rasooli, M. S., Kouhestani, M., & Moloodi, A. Development of a Persian syntactic dependency treebank. In Proceedings of NAACL-HLT, 2013, pp. 306–314. (document gained from Wayback Machine)

Search the TalkBank Persian corpus

Sketch Engine offers a range of tools to work with this Persian corpus of blog posts.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

TalkBank Persian Corpus created from the Web

Part-of-speech tagset

Tools to work with the TalkBank Persian corpus

Search the TalkBank Persian corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

TalkBank Persian corpus of blog posts

TalkBank Persian Corpus created from the Web

Part-of-speech tagset

Tools to work with the TalkBank Persian corpus

Search the TalkBank Persian corpus

Other text corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine