The TalkBank Persian corpus contains blog posts to various Farsi blog sites. (NB: we treat “Persian” and “Farsi” as synonyms when referring to the language.)

It was collected by Shlomo Argamon‘s research group at IIT. They shared it with Brian MacWhinney‘s group at  Carnegie Mellon as part of the IARPA metaphor program.

More material from other blog and news sources will be added in due course, with the eventual, combined corpus being called “Farsi-CMU”.

The Farsi text pre-processing tools from Uppsala University were then applied to normalize spacing between Farsi words and their affixes. CMU’s Farsi text normalizer was then applied, to remove Arabic and Persian diacritics and normalize variant forms of the Farsi letter “ye” to a single unicode representation. Finally, CMU applied its Farsi part-of-speech tagger, created by TurboTagger (from Noah Smith’s TurboParser), which was trained on the part of speech tags in the Persian dependency treebank from Dadegan University with a few minor modifications.

The Sketch Grammar was developed by Benjamin Mericli at Carnegie Mellon.


Rasooli, M. S., Kouhestani, M., & Moloodi, A. Development of a Persian syntactic dependency treebank. In Proceedings of NAACL-HLT, 2013, pp. 306–314.