ACL Anthology Reference Corpus is a digital archive of 10,291 research papers in computational linguistics sponsored by the Association for Computational Linguistics (ACL). This release contains most of the papers that appear up to February 2007 in the web-based ACL Anthology, a dynamic repository that currently hosts over 16,500 articles drawn from a range of conferences and workshops as well as past issues of the Computational Linguistics journal. See more on the Data Linguistic Consorcium web.
- 10,921 PDF files in the pdf/anthology-PDF tree.
- 13,551 files with metadata described in the metadata/anthology-XML tree
- 84,542 pages in the PDF files
(copied from web archive at https://web.archive.org/web/20160503023245/http://acl-arc.comp.nus.edu.sg/)
- Version 20090501: This is the old version distributed by the Linguistic Data Consortium (LDC). This corpus is a canonicalized subset of the ACL Anthology, up to February 2007, consisting of 10,921 articles. This version adds page images in both text and image forms, as created by running OCR over the PDF files (Nuance Omnipage 15 or 16). For detailed information about the citation structure, do see the related project, Anthology Author Network (see below).
[ DVD Disc 1 ] – Interlink data (clean), XML metadata (not clean, from Anthology), Image files in PNG format, text files from Omnipage in formatted and normal styles.
[ DVD Disc 2 ] – continuing PNG files.
[ DVD Disc 3 ] – remaining PNG files, PDFs from Anthology
[ DVD Disc 4 ] – remaining PDF files, text files from Omnipage in XML style.
If you have a copy of the ACL ARC from LDC, you may be missing some of the key files, files that give the textual dump of each page in three different formats. Here are a few quick links to the files:
- Version 20080325: This is the version described in the LREC paper that contains the canonical 10,921 computational linguistics papers as PDF and plain text files, with the associated metadata. (You can also email me to request a DVD copy of the corpus)
[ Complete tgz file from NUS ] [ Complete tgz file from Macquarie Univ. (courtesy Robert Dale) ] Warning, Huge! (4621149669 bytes, ~4.4 GB) Expect re-tries, use a client with resume capability
[ tgz file (without PDFs) ] (111001977 bytes, ~100MB)
- Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev and Yee Fan Tan (2008) The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proc. of Language Resources and Evaluation Conference (LREC 08). Marrakesh, Morocco, May. [ .pdf pre-print ]