Copied from:

The British Academic Spoken English (BASE) corpus is a collection of transcripts of lectures and seminars recorded at two universities in the UK during the period 1998-2005. The corpus that can be accessed through Sketch Engine consists of 160 lectures recorded in a variety of university departments. Holdings are distributed across four broad disciplinary groups, each represented by 40 lectures:

  • Arts and Humanities
  • Life and Medical Sciences
  • Physical Sciences
  • Social Studies and Sciences.

The lectures have been transcribed and annotated in accordance with the TEI Guidelines. File names are made up of five letters and three digits, in which the first two letters indicate the disciplinary group, the next three indicate that the file is a transcript of a lecture, and the digits are unique identifiers:

| **ah** [Arts and Humanities]         |                   |         |
+--------------------------------------+                   |         |
| **ls** [Life and Medical Sciences]   |                   |         |
+--------------------------------------+ **lct** [lecture] | **0nn** |
| **ps** [Physical Sciences]           |                   |         |
+--------------------------------------+                   |         |
| **ss** [Social Studies and Sciences] |                   |         |

The Manual (PDF) explains the spelling and transcription conventions adopted. In the conversion of the corpus to Sketch Engine format, some of the mark-up has been changed and further details will be made available soon. In addition, a set of guidelines for how to form CQL queries in Sketch Engine for exploring the BASE corpus will be added soon.

A spreadsheet detailing the files in the BASE corpus can also be downloaded, as an Excel file.

For further information or guidance, contact the BASE team through Paul Thompson (

Bibliographic references

How to reference Sketch Engine

The British Academic Spoken English (BASE) corpus is freely available to researchers who agree to the following conditions:

  1. Corpus holdings should not be reproduced in full for a wider audience/readership (ie for publication or for teaching purposes), although researchers are free to quote short passages of text up to 100 running words, with a total of 200 running words from any given assignment.
  2. No part of the corpus holdings should be reproduced in teaching materials intended for publication (in print or via the internet).
  3. The corpus developers should be informed of all presentations and publications arising from analysis of the corpus.

Researchers must acknowledge their use of the BASE corpus using the following form of words: The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEAP, EURALEX, the British Academy and the Arts and Humanities Research Council.

Files should be referred to by their letter and number codes, indicating disciplinary grouping (e.g. ah = arts and humanities), type of speech event (e.g. lct = lecture) and file number.