Create a new corpus

To create a new corpus from uploaded files, log in to Sketch Engine (or click Home) and click click Create corpus from the left menu. Complete the following details:

  • Corpus name: this will be displayed in the interface. A simple, unique, alphanumeric corpus ID that can be used in CQL queries will be created automatically.
  • Language: Sketch Engine will process the uploaded data using settings for the selected language. Select Other UTF-8 for a language not found in the list.

For advanced users

Click Create to create an empty corpus and to go to the next step.

Add a File

If you wish to add data using WebBootCat instead, click Cancel and then Add data from the web using WebBootCaT in the corpus screen.

Clicking add a new file will give you these options

  • upload a file from your computer
  • download from a URL
  • upload from the Sketch Engine server
    FTP to at port 10021 to upload files, use the same username and password as for logging into the web interface – FTP tutorial

Uploading multiple files at once

You can also add multiple files in an archive using formats: .zip, .tar, .tar.gz, and .tar. bz2. Optionally, if the file names should be preserved, click 'Expand this archive instead of converting it to a single plaintext'. It is recommended, however, to put all metadata (including the file name) in XML structures inside the file and not use the 'expand' option.

You need to (re-)compile the corpus after adding one or many files.

You will find your corpus by clicking Home and going to My corpora.

Supported formats

Supported file formats include .doc, .docx, .htm, .html, .pdf, .ps, .tmx, .txt, .vert, .xml. An xml file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex xml will not be processed correctly. Here is a sample of xml text that would be processed correctly: