Create a new corpus from files

To create a new corpus from uploaded files, log in to Sketch Engine (or click Home) and click Create corpus from the left menu. Complete the following details:

  • Corpus name: this will be displayed in the interface. A simple, unique, alphanumeric corpus ID that can be used in CQL queries will be created automatically.
  • Language: Sketch Engine will process the uploaded data using settings for the selected language. Select Other UTF-8 for a language not found in the list.
    see also Unsupported language

for advanced users

Configuration template (for advanced users): Instead of the default template for the selected language, you may select a custom configuration template. Be advised that only vertical files are supported when using custom templates. Also, note that this option is not shown by default. To enable it, you must first create a user template in Configuration templates as described in The Corpus Configuration File: Overview page and Corpus Configuration File: All Features.

  • Click Create to create an empty corpus and to go to the next step.

Add a File

(If you wish to add data using WebBootCat instead, click Cancel and then Add data from the web using WebBootCaT in the corpus screen.)

Click add a new file and chose from these options:

  • upload a file from your computer
  • download from a URL
  • upload from the Sketch Engine server
    FTP to the.sketchengine.co.uk at port 10021 to upload files, use the same username and password as for logging into the web interface – FTP tutorial
Uploading multiple files at once

You can also add multiple files in an archive using formats: .zip, .tar, .tar.gz, and .tar. bz2. Optionally, if the file names should be preserved, click 'Expand this archive instead of converting it to a single plaintext'. It is recommended, however, to put all metadata (including the file name) in XML structures inside the file and not use the 'expand' option.

(Re-)Compile

You need to (re-)compile the corpus after adding data to your corpus.

You will find your corpus by clicking Home and going to My corpora.

Supported formats

Supported file formats include .doc, .docx, .htm, .html, .pdf, .ps, .tmx, .txt, .vert, .xml. An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:

With regards to PDF files, please bear in mind that firstly PDF files are converted into plain text in order to create a corpus. This conversion is still the unsolved problem in computer science, especially, there may be problems with PDF files containing multiple columns, headings/footers or splitting words at the end of lines which may not be processed correctly.