Have Sketch Engine create your own subject-specific corpora

Did you not find the right corpora for you? Do you deal with subject-specific language? WebBootCat is a simple intuitive tool to create a user corpus by automatically downloading relevant texts from the internet.

After logging in, click WebBootCat.

WebBootCaT - create your own corpus from the web

WebBootCaT – create your own corpus from the web

(1) Name the corpus
(2) select the language
(3) choose how you want to define the topic of the corpus:
seed words – type keywords and phrases that describe the topic
URLs – provide a list of web pages to download
Website – type a website to obtain up to 2000 text documents within this site
This example shows the first option. (4) Type the seed words, the list does not have to be exhaustive. You can repeat the procedure with different words to harvest more texts later.
(5) Click Next >.

L4-2

Sketch Engine will find relevant web pages and will display the list. You can exclude some pages by removing the ticks. Click Next >

WebBootCaT - url suggestion for corpus from the web

WebBootCaT – url suggestion for corpus from the web

Sketch Engine will start downloading the texts from the web pages and will also process the texts for use in Sketch Engine. With large numbers of pages, the process can take several minutes to complete. The process is over when the progress bar reaches 100%. Texts which are too short or have other issues will be excluded.

Downloading and processing texts from the internet for inclusion into the corpus.

Relevant texts are downloaded, tagged and processed for duplicates or text unsuitable for inclusion into the corpus.

Your corpus is ready to use now. Click Home in the main menu, go to  My corpora a select the newly created corpus. Use the main menu to generate word sketches, thesaurus and to with the corpus as normal.

Corpora from files, URLs or translation memory

You can also create corpora from other sources:

  • files and documents which can be uploaded to Sketch Engine
  • from a user-defined list of web pages
  • from the translation memory of your CAT tool

To learn more about user corpora, please refer to the User manual.