Build a corpus from the web

Sketch Engine has a unique corpus-building tool, called WebBootCaT, which can create a corpus from relevant texts on the web. Data downloaded from the internet are deduplicated, cleaned, spam text or non-text is eliminated to obtain high-quality text material. The user can specify the topic area which should be covered by one of these three options:

  • by providing some typical words defining the topic (seed words)
    (relevant Wikipedia article(s) can be used for keyword suggestions)
  • by providing a list of URLs which should be downloaded
  • by downloading a complete website

How to create a corpus from the web

Log in to Sketch Engine (or click Home) and select WebBootCaT.


(1) name of the corpus as it will appear in the system can be changed later

(2) language whose settings will be used to process the data (my language is not on the list)

(3) Input type:

Seed words

define the topic by providing a list of words directly related to the topic (see details below)

URLs (web pages)

provide a list of urls (web pages) which should be downloaded, only the urls provided will be downloaded, links pointing to other pages will be ignored

the urls must be seprated by a white space (space character, new line/paragraph)


Web site

Download the whole website (up to the limit of 1,000 pages maximum). A download of a large website may take longer than the seed word and URL options. Sketch Engine only downloads one page every few seconds to avoid being blocked form access to the website. A download of a 1,000-page site might take about 3 hours (6 hours for trial users).

The process runs in the background, you can log out of Sketch Engine or continue using Sketch Engine while the process is running.

(4) (if the Seed words option is selected) type between 3 and 20 keywords separated by a white space, enclose phrases in double quotes (20 is recommended, less than 8 might be too low, more than 40 is useless, rare words may produce less but more accurate results). Words do not have to be written in all forms, the basic form, e.g. be, is sufficient.

(5) if you cannot think of enough keywords, point Sketch Engine to one or more Wikipedia articles. Sketch Engine will extract terminology from the articles. You can select which terminology should be used as seed words.

(6) when ticked (recommended), the downloaded data will be compiled automatically

(7) shows the following advanced options:

Bing search options

Tuple is a sequence of seed words submitted to a search engine to find relevant internet pages.

Tuple size
the number of seed words to be combined together for each web search
(3 or 4 is optimal, 4 may produce less but more accurate results)

Max tuples
 the maximum number of tuples (searches) to be sent to the search engine. The limit is 100.

Max URLs per query
the maximum number of URLs to be retrieved from one search. The limit is 100.

Sites list
a whitelist of sites, e.g. will limit the search to only URLs ending in

  • the urls are matched from the end, e.g. will match
  • the whitelist url can go up to 2 levels deep, e.g.

Size restrictions

Exclude too small or too large files.

min file size
excludes files (documents) containing hardly any content

max file size
excludes documents which are too large and therefore could make the corpus unbalanced, a very long text on a certain topic among short texts about other topics might influence the representativeness of the corpus

All downloaded files are cleaned first. This involves removing text unsuitable for a corpus use such as menus, small print, disclaimers, advertisements, consolidating white spaces etc.

min cleaned file size
minimum size of the cleaned file in number of words (punctuation is counted as words)

max cleaned file size
maximum size of the cleaned file in number of words (punctuation is counted as words)

White list keywords

Provide a list of words which must be included in the texts. Matching is case-sensitive and phrases can be enclosed in quotes e.g. “bread and butter”.

min total keywords
the minimum number of white list keywords that a web page (file) must contain after processing to be included in the corpus (multiple occurrences of the same keyword are counted)

min unique keywords
the minimum number of different white list keywords a web page (file) must contain (multiple occurences of the same keyword are not counted)

min keywords ratio
the minimum ratio of key word instances to non-keyword instances that must occur on a web page for it to be included

Black list keywords

Provide a list of keywords which must not appear in a web page to be included in the corpus. The options are analogical to the white list options.

Click Next, your seed words will be submitted to the search engine and candidate URLs will be displayed grouped by the query that produced them. You can deselect any of these URLs at this point.


Click OK to build the corpus. Sketch Engine will start downloading, cleaning, tokenizing and tagging your corpus.


You may have to compile the corpus if you did not select automatic compilation in the first step.

Your corpus is now ready to be used. Click Home, go to My corpora and select the corpus to start searching it.

If the corpus is small, you can add more texts by repeating this process as many times as necessary or by uploading your files.

Questions and answers about WebBootCaT

How do I decide on the correct parameters for WebBootCaT?

As a rule of thumb, do not worry about the advanced settings and use the default settings. Only if the results do not produce the results you need, start looking into the advanced settings.

Max tuples (100) is the number of queries (seed word combinations) to be sent to the search engine. Use these maximum values to get as much data as possible.

High number of URLs per query can result in a bigger but less relevant corpus because even links found on the 2nd, 3rd and subsequent pages of the search engine results will be included.

Finally, you can repeat the same procedure several times to enlarge the corpus. Sketch Engine will make sure no page, text or part of text is included twice (deduplication).

The white list keywords can be useful to avoid ambiguity of the seed words, i.e. you can make some of the unambiguous seed words compulsory to make sure the document matches the topic.

Black list keywords can also be used to reduce ambiguity (e.g. you might use “party” when collecting a corpus on the environment using seeds which include “green”). It is only necessary to use the whitelist and blacklists if you are getting irrelevant documents, otherwise it is not necessary.

How to create a 10-million corpus?

The more WebBootCaT runs the better because this will generate more queries. You should aim for 20-60 seeds if that is possible in your domain. You can repeat the process with the same seeds multiple times (there is only a very small probability the same seed tuples will be chosen). Or you can split your seeds to sets of 10 seeds and run the tool with each seed set. Please note that you can use multiwords such as “kick the bucket” using the quotes, and also proper names of different kinds.

How to limit my corpus to British English or European Portuguese only?

Limit the search to only UK domains or the domains of Portugal. Type .uk (.pt) into the site list in the advanced options.

How do I get new seed words when I want to repeat the process?

To repeat the process with new seed words, use the keyword extraction from the current corpus.

  • click Home
  • locate your corpus and click the wrench button (manage your corpus)
  • in the left menu click on Keywords and terms, the process will start automatically
  • tick the keywords you want to use as new seed words
  • click Use WebBootCaT with selected words
  • you will need to name this part of your corpus and then proceed as normal.

You can repeat the process as much as you like. You can see how much data you have at each stage by checking the corpus page.

Why are some paragraphs missing?

WebBootCat uses an algorithm (jusText). to remove unwanted content such as page navigation, headers, footers, very short paragraphs (=boilerplate) etc. Filtering low quality text from the internet is very difficult to do programmatically. This is why, on very rare occasions, some good content may be removed too by mistake.

A tip for downloading text sparse pages (e.g. internet forum): Set Min file size and Min cleaned file size to zero in advanced options. A tool for boilerplate removal is used to extract text in a web page. The tool is likely to ignore short isolated paragraphs which can be the case of some discussions.

Unsupported languages

Corpus can be created from the web even if the language is not supported by Sketch Engine. Select “–other (UTF-8)–” from the language dropdown if your language is not listed.

  • just the universal tokenizer can be applied (or use your own tokenizer prior to uploading data),
  • no automated taggers can be applied (or use your own tagged prior to uploading data),
  • automatic encoding detection might be limited – uploading files in UTF-8 is recommended,
  • search engine setting will not constrain the search to any language when using WebBootCaT.

The Word sketch feature and related functions work depending on user’s definition or you can select the universal generic sketch grammar.

More on unsupported languages»

Duplicated content

When creating a corpus or adding new texts to an existing corpus using WebBootCaT, a simple strategy is applied to avoid duplicated content:

  • Sketch Engine will not not download the same url twice into the same corpus
  • if exactly the same content (an exact copy of the same document) is found on a different url, it will not be downloaded again

A sophisticated deduplication option becomes available if the user decides to manually compile a corpus. This deduplication option has to be manually selected. This deduplication uses the onion deduplication tool.

I cannot download a specific website.

The internet is a decentralised and constantly changing place, therefore, we cannot guarantee a particular website is downloaded. You can try to use another tool downloading entire websites, e.g. HTTrack or Wget tool.

For more information on WBC, please see WebBootCaT: a web tool for instant corpora (2006).

WebBootCaT: instant domain-specific corpora to support human translators

  • Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
  • In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252