How to build a corpus from the web

Build a corpus from the web

Sketch Engine also serves as corpus building software. It has a unique corpus-building tool, which uses the WebBootCaT technology, to automatically create a text corpus from relevant web pages. Data downloaded from the internet are cleaned, optionally deduplicated and non-text is eliminated to obtain linguistically valuable text material. The user can specify which content should be downloaded via one of these options:

by providing some typical words defining the topic (seed words)
(relevant Wikipedia article(s) can be used for seed word suggestions)
by providing a list of URLs which should be downloaded
by downloading a complete website

The user can also upload files to build a corpus.

Who can access my data?

Sketch Engine is not a public cloud. Texts you upload and corpora you create will be stored in your personal space in your account. Other users cannot access your texts. You can, however, choose to grant access to other users. An explicit action has to be taken by the user for this to happen.

sharing corpora

How to create a corpus from the web

There are 3 ways to reach the corpus building tool:

on the corpus dashboard click NEW CORPUS
on the select corpus advanced screen click NEW CORPUS
open the corpus selector at the top of each screen and click CREATE CORPUS

In the corpus building interface

type a name for your new corpus, select the language, optionally provide a description and click NEXT
select the Find texts on the web option
click on the help icons to learn about the options and settings

This process can be repeated to make the corpus larger. Building from the web can be combined with uploading files to the corpus.

FAQs

How do I decide on the correct parameters?

As a rule of thumb, do not worry about the advanced settings and use the default settings. Only if the results do not produce the desired results, start looking into the advanced settings.

You can repeat the same procedure several times to enlarge the corpus. Sketch Engine will make sure no page, is included twice.

The allowlist keywords can be useful to avoid ambiguity of the seed words, i.e. you can make some of the unambiguous seed words compulsory to make sure the document matches the topic.

Denylist keywords can also be used to reduce ambiguity (e.g. you might use “politics” when collecting a corpus on the environment using “party”). It is only necessary to use the denylist and allowlist if you irrelevant documents are found, otherwise it is not necessary.

How to create a 10-million corpus?

You can run the corpus building tool many times to build a bigger corpus. You should aim for 20-60 seeds if that is possible in your domain. Furthermore, you can repeat the process with the same seeds multiple times (most likely, different seed groups will be used each time). It is also possible to split your seeds to sets of 10 seeds and run the tool with each seed set. Please note that you can use multiwords such as “kick the bucket” using the quotes, and also proper names of different kinds.

How to limit my corpus to British English or European Portuguese only?

Limit the search to only UK domains or the domains of Portugal. Type .uk (.pt) into the site list in the advanced options. Refer to the corresponding help icon in the interface.

How do I get new seed words when I want to repeat the process?

To repeat the process with new seed words, use the keyword extraction from the current corpus.

go to Manage corpus and click Make bigger
select Find texts on the web
click SUGGESTIONS
tick the keywords you want to use as new seed words

The terms previously used as seed words are highlighted.

You can repeat the process as much as you like. You can see how much data you have at each stage by checking the corpus page.

Why are some paragraphs missing?

The web building tool uses the jusText tool to remove unwanted content such as page navigation, headers, footers, very short paragraphs (=boilerplate) etc. Distinguishing low quality text from good quality text is very difficult to do programmatically. This is why, on very rare occasions, some good content may be removed too by mistake.

A tip for downloading pages with little text on them: Set Min file size and Min cleaned file size to zero in advanced options. The tool is still likely to ignore short isolated paragraphs which can be the case of some online forums and discussions.

Refer to the help icon next to the options in the Expert settings.

Unsupported languages

Corpus can be created from the web even if the language is not supported by Sketch Engine. Select “–other (UTF-8)–” from the language dropdown if your language is not listed.

just the universal tokenizer can be applied (or use your own tokenizer prior to uploading data),
no automated taggers can be applied (or use your own tagged prior to uploading data),
automatic encoding detection might be limited – uploading files in UTF-8 is recommended,
search engine setting will not constrain the search to any language when using WebBootCaT.

The Word sketch feature and related functions work depending on user’s definition or you can select the universal generic sketch grammar.

Solving problems

Log file
The exact reason why the page was not included in the corpus can be found in the log file (MANAGE CORPUS – LOGS, the log name contains bootcat_and_compile.log).

Forums, discussions and other text sparse pages
Set Min file size and Min cleaned file size to zero in the advanced options. Very short isolated paragraphs may still be ignored because they might be incorrectly identified as navigation menus or similar linguistically unsuitable content. If necessary, use the Save as option in your browser and upload it to Sketch Engine manually.

Alternative download tools
Tools such as HTTrack, cURL or Wget might be able to download the problematic pages. These tools can also help with password-protected web pages. Bear in mind possible legal implications when using these tools to download internet content.

Bibliography

For more information on WBC, please see WebBootCaT: a web tool for instant corpora (2006).

WebBootCaT: instant domain-specific corpora to support human translators

Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252

back to Guide

Build a corpus from the web

Who can access my data?

How to create a corpus from the web

FAQs

Solving problems

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

Create a corpus from the web

Build a corpus from the web

Who can access my data?

How to create a corpus from the web

FAQs

Solving problems

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine