wiki:GDEX

Manual for GDEX

Read How to sort sentences by GDEX in Sketch Engine? if you want to quickly start using GDEX.

Introduction

GDEX is an abbreviation for "Good Dictionary EXamples". It is a system for evaluation of sentences with respect to their suitability to serve as dictionary examples. Its typical usage is in sorting sentences so that good examples do not have to be searched for in hundreds of unusable sentences. Especially in web-based corpora it can effectively rule out sentences that are poor candidates as dictionary examples and it offers the lexicographers a selected set of sentences with a higher chance of containing a good sentence.

The exact way of sorting of the sentences can be adapted for various languages or even various purposes by changing parameters in a GDEX configuration file. Custom based configurations can be created and evaluated partly with tools directly provided with GDEX or some external applications.

Additionally the GDEX library contains a simple web-interface called GDEX Tools that facilitates all GDEX-related tasks and provides access to supporting applications.

For fees for a user-specific GDEX configuration, please contact inquiries@sketchengine.co.uk

GDEX in Sketch Engine

Sketch Engine uses GDEX to sort sentences in Concordances and in TickBox Lexicography (TBL). Sorting of concordances using GDEX has to be activated in View Options otherwise the concordance is shown in corpus order. Sorting in TickBox Lexicography is always activated and the number of sorted sentences is 300 (if available) for each collocation.

Currently only the default GDEX configuration is available for all users. It was trained on English, so it may not give good results for other languages. It is, however, possible to create and use custom GDEX configurations.

 

Adding user GDEX configurations

The online interface provides a special page for uploading user configurations to Sketche Engine. Local installations need to register gdex configurations manually. Currently the upload page is not advertised anywhere and is hidden, as user configurations can cause errors if not set up properly. Once the GDEX configuration is uploaded that version of GDEX becomes available in the View Options dialog to be selected. Since the configurations do not have to be corpus/language dependent, it is up to the user to use them with correct corpora.

Uploaded user configurations can also be shared with other users or user groups.

Selecting from a list of GDEX configurations

If more than one GDEX configuration is available a drop-down list appears in View Options. The selected configuration is used for sorting in both Concordance View and TickBox Lexicography.

Comparing two different GDEX configurations

Similarly, if more than one GDEX configuration is available another drop-down list appears at the TBL result page, where the user can select an alternative configuration that will be used for sorting the same set of sentences side-by-side with the first GDEX configuration.

GDEX Configuration Files

Technically, GDEX assigns the sentences with a score and sorts them from the best to the worst. The assigned value is composed from results of a variety of classifiers that measure various features. The exact set of measured features and the way they are combined together is specified by the GDEX configuration files. Each configuration file is a description of the sentence evaluation function.

Configuration File Syntax

The GDEX configurations can be created using the Configuration Editor or written manually since the configuration file is basically a text file with attribute-value pairs.

  • every configuration contains exactly one attribute called classifier
  • { as a value signifies the beginning of the classifier definition
  • } closes the classifier definition
  • the attribute-pairs are on separate lines. The values can be atomic (1 value) or a list in which case the list will itself contain attribute-value pairs each on separate lines.
  • [ as a value signifies the beginning of a list; each line within the list represents a separate value
  • ] on a separate line closes the list
  • each classifier is described at least by its classid, name and obligatory arguments. A full description of all classifiers and their arguments can be found in Configuration Editor below
  • empty lines are ignored
  • line beginning with # are ignored. This is useful for adding comments that are ignored by GDEX
  • sequence of tabs and spaces are interpreted as a single space

A simple configuration file:

# comment
name	example
classifier	{
    name        the shorter, the better
    classid     op_optimal_interval
    weight      1
    subclassifiers   [
                        {
                        name      sentence length
                        classid   s_sentence_length
                        weight    1
                        }
                     ]
    low         0
    high        1
}
 

GDEX Tools

In addition to the basic functionality GDEX comes with a web interface that can be used for creating user configurations and evaluation of the GDEX results and some management tasks. This interface is at: http://gdex.sketchengine.co.uk/sandbox .

The first step within the GDEX Tools is selecting a GDEX configuration or creating a new one in Configuration Editor. GDEX configurations that have been created in GDEX tools can be downloaded from "Configurations" section of the "Files" menu in order to upload them into Sketch Engine. You can also use GDEX configurations within GDEX Tools for evaluating annotated data and export the intermediate classifier results into WEKA data mining software.

Configuration Editor

To open Configuration Editor click at the Editor link at the top of the GDEX Tools menu. The Editor page consists of three parts:

  • The configuration name field at the top. It is an obligatory attribute and it cannot be omitted.
  • A configuration tree at the left side of the screen shows the structure of the classifiers and subclassifiers in the configuration.
  • The classifier being edited is shown in the largest area. There is a short description of the classifier's function at the top, below is a list of attributes of the classifier, and the very bottom of the area shows a list of subclassifiers that can be chosen.

Categories of classifiers

The classifiers are divided in 6 categories, the first 4 can be chosen as the root classifier and the rest can serve only as subclassifiers:

  • Compositional Operators - take values of multiple classifiers for each sentence and return a single number representing the score for that sentence
  • Sentence Operators - take a result of a single classifier and modify it somehow (e.g. normalize it).
  • Sentence Analyzers - give a single numerical score for the sentence itself.
  • Token Operators - take a list of numbers for individual tokens (words or punctuation) within a sentence and return a single number for a sentence.
  • Token Analyzers - are classifiers that take a list of string values for individual tokens and return a list of numbers for each token.
  • Attribute - a single classifier that returns a list of attributes (such as tag or lemma) for each word in the sentence. The attributes corresponds to attributes available in the underlying corpus. The sentences provided via XML file always contain 'word' attribute, and also attributes provieded via the '/'-notation.

Creating A New Configuration

If you want to create a new configuration, leave field Select configuration empty, fill in Configuration name (avoid already occupied names or the old configurations will be overwritten), add at least one root classifier and click Save. Similarly, you can start with opening an already existing configuration, edit it and save it under a new name.

Managing existing configurations

All configurations created in Configuration Editor are stored on the server, you can look at them or delete them after opening Files link at the top menu. At this page you will see three sections - the first lists all GDEX configurations, the second lists all WEKA models uploaded to the server and the last shows all mapping files on the server. Uploaded WEKA models can be used by the WEKA classifier in GDEX configurations see Creating WEKA classifiers below, similarly mapping files can be used by the mapping classifier.

Downloading GDEX configurations

To upload the configuration into Sketch Engine it is currently necessary to download it from the application and reupload it to Sketch Engine.

TBL Logs

In this section, you can have analyzed the logs from TBL for a specified corpus.

The first application computes the number of selected sentences per lemma, collocation, ratio to all sentences, etc... Also a set of graphs of how many sentences were selected per query per user is generated. From these one can obtain further information about GDEX usage, however please note that the relation between user query and selected sentences is computed from lists of sentences without any time or explicit query information.

The second application exports the sentences from TBL logs in two plaintext files (separately good and bad sentences).

For both applications the corpus name needs to be specified. The logs can also be optionally filtered by regular expression matching selected user names.

Cooperation with WEKA

The most important feature of GDEX Tools is the ability to export the intermediate results of GDEX classifiers into ARFF file, which can be imported into some data analysis or data mining software, such as WEKA. WEKA is a JAVA software that can be run locally on clients computer, for installation, please, consult its web pages. Note though that only numeric values are exported, partly because more complex data would be difficult to analyze and partly because of the ARFF limitations.

We described the process in 2 subsections, the first relates to functionality within GDEX tools and the second relates to functionality within the WEKA software itself.

1) Within GDEX tools

Exporting intermediate GDEX results into ARFF

Open link Evaluate Annotated Data and you will be provided with a list of export methods:

  • Evaluate concordance from a XML file - this method can be used for analysis of user sentences (not from SkE). It asks for a XML file with sentences within <zgled> tags and keyword within <i> tags. If a need for analysis of XML files with different tags arise, the interface can be extended. The Default corpus field is obligatory if the configuration contains some corpus dependent classifiers (e.g. token frequency without specified corpus), otherwise can be empty. Field File contains tags in '/'-notation allows to specify other attributes present in the XML file. For example value "tag lemma" says that each word is followed by these attributes in form 'word /tag /lemma'.
  • Evaluate logs from TBLex - requires the name of the corpus (e.g. fidaplus2-sld) for which the sentences selected in TBL were logged

After running any of the methods, you will be redirected to Task manager, where all the available tasks are listed. A link Download results will appear next to the task if it ends correctly. The downloaded file can be opened in WEKA Explorer.

Uploading WEKA models

The second section at the Files page of the GDEX Tools, lists available WEKA models that can be referenced by name from the WEKA classifiers. At this place it is possible to upload new WEKA models to the system.

2) Using WEKA to analyze annotated data

Since all the GDEX-related data are stored on the server, it is necessary to first export the required data in ARFF as described above.

Opening ARFF files in WEKA
  • run the WEKA
  • open Explorer
  • click to Open File and find the downloaded ARFF file.
Analyzing the data with WEKA

You will instantly see a preview of the data distribution in the bottom right corner. You can apply a variety of filters (click Choose, select a filter, you can also change its parameters by clicking at the filter command and finally click to Apply) to the data in Preprocess page and study the results.

Another interesting view is at page "Visualize", where the distribution of the data is shown in plots with any combination of axes.

Creating WEKA classifiers

In order to create a WEKA classifier, it is necessary to go through several steps:

  • create a GDEX configuration with the _weka root classifier and all its subclassifiers, however do not select a WEKA Model.
  • use the created configuration to evaluate annotated data and open the created ARFF in WEKA
  • use WEKA to learn a WEKA model and save it
  • upload the WEKA model into GDEX Tools
  • open the WEKA configuration that was used for preparing learning data, fill in the Weka Model and Weka Method and store the configuration under new name
  • download the configuration and reupload it to Sketch Engine
Learning WEKA models
  • To use WEKA models within GDEX, they need to be trained on data of exactly the same structure as the corpus data that they will be applied to, so choose your learning data carefully.
  • after you open the ARFF in WEKA, go to the page Classify
  • select a learning algorithm (click on 'Choose', after selecting the method, you can adjust it by clicking at its command line)
  • adjust the Test options and click Start
  • after the model is learned, right-click on it in the Result list and select Save Model
Tutorials
Last modified 2 years ago Last modified on Mar 9, 2012, 2:48:18 PM

Lexical Computing Ltd.
71, Freshfield Road
Brighton BN2 0BL
East Sussex
UNITED KINGDOM

UK Company Registration: 04841901
VAT: GB844370721

e-contacts: Inquiries | Support

Copyright © Lexical Computing, Ltd.