Task #2: Evaluating Word Sense Induction and Discrimination Systems

Organizers
Eneko Agirre, University of the Basque Country (e.agirre at ehu dot es)
Aitor Soroa, University of the Basque Country (a.soroa at ehu dot es)

Links

Motivation
The motivation for this task is that most of the existing word sense induction and discrimination systems (also known as corpus-based unsupervised systems) have been evaluated by ad-hoc methods, and usually by the own authors of the method. We need to make clear that this task is for such sense-induction systems, and is not intended for knowledge-based systems (which are also taken to be "unsupervised). See Pedersen (2006) for an overview.

The goal of this task is to allow for comparison across sense-induction and discrimination systems, and also to compare these systems to other supervised and knowledge-based systems. With this goal on mind the following evaluation types are proposed:

1) evaluating the induced senses as clusters of examples. The induced clusters will be compared to the sets of examples tagged with the given gold standard word senses (classes), and evaluated using the F-score measure for clusters.

2) mapping induced senses to gold standard senses, and using the mapping to tag the test corpus with gold standard tags. The mapping will be automatically produced by the task organizers, and the resulting system evaluated according to the usual precision and recall measures for supervised word sense disambiguation systems.

This double evaluation methodology has already been tried in (Agirre et al. 2006).

Datasets and formats
The dataset will be comprised by the texts from the English lexical-sample task in SemEval-2007 (task 17). 

The input and outputs of participant systems will follow the usual Senseval-3 format, with one difference: the labels for senses in the output can be arbitrary symbols. Please note that the output will consist of instances from different words, and thus the labels of each induced sense must be unique. For instance, let's assume that one participant system has induced 2 senses for the noun "brother" (named brother.n.C0, brother.n.C1) and 3 senses for verb "shake" (named shake.v.C0, shake.v.C1 and shake.v.C2). These are example outputs for a sample of instances of both words:

brother.n brother.n.00001 brother.n.C1
brother.n brother.n.00002 brother.n.C0 brother.n.C1
...
shake.v shake.v.00001 shake.v.C2/0.4 shake.v.C0/0.5 shake.v.C1/0.1
shake.v shake.v.00002 shake.v.C2/914 shake.v.C0/817

In the first line the system assigns sense brother.n.C1 to instance brother.n.00001 with weight 1 (default). In the second line the system assigns equal weight to senses brother.n.C0 and brother.n.C1 (1 by default). In the last two lines the weight is explicitly given for the senses of shake. Weights don't need to add to one, but must be positive. Senses not mentioned in the line will get weight 0. Check this site for more details on formats.

We interpret the results as a hard clustering task, with systems assigning the sense with maximum weight. In case of ties, we interpret that the system is forming a new sense which is a combination of those senses in the tie. For the example above:

We recommend that participants return all induced senses per instance with associated weights, as these will be used for the second variety of evaluation (see below).

Participation
These are the steps to be followed by participants (see also important dates below):
  1. register in the Semeval website
  2. download the data from the Semeval website 
  3. participants have 2 weeks to induce the "senses", tag the whole data with those "senses" and upload it on the Semeval website
Evaluation
Organizers will return the evaluation in two varieties:

a. clustering-style evaluation. We interpret the gold standard (GS) as a clustering solution: all examples tagged with a given sense in the GS form a class. The examples returned by participants that share the "sense" tag with maximum weight are the clusters. We compare participants clusters on the test data with the classes in the gold-standard, and compute F-score as usual (Agirre et al. 2006). In case of ties (or multiple sense tags in the GS), a new sense will be formed.

b. mapping to the GS sense inventory: organizers use training/test split of the data (as defned in task 17) to map the participants "senses" into the official sense inventory. Using this mapping, the organizers convert the participants results into the official sense inventory, and compute the usual precision and recall measures. See (Agirre et al. 2006) for more details.

The first evaluation variety give better scores to the induced senses most similar to the GS senses (e.g. similar number of senses). The second evaluation variety allows for comparison with other kinds of systems. It does not necessarily favor systems inducing senses similar to the GS. We have used such framework to evaluate graph-based sense-induction techniques in (Agirre et al. 2006). 

We strongly suggest participants to discuss and propose alternative evaluation strategies, with the conditions that they make use of the available lexical-sample data.

Download area
This section will contain evaluation software, useful scripts, complementary materials, baseline systems, etc. but not the datasets
proper. The datasets are available at the main site for download.

Deadlines
The timing for this task can be summarized in the following steps:

  1. participants register on the 26th of Feb.
  2. deadline for submission is the 1st of Apr.
  3. participants can choose when to download and submit in this timeframe (26th of Feb. to 1st of Apr.), but will only have 2 weeks for submitting the results starting from the download date.

Discussion
Some skepticism has been raised about using Gold Standard senses as a means of evaluation of induced senses. The quality of the Gold Standard is certainly an issue, as it is the fact that the systems might be inducing senses from the text which might not be fully reflected in the Gold Standard inventory of senses.

Another issue is the granularity of the clusters, and the fact that the participants might know in advance which is the target number of senses per word. Participants using this information should clearly indicate this fact when registering.

We acknowledge that some objections can be raised, specially to the evaluation using an automatic mapping to the gold standard senses, but we think that there is no better alternative to perform large scale comparative evaluations with regard to other types of WSD systems.

We suggest participants to discuss and propose alternative evaluation strategies, with the conditions that they make use of the available lexical-sample data.

Acknowledgements
We thank Ted Pedersen and Phil Edmonds for comments on this task proposal.

References
Pedersen, T. Unsupervised Corpus-Based Methods for WSD. In Agirre, E. and Edmonds, P. (Eds.) "Word Sense Disambiguation: Algorithms and applications". Springer, 2006.

Agirre E., Lopez de Lacalle Lekuona O., Martinez D., Soroa A. 2006. Two graph-based algorithms for state-of-the-art WSD. Procceedings of EMNLP 2006.