Task #2: Evaluating Word Sense Induction and Discrimination Systems

The goal of this task is to allow for comparison across sense-induction and discrimination systems, and also to compare these systems to other supervised and knowledge-based systems. With this goal on mind the following evaluation types are proposed:

1) evaluating the induced senses as clusters of examples. The induced clusters will be compared to the sets of examples tagged with the given gold standard word senses (classes), and evaluated using the standard purity, entropy and f-score measures for clusters.

2) mapping induced senses to gold standard senses, and using the mapping to tag the test corpus with gold standard tags. The mapping will be automatically produced by the task organizers, and the resulting system evaluated according to the usual precision and recall measures for supervised word sense disambiguation systems.

This double evaluation methodology has already been tried in (Agirre et al. 2006).

In particular we propose to use the data from English lexical-sample task in SemEval-2007, with the usual training + test split. The sense inventory will be that of a coarse-grained WordNet. Please refer to that task for reference.

Eneko Agirre and Aitor Soroa
University of the Basque country

Agirre E., Lopez de Lacalle Lekuona O., Martinez D., Soroa A. 2006. Two graph-based algorithms for state-of-the-art WSD. Procceedings of EMNLP 2006.