Task #2: Evaluating Word Sense Induction and Discrimination Systems
Organizers
Eneko
Agirre, University of the Basque Country (e.agirre at ehu dot es)
Aitor
Soroa, University of the Basque Country (a.soroa at ehu dot es)
Links
Motivation
The motivation for this task is that most of the existing word sense
induction and discrimination systems (also known as corpus-based
unsupervised systems) have been evaluated by ad-hoc methods, and
usually by the own authors of the method. We need to make clear that
this task is for such sense-induction systems, and is not intended for
knowledge-based systems (which are also taken to be
"unsupervised). See Pedersen (2006) for an overview.
The goal of this task is to allow for
comparison across
sense-induction and discrimination systems, and also to compare these
systems to other supervised and knowledge-based systems. With this
goal on mind the following evaluation types are proposed:
1) evaluating the induced senses as clusters
of examples. The
induced
clusters will be compared to the sets of examples tagged with the
given gold standard word senses (classes), and evaluated using the
F-score measure for clusters.
2) mapping induced senses to gold standard
senses, and using
the
mapping to tag the test corpus with gold standard tags. The mapping
will be automatically produced by the task organizers, and the
resulting system evaluated according to the usual precision and recall
measures for supervised word sense disambiguation systems.
This double evaluation methodology has
already been tried in
(Agirre
et al. 2006).
Datasets and formats
The dataset will be comprised by the texts from the English
lexical-sample task in SemEval-2007 (task
17).
The input and outputs of participant systems
will
follow the usual Senseval-3 format,
with one difference: the labels for senses in the output can be
arbitrary symbols. Please note that the output will consist of
instances from different words, and thus the labels of each
induced sense must be unique. For instance, let's assume that
one
participant
system has induced 2 senses for the noun "brother" (named brother.n.C0,
brother.n.C1) and 3 senses for verb "shake" (named shake.v.C0, shake.v.C1
and shake.v.C2). These are example outputs for a sample of instances of
both words:
brother.n brother.n.00001 brother.n.C1 brother.n brother.n.00002 brother.n.C0 brother.n.C1 ... shake.v shake.v.00001 shake.v.C2/0.4 shake.v.C0/0.5 shake.v.C1/0.1 shake.v shake.v.00002 shake.v.C2/914 shake.v.C0/817
In the first line the system assigns sense
brother.n.C1 to
instance brother.n.00001 with weight 1 (default). In the
second line the system assigns
equal weight to senses brother.n.C0 and brother.n.C1
(1 by default). In the last two lines the weight is explicitly
given for the senses of shake. Weights
don't need to add to one, but must be positive. Senses not mentioned in
the line will get weight 0. Check this
site for more details on formats.
We interpret the results as a hard clustering
task,
with systems
assigning the sense with maximum weight. In case of ties, we interpret
that the system is forming a new sense which is a combination of those
senses in the tie. For the example above:
- instance brother.n.00001 is assigned
brother.n.C1
as the induced sense
- instance brother.n.00002 is
assigned
brother.n.C0_brother.n.C1 as the induced sense
- instance shake.v.00001 is assigned
shake.v.C0 as
the induced sense
- instance shake.v.00002 is assigned
shake.v.C2 as the induced sense
We recommend that participants return all
induced senses per
instance with associated weights, as these will be used for the second
variety of evaluation (see below).
Participation
These are the steps to be followed by participants (see
also
important dates below):
- register in the Semeval website
- download the data from the Semeval
website
- participants have 2 weeks to induce the
"senses", tag the
whole data with those "senses" and upload it on the Semeval
website
Evaluation
Organizers will return the evaluation in two varieties:
a.
clustering-style evaluation. We interpret the gold standard (GS)
as a
clustering solution: all examples tagged with a given sense in the GS
form a class. The examples returned by participants that share the
"sense" tag with maximum weight are the clusters. We compare
participants clusters on the
test data with the classes in the gold-standard, and compute F-score as
usual (Agirre et al. 2006). In case of ties (or multiple sense tags in
the GS), a new sense will be formed.
b. mapping
to the GS sense inventory: organizers
use training/test split of the
data (as defned in task 17) to map the participants "senses" into the
official sense
inventory. Using this mapping, the organizers convert the participants
results into the official sense
inventory, and compute the usual precision and recall measures.
See (Agirre et al. 2006) for more details.
The first evaluation variety give better
scores to the
induced
senses
most similar to the GS senses (e.g. similar number of senses). The
second evaluation variety allows for comparison with other kinds
of systems. It does not necessarily favor systems inducing senses
similar to the GS. We have used such framework to evaluate
graph-based sense-induction
techniques in (Agirre et al. 2006).
We strongly suggest participants to
discuss and
propose alternative evaluation strategies, with the conditions
that they make use of the available lexical-sample data.
Download area
This section will contain evaluation software, useful
scripts,
complementary materials, baseline systems, etc. but not the datasets
proper. The datasets are available at the main site
for download.
Deadlines
The timing for this task can be summarized in the following steps:
- participants register on the 26th of Feb.
- deadline for submission is the 1st of Apr.
- participants can choose when to download
and
submit in this timeframe (26th of Feb.
to 1st of Apr.), but will
only have 2 weeks for
submitting the results starting from the download date.
Discussion
Some skepticism has been raised about using Gold Standard senses as a
means of evaluation of induced senses. The quality of the Gold
Standard is certainly an issue, as it is the fact that the systems
might be inducing senses from the text which might not be fully
reflected in the Gold Standard inventory of senses.
Another issue is the granularity of the
clusters, and the
fact that
the participants might know in advance which is the target number of
senses per word. Participants using this information should clearly
indicate this fact when registering.
We acknowledge that some objections can be
raised, specially
to the
evaluation using an automatic mapping to the gold standard senses, but
we think that there is no better alternative to perform large scale
comparative evaluations with regard to other types of WSD systems.
We suggest participants to discuss
and propose
alternative evaluation strategies, with the conditions
that they make use of the available lexical-sample data.
Acknowledgements
We thank Ted Pedersen and Phil Edmonds for comments on this task
proposal.
References
Pedersen, T. Unsupervised Corpus-Based Methods for WSD. In Agirre,
E. and Edmonds, P. (Eds.) "Word Sense Disambiguation: Algorithms and
applications". Springer, 2006.
Agirre E., Lopez de Lacalle Lekuona O.,
Martinez D., Soroa A.
2006. Two
graph-based algorithms for state-of-the-art WSD. Procceedings
of EMNLP
2006.
|