| |
Task #10: English Lexical Substitution Task
Diana McCarthy and Roberto Navigli
(For recent information including questions from participants and a list of issues raised with trial dataset. Please visit the task web site).
WSD has been described as a task in need of an application. Whilst
researchers do believe that it will ultimately prove useful for
applications which need some degree of semantic interpretation, the
jury is still out on this point. One problem is that WSD systems have
been tested on fine-grained inventories, rendering the task harder
than it need be for many applications (Ide and Wilks, 2006). A
significant problem is that there is no clear choice of inventory for
any given task (other than perhaps the use of a parallel corpus for a
specific language pair for a machine translation application).
Following some earlier ideas (McCarthy, 2002), we will organise a
lexical substitution task for Semeval-2007. The motivation behind this is
that such a task would better bridge the gap between the needs of NLP
applications and capabilities of systems built by researchers (rather
than large application driven teams). Finding alternative words that
can occur in given contexts would hopefully be useful to many
applications such as question answering, summarisation, paraphrase
acquisition (Dagan et al., 2006), text simplification, lexical
acquisition (McCarthy, 2002), etc...
We will use a lexical sample, since this will make the tagging task
easier, and allow us to see the potential of this method for a set of
words which have candidate substitutes. We will select a sample of up
to 100 words from each part-of-speech (nouns, verbs, adjectives and
adverbs); but with a bias to part-of-speech (PoS) where the words have
different meanings with different substitutes. The task, for both
human annotators and systems will be to replace a target word in a
sentence with as close a word as possible. If necessary a phrase can
be used, but we will specifically instruct for single words to be used
in preference to phrases. We will also instruct the annotators that
they can additionally supply a slightly more general term if they
can't think of a good substitute. Annotators will be asked to identify
cases where the target word seems to be an integral part of a phrase
in the test sentence. This will hopefully provide useful data for
multiword exploration. We will use these annotations, where the
target is part of a multiword, for a sub-task for multiword detection
(spotting that there is a multiword) and identification (specifying
what the multiword is).
The test words that we use will be selected as having a range of
different meanings with different candidate substitutes - by
examination of lexicons such as WordNet, the Oxford Dictionary of
English, the Sketch Engine (http://www.sketchengine.co.uk/)
and by examining corpora. We will manually select a sample of test
words, and randomly select the remaining words for each PoS. The
proportion of manually selected words will depend on the number of
good candidates available for the PoS, but will not exceed 50. Each
test word will have 10 utterances. For a proportion of words (20 for
each PoS) we will manually select the utterances; for the other words
we will sample randomly. The motivation for manual selection is that
we are more likely to find examples requiring a variety of different
synonyms for a given test word. Systems which rely on the skew of word
meanings will not get such an advantage on the subset of words with
manually selected sentences. Results on the data obtained from the
manual and random selection procedures for sentences will be reported
separately.
We will not require human annotators to provide synonyms from a
given inventory precisely because we wish to avoid bias from systems
using a predefined inventory, such as WordNet, as opposed to a system
such as CBC (Pantel and Lin, 2002) which automatically induces
senses. We would also like the data collected to be of use to those
evaluating lexical resources for NLP. For this reason we will not
provide training data since this would mean we would need to specify
potential substitutes in advance. The anticipated lack of human
agreement on this task is the largest obstacle. It is quite likely
that humans will come up with different near synonyms for a given
target. Systems will undoubtedly do the same. To tackle this we
propose several different evaluation measures. The details of these
are included in the document task10documentation.pdf
that will be released with the dry run data. This document also
contains details on the baselines and agreement measures
The Annotators
We will use adult native English speakers as annotators (living in the
UK). We do not believe the task requires lexicographers as annotators,
indeed lexicographers might be biased by a given dictionary that they
are working on.
Gold-standard Validation
We will get 2 annotators (not used in the initial stage) + 1
adjudicator, to check a mixed list of types returned by both systems
and annotators for each item. They will be instructed to find the
words which are good substitutes from the list for each item. We will
do this just on a sample of 100 items (sentences) to validate the
adequacy of the gold-standard. There would not be time after the
systems upload their results to conduct a full post-hoc evaluation.
The Corpus
We will use the corpus produced by Sharoff (2006) from the internet
(http://corpus.leeds.ac.uk/internet.html). This
is a balanced corpus similar in flavour to the BNC, though with less
bias to British English, obtained by sampling data from the web. Because it is collected from the web there will be some noise. We will do our best to remove this automatically and with some manual screening but some will inevitably remain.
References
Dagan, I., Glickman, O., Gliozzo, A., Marmorshtein, E. and
Strapparava, C. (2006) Direct Word Sense Matching for lexical
substitution. Proceedings of ACL/COLING 2006.
Ide, N. and Wilks, Y. (2006). Making Sense About Sense. In Agirre, E.,
Edmonds, P. (Eds.), Word Sense Disambiguation: Algorithms and
Applications, Springer.
McCarthy, D. (2002) Lexical Substitution as a Task for WSD
Evaluation, In Proceedings of the ACL Workshop on Word Sense
Disambiguation: Recent Successes and Future Directions,
Philadelphia, USA.
Pantel, P. and Dekang, L. (2002) Discovering Word Senses from
Text. In Proceedings of ACM Conference on Knowledge Discovery and Data
Mining (KDD-02). pp. 613-619. Edmonton, Canada.
Sharoff, S. (2006). Open-source corpora: Using the net to fish for
linguistic data. International Journal of Corpus Linguistics 11 (4):
435-462.
|