Task #10: English Lexical Substitution Task

Diana McCarthy and Roberto Navigli

(For recent information including questions from participants and a list of issues raised with trial dataset. Please visit the task web site).

WSD has been described as a task in need of an application. Whilst researchers do believe that it will ultimately prove useful for applications which need some degree of semantic interpretation, the jury is still out on this point. One problem is that WSD systems have been tested on fine-grained inventories, rendering the task harder than it need be for many applications (Ide and Wilks, 2006). A significant problem is that there is no clear choice of inventory for any given task (other than perhaps the use of a parallel corpus for a specific language pair for a machine translation application).

Following some earlier ideas (McCarthy, 2002), we will organise a lexical substitution task for Semeval-2007. The motivation behind this is that such a task would better bridge the gap between the needs of NLP applications and capabilities of systems built by researchers (rather than large application driven teams). Finding alternative words that can occur in given contexts would hopefully be useful to many applications such as question answering, summarisation, paraphrase acquisition (Dagan et al., 2006), text simplification, lexical acquisition (McCarthy, 2002), etc...

We will use a lexical sample, since this will make the tagging task easier, and allow us to see the potential of this method for a set of words which have candidate substitutes. We will select a sample of up to 100 words from each part-of-speech (nouns, verbs, adjectives and adverbs); but with a bias to part-of-speech (PoS) where the words have different meanings with different substitutes. The task, for both human annotators and systems will be to replace a target word in a sentence with as close a word as possible. If necessary a phrase can be used, but we will specifically instruct for single words to be used in preference to phrases. We will also instruct the annotators that they can additionally supply a slightly more general term if they can't think of a good substitute. Annotators will be asked to identify cases where the target word seems to be an integral part of a phrase in the test sentence. This will hopefully provide useful data for multiword exploration. We will use these annotations, where the target is part of a multiword, for a sub-task for multiword detection (spotting that there is a multiword) and identification (specifying what the multiword is).

The test words that we use will be selected as having a range of different meanings with different candidate substitutes - by examination of lexicons such as WordNet, the Oxford Dictionary of English, the Sketch Engine (http://www.sketchengine.co.uk/) and by examining corpora. We will manually select a sample of test words, and randomly select the remaining words for each PoS. The proportion of manually selected words will depend on the number of good candidates available for the PoS, but will not exceed 50. Each test word will have 10 utterances. For a proportion of words (20 for each PoS) we will manually select the utterances; for the other words we will sample randomly. The motivation for manual selection is that we are more likely to find examples requiring a variety of different synonyms for a given test word. Systems which rely on the skew of word meanings will not get such an advantage on the subset of words with manually selected sentences. Results on the data obtained from the manual and random selection procedures for sentences will be reported separately.

We will not require human annotators to provide synonyms from a given inventory precisely because we wish to avoid bias from systems using a predefined inventory, such as WordNet, as opposed to a system such as CBC (Pantel and Lin, 2002) which automatically induces senses. We would also like the data collected to be of use to those evaluating lexical resources for NLP. For this reason we will not provide training data since this would mean we would need to specify potential substitutes in advance. The anticipated lack of human agreement on this task is the largest obstacle. It is quite likely that humans will come up with different near synonyms for a given target. Systems will undoubtedly do the same. To tackle this we propose several different evaluation measures. The details of these are included in the document task10documentation.pdf that will be released with the dry run data. This document also contains details on the baselines and agreement measures

The Annotators

We will use adult native English speakers as annotators (living in the UK). We do not believe the task requires lexicographers as annotators, indeed lexicographers might be biased by a given dictionary that they are working on.

Gold-standard Validation

We will get 2 annotators (not used in the initial stage) + 1 adjudicator, to check a mixed list of types returned by both systems and annotators for each item. They will be instructed to find the words which are good substitutes from the list for each item. We will do this just on a sample of 100 items (sentences) to validate the adequacy of the gold-standard. There would not be time after the systems upload their results to conduct a full post-hoc evaluation.

The Corpus

We will use the corpus produced by Sharoff (2006) from the internet (http://corpus.leeds.ac.uk/internet.html). This is a balanced corpus similar in flavour to the BNC, though with less bias to British English, obtained by sampling data from the web. Because it is collected from the web there will be some noise. We will do our best to remove this automatically and with some manual screening but some will inevitably remain.


Dagan, I., Glickman, O., Gliozzo, A., Marmorshtein, E. and Strapparava, C. (2006) Direct Word Sense Matching for lexical substitution. Proceedings of ACL/COLING 2006.

Ide, N. and Wilks, Y. (2006). Making Sense About Sense. In Agirre, E., Edmonds, P. (Eds.), Word Sense Disambiguation: Algorithms and Applications, Springer.

McCarthy, D. (2002) Lexical Substitution as a Task for WSD Evaluation, In Proceedings of the ACL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, USA.

Pantel, P. and Dekang, L. (2002) Discovering Word Senses from Text. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD-02). pp. 613-619. Edmonton, Canada.

Sharoff, S. (2006). Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11 (4): 435-462.