Task #1: Evaluating WSD on Cross-Language Information Retrieval

A SemEval-2007 task run in collaboration with the Cross-Language Evaluation Forum - CLEF

Organizers Eneko Agirre (e.agirre@ehu.es) Bernardo Magnini (magnini@itc.it) German Rigau (g.rigau@si.ehu.es) Piek Vossen (Piek.Vossen@irion.nl)

Introduction Since the Senseval evaluation campaigns started, the evaluation of Word Sense Disambiguation (WSD) as a separate task is a mature field, with both lexical-sample and all-words tasks. In the first case the participants need to tag the occurrences of a few words, for which hand-tagged data has already been provided. In the all-words task all the occurrences of open-class words occurring in two or three documents (a few thousand words) need to be disambiguated.

The community has long mentioned the necessity of evaluating WSD in an application, in order to check which WSD strategy is best, and more important, to try to show that WSD can make a difference in applications. The use of WSD in Machine Translation has been the subject of some recent papers, but less attention has been paid to Information Retrieval (IR).

With this proposal we want to make a first try to define a task where WSD is evaluated with respect to an Information Retrieval and Cross-Lingual Information Retrieval (CLIR) exercise. From the WSD perspective, this task will evaluate all-words WSD systems indirectly on a real task. From the CLIR perspective, this task will evaluate which WSD systems and strategies work best.

We are conscious that the number of possible configurations for such an exercise is very large (including sense inventory choice, using word sense induction instead of disambiguation, query expansion, WSD strategies, IR strategies, etc.), so we wanted to focus this first edition in the following way:

1) the IR/CLIR system is fixed 2) the expansion / translation strategy is fixed 3) the participants need to choose the best WSD strategy 4) the IR system is used as the upperbound for the CLIR systems.

We think that it is important to start doing this kind of application-driven evaluations, which might shed light to intricacies of the interaction among WSD and IR strategies. We see this as the first of a series of exercises, and one outcome of this task should be that both WSD and CLIR communities discuss together future evaluation possibilities.

In fact, CLEF-2008 will have a special track with the same data, where CLIR systems will have the opportunity of using the annotated data produced as a result of the Semeval-2007 task. The results of the task will be presented both in the Semeval workshop (ACL 2007) and the CLEF conference.

Description of the task This is an application-driven task, where the application is a fixed cross-lingual information retrieval system. Participants disambiguate text by assigning WordNet synsets and the system will do the expansion to other languages, index the expanded documents and run the retrieval for all the languages in batch. The retrieval results will be taken as a measure for fitness of the disambiguation. The modules and rules for the expansion and the retrieval will be exactly the same for all participants.

We propose two specific tasks:

1. participants disambiguate the corpus, the corpus is expanded to synonyms and translations and we measure the effects on cross-lingual retrieval. Queries are not processed.

2. participants disambiguate the queries per language, we expand the queries to synonyms and translations and we measure the effects on cross-lingual retrieval. Documents are not processed

The corpora and queries will be obtained from the ad-hoc CLEF tasks. The scores can be compared among the Semeval participants but also with the past CLEF participants.

Supported languages for queries: English, Spanish

Supported languages in documents: English

The English CLEF data from years 2000-2003 covers 250 topics, 169.477 documents (579 MB). The relevance judgments will be taken from CLEF. This has the disadvantage of having been produced by pooling the results of CLEF participants, and might bias the results towards systems not using WSD, specially for monolingual English retrieval. A post-hoc analysis of the participants results will analyze the effects of this.

Expansion and retrieval The retrieval engine will be an adaptation of the TwentyOne search system was developed during the 90's by the TNO research institute at Delft (The Netherlands). It is now further developed by Irion technologies as a cross-lingual retrieval system (Hiemstra & Kraaig, 1999; Vossen et al. 2006). It can be stripped down to a basic vector space retrieval system.

For expansion and translation we will use the publicly available Multilingual Central Repository (MCR, Atserias et al. 2004) from the MEANING project. The MCR follows the EuroWordNet design, and currently includes English, Spanish, Italian, Basque and Catalan wordnets tightly connected through the Interlingual Index (based on WordNet 1.6, but linked to all other WordNet versions).

The following work will be carried out by the organizers:

Conversion of the CLEF corpus and queries to the TwentyOne format
Normalisation of all queries to entries in the semantic network(s).
Expansion of the *.wsd files using the MCR to *.exp files that can be indexed by the Irion TwentyOne system.
Adaptation of TwentyOne to run as basic document retrieval system
Adaptation of the batch query and scoring system to get the adequate output
Conversion of the TwentyOne results to scores that can be used within the CLEF system.

Instructions for participation
Please note the steps in order to participate:

download trial data
fax and mail a signed-in copy of the end user agreement
register in the Semeval website
the organizers will send the user/pwd necessary to download the test data from the CLEF website (from Feb. 26 onwards)
upload the output files in the Semeval website (prior to Mar. 19 and not later than 14 days after downloading test data)

In addition to the general participation guidelines (see Semeval website) note also the following:

by participating in Semeval-2007 participants give the grant for future CLEF-2008 participants to use their automatically annotated data for research purposes.
the results returned by participants need to conform to the dtd's as provided by the organizers. Otherwise we cannot guarantee to be able to score those results. Software to validate the results is provided in the task website.
given the high overload of expanding and scoring the systems, the dealine for uploading results is tighter than other Semeval tasks (19 of March).
given the amount of text to be tagged, participants have 2 weeks to submit results starting from test data download time.

General design
The participants will be provided with (see Full Description for more details):

   1. the document collections (.nam+id files)
   2. the topics (.nam+id files)

The participants need to return a single compressed file with the input files enriched with WordNet 1.6 sense tags:

   1. for all the documents in the collection (.wsd files)
   2. for all the topics (.wsd files)

All files are in XML. Input documents and topics (.nam+id files) will follow the "docs.nam+id.dtd" and "topics.nam+id.dtd" dtds respectively. Output documents and topics (.wsd files) will follow the "docs.wsd.dtd" and "topics.wsd.dtd" dtds respectively.

The result files will be organized into a rich directory structure which follows the directory structure of the input.

See the trial data release for further details and examples of files.

Note that all senses returned by participants will be expanded, regardless of their weight. The current CLIR system does not use the weight information.

Additional data We will also provide some of the widely used WSD features in a word-to-word fashion (Agirre et al. 2006) in order to make participation easier. These features will be available for both topics and documents (test data) as well as all the words with frequency above 10 in SemCor 1.6 (which can be taken as the training data for supervised WSD systems). They will be posten in the task webpage soon. Evaluation The organizers will run an internal IR and CLIR system based on (Vossen et al. 2006) on each of the participants results as follows: 1. expand the returned sense tags to all synonyms in the target languages (Spanish and English) for both the documents and queries 2. index both original and expanded documents 3. perform the queries in two fashions: a. original queries on expanded document collection b. expanded queries on original document collection 4. compare the returned documents with relevance judgements The participant systems will be scored according tostandard IR/CLIR measures as implemented in the TREC evaluation package.

Schedule Jan. 3 Trial data available Jan. 15 (aprox.) End user agreement (EUA) available in website Feb. 26 Test data available fr download (signed EUA required) Mar. 26 Deadline for uploading results

June SemEval Workshop September CLEF conference

References Agirre E., O. Lopez de Lacalle Lekuona , D. Martinez. "Exploring feature set combinations for WSD". In procceedings of the annual meeting of the SEPLN, Spain. 2006 Jordi Atserias, Luis Villarejo, German Rigau, Eneko Agirre, John Carroll, Bernardo Magnini, Piek Vossen. "The MEANING Multilingual Central Repository". In Proceedings of the Second International WordNet Conference-GWC 2004, pg. 23-30, January 2004, Brno, Czech Republic. ISBN 80-210-3302-9

Djoerd Hiemstra and Wessel Kraaij. Twenty-One at TREC-7: Ad Hoc and Cross Language track. In Ellen M. Voorhees and Donna K. Harman, editors, The Seventh Text Retrieval Conference (TREC-7), volume 7, 1999. National Institute of Standards and Technology, NIST. Note: NIST Special Publication 500-242.1999.

Vossen, Piek, German Rigau, Iñaki Alegria, Eneko Agirre, David Farwell, Manuel Fuentes. "Meaningful results for Information Retrieval in the MEANING project". In Proceedings of the 3rd Global Wordnet Conference, Jeju Island, Korea, South Jeju, January 22-26, 2006.