Guide for Use of Trial Dataset

This document describes the Trial Dataset [1] for Task 4 in SemEval 2007 [2], Classification of Semantic Relations between Nominals [3]. If you have any questions about the use of the Trial Dataset that are not answered below or in the description of Task 4, please send a message to the Semantic Relations Google Group [4] or contact any of the authors of Classification of Semantic Relations between Nominals [3]. The purpose of this guide is to provide basic information about the Trial Dataset for participants in Task 4. We describe the format of the Trial Dataset and the plans for the Complete Dataset, we give some suggestions for the use of the Trial Dataset, and we suggest some resources that may be useful to participants.

Description of Trial Dataset

The planned release date for the Trial Dataset is January 3, 2007 [5]. The Trial Dataset includes 140 sentences that are positive and negative examples of the Content-Container relation, which is defined in Relation 7: Content-Container [6]. These 140 sentences will be merged into the Complete Dataset as training examples.

An example of one of the sentences follows:

113 "After the cashier put the <e1>cash</e1> in a <e2>bag</e2>, the robber saw a bottle of scotch that he wanted behind the counter on the shelf."
WordNet(e1) = "cash%1:21:00::", WordNet(e2) = "bag%1:06:00::", Content-Container(e1, e2) = "true", Query = "the * in a bag"

The first line includes the sentence itself, preceded by a numerical identifier. The two nominals, "cash" and "bag", are marked by <e1> and <e2> tags. The second line gives the WordNet sense keys for the two nominals [7] and indicates whether the semantic relation between the nominals is a positive ("true") or negative ("false") example of the Content-Container relation [6]. We use WordNet sense keys because, unlike WordNet synset numbers, sense keys are relatively stable across different versions of WordNet. Our preferred version of WordNet is 3.0, but we believe that most of the sense keys for version 2.1 are the same as in version 3.0 [8] (although the synset numbers changed significantly between the two versions). Finally, the second line gives the query that was used to find the sentence, by searching on Google. The queries are manually generated heuristic patterns that are intended to find sentences that are positive examples of the given relation (Content-Container, in this case) [9].

Some of the sentences are followed by an optional comment line:

127 "I find it hard to bend and reach and I cannot use the <e1>cupboards</e1> in my <e2>kitchen</e2>."
WordNet(e1) = "cupboard%1:06:00::", WordNet(e2) = "kitchen%1:06:00::", Content-Container(e1, e2) = "false", Query = "the * in my kitchen"
Comment: Located-Location or, better, Part-Whole.

The comment lines have been added by the annotators, to explain their labeling decisions. The comments are intended for human readers. They should be ignored by the algorithms that participate in the task, and they will not be used in scoring the output of the algorithms.

To help motivate Task 4, consider the following potential application. Imagine that we wish to create a new type of search engine for semantic relations. For example, suppose I have just bought a new home, and I am wondering what things I will need to purchase for my new kitchen. I could search for all X such that Content-Container(X, kitchen) = "true". We assume that the search engine will have a predefined set of manually generated heuristic patterns for a few basic semantic relations, such as Content-Container(X, Y). One of the patterns might be "the X in a Y", so that a search for all X such that Content-Container(X, kitchen) = "true" will result in the query "the X in a kitchen". Some of the sentences that are found with this query will be positive examples of Content-Container(X, kitchen) and some will be near-miss negative examples. The challenge of Task 4 is to learn to automatically distinguish the positive and negative examples. A successful algorithm for this task could be used to filter the query results in a search engine for semantic relations. Other possible applications of a successful algorithm include question answering and paraphrasing.

Plans for Complete Dataset

The planned release date for the Complete Dataset is February 26, 2007 [5]. The evaluation period begins on this date and ends on April 1, 2007. The Complete Dataset will include the following seven semantic relations:

  1. Cause-Effect (e.g., virus-flu)
  2. Instrument-User (e.g., laser-printer)
  3. Product-Producer (e.g., honey-bee)
  4. Origin-Entity (e.g., rye-whiskey)
  5. Purpose-Tool (e.g., soup-pot)
  6. Part-Whole (e.g., wheel-car)
  7. Content-Container (e.g., apple-basket)

For each relation, there will be 140 training sentences and 70 testing sentences. Approximately half of the sentences will be positive examples and the other half will be near-miss negative examples. The "true" and "false" labels will not be available to the participants for the testing sentences until after the end of the evaluation period. Comment lines will also be removed from the testing dataset until the end of the evaluation period. All other labels will be included in the initial release of the Complete Dataset.

The above seven semantic relations are not exhaustive; for example, the Hypernym-Hyponym relation is not included. When generating the Complete Dataset, we will consider each relation on its own, as a binary positive-negative classification problem. We will not make any assumptions about whether the relations are overlapping or exclusive. Therefore a positive example of one relation is not necessarily a negative example of another relation.

Experimenting with the Trial Dataset

The Trial Dataset is intended to help participants in Task 4 develop and test their algorithms, to prepare for the Complete Dataset. For development and testing purposes, the participants can randomly split the Trial Dataset into training and testing sets. When the Complete Dataset is released, we will ask participants to submit predictions for the labels in the testing set, based on various fractions of the training set. This can be simulated with the Trial Dataset by experimenting with various train/test ratios of the Trial Dataset.

The performance of the participants' algorithms will be evaluated based on their success at guessing the hidden true/false labels for the testing sentences. The performance measures will be precision, recall, and F (the harmonic mean of precision and recall). Algorithms will be allowed to skip difficult sentences, for increased precision but decreased recall.

For the evaluation with the Complete Dataset, performance measures will be calculated automatically by comparing the output of each algorithm to the annotators' labels. The scoring script will accept output in the following format:

001 true
002 false
003 skipped
004 skipped
005 false
006 true
...

For example, the first line of output indicates that the algorithm has guessed that Content-Container is true for sentence number 001. Participants may wish to use this format with the Trial Dataset, to prepare for the evaluation with the Complete Dataset.

We anticipate that some of the participating algorithms will use the WordNet labels and others will ignore them (e.g., corpus-based algorithms may have no use for the WordNet labels). Therefore we will divide the results (the performance on the testing data) into two categories, based on whether WordNet labels were used. A participating team may submit predictions for the testing labels in both categories, if their algorithm can work with both options.

Resources

All resources are allowed for Task 4 (e.g., lexicons, corpora, part-of-speech tagging, parsing), but the algorithms must be automated (i.e., no human in the loop). We anticipate that many of the participants will use supervised machine learning algorithms to learn positive/negative classification models from the training data. We expect that the main challenge will be creating good feature vectors to represent each example. As a starting point in the search for resources, we recommend the ACL Resources List [10].

References

[1] Relation 7: Training Data, http://docs.google.com/View?docID=w.df735kg3_8gt4b4c

[2] SemEval 2007: 4th International Workshop on Semantic Evaluations, http://nlp.cs.swarthmore.edu/semeval/

[3] Classification of Semantic Relations between Nominals: Description of Task 4 in SemEval 2007, http://docs.google.com/View?docID=w.d2jm3f3_98kcwd4

[4] Google Groups: Semantic Relations, http://groups.google.com/group/semanticrelations

[5] SemEval-2007: Schedule, http://nlp.cs.swarthmore.edu/semeval/schedule.shtml

[6] Relation 7: Content-Container, http://docs.google.com/View?docID=w.df735kg3_3gnrv95

[7] WordNet Reference Manual: Format of Sense Index File, http://wordnet.princeton.edu/man/senseidx.5WN

[8] WordNet: A Lexical Database for the English Language, http://wordnet.princeton.edu/

[9] Relation 7: Queries, http://docs.google.com/View?docID=w.df735kg3_12dpk9mx

[10] ACL Resources List, http://aclweb.org/aclwiki/index.php?title=Resources