Guide for Use of Trial Dataset
This document describes the Trial Dataset [1] for Task 4 in SemEval 2007 [2],
Classification of Semantic Relations between
Nominals [3]. If you have any questions about the use of the Trial
Dataset that are not answered below or in the description of Task 4, please send
a message to the Semantic Relations Google Group [4] or contact any of the
authors of Classification of Semantic Relations
between Nominals [3]. The purpose of this guide is to provide basic
information about the Trial Dataset for participants in Task 4. We describe the
format of the Trial Dataset and the plans for the Complete Dataset, we give some
suggestions for the use of the Trial Dataset, and we suggest some resources that
may be useful to participants.
Description of Trial Dataset
The planned release date for the Trial Dataset is January 3, 2007 [5]. The Trial
Dataset includes 140 sentences that are positive and negative examples of the
Content-Container relation, which is defined in
Relation 7: Content-Container [6]. These
140 sentences will be merged into the Complete Dataset as training examples.
An example of one of the sentences follows:
113 "After the cashier put the
<e1>cash</e1> in a <e2>bag</e2>, the robber saw a
bottle of scotch that he wanted behind the counter on the
shelf."
WordNet(e1) = "cash%1:21:00::",
WordNet(e2) = "bag%1:06:00::", Content-Container(e1, e2) = "true", Query =
"the * in a bag"
The first line includes the sentence itself, preceded by a numerical identifier.
The two nominals, "cash" and "bag", are marked by <e1> and <e2>
tags. The second line gives the WordNet sense keys for the two nominals [7] and
indicates whether the semantic relation between the nominals is a positive
("true") or negative ("false") example of the Content-Container relation [6]. We
use WordNet sense keys because, unlike WordNet synset numbers, sense keys are
relatively stable across different versions of WordNet. Our preferred version of
WordNet is 3.0, but we believe that most of the sense keys for version 2.1 are
the same as in version 3.0 [8] (although the synset numbers changed
significantly between the two versions). Finally, the second line gives the
query that was used to find the sentence, by searching on Google. The queries
are manually generated heuristic patterns that are intended to find sentences
that are positive examples of the given relation (Content-Container, in this
case) [9].
Some of the sentences are followed by an optional comment line:
127 "I find it hard to bend and reach
and I cannot use the <e1>cupboards</e1> in my
<e2>kitchen</e2>."
WordNet(e1) = "cupboard%1:06:00::",
WordNet(e2) = "kitchen%1:06:00::", Content-Container(e1, e2) = "false", Query
= "the * in my kitchen"
Comment: Located-Location or, better,
Part-Whole.
The comment lines have been added by the annotators, to explain their labeling
decisions. The comments are intended for human readers. They should be ignored
by the algorithms that participate in the task, and they will not be used in
scoring the output of the algorithms.
To help motivate Task 4, consider the following potential application. Imagine
that we wish to create a new type of search engine for semantic relations. For
example, suppose I have just bought a new home, and I am wondering what things I
will need to purchase for my new kitchen. I could search for all X such that
Content-Container(X, kitchen) = "true". We assume that the search engine will
have a predefined set of manually generated heuristic patterns for a few basic
semantic relations, such as Content-Container(X, Y). One of the patterns might
be "the X in a Y", so that a search for all X such that Content-Container(X,
kitchen) = "true" will result in the query "the X in a kitchen". Some of the
sentences that are found with this query will be positive examples of
Content-Container(X, kitchen) and some will be near-miss negative examples. The
challenge of Task 4 is to learn to automatically distinguish the positive and
negative examples. A successful algorithm for this task could be used to filter
the query results in a search engine for semantic relations. Other possible
applications of a successful algorithm include question answering and
paraphrasing.
Plans for Complete Dataset
The planned release date for the Complete Dataset is February 26, 2007 [5]. The
evaluation period begins on this date and ends on April 1, 2007. The Complete
Dataset will include the following seven semantic relations:
-
Cause-Effect (e.g., virus-flu)
-
Instrument-User (e.g., laser-printer)
-
Product-Producer (e.g., honey-bee)
-
Origin-Entity (e.g., rye-whiskey)
-
Purpose-Tool (e.g., soup-pot)
-
Part-Whole (e.g., wheel-car)
-
Content-Container (e.g., apple-basket)
For each relation, there will be 140 training sentences and 70 testing
sentences. Approximately half of the sentences will be positive examples and the
other half will be near-miss negative examples. The "true" and "false" labels
will not be available to the participants for the testing sentences until after
the end of the evaluation period. Comment lines will also be removed from the
testing dataset until the end of the evaluation period. All other labels will be
included in the initial release of the Complete Dataset.
The above seven semantic relations are not exhaustive; for example, the
Hypernym-Hyponym relation is not included. When generating the Complete Dataset,
we will consider each relation on its own, as a binary positive-negative
classification problem. We will not make any assumptions about whether the
relations are overlapping or exclusive. Therefore a positive example of one
relation is not necessarily a negative example of another relation.
Experimenting with the Trial Dataset
The Trial Dataset is intended to help participants in Task 4 develop and test
their algorithms, to prepare for the Complete Dataset. For development and
testing purposes, the participants can randomly split the Trial Dataset into
training and testing sets. When the Complete Dataset is released, we will ask
participants to submit predictions for the labels in the testing set, based on
various fractions of the training set. This can be simulated with the Trial
Dataset by experimenting with various train/test ratios of the Trial Dataset.
The performance of the participants' algorithms will be evaluated based on their
success at guessing the hidden true/false labels for the testing sentences. The
performance measures will be precision, recall, and F (the harmonic mean of
precision and recall). Algorithms will be allowed to skip difficult sentences,
for increased precision but decreased recall.
For the evaluation with the Complete Dataset, performance measures will be
calculated automatically by comparing the output of each algorithm to the
annotators' labels. The scoring script will accept output in the following
format:
001
true
002
false
003
skipped
004
skipped
005
false
006
true
...
For example, the first line of output indicates that the algorithm has guessed
that Content-Container is true for sentence number 001. Participants may wish to
use this format with the Trial Dataset, to prepare for the evaluation with the
Complete Dataset.
We anticipate that some of the participating algorithms will use the WordNet
labels and others will ignore them (e.g., corpus-based algorithms may have no
use for the WordNet labels). Therefore we will divide the results (the
performance on the testing data) into two categories, based on whether WordNet
labels were used. A participating team may submit predictions for the testing
labels in both categories, if their algorithm can work with both options.
Resources
All resources are allowed for Task 4 (e.g., lexicons, corpora, part-of-speech
tagging, parsing), but the algorithms must be automated (i.e., no human in the
loop). We anticipate that many of the participants will use supervised machine
learning algorithms to learn positive/negative classification models from the
training data. We expect that the main challenge will be creating good feature
vectors to represent each example. As a starting point in the search for
resources, we recommend the ACL Resources
List [10].
References
[1] Relation 7: Training Data,
http://docs.google.com/View?docID=w.df735kg3_8gt4b4c
[2] SemEval 2007: 4th International Workshop on Semantic Evaluations,
http://nlp.cs.swarthmore.edu/semeval/
[3] Classification of Semantic Relations between Nominals: Description of Task 4
in SemEval 2007,
http://docs.google.com/View?docID=w.d2jm3f3_98kcwd4
[4] Google Groups: Semantic Relations,
http://groups.google.com/group/semanticrelations
[5] SemEval-2007: Schedule,
http://nlp.cs.swarthmore.edu/semeval/schedule.shtml
[6] Relation 7: Content-Container,
http://docs.google.com/View?docID=w.df735kg3_3gnrv95
[7] WordNet Reference Manual: Format of Sense Index File,
http://wordnet.princeton.edu/man/senseidx.5WN
[8] WordNet: A Lexical Database for the English Language,
http://wordnet.princeton.edu/
[9] Relation 7: Queries,
http://docs.google.com/View?docID=w.df735kg3_12dpk9mx
[10] ACL Resources List,
http://aclweb.org/aclwiki/index.php?title=Resources