Task #4: Classification of Semantic Relations between Nominals
Roxana Girju1, Marti Hearst2, Preslav Nakov3,
Stan Szpakowicz5, Peter Turney6, Deniz Yuret7
July 28, 2006
1. Department of Linguistics, University of Illinois at Urbana-Champaign,
2. School of Information, University of California, Berkeley,
3. Department of Electrical Engineering and Computer Science, University of
California, Berkeley, email@example.com
4. School of Information Technology and Engineering, University of Ottawa,
5. School of Information Technology and Engineering, University of Ottawa,
6. Institute for Information Technology, National Research Council of Canada,
7. Department of Computer Engineering, Koc University, firstname.lastname@example.org
1.Description of the Task
There is growing interest in the task of classifying semantic relations
between pairs of words. However, many different classification schemes have
been used, which makes it difficult to compare the various classification
algorithms. We will create a benchmark dataset and evaluation task that will
enable researchers to compare their algorithms.
Rosario and Hearst (2001) classify noun-compounds from the medical
domain, using a set of 13 classes that describe the semantic
relation between the head noun and the modifier in a given
noun-compound. Rosario et al. (2002) classify noun-compounds using a
multi-level hierarchy of semantic relations, with 15 classes at the
top level. Nastase and Szpakowicz (2003) present a two-level
hierarchy for classifying noun-modifier relations in general domain
text, with 5 classes at the top and 30 classes at the bottom. Their
class scheme and dataset have been used by other researchers (Turney
and Littman, 2005; Turney, 2005; Nastase et al., 2006). Moldovan et
al. (2004) use a 35-class scheme to classify relations in noun
phrases. The same scheme has been applied to noun compounds (Girju
et al., 2005). Chklovski and Pantel (2004) use a 5-class scheme,
designed specifically for characterizing verb-verb semantic
relations. Stephens et al. (2001) use a 17-class scheme created for
relations between genes. Lapata (2002) uses a 2-class scheme for
classifying relations in nominalizations.
Algorithms for classifying semantic relations have potential
applications in Information Retrieval, Information Extraction,
Summarization, Machine Translation, Question Answering,
Paraphrasing, Recognizing Textual Entailment, Thesaurus
Construction, Semantic Network Construction, Word Sense
Disambiguation, and Language Modeling. As the techniques for
semantic relation classification mature, some of these applications
are being tested. Tatu and Moldovan (2005) applied the 35-class
scheme of Moldovan et al. (2004) to the PASCAL Recognizing Textual
Entailment (RTE) challenge, obtaining significant improvement in a
There is no consensus on schemes for classifying semantic relations,
and it seems unlikely that any single scheme could be useful for all
applications. For example, the gene-gene relation scheme of
Stephens et al. (2001) includes relations such as "X
phosphorylates Y", which are not very useful for general
domain text. Even if we focus on general domain text, the verb-verb
relations of Chklovski and Pantel (2004) are unlike the
noun-modifier relations of Nastase and Szpakowicz (2003) or the noun
phrase relations of Moldovan et al. (2004).
We will create a benchmark dataset for evaluating semantic relation
classification algorithms, embracing several different existing
classification schemes, instead of attempting the daunting chore of
creating a single unified standard classification scheme. We will
treat each semantic relation separately, as a single two-class
(positive negative) classification task, rather than taking a
whole N class scheme of relations as an
N class classification task (Nastase and Szpakowicz,
To constrain the scope of the task, we have chosen a specific
application for semantic relation classification, relational
search (Cafarella et al., 2006). We describe this application in Section 2. The
application we envision is a kind of search engine that can answer
queries such as "list all X such that X
causes asthma" (Girju, 2001). Given this
application, we have decided to focus on semantic relations between
nominals (i.e., nouns and base noun phrases, excluding named
The dataset for the task will consist of annotated sentences. We
will select a sample of relation classes from several different
classification schemes and then gather sentences from the Web using
a search engine. We will manually markup the sentences, indicating
the nominals and their relations. Algorithms will be evaluated by
their average classification performance over all of the sampled
relations, but we will also be able to see whether some relations
are more difficult to classify than others, and whether some
algorithms are best suited for certain types of relations.
2.Application Example: Relational Search
For some of the tasks that we mention in Section 1, it might be
argued that semantic relation classification plays a supporting
role, rather than a central role. We believe that semantic relation
classification is central in relational search. Cafarella et al. (2006) describe
four types of relational search tasks. Although they focus on
relations between named entities, the same kinds of tasks would be
interesting for nominals. For example, consider the task of making a
list of things that have a given relation with some constant thing:
list all X such that X causes cancer
list all X such that X is part of an automobile
list all X such that X is material for making a
list all X such that X is a type of
list all X such that X is produced from cork
For these kinds of relational search tasks, we do not need a
complete, exhaustive, non-overlapping set of classes of semantic
relations. Each class, such as X causes Y, can
be treated as a single binary classification problem. Any algorithm
that performs well on the dataset (Section 4) and task (Section 5)
described here should be directly applicable to relational search
3.Semantic Relations versus Semantic Roles
We should note that classifying semantic relations between pairs of
words is different in several ways from automatic labeling of
semantic roles (Gildea and Jurafsky, 2002), which was one of the
tasks in Senseval-3.1 Semantic roles involve frames
with many slots, but our focus is on relations between pairs of
words; semantic roles are centered on verbs and their arguments, but
semantic relations include pairwise relations between all parts of
speech (although we limit our attention to nominals in this task, to
keep the task manageable); FrameNet2 currently contains more than
8,900 lexical units, but none of the schemes discussed above contain
more than 50 classes of semantic relations.
Each slot in a frame might be considered as a binary relation, but
FrameNet and PropBank3 do not make a consistent
effort to assign the same labels to similar slots. In FrameNet, for
example, the verb "sell" ("Commerce_sell") has core slots
"buyer", "goods", and "seller", whereas the verb
"give" ("Giving") has core slots "donor",
"recipient", and "theme". There is no matching of the
similar slots, although they have very similar semantic relations:
<sell, buyer> ↔ <give,
<sell, goods> ↔ <give, theme>
<sell, seller> ↔ <give, donor>
Semantic relation classification schemes generalize relations across
wide groups of verbs (Chklovski and Pantel, 2004) and include
relations that are not verb-centered (Nastase and Szpakowicz, 2003;
Moldovan et al., 2004). Using the same labels for similar semantic
relations facilitates supervised learning. For example, a learner
that has been trained with examples of "sell" relations should
be able to transfer what it has learned to "give" relations.
4.Generating Training and Testing Data
Nastase and Szpakowicz (2003) manually labeled 600 noun-modifier pairs. Each
pair was assigned one of thirty possible labels. To facilitate classification,
each word in a pair was also labeled with its part of speech and its synset
When classifying relations in noun phrases, Moldovan et al. (2004) provided
their annotators with an example of each phrase in a sentence. We will include
parts of speech, synset numbers, and a sample sentence for each pair in our
training and testing data.
Consider the noun-modifier pair "silver ship", in which the head
noun "ship" is modified by the word "silver". Using the
classification scheme of Nastase and Szpakowicz (2003), the semantic
relation in this pair might be classified as material (the
ship is made of silver), purpose (the ship was built for
carrying silver), or content (the ship contains silver). Note
that parts of speech and WordNet synsets are not sufficient to
determine which of these three classes is intended. If "silver" is
labeled "a1" (adjective, WordNet sense number 1, "made from or
largely consisting of silver") and "ship" is labeled "n1" (noun,
WordNet sense number 1, "a vessel that carries passengers or
freight"), then the correct class must be material. However,
if "silver" is labeled "n1" (noun, WordNet sense number 1, "a soft
white precious univalent metallic element"), then the correct class
could be either purpose or content. We would represent
this example as follows:
"The <e1>silver</e1> <e2>ship</e2> usually
carried silver bullion bars, but sometimes the cargo was gold or
platinum." WordNet(e1) = "n1", WordNet(e2) = "n1", Relation(e1, e2)
We will begin by choosing seven relations from several different
schemes (e.g., Nastase and Szpakowicz, 2003; Moldovan et al.,
2004). We will focus on relations between nominals. For the purposes
of this exercise, we define a nominal as a noun or base noun phrase,
excluding named entities. A base noun phrase is a noun and its
premodifiers (e.g., nouns, adjectives, determiners). We do not
include complex noun phrases (e.g., noun phrases with attached
prepositional phrases). For example, "lawn" is a noun, "lawn mower"
is a base noun phrase, and "the engine of the lawn mower" is a
complex noun phrase. The markup will explicitly identify entity
boundaries, so the teams that attempt this task will not need to
worry about finding entity boundaries (e.g., in "<e1>macadamia nuts</e1> in the
<e2>cake</e2>", we can see that the first entity is a base
noun phrase and the second entity is a noun; there will be no need
for a chunking parser).
For each of the chosen relations, we will give a precise definition of the
relation and some prototypical examples. The definitions and examples will be
available to the annotators and will be included in the distribution of the
training and testing data.
Given a specific relation (e.g., content), we will use heuristic
patterns to search in a large corpus for sentences that illustrate the given
relation. For example, for the relation content, we may use Google to
search the Web, using queries such as "contains", "holds", "the * in the". For
each relation, we will use several different search patterns, to ensure a wide
variety of example sentences. The search patterns will be manually
constructed, using the approach of Hearst (1992).
The collected sentences will be given to two annotators, who will create
positive and negative training examples from the sentences. For example, here
are positive and negative examples for content (below,
"!=" means "does
"The <e1>macadamia nuts</e1> in the <e2>cake</e2> also
make it necessary to have a very sharp knife to cut through the cake neatly."
WordNet(e1) = "n2", WordNet(e2) = "n3", Relation(e1, e2) = "content".
"Summer was over and he knew that the
<e1>climate</e1> in the <e2>forest</e2> would only get
worse." WordNet(e1) = "n1", WordNet(e2) = "n1", Relation(e1, e2) !=
The negative example above would be classified as location in the
scheme of Nastase and Szpakowicz (2003). The use of heuristic patterns to
search for positive and negative training examples should naturally result in
negative examples that are near misses. We believe that near misses are more
useful for supervised learning than negative examples that are generated
Each example will be independently labeled by two annotators. When the
annotation is completed, the annotators will compare their labels and make a
note of the number of cases in which they agree and disagree. If the
annotators cannot come to a consensus on the correct labels for a particular
example, that example will not be included in the training and testing data,
although it will be recorded for possible future analysis.
This method of generating training and testing data is designed with
relational search in mind (Section 2). A natural approach to relational search
is to use heuristic patterns (Hearst, 1992) with a conventional search engine,
and then use supervised learning (Moldovan et al., 2004) to filter the
resulting noisy text.
We will follow the model of the Senseval-3 English Lexical Sample Task, which
had about 140 training and 70 testing samples per word. The following list
summarizes the main features of our dataset:
7 semantic relations (not exhaustive and possibly overlapping)
140 training sentences per relation (7
× 140 = 980 training sentences)
70 testing sentences per relation (7 ×
70 = 490 testing sentences)
210 combined testing and training sentences per relation (7
× 210 = 1,470 sentences)
sentence classes will be approximately 50% positive and 50% negative
(roughly 735 positive and 735 negative, for a total of 1,470 sentences)
several different search patterns will be used for each semantic relation,
to avoid biasing the sample sentences
negative examples of a relation will be "near misses"
Since most of the words in the Senseval-3 English Lexical Sample Task had more than two senses, we will have more
samples per class (positive and negative) per relation than the average word
in the English Lexical Sample Task had samples per sense per word.
For each relation, one person will retrieve sample sentences and two other
people will annotate the sentences. To encourage debate, the three people will
be chosen from three different institutions. A detailed guide will be
prepared, to maximize the agreement between annotators.
As with the Senseval-3 Lexical Sample tasks, each team participating in this
task will initially have access only to the training data. Later, the teams
will have access to unlabeled testing data (that is, there will be WordNet
labels, but no Relation labels). The teams will enter their algorithms'
guesses for the labels for the testing data. When SemEval-1 is over, the
labels for the testing data will be released to the public.
Algorithms will be allowed to skip examples that they cannot classify. An
algorithm's score for a given relation will be the F score, the harmonic mean
of precision and recall. Algorithms will be ranked according to their average
F scores for the chosen set of relations. We will also analyze the results to
see which relations are most difficult to classify. To assess the effect of
varying quantities of training data, we will ask the teams to submit several
sets of guesses for the labels for the testing data, using varying fractions
of the training data.
Some algorithms (e.g., corpus-based algorithms) may have no use for WordNet
annotations. It might also be argued that the WordNet annotation is not
practical in a real application. Therefore we will ask teams to indicate, when
they submit their answers, whether their algorithms used the WordNet labels.
We will group the submitted answers into those that used the WordNet labels
and those that did not, and we will rank the answers in each group separately.
Teams will be allowed to submit both types of answers, if their algorithms
Our collected training and testing data, including all annotation, will be
released under a Creative Commons
Cafarella, M.J., Banko, M., and Etzioni, O. (2006). Relational Web
Search. University of Washington, Department of Computer Science and
Engineering, Technical Report 2006-04-02.
Chklovski, T., and Pantel, P. (2004). VerbOcean: Mining the Web for
fine-grained semantic verb relations. In Proceedings of Conference on
Empirical Methods in Natural Language Processing (EMNLP-04). pp. 33-40.
Gildea, D., and Jurafsky, D. (2002). Automatic labeling of semantic roles.
Computational Linguistics, 28(3):245-288.
Girju, R. (2001). Answer fusion with on-line ontology development. In
Proceedings of the North American Chapter of the Association for
Computational Linguistics (NAACL) - Student Research Workshop (NAACL
2001), Pittsburgh, PA.
Girju, R., Moldovan, D., Tatu, M., Antohe, D. (2005). On the semantics of noun
compounds. Computer Speech and Language, 19:479-496.
Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora.
In Proceedings of the 14th International Conference on Computational
Linguistics (COLING-92), pages 539-545.
Lapata, M. (2002). The disambiguation of nominalisations. Computational
Moldovan, D., Badulescu, A., Tatu, M., Antohe, D., and Girju, R. (2004).
Models for the semantic classification of noun phrases. In Proceedings of
the Computational Lexical Semantics Workshop at HLT-NAACL 2004, pp. 60-67,
Nastase, V., and Szpakowicz, S. (2003). Exploring noun-modifier semantic
relations. In Fifth International Workshop on Computational Semantics
(IWCS-5), pp. 285-301. Tilburg, The Netherlands.
Nastase, V., Sayyad-Shirabad, J., Sokolova, M., and Szpakowicz, S. (2006).
Learning noun-modifier semantic relations with corpus-based and WordNet-based
features. In Proceedings of the 21st National Conference on Artificial
Intelligence (AAAI 2006). Boston, MA.
Rosario, B., and Hearst, M. (2001). Classifying the semantic relations in
noun-compounds via a domain-specific lexical hierarchy. In Proceedings of
the 2001 Conference on Empirical Methods in Natural Language Processing
(EMNLP-01), pp. 82-90.
Rosario, B., Hearst, M., and Fillmore, C. (2002). The descent of hierarchy,
and selection in relational semantics. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguistics (ACL-02),
Philadelphia, PA, pp. 417-424.
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., and Mostafa, J. (2001).
Detecting gene relations from MEDLINE abstracts. In Proceedings of the
Sixth Annual Pacific Symposium on Biocomputing, pp. 483-496.
Tatu, M., and Moldovan, D. (2005). A semantic approach to recognizing textual
entailment. In Proceedings of Human Language Technology Conference and
Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP
2005), pp. 371-378, Vancouver,
Turney, P.D. (2005). Measuring semantic similarity by latent relational
analysis. In Proceedings of the Nineteenth International Joint Conference
on Artificial Intelligence (IJCAI-05), pp. 1136-1141, Edinburgh, Scotland.
Turney, P.D. and Littman, M.L.. (2005). Corpus-based learning of analogies and
semantic relations. Machine Learning, 60(1-3):251-278.