Roxana Girju1, Marti Hearst2, Preslav Nakov3,
Vivi Nastase4,
Stan Szpakowicz5, Peter Turney6, Deniz Yuret7
July 28, 2006
1. Department of Linguistics, University of Illinois at Urbana-Champaign,
girju@cs.uiuc.edu
2. School of Information, University of California, Berkeley,
hearst@sims.berkeley.edu
3. Department of Electrical Engineering and Computer Science, University of
California, Berkeley, nakov@cs.berkeley.edu
4. School of Information Technology and Engineering, University of Ottawa,
vnastase@site.uottawa.ca
5. School of Information Technology and Engineering, University of Ottawa,
szpak@site.uottawa.ca
6. Institute for Information Technology, National Research Council of Canada,
peter.turney@nrc-cnrc.gc.ca
7. Department of Computer Engineering, Koc University, dyuret@ku.edu.tr
There is growing interest in the task of classifying semantic relations between pairs of words. However, many different classification schemes have been used, which makes it difficult to compare the various classification algorithms. We will create a benchmark dataset and evaluation task that will enable researchers to compare their algorithms.
Rosario and Hearst (2001) classify noun-compounds from the medical domain, using a set of 13 classes that describe the semantic relation between the head noun and the modifier in a given noun-compound. Rosario et al. (2002) classify noun-compounds using a multi-level hierarchy of semantic relations, with 15 classes at the top level. Nastase and Szpakowicz (2003) present a two-level hierarchy for classifying noun-modifier relations in general domain text, with 5 classes at the top and 30 classes at the bottom. Their class scheme and dataset have been used by other researchers (Turney and Littman, 2005; Turney, 2005; Nastase et al., 2006). Moldovan et al. (2004) use a 35-class scheme to classify relations in noun phrases. The same scheme has been applied to noun compounds (Girju et al., 2005). Chklovski and Pantel (2004) use a 5-class scheme, designed specifically for characterizing verb-verb semantic relations. Stephens et al. (2001) use a 17-class scheme created for relations between genes. Lapata (2002) uses a 2-class scheme for classifying relations in nominalizations.
Algorithms for classifying semantic relations have potential applications in Information Retrieval, Information Extraction, Summarization, Machine Translation, Question Answering, Paraphrasing, Recognizing Textual Entailment, Thesaurus Construction, Semantic Network Construction, Word Sense Disambiguation, and Language Modeling. As the techniques for semantic relation classification mature, some of these applications are being tested. Tatu and Moldovan (2005) applied the 35-class scheme of Moldovan et al. (2004) to the PASCAL Recognizing Textual Entailment (RTE) challenge, obtaining significant improvement in a state-of-the-art algorithm.
There is no consensus on schemes for classifying semantic relations, and it seems unlikely that any single scheme could be useful for all applications. For example, the gene-gene relation scheme of Stephens et al. (2001) includes relations such as "X phosphorylates Y", which are not very useful for general domain text. Even if we focus on general domain text, the verb-verb relations of Chklovski and Pantel (2004) are unlike the noun-modifier relations of Nastase and Szpakowicz (2003) or the noun phrase relations of Moldovan et al. (2004).
We will create a benchmark dataset for evaluating semantic relation classification algorithms, embracing several different existing classification schemes, instead of attempting the daunting chore of creating a single unified standard classification scheme. We will treat each semantic relation separately, as a single two-class (positive negative) classification task, rather than taking a whole N class scheme of relations as an N class classification task (Nastase and Szpakowicz, 2003).
To constrain the scope of the task, we have chosen a specific application for semantic relation classification, relational search (Cafarella et al., 2006). We describe this application in Section 2. The application we envision is a kind of search engine that can answer queries such as "list all X such that X causes asthma" (Girju, 2001). Given this application, we have decided to focus on semantic relations between nominals (i.e., nouns and base noun phrases, excluding named entities).
The dataset for the task will consist of annotated sentences. We will select a sample of relation classes from several different classification schemes and then gather sentences from the Web using a search engine. We will manually markup the sentences, indicating the nominals and their relations. Algorithms will be evaluated by their average classification performance over all of the sampled relations, but we will also be able to see whether some relations are more difficult to classify than others, and whether some algorithms are best suited for certain types of relations.
For some of the tasks that we mention in Section 1, it might be argued that semantic relation classification plays a supporting role, rather than a central role. We believe that semantic relation classification is central in relational search. Cafarella et al. (2006) describe four types of relational search tasks. Although they focus on relations between named entities, the same kinds of tasks would be interesting for nominals. For example, consider the task of making a list of things that have a given relation with some constant thing:
list all X such that X causes cancer
list all X such that X is part of an automobile engine
list all X such that X is material for making a ship's hull
list all X such that X is a type of transportation
list all X such that X is produced from cork trees
For these kinds of relational search tasks, we do not need a complete, exhaustive, non-overlapping set of classes of semantic relations. Each class, such as X causes Y, can be treated as a single binary classification problem. Any algorithm that performs well on the dataset (Section 4) and task (Section 5) described here should be directly applicable to relational search applications.
We should note that classifying semantic relations between pairs of words is different in several ways from automatic labeling of semantic roles (Gildea and Jurafsky, 2002), which was one of the tasks in Senseval-3.1 Semantic roles involve frames with many slots, but our focus is on relations between pairs of words; semantic roles are centered on verbs and their arguments, but semantic relations include pairwise relations between all parts of speech (although we limit our attention to nominals in this task, to keep the task manageable); FrameNet2 currently contains more than 8,900 lexical units, but none of the schemes discussed above contain more than 50 classes of semantic relations.
Each slot in a frame might be considered as a binary relation, but FrameNet and PropBank3 do not make a consistent effort to assign the same labels to similar slots. In FrameNet, for example, the verb "sell" ("Commerce_sell") has core slots "buyer", "goods", and "seller", whereas the verb "give" ("Giving") has core slots "donor", "recipient", and "theme". There is no matching of the similar slots, although they have very similar semantic relations:
<sell, buyer> ↔ <give, recipient>
<sell, goods> ↔ <give, theme>
<sell, seller> ↔ <give, donor>
Semantic relation classification schemes generalize relations across wide groups of verbs (Chklovski and Pantel, 2004) and include relations that are not verb-centered (Nastase and Szpakowicz, 2003; Moldovan et al., 2004). Using the same labels for similar semantic relations facilitates supervised learning. For example, a learner that has been trained with examples of "sell" relations should be able to transfer what it has learned to "give" relations.
Nastase and Szpakowicz (2003) manually labeled 600 noun-modifier pairs. Each pair was assigned one of thirty possible labels. To facilitate classification, each word in a pair was also labeled with its part of speech and its synset number in WordNet4. When classifying relations in noun phrases, Moldovan et al. (2004) provided their annotators with an example of each phrase in a sentence. We will include parts of speech, synset numbers, and a sample sentence for each pair in our training and testing data.
Consider the noun-modifier pair "silver ship", in which the head noun "ship" is modified by the word "silver". Using the classification scheme of Nastase and Szpakowicz (2003), the semantic relation in this pair might be classified as material (the ship is made of silver), purpose (the ship was built for carrying silver), or content (the ship contains silver). Note that parts of speech and WordNet synsets are not sufficient to determine which of these three classes is intended. If "silver" is labeled "a1" (adjective, WordNet sense number 1, "made from or largely consisting of silver") and "ship" is labeled "n1" (noun, WordNet sense number 1, "a vessel that carries passengers or freight"), then the correct class must be material. However, if "silver" is labeled "n1" (noun, WordNet sense number 1, "a soft white precious univalent metallic element"), then the correct class could be either purpose or content. We would represent this example as follows:
"The <e1>silver</e1> <e2>ship</e2> usually carried silver bullion bars, but sometimes the cargo was gold or platinum." WordNet(e1) = "n1", WordNet(e2) = "n1", Relation(e1, e2) = "content".
We will begin by choosing seven relations from several different schemes (e.g., Nastase and Szpakowicz, 2003; Moldovan et al., 2004). We will focus on relations between nominals. For the purposes of this exercise, we define a nominal as a noun or base noun phrase, excluding named entities. A base noun phrase is a noun and its premodifiers (e.g., nouns, adjectives, determiners). We do not include complex noun phrases (e.g., noun phrases with attached prepositional phrases). For example, "lawn" is a noun, "lawn mower" is a base noun phrase, and "the engine of the lawn mower" is a complex noun phrase. The markup will explicitly identify entity boundaries, so the teams that attempt this task will not need to worry about finding entity boundaries (e.g., in "<e1>macadamia nuts</e1> in the <e2>cake</e2>", we can see that the first entity is a base noun phrase and the second entity is a noun; there will be no need for a chunking parser).
For each of the chosen relations, we will give a precise definition of the relation and some prototypical examples. The definitions and examples will be available to the annotators and will be included in the distribution of the training and testing data.
Given a specific relation (e.g., content), we will use heuristic patterns to search in a large corpus for sentences that illustrate the given relation. For example, for the relation content, we may use Google to search the Web, using queries such as "contains", "holds", "the * in the". For each relation, we will use several different search patterns, to ensure a wide variety of example sentences. The search patterns will be manually constructed, using the approach of Hearst (1992).
The collected sentences will be given to two annotators, who will create positive and negative training examples from the sentences. For example, here are positive and negative examples for content (below, "!=" means "does not equal"):
"The <e1>macadamia nuts</e1> in the <e2>cake</e2> also make it necessary to have a very sharp knife to cut through the cake neatly." WordNet(e1) = "n2", WordNet(e2) = "n3", Relation(e1, e2) = "content".
"Summer was over and he knew that the <e1>climate</e1> in the <e2>forest</e2> would only get worse." WordNet(e1) = "n1", WordNet(e2) = "n1", Relation(e1, e2) != "content".
The negative example above would be classified as location in the scheme of Nastase and Szpakowicz (2003). The use of heuristic patterns to search for positive and negative training examples should naturally result in negative examples that are near misses. We believe that near misses are more useful for supervised learning than negative examples that are generated purely randomly.
Each example will be independently labeled by two annotators. When the annotation is completed, the annotators will compare their labels and make a note of the number of cases in which they agree and disagree. If the annotators cannot come to a consensus on the correct labels for a particular example, that example will not be included in the training and testing data, although it will be recorded for possible future analysis.
This method of generating training and testing data is designed with relational search in mind (Section 2). A natural approach to relational search is to use heuristic patterns (Hearst, 1992) with a conventional search engine, and then use supervised learning (Moldovan et al., 2004) to filter the resulting noisy text.
We will follow the model of the Senseval-3 English Lexical Sample Task, which had about 140 training and 70 testing samples per word. The following list summarizes the main features of our dataset:
7 semantic relations (not exhaustive and possibly overlapping)
140 training sentences per relation (7 × 140 = 980 training sentences)
70 testing sentences per relation (7 × 70 = 490 testing sentences)
210 combined testing and training sentences per relation (7 × 210 = 1,470 sentences)
sentence classes will be approximately 50% positive and 50% negative (roughly 735 positive and 735 negative, for a total of 1,470 sentences)
several different search patterns will be used for each semantic relation, to avoid biasing the sample sentences
negative examples of a relation will be "near misses"
Since most of the words in the Senseval-3 English Lexical Sample Task had more than two senses, we will have more samples per class (positive and negative) per relation than the average word in the English Lexical Sample Task had samples per sense per word.
For each relation, one person will retrieve sample sentences and two other people will annotate the sentences. To encourage debate, the three people will be chosen from three different institutions. A detailed guide will be prepared, to maximize the agreement between annotators.
As with the Senseval-3 Lexical Sample tasks, each team participating in this task will initially have access only to the training data. Later, the teams will have access to unlabeled testing data (that is, there will be WordNet labels, but no Relation labels). The teams will enter their algorithms' guesses for the labels for the testing data. When SemEval-1 is over, the labels for the testing data will be released to the public.
Algorithms will be allowed to skip examples that they cannot classify. An algorithm's score for a given relation will be the F score, the harmonic mean of precision and recall. Algorithms will be ranked according to their average F scores for the chosen set of relations. We will also analyze the results to see which relations are most difficult to classify. To assess the effect of varying quantities of training data, we will ask the teams to submit several sets of guesses for the labels for the testing data, using varying fractions of the training data.
Some algorithms (e.g., corpus-based algorithms) may have no use for WordNet annotations. It might also be argued that the WordNet annotation is not practical in a real application. Therefore we will ask teams to indicate, when they submit their answers, whether their algorithms used the WordNet labels. We will group the submitted answers into those that used the WordNet labels and those that did not, and we will rank the answers in each group separately. Teams will be allowed to submit both types of answers, if their algorithms permit it.
Our collected training and testing data, including all annotation, will be released under a Creative Commons License.5
Cafarella, M.J., Banko, M., and Etzioni, O. (2006). Relational Web Search. University of Washington, Department of Computer Science and Engineering, Technical Report 2006-04-02.
Chklovski, T., and Pantel, P. (2004). VerbOcean: Mining the Web for fine-grained semantic verb relations. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-04). pp. 33-40. Barcelona, Spain.
Gildea, D., and Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288.
Girju, R. (2001). Answer fusion with on-line ontology development. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) - Student Research Workshop (NAACL 2001), Pittsburgh, PA.
Girju, R., Moldovan, D., Tatu, M., Antohe, D. (2005). On the semantics of noun compounds. Computer Speech and Language, 19:479-496.
Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 539-545.
Lapata, M. (2002). The disambiguation of nominalisations. Computational Linguistics, 28(3):357-388.
Moldovan, D., Badulescu, A., Tatu, M., Antohe, D., and Girju, R. (2004). Models for the semantic classification of noun phrases. In Proceedings of the Computational Lexical Semantics Workshop at HLT-NAACL 2004, pp. 60-67, Boston, MA.
Nastase, V., and Szpakowicz, S. (2003). Exploring noun-modifier semantic relations. In Fifth International Workshop on Computational Semantics (IWCS-5), pp. 285-301. Tilburg, The Netherlands.
Nastase, V., Sayyad-Shirabad, J., Sokolova, M., and Szpakowicz, S. (2006). Learning noun-modifier semantic relations with corpus-based and WordNet-based features. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006). Boston, MA.
Rosario, B., and Hearst, M. (2001). Classifying the semantic relations in noun-compounds via a domain-specific lexical hierarchy. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP-01), pp. 82-90.
Rosario, B., Hearst, M., and Fillmore, C. (2002). The descent of hierarchy, and selection in relational semantics. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, pp. 417-424.
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., and Mostafa, J. (2001). Detecting gene relations from MEDLINE abstracts. In Proceedings of the Sixth Annual Pacific Symposium on Biocomputing, pp. 483-496.
Tatu, M., and Moldovan, D. (2005). A semantic approach to recognizing textual entailment. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), pp. 371-378, Vancouver, Canada.
Turney, P.D. (2005). Measuring semantic similarity by latent relational analysis. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), pp. 1136-1141, Edinburgh, Scotland.
Turney, P.D. and Littman, M.L.. (2005). Corpus-based learning of analogies and semantic relations. Machine Learning, 60(1-3):251-278.
1http://www.senseval.org/
2http://framenet.icsi.berkeley.edu/
3http://www.cis.upenn.edu/~mpalmer/project_pages/ACE.htm
4http://wordnet.princeton.edu/
5http://creativecommons.org/
1