Using WordNet and Semantic Similarity to ... - Semantic Scholar

STOCKHOLMS UNIVERSITET Institutionen för lingvistik

Using WordNet and Semantic Similarity to Disambiguate an Ontology Martin Warin

Abstract In this work I have used a number of different semantic similarity measures and WordNet to enrich an ontology (the Common Procurement Vocabulary) with information about its leaf-nodes. Nouns are extracted from the leaf-nodes, and compared for semantic similarity against the nouns extracted from their parent-node. The leafnode noun-senses with highest overall similarity with the parent-node noun-senses are seen as good descriptors of the leaf-node and are added to the ontology. It is shown that the similarity measure proposed by Leacock and Chodorow is the one most suitable for this task, out of five measures compared. Leacock-Chodorow shows average precision between 0.644 and 0.711 and recall between 0.534 and 0.977, depending on the kind of threshold used. Baseline average precision peaks at 0.592. In the end, this work aims to contribute to the work of Henrik Oxhammar (PhD student at Stockholm University), who is investigating the possibilities of automatically classifying product descriptions on the WWW according to the Common Procurement Vocabulary. For this to be possible, more information about the leaf-nodes in the ontology than is currently there is needed.

D-uppsats i Allmän spr˚ akvetenskap med inriktning p˚ a datorlingvistik 10 poäng, VT 2004 Handledare: Martin Volk

Contents I

Introduction

1

1 Structure

1

2 Notation

1

3 Aim

1

4 Material 4.1 The Common Procurement Vocabulary . . . . . . . . . . . . . 4.2 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The synset . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 The tennis-problem . . . . . . . . . . . . . . . . . . . . 4.3 Measuring semantic similarity . . . . . . . . . . . . . . . . . . 4.3.1 Semantic similarity versus semantic relatedness . . . . 4.3.2 Information content . . . . . . . . . . . . . . . . . . . . 4.3.3 Path length . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Leacock-Chodorow . . . . . . . . . . . . . . . . . . . . 4.3.5 Resnik . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Wu-Palmer . . . . . . . . . . . . . . . . . . . . . . . . 4.3.7 Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.8 Jiang-Conrath . . . . . . . . . . . . . . . . . . . . . . . 4.3.9 Semantic relatedness measures . . . . . . . . . . . . . . 4.3.10 Evaluating performance of semantic similarity measures 4.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Previous work on ontologies . . . . . . . . . . . . . . . 4.4.2 Previous work using semantic similarity . . . . . . . . .

II

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

Method

5 Description of the program 5.1 How to find WordNet-senses given a CPV-node . . . . 5.2 Comparing word-senses for semantic similarity . . . . . 5.3 When things go wrong . . . . . . . . . . . . . . . . . . 5.3.1 Error scenario #1: No words in Wleaf . . . . . . 5.3.2 Error scenario #2: No words in Wparent . . . . . 5.3.3 Error scenario #3: Only one and the same word

2 2 4 4 6 7 7 8 9 9 9 10 11 11 12 12 13 13 14

15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wleaf and Wparent

15 15 17 17 17 20 20

6 Precision and recall in word-sense disambiguation 6.1 Gold standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 22

7 Evaluation of the measures 7.1 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 22

III

23

Results, discussion and conclusion

8 Results

23

9 Discussion 9.1 Quantitative analysis of results . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Qualitative analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Biased towards Resnik? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 27 29

10 Conclusion 10.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30

A Test set

i

Part I

Introduction 1

Structure

First of all, some matters concerning the notation are explained in Section 2. After that, the aim of this work is presented in Section 3. In Section 4 I will explain the two sources of data - the Common Procurement Vocabulary (Section 4.1) and WordNet (Section 4.2) - what they are, how they are structured and how they can be used. After that comes an introduction to semantic similarity and some ways to measure it (Section 4.3). This is followed by some previous research on ontologies and some previous work using semantic similarity measures (Section 4.4). Part II describes how the task was executed and evaluated, and in the Part III the results are presented and discussed, followed by conclusions and thoughts about future work in this area.

2

Notation • WordNet synsets are written as the lexical items they contain, e.g. “beet/beetroot”. • WordNet-senses and glosses are written in angle brackets: and . • All WordNet-senses and synsets are nouns unless otherwise stated. • Very long WordNet-glosses will be shortened, as in . • Nodes in the Common Procurement Vocabulary are written within square brackets: [Beetroot]. • The code of a node in the Common Procurement Vocabulary will only be included when important ([01121111 Beetroot]), otherwise left out.

3

Aim

The aim of this work is to successfully disambiguate the ontology called the Common Procurement Vocabulary (CPV) using a measure for semantic similarity and WordNet. By disambiguating is meant to provide a small ranked list of WordNet-senses for each leafnode in the ontology. These WordNet-senses should be good candidates for describing the node as a whole, or parts of it. Consider the CPV-node [Jackets], which is a child to [Jackets and blazers]. There are five different senses of “jacket” in WordNet: , , , and . To disambiguate “jacket”, we compare its five senses with the information found in the parent node, the one sense of “blazer”, described as a . If the semantic similarity measure we use is accurate, we will find that the “jacket”-sense most similar to the “blazer”-sense is . This sense will top the list of senses that will be assigned to [Jackets], followed by the other senses. Primarily, the reason for doing this work is to enable automatic classification of products or services according to the Common Procurement Vocabulary. Henrik Oxhammar (PhD student at Stockholm University) is working on such a project. He is building a vector space model, where products will be matched to nodes in the Common Procurement Vocabulary. The nodes must first be broadened by building “clouds” of semantically similar items around them. This work will hopefully also be of use for any one wishing to add information from WordNet to their ontology or disambiguate with WordNet-senses. In one sentence, this is what I want to find out: Is the method described in this work sufficient for successfully disambiguating the leaf-nodes of a whole ontology?

4

Material

In this section I present the material I have used. First is the Common Procurement Vocabulary (4.1), the ontology that is to be enriched with information about its leafnodes. Second is WordNet (4.2), the lexical database from which the enriching information is taken. After that, I describe the WordNet::Similarity package (4.3), a collection of tools for measuring semantic similarity. It is with help of these tools it is decided which pieces of information are to be used for enriching. Last in this section comes some information about previous work in this field (4.4).

4.1

The Common Procurement Vocabulary

The Common Procurement Vocabulary (henceforth CPV) is an EU standard document, compiled by the EU project Simap, whose objective is to develop information systems supporting effective procurement within the union. The CPV is used for classifying products or services in common procurement, and using it is recommended by the Commission. They argue that the CPV improves openness and transparency of common procurement, and that it helps overcoming language barriers [European Union and European Parliament 2002]. As of now, the CPV is available in 11 European languages: Spanish, Danish, German, Greek, English, French, Italian, Dutch, Portuguese, Finnish and Swedish.1 1

It can be downloaded at http://simap.eu.int/EN/pub/src/main5.htm, last visited September 27, 2004

2

The CPV is an ontology, i.e. a collection of knowledge about a domain structured in a meaningful way. Its nodes consist of a unique eight-digit code and a classification, e.g. “30233171-0 Disks.” . The ninth digit is only used for control purposes and can in practice be ignored; in fact, that is what will be done throughout the rest of this work. The codes are structured in such a way that they represent hierarchical relationships (IS-A or hypernymy / hyponymy relations) between nodes. 30000000 30200000 30230000 30233000 30233100 30233170 30233171

[Office and computing machinery, equipment and supplies.] [Computer equipment and supplies.] [Computer hardware.] [Storage, input or output units.] [Computer storage units.] [Storage media.] [Disks.] Figure 1: CPV: The ancestors to “Disks”.

Parent: 01117100 [Raw vegetable materials used in textile production.] Children: 01117110 [Cotton.] 01117120 [Jute.] 01117130 [Flax.] Figure 2: CPV: Siblings. The design is fairly simple. A top-level node has a code with three digits to the left identifying the branch, followed by zeros. The next generation inherits these three digits and adds a non-zero of its own, the next generation in turn inherits these four digits and adds a non-zero of its own etc. If a node has children, the first child inherits the parentcode and substitutes the leftmost zero with 1, the second child substitutes with 2 instead etc. A case like this is illustrated in Figure 2. By substituting the rightmost non-zero with a zero in any given CPV-code, one gets the code belonging to its parent. It is also possible to find the siblings to any given CPV-node by systematically substituting the rightmost non-zero with all other non-zeros. The CPV contains 8323 nodes of the type discussed above, out of which 5644 are leaf-nodes. Apart from the type of nodes discussed above, there is also a shorter list of supplementary vocabulary with a code structure of its own. It is used in combination with regular CPV-nodes to further describe them, e.g. what they will be used for (Figure 3). This supplementary vocabulary will not be treated in this work. 3

E001-0 E002-7 E003-4 E004-1 E005-8

For For For For For

cleaning children day homes graphic purposes access Figure 3: CPV: Supplementary vocabulary.

During my work with the CPV, I have found a few oddities, for example that [28528000 Swords] would be the parent of [28528900 Metal fittings], and that [30192130 Pencils] would be a kind of [30192100 Erasers]. These oddities are probably due to the fact that the CPV has been revised several times. These oddities are very few as far as I have seen, and they do not seem to affect the outcome in any significant way.

4.2

WordNet

WordNet is a semantic network database for English developed at Princeton under the direction of George A. Miller. Latest current version is WordNet 2.0; however, in this work the version used is 1.7.1 from 2001. 2 Versions for other languages have been made, e.g. EuroNet [Vossen 1998], where also cross-lingual links exist. The basic building-block in WordNet is the synset (Figure 4). A synset is a set of synonyms denoting the same concept, paired with a description (a.k.a. gloss) of the synset. The synsets are interconnected with different relational links, such as hypernymy (is-a-kindof), meronymy (is-a-part-of), antonymy (is-an-opposite-of) and others. Since in this work only nouns are used, only noun synsets will be discussed here. The synsets for the other parts of speech are construed in a slightly different fashion. The interested reader may consult [Fellbaum 1998] chapter 2 for adjectives and adverbs, and chapter 3 for verbs. The lexical items in a synset are not absolute synonyms, rather they are interchangeable within some context ([Fellbaum 1998] pp. 23-24). The synsets for the different parts-of-speech are stored in separate files; data.noun, data.verb etc. When using WordNet, the user types in a query, e.g. “beetroot”, and then the sense of that query is looked up. The user can then choose to look at synsets related to the one shown. 4.2.1

The synset

All noun synsets contain the following data, which can bee seen in Figure 4: • A synset ID (06429642). This is used by all other synsets when referring to this one. 2

The reason for using WordNet 1.7.1 is that the tool WordNet::Similarity had not been made compatible with WordNet 2.0 by the time I began this project.

4

06429642 13 n 02 beet 0 beetroot 0 004 @06420440 n 0000 #p 09775909 n 0000 ∼06429887 n 0000 ∼06429988 n 0000 round red root vegetable Figure 4: WordNet: A typical synset. • A two-digit code (13) between 03 and 28 identifying the synset as descending from one of 25 so called unique beginners. The unique beginners are top nodes in the WordNet hierarchy and can be seen as a sort of semantic prime. Unique beginner 13, which is the case here, is “food”. Other unique beginners are 05 “animal, fauna”, 20 “plant, flora”, 27 “substance” etc. In fact, there is another synset for “beetroot/beet”, this one descending from unique beginner 20 “plant, flora” and with a slightly different description focusing on the plant-sense of the word. This is the case with all polysemous words; they have one synset for every sense of the word, usually descending from different unique beginners. • A part-of-speech tag. n is for nouns. • A number (02) indicating how many lexical items the synset contains (beetroot and beet). • Pairs containing one lexical item and a number (beet 0) indicating where in the file index.sense this sense of the word can be found. sense.index provides an alternate way of searching for synsets, but this will not be used here. • A three-digit number specifying how many other synsets the synset points to. • A number of relation pointers. The first relation is always that to the hypernym synset (@06420440), which in this case is “root vegetables”, and other various relation-type & synset pairs usually follow. In this case the #p stands for a part-holonym relation, and ∼ for hyponym relations. Since only the hypernym relation is used in this work, the others will not be discussed here. 3 Relation pointers are usually confined to within the parts-of-speech, i.e. there is no relation pointing from e.g. the noun concept “snow” to the adjective concepts “white” or “cold”, but there are to other noun concepts such as “water”. The exceptions are the relations attribute and domain-term. The domain-term relation first appeared in WordNet 2.0 and is described in Section 4.2.2. The attribute relation is a relation between certain nouns and adjectives; the noun “weight” is an attribute for which the 3

The interested reader can find more about the relations used in WordNet at www.cogsci.princeton.edu/∼wn/man/wninput.5WN.html or in [Fellbaum 1998] pp 37-41.

5

adjectives “light” and “heavy” express values. In WordNet 1.7.1, there are only 325 noun-synsets with attribute relations. • A description (also called gloss) of the synset. It is seldom longer than a couple of sentences. Often the description also contains one or a few example sentences so the user may see how the word is used in a context. The example sentences are generally cut off in this work, in order to improve readability. Another type of file, the index.noun (and index.verb etc.) contains one lexical item per line, paired with the synset ID(s) of the synset(s) in which the lexical item occurs (Figure 5). This file is used when a query is sent to WordNet in order for it to find the synsets to retrieve. Later in this work, index.noun will be used to control whether a word is included in WordNet or not. beethoven n 2 1 @ 2 0 08870823 06064724 beetle n 2 3 @ ~ #m 2 1 01829893 03236882 beetleweed n 1 2 @ #m 1 0 10192028 beetroot n 2 4 @ ~ #p %p 2 0 09775909 06429642 Figure 5: WordNet: index.noun.

4.2.2

The tennis-problem

In WordNet version 1.7.1, the relations between noun synsets are antonymy, hypernymy, hyponymy, holonymy and meronymy. With these, it is possible to relate “tennis” with “badminton” (coordinate term, i.e. both having the same hypernym, both “tennis” and “badminton” is a kind of “court game”), “court game” (hypernym), “doubles” (hyponym, a “double” is a kind of “tennis”), “set point” (meronym, “set point” is a part of “tennis”). There are no antonyms, i.e. opposites to “tennis”, but there is an antonymy relation between e.g. “defeat” and “victory”. But there is no way to link “tennis” to many things that most humans would agree on really relates to “tennis” in some way (“forehand”, “racquet”, “serve” etc.). With WordNet 2.0 came the relation “domain-term” with which these loose relations can be captured. It works in two ways, one can either find the domain(s) in which “serve” occurs, or search for the terms that come with a domain. This relation might prove useful in tasks such as the one undertaken in this work. However, there are yet no semantic similarity tools which can exploit this relation. In the WordNet::Similarity package, the only measure that uses WordNet-relations other than hyponomy relations is the semantic relatedness measure Hirst-St-Onge [Hirst and St-Onge 1998]. However, the domain-term relation is not used even by the implementation in version 1.0, which is adapted to WordNet 2.0. This might change in coming versions. 4 4

Ted Pedersen, personal correspondence, August 2004.

6

4.3

Measuring semantic similarity

The task at hand in this work is to find WordNet-senses which describe a leaf-node in the CPV (the node as a whole or parts of it) in the context of its parent node. In practice, this means that all nouns are extracted from the leaf-node and its parent node. All nouns extracted from the leaf-node are evaluated for semantic similarity against all nouns from the parent node. This is described more in detail in Part II. How then, are the nouns evaluated for semantic similarity? Luckily, there is a perl package called WordNet::Similarity [Pedersen et al. 2004], containing implementations of eight algorithms for measuring semantic similarity or relatedness. In this work, WordNet::Similarity version 0.09 is the version used. 5 Out of these eight measures, I have chosen to run a test on five of them. This is described in Sections 4.3.10 and 7. The five measures used here will be referred to as the Leacock-Chodorow measure ([Leacock and Chodorow 1998]), the Jiang-Conrath measure ([Jiang and Conrath 1997]), the Lin measure ([Lin 1998]), the Resnik measure ([Resnik 1995b]) and the Wu-Palmer measure [Wu and Palmer 1994]. The reasons for choosing these are partly because they are all designed for dealing with semantic similarity, as opposed to semantic relatedness, within IS-A-taxonomies (well, the Lin measure is intended to be useful in nearly any environment, see 4.3.7 below), and partly because of time limitations. The measures Lin, Resnik and Jiang-Conrath are based on the notion of information content (4.3.2), and the measures Leacock-Chodorow are based on path-length. When describing the five measures below, I will give examples of how they differ in scoring semantic similarity. It must be noticed, however, that one can not tell only by looking at the scores of a measure how good it is. It is first when using the measures to perform a task that one can actually evaluate how well they work ([Budanitsky and Hirst 2004], p. 2). 4.3.1

Semantic similarity versus semantic relatedness

Semantic similarity is a special case of semantic relatedness. Semantic relatedness is the question of how related two concepts are, using any kind of relation, whereas semantic similarity only considers the IS-A relation (hypernymy / hyponymy). “Car” and “gasoline” may be closely related to each other, e.g. because gasoline is the fuel most often used by cars. “Car” and “bicycle” are semantically similar, not because they both have wheels and means of steering and propulsion, but because they are both instances of “vehicle”. The relation between semantically similar and semantically related is asymmetric: if two concepts are similar, they are also related, but they are not necessarily similar just 5

WordNet::Similarity can be downloaded freely at: groups.yahoo.com/group/wn-similarity/. There is also a web interface at: www.d.umn.edu/∼mich0212/cgi-bin/similarity/similarity.cgi. Last visited September 27, 2004.

7

Table 1: The effect of using different corpora on the information content-based measures’ correlation with human judges. Corpus Size (word tokens) Resnik Lin Jiang-Conrath SemCor ∼ 200 000 0.714 0.698 0.727 Brown Corpus ∼ 1 million 0.730 0.744 0.802 Penn Treebank ∼ 1.3 million 0.746 0.749 0.826 British National Corpus ∼ 100 million 0.753 0.751 0.812

because they are related. There are three measures for semantic relatedness in WordNet::Similarity, Extended gloss overlap [Banerjee and Pedersen 2003], Vector [Patwardhan 2003] and Hirst-St-Onge [Hirst and St-Onge 1998]. Because of time limitations, they were not included in this work. How they perform compared to the semantic similarity measures used here, at tasks similar to the one at hand, remains to see. These measures will be briefly discussed in Section 4.3.9. 4.3.2

Information content

Since three of the five measures below are based on the notion of information content, here follows a short description of what that is, after which the semantic similarity measures are explained. Each node in the WordNet noun-taxonomy is a synset (4.2.1). The information content of a node is -log the sum of all probabilities (computed from corpus frequencies) of all words in that synset ([Resnik 1995b]). Let x be a synset in WordNet and p(x) the probability of encountering an instance (i.e. one of the lexical items in the synset) of x, computed from a corpus. The information content of x is −log p(x). The default counting method in WordNet::Similarity is to count an occurrence of e.g. the word “beet” as an instance of each synset it occurs in, i.e. the frequency of both “beet” as a plant and “beet” as a food increases by one. The smoothing technique used is add-1. By default in WordNet::Similarity, word-sense frequencies are computed from the SemCor corpus [Landes et al. 1998], a portion of the Brown Corpus tagged with WordNetsenses. This is used in this work too. [Patwardhan 2003] (pp. 41-43) has shown that the performance of the information content-based measures is noticeably influenced by the corpus used, as can be seen in Table 1. The numbers in Table 1 show the correlation with human judges ([Miller and Charles 1991], I shall return to this experiment in 4.3.10). All measures show results closer correlating with human judgement as the corpus size increases. The only exception is the Jiang-Conrath measure, which has the highest correlation using the second largest corpus. [Patwardhan 2003] believes that the Jiang-Conrath measure is relatively independent of corpus size, once it is above a certain minimum size and is relatively balanced (p. 42). 8

4.3.3

Path length

The two measures Leacock-Chodorow and Wu-Palmer are based on path length. Simply counting the number of nodes or relation links between nodes in a taxonomy may seem as a plausible way of measuring semantic similarity. The lower the distance between two concepts, the higher their similarity. However, this has proved not to be a successful method of measuring semantic similarity. A problem with this method is that it relies on that the links in the taxonomy represent uniform distances ([Resnik 1995b]). Therefor, most similarity or relatedness measures based on path length use some value to scale path length with. Path length measures have the advantage of being independent of corpus statistics, and therefor uninfluenced by sparse data. 4.3.4

Leacock-Chodorow

The Leacock-Chodorow measure [Leacock and Chodorow 1998] is not based on the notion of information content, unlike Lin, Resnik and Jiang-Conrath, but on path length. The similarity between two concepts a and b equals the number of nodes along the shortest path between them, divided by double the maximum depth (from the lowest node to the top) in the taxonomy in which a and b occur. The number of nodes between two siblings, i.e. two nodes with the same parent node is three. "

length (a, b) simLCH (a, b) = max −log 2D

#

The Leacock-Chodorow measure assumes a virtual top node dominating all nodes, and will therefor always return a value greater than zero, as long as the two concepts compared can be found in WordNet since there always will be a path between them. No corpus data is used by the Leacock-Chodorow measure, so it cannot be affected by sparse data problems. The Leacock-Chodorow measure gives a score of 3.583 for maximum similarity 6 , that is the similarity of a concept and itself, such as and . The score it gives for and is 2.484, the same as for and (, immediate child of ). 4.3.5

Resnik

The semantic similarity measure proposed in [Resnik 1995b] is defined as follows: The similarity score of two concepts in an IS-A taxonomy equals the information content value of their lowest common subsumer (LCS, the lowest node subsuming / dominating them both). 6 At least when using WordNet 1.7.1. If other versions have different depth, then this value will also be different.

9

simRES (a, b) = max [IC (LCS (a, b))] Resnik’s measure does not capture the difference in semantic similarity in the following case: Assume three concepts, a b and ca , and that ca is a special instance of a. The lowest common subsumer for a and b is then the same as for ca and b! In practice, this means that the Resnik score for and is not different from and . The Resnik score for two identical concepts, such as and still equals the information content value of the lowest common subsumer. This means that the Resnik score for and differs from the score for and , since the lowest common subsumers (which in this special case is an immediate parent) of and do not have the exactly same frequency in SemCor. This is rather counterintuitive; is not as semantically similar to as is to ? The Resnik measure is the only of the five measures discussed here that that does not give the same value for and - . If the only LCS is the virtual root node, Resnik will return zero. 4.3.6

Wu-Palmer

The semantic similarity measure in [Wu and Palmer 1994] was first used in a system for English-Mandarin machine translation of verbs. Like the Leacock-Chodorow measure, it is not based on information content but on path length. It is defined as: "

simwup

2 × depth (LCS (a, b)) = max length (a, b) + 2 × depth (LCS (a, b))

#

Though this formula may look more complex than the others, it is rather straightforward. The depth of the lowest common subsumer of the two concepts is divided by the length (number of nodes) between the concepts times the depth again. Two identical concepts ( - ) receive similarity score 1 by WuPalmer. Sometimes, due to a bug in the implementation, similarity scores higher than 1 can be assigned to concepts.7 When computing the similarity of - there are paths of different lengths from the top node (a virtual top node is assumed) to the concepts. This somehow results in a LCS that the algorithm (or the implementation of it) believes is situated deeper than the concepts it subsumes, and the similarity score reported is 1.2. Also - results in a score above 1, namely 1.125. Despite this bug, the measure will be evaluated with the others; it need not be a fatal flaw. The same thing goes for Wu-Palmer as for Leacock-Chodorow, they will both always return a value greater than 0, and they will not be affected by sparse data problems. 7

Jason Michelizzi, personal correspondence, August 2004.

10

4.3.7

Lin

In [Lin 1998], the semantic similarity between two concepts a and b in a taxonomy is defined as "

2 × log p (LCS (a, b)) simLIN (a, b) = max log p (a) + log p (b)

#

The algorithm is intended to be useful in any domain, as long as there can be a probabilistic model for that domain, unlike Resnik, Jiang-Conrath and Leacock-Chodorow who all presuppose a taxonomy. Lin motivates his measure with the two arguments that there was no similarity measures around that was not tied to a particular application or domain model (such as a taxonomy), and that the fundamental assumptions of previous similarity measures were not explicitly stated. Lin is very firm on the second point, he lists the intuitions and assumptions underlying the measure, and then gives a logical proof that the measure actually conforms with them. The Lin measure gives scores between 0 and 1. - gets a score of 0.935, meaning that they are highly similar. - is scored 1 and - 0. The zero score is caused by sparse data, as described in 4.3.8 below. 4.3.8

Jiang-Conrath

Originally, the Jiang-Conrath [Jiang and Conrath 1997] measure measures semantic distance, i.e. that a high number indicates less similarity between two concepts than a lower number. In the WordNet::Similarity implementation the scoring has been inverted 1 simJCN (a, b) = distJCN (a,b) , so that a high value indicates greater semantic similarity between two concepts than a low value. The Jiang-Conrath measure can be weighted so that it takes into consideration factors like at which depth the concepts compared are situated and the fact that some areas of a taxonomy are denser (i.e. where nodes have more children) than others. When, as in the WordNet::Similarity implementation, these weights are not used, the Jiang-Conrath measure (i.e. the un-inverted version, showing semantic distance) can be defined as: simJCN (a, b) = max [IC(a) + IC(b) − 2 × IC(LCS(a, b))] The fact that the information content for a, b and LCS(a, b) must be known also makes the Jiang-Conrath measure especially sensitive to sparse data problems. Identical concepts ( - ) are scored close to 29 million, which might seem a bit strange, since - is only scored 0.665. This has to do with the original nature of the Jiang-Conrath algorithm; the semantic distance between two identical concepts must be zero, and when inverted, the value must be very high. - gets a score of zero, because “cortland” was never seen in the corpus. Add-1 smoothing gives a frequency of 1 anyway, and then 11

the information content of “cortland” becomes very high, and will indicate high similarity with practically any concept. Therefor, the implementation returns zero whenever any of the two concepts has the frequency 1. 8 4.3.9

Semantic relatedness measures

A semantic relatedness measure based on gloss overlaps is presented in [Banerjee and Pedersen 2003]. The core idea is that if the glosses (definitions) of two concepts share words, then they are related. The more words the two concept glosses share, the more related the concepts are. The glosses in WordNet tend to be too short for this kind of evaluation to be successful. Therefor, the glosses of the concepts related by hypernymy, hyponymy, meronymy and holonymy are also taken into consideration. The relatedness score for this measure is the number of words occurring in both the glosses of the concepts related to (by the above relations) concept a and the gloss of a itself, and the glosses of the concepts related to concept b and the gloss of b itself. Multi-word units are given extra weight (square the number of words in the multi-word unit). Its correlation with human judgement is reported as 0.67 by [Banerjee and Pedersen 2003]. Another semantic relatedness measure, proposed in [Patwardhan 2003], uses context vectors combining gloss overlaps and corpus statistics. This measure shows very high correlation with human judgement, a correlation of 0.877 (p. 40), which will be shown below to be close to the upper bound. The Hirst-St-Onge measure, [Hirst and St-Onge 1998] treats different relations in WordNet as directions; hypernymy and meronymy are upwards, antonymy and attribute are horizontal etc. Similarity between two concepts (the original algorithm does not support single senses to be compared, but this has been made possible in the WordNet::Similarity implementation) is computed by taking the shortest path minus the number of changes in direction. The Hirst-St-Onge measure has a correlation with human judgement of 0.68 according to [Seco et al. 2004]. 4.3.10

Evaluating performance of semantic similarity measures

When a measure for semantic similarity (for English) is evaluated, the Gold Standard most often used is [Miller and Charles 1991]. In that experiment, 30 pairs of nouns were given to 38 undergraduate students who were asked to give “similarity of meaning”-ratings for each pair. The scale went from 0 (no similarity) to 4 (perfect similarity). The average score for each pair was seen as a measure of how semantically similar humans found that pair of words. This experiment was later replicated by [Resnik 1995b] in a smaller scale; using the same word pairs but only ten subjects. The outcome was as good as consistent with [Miller and Charles 1991]. 8

This is discussed further on http://groups.yahoo.com/group/wn-similarity/message/7 and http://groups.yahoo.com/group/wn-similarity/message/8, last visited September 27, 2004

12

[Resnik 1995b] also found that the average correlation between his subjects and [Miller and Charles 1991] was r = 0.885. This is usually seen as the upper bound to what a computer might perform given the same task. In [Seco et al. 2004] all modules in the WordNet::Similarity package (and their own implementations of the same algorithms) were evaluated with the same noun pairs as in [Miller and Charles 1991]. The scores for the five measures used here (WordNet::Similarityimplementations) according to [Seco et al. 2004] can be seen in Table 2. One should note that in [Seco et al. 2004], WordNet 2.0 was used, so the numbers for WordNet 1.7.1 might be somewhat different. Also, the numbers differ somewhat from those in Table 1, probably because different versions of both WordNet and WordNet::Similarity were used, and because [Patwardhan 2003] and [Seco et al. 2004] used different corpora and smoothing techniques for the information content based measures. Table 2: Five semantic similarity measures’ correlation with human judgement. Leacock-Chodorow 0.82 Jiang-Conrath 0.81 Lin 0.80 Resnik 0.77 Wu-Palmer 0.74

[Budanitsky and Hirst 2004] made an evaluation of the measures Lin, Resnik, LeacockChodorow, Jiang-Conrath and Hirst-St-Onge (their own implementations) in a malapropism detection and correction task. They found that Jiang-Conrath performed best, followed by Lin and Leacock-Chodorow on shared second place, then Resnik followed by Hirst-St-Onge. They used the Brown corpus for word counts. In order to see how well the different measures performed at the task at hand in this work, I ran a test, letting the five measures disambiguate 73 CPV-nodes. The test is described in Section 7, and results and discussion about them in Section 8 and Part III.

4.4 4.4.1

Previous work Previous work on ontologies

The construction and maintenance of an ontology is difficult, expensive and highly timeconsuming, as one must train domain experts in formal knowledge representation [Faatz and Steinmetz 2002], p. 2. Therefor, every way that ontology construction and / or maintenance can be successfully automatized or semi-automatized is welcome. Ontology enrichment is one such way. This usually means that new nodes are created in an existing ontology. New concepts are identified by mining large corpora, either domainspecial corpora such as a collection of computer magazines, general-purpose corpora such as the Brown corpus, or corpora constructed by search-engine results.

13

Ontologies covering e.g. technical domains constantly need updating, since new products with new names appear all the time. A new clustering-algorithm for finding new synonyms for a concept (such as “P II” for “Pentium 2”) is presented in [Valarakos et al. 2004]. If there exists a node “Pentium 2”, new versions of that name will be clustered around the existing node. New names which are not (orthographically) similar enough to any existing node will result in a new node. In [Roux et al. 2000] an information extraction system for the domain “genomics” is presented. Its knowledge of the domain is represented as verb frames such as represses (protein(Antp), gene(BicD)), meaning that the Antp protein represses the BicD gene, in an ontology. The verb repress with a protein as first argument will usually take a gene as its second argument, so the system can learn new gene names by looking for any frame matching repress(protein(), ()). [Agirre et al. 2000] wanted to find out if they could collapse some of the senses in WordNet, arguing that WordNet often makes too fine distinctions between concepts. They also wanted to find information about topically relations, that WordNet could be enriched with. For this, they constructed queries like “(all words in the description of WordNet-sense x) NOT (all words in the description of the other WordNet-senses for that word)” for all senses of 20 words. These queries were sent to Alta-Vista, and the 100 first retrieved documents were tied to the word-sense for that particular query. When all word-senses had a collection of documents, the documents for each word-sense were compared to the documents belonging to all the other senses of that word. The words (in the documents) with a distinct frequency for one of the collections were seen as topic signatures for the word-sense to which their collection was tied. They were then able to cluster senses of a word which had high overlap in topic signature words, thus reducing the number of sense-distinctions in WordNet. The topic signature words for a sense could also be used for topical relations in WordNet. 4.4.2

Previous work using semantic similarity

In [Resnik 1995a] the semantic similarity measure described in [Resnik 1995b] is used to disambiguate groups of similar polysemous nouns. He observes that when two polysemous nouns are similar, i.e. they have one sense each which are similar, their lowest common subsumer tells us which of their senses are relevant. For instance, take “doctor” and “nurse”. They are both polysemous, and the lowest common subsumer for all of their senses taken together is “health professional”. When this is known, the senses descending from “health professional” can be assigned. An algorithm disambiguating large groups of nouns (groups like “tie, jacket, suit” or “head, body, hands, eye, voice, arm, seat, hair, mouth”) was proposed, and it was almost comparable to human judgement results. Two human judges were independently given the same set of 125 noun groups from Roget’s thesaurus, together with one of the words in the group to disambiguate. The human judges were forced to choose one of the word’s WordNet-senses only, and give a 14

Table 3: Evaluation of Resnik’s disambiguation algorithm Upper bound % Algorithm correct% Random correct % Judge 1 65.7 58.6 34.8 Judge 2 68.6 60.5 33.3

value of how confident they were that this was the one correct sense. Only answers with confidence higher than 1 (on a scale 0-4) were kept. Results from Judge 2 were used to create an upper bound for the results on Judge 1 and vice versa. A baseline was also created by randomly choosing senses. It is shown in Table 3 that the algorithm performs rather well compared to the upper bound.

Part II

Method In this part, I describe how I have solved the task at hand: how candidates for enriching a leaf-node were found and evaluated, how certain errors were worked around, how the test and Gold Standard was designed.

5

Description of the program

In this section I will explain how my program works. I do this in order to facilitate comparison and repeatability. First of all, let me just briefly recapture what it is I am doing. In this work, the leaf-nodes in the CPV have been selected as nodes to be enriched with information from WordNet. Only leaf-nodes are treated, because they are generally more informative and specific than non-terminal nodes. The nouns from each leaf-node will be extracted, as will the nouns in their parent-nodes. The senses of the nouns in the leaf-node will be compared against the senses of the nouns in the parent node for semantic similarity, using different similarity measures. The senses of the leaf-node nouns which get the highest overall similarity scores, i.e. is more overall semantically similar to the senses of the parentnode nouns, will be selected for enriching the leaf-node. The program takes CPV-nodes as input, and for each node it outputs a ranked list of senses. An overview of the stages involved from input to output can be seen in Table 5.

5.1

How to find WordNet-senses given a CPV-node

In order to have full control and overview of what is done, I have chosen not to use the WordNet browser (called “wnb”) which is included in the WordNet package. Instead I 15

have written a perl module that does just the things I need in this work; nothing more, nothing less. This perl module will simply be referred to as the “WordNet noun-searcher”. Its function is to retrieve only WordNet noun-senses given a word. The first step in the program is finding the parent of each input CPV leaf-node. Then both nodes are transformed into relevant queries which can be sent to my WordNet nounsearcher, which then returns all WordNet noun-senses for each query word. This transformation requires that all words are lemmatized. For this, the program uses bits and pieces from the WordNet lemmatizer called “Morphy”. Lemmatizing is very important, since only uninflected word-forms are stored in data.noun and index.noun, and most CPV-nodes are in plural. Information about the words in the CPV-node [Curtains, drapes, valances and textile blinds.] are stored under “curtain”, “drape”, “valance”, “textile” and “blind” in WordNet. First of all, the lemmatizer looks for the input word in a long list of irregular nouns, where their lemma-forms also are listed, enabling easy substitution. If the input word is not found there, a simple set of rules for lemmatizing regular nouns is employed. All rules are tried, and the resulting forms which cannot be found in index.noun are immediately discarded, the ones that can be are kept. Words that are included in a stoplist, containing 448 common or unwanted words, are filtered out. In order to find compound words as well9 , all pairs that can be made of the lemmatized words such that the second word is not the same as or preceding the first word are also looked for in index.noun. That is, for all words in the original node [w1 , w2 ...wn ] make pairs of the kind [wx , wy ] so that y > x. Discard all pairs that cannot be found in index.noun, save those that can. Generating compounds this way makes sure that, in e.g. the case of the CPV-node [Lead plates, sheets, strip and foil.], not only “lead plate” but also “lead sheet”, “lead strip” and “lead foil” are tried. The constraint that y > x also makes sure that, in the case of e.g. [Command and control system.], “control system” and “command system” are looked for, but “system command” is not. The queries that have been extracted from the CPV-node are then sent to the WordNetbrowser, which retrieves at least one word-sense per query. These word-senses are saved for later. This procedure is done twice: once for the leaf-node and once for its parent. A second, smaller stoplist, containing the words figurative, slang, informal, astrology, (ethnic) slur, baseball, chess, folklore, old/new testament, archaic, someone, anatomy, psychoanalysis, obscene is used to filter out some of the WordNet-senses retrieved. If the sense gloss contains any of the above words, it is discarded. All WordNet-senses with years in their gloss are also filtered out. Such concepts often denote historical events or persons, e.g. . These concepts are not useful when working with the CPV, and therefor discarded. When working with other ontologies, stoplists must certainly be revised to fit their needs. 9 Only two-word compounds are considered here. There seems to be very few or no three-word compounds in the CPV, and most more-than-two-word compounds in WordNet are of the type “imaginary part of a complex number” or ”breach of the covenant of warranty”.

16

This stoplist was manually compiled during the earliest stages of this work. It is a simple and rather crude heuristic to lower the number of unique comparisons somewhat. Without this stoplist, the numbers of unique comparisons made for the whole CPV is ca 198000, with it 184000, ca 14000 fewer comparisons. A stoplist of this kind could also have been automatically compiled, by investigating the glosses of senses that are always given low or no similarity. Time limitations prohibited me from investigating this possibility.

5.2

Comparing word-senses for semantic similarity

From the leaf-node, we now have a set of words Wleaf = {w1 ... wn } with each word wi having an associated set of senses Si = {s1 ... sn }. There is also a set of words from the parent Wparent = {w1 ... wn } with each word wj having a set of senses Sj = {s1 ... sn }. L is the set of all senses associated to all words in Wleaf , and P is the set of all senses associated to all words in Wparent . This corresponds to stage 5 in Table 5. All pairs of L and P senses which are not associated to the same word (we do not want to compare senses of two identical words with each other, unless in the specific case described in Section 5.3.3) are sent to the module WordNet::Similarity where their semantic similarity is computed according to the chosen similarity measure. The output from the module is a semantic similarity value for that pair of word-senses. In Table 4 it is shown how the semantic similarity values for all pairs of L and P senses are computed. The CPV-node disambiguated is [Beetroot], child to [Root vegetables]. The scores for all pairs containing the same L-sense are added together and assigned to that L-sense. The assumption is that the higher the score, the more likely is that L-sense to be a good description of the CPV-node from which it was extracted. When all L-senses and P -senses have been compared, the L-senses are sorted according to their total score. This corresponds to stages 4 and 5 in Figure 5. In Table 4 we see that got almost five times higher score than . There are usually many more senses involved in most CPV-nodes. In this particular case, there were only two, and both of them are good descriptions of beetroots.

5.3

When things go wrong

There are a few scenarios when the program is unable to produce any candidates for enriching a CPV-node. They will in turn be discussed below. 5.3.1

Error scenario #1: No words in Wleaf

In some cases, the words in the leaf-node are all included in the stoplist or simply not in WordNet, and therefor not put in Wleaf . For instance, the only Wleaf -word from the CPV-node [Infill work.] is “work”. “infill” is not included in WordNet. “work” has been added to the stoplist, since it first of all is a very common word; it appears 303 times in the leaf-nodes of the CPV. Second, the 8 17

Table 4: Finding a good description for [Beetroot]. L-sense P -sense Similarity beetroot#1 root#1 1.33735833155429 beetroot#1 root#2 0 beetroot#1 root#3 0.828662971392306 beetroot#1 root#4 0 beetroot#1 root#5 0 beetroot#1 root#7 0 beetroot#1 root#8 0.828662971392306 beetroot#1 vegetable#1 0.828662971392306 beetroot#1 vegetable#2 0 beetroot#1 root vegetable#1 0.828662971392306 beetroot#2 root#1 0.828662971392306 beetroot#2 root#3 0.828662971392306 beetroot#2 root#4 0 beetroot#2 root#5 0 beetroot#2 root#7 0 beetroot#2 root#8 0.828662971392306 beetroot#2 vegetable#1 8.8602465125736 beetroot#2 vegetable#2 0.828662971392306 beetroot#2 root vegetable#1 10.1759233064795 Sum beetroot#1 = 4.65201021712351 Sum beetroot#2 = 22.3508217046223 Word-sense beetroot#1 beetroot#2 root#1 root#2 root#3 root#4 root#5 root#7 root#8 vegetable#1 vegetable#2 root vegetable#1

Gloss beet having a massively swollen red root; widely grown ... round red root vegetable the usually underground organ that lacks buds or leaves ... (linguistics) the form of a word after all affixes ... the place where something begins, where it springs ... a number that when multiplied by itself some number ... the set of values that give a true statement when ... a simple form inferred as the common basis from ... the part of a tooth that is embedded in the jaw ... edible seeds or roots or stems or leaves ... any of various herbaceous plants cultivated .. any of various fleshy edible underground roots ...

18

Table 5: A schematic description of the seven stages involved in the program, from input (a CPV leaf-node) to output (a ranked list of senses). Instructions Examples 1) Find the parent of current CPV leaf-node. Leaf-node = [15223000 Frozen fish steaks], parent = [15220000 Frozen fish, fish fillets and other fish meat] 2) Extract words (nouns that can be found in Wleaf = {fish, steak, fish steak}, Wparent = WordNet) from leaf-node and parent. Lem- {fish, meat, fillet, fish fillet]} matize words, search for compounds. 3) Filter out stoplisted words. Wleaf = {fish, steak, fish steak}, Wparent = {fish, meat, fillet, fish fillet]} 4) Look up all remaining words in WordNet L = {fish#1-4, steak#1, fish steak#1} and assign senses to all words. P = {fish#1-4, meat#1-3, fillet#1-5, fish fillet#1} 5) Filter out senses whose gloss contain sto- L = {fish#1-2+4, steak#1, fish steak#1} plisted words. P = {fish#1-2+4, meat#1-3, fillet#1-5, fish fillet#1} 6) Compute similarity score for each pair of for each pair of senses (a, b) L and P senses. Assign score to L-sense. where a L and b P { compute similarity(a, b) unless a = b add similarity score to score(a) } score(sense) 7) Sort L-senses according to score(all . 0.343 0.343 0.189 0.070 0.054

19

senses returned from WordNet, given the query “work”, are also too generic to be of much use (even the possibly correct ones). Since the semantic similarity computation relies on there being words in both Wleaf and Wparent , something has to be done in order to fill Wleaf with something. I see two solutions. First, one could relax the constraint that Wleaf -words must not be included in the stoplist. Second, one could let the parent node take the place of the leaf node, and let the grandparent take the place of the parent node; moving up one generation in the CPV, in other words. I have chosen the latter solution, assuming that one in this way will be able to capture some, although often less specific, information relevant to the original leaf-node. Information captured using the first solution will often be too generic and is less likely to be restricted to the relevant domain. 5.3.2

Error scenario #2: No words in Wparent

Another similar case is when there are no words in Wparent , by the same reasons as in the case above. Here too one can either relax the stoplist constraint or move up one generation. I have chosen the same sort of solution here, letting Wparent take words from the grandparent. In the case of [Dental hygiene products], the parent is [Dental consumables]. Wleaf will here consist of only “hygiene”, since “product” is stoplisted, and “dental” is not in WordNet as a noun. Using the adjective form of “dental” is not really an option, since the similarity measures used here are confined to within part-of-speech boundaries. Wparent will be empty, since also “consumable” is not in WordNet as a noun. Instead, the grandparent [Disposable non-chemical medical consumables and haematological consumables] will be the source of words for Wparent . The domain will hopefully remain the same, even if there now is a wider gap between Wleaf and Wparent . 5.3.3

Error scenario #3: Only one and the same word Wleaf and Wparent

Another possibility is that Wparent and Wleaf contain only one and the same word. In such cases, the constraint that two identical words may not be sent together for similarity computation is relaxed. It is replaced by a more forgiving constraint saying that two identical words may be sent, but not two identical word-senses. The parent of [Tropical wood] is simply [Wood], so both Wparent and Wchild will contain the single word “wood”. Senses 4-6 of “wood” are left out by the program since they all denote historical persons. Senses 1, 2, 7 and 8 are compared with each other, but not with themselves. At an earlier stage of this work, whenever Wparent and Wleaf contained one and the same word, Wparent was instead filled with words from the grandparent. This resulted in slightly fewer cases where no candidates were found, but the candidates found with the relaxing constraint were more often relevant. 20

6

Precision and recall in word-sense disambiguation

Average precision is an evaluation measure widely used in word-sense disambiguation and information retrieval. I find that the easiest way to describe average precision, is in terms of information retrieval. That is what I will do here. Let the words to be disambiguated be the query, and the set of all WordNet-senses the documents in your collection. Assume you want to find the best documents matching the CPV-node [Printing services for commercial catalogues], then the query you send is “printing, services, commercial, service, catalogue” (extracted from the CPV-node as described in 5.1). The ranked list of retrieved documents for the query above can be seen in Table 6.

Ranking 1 2 3 4 5 6 7

Table 6: A ranked list of WordNet-senses. Document [The business of printing] [A commercially sponsored ad on radio or television] [A complete list of things; usually arranged systematically] [A book or pamphlet containing an enumeration of things] [All the copies of a work printed at one time] [Text handwritten in the style of printed matter] [Reproduction by applying ink to paper as for publication]

Now, the average precision of a list such as Table 6 can be computed as follows: First, pick out all correct documents (according to your Gold Standard) and mark them with their position in the original list. Next, for each document in the new list, divide the position in the new list by its position in the old list. The average of these divisions is the average precision at this particular CPV-node (Table 7). Recall is computed as usual; by dividing the number of correct found by the number of total correct. Table 7: A demonstration of average precision. Ranking 1 2 3 4

Document [The business of printing] [A complete list of things; usually arranged systematically] [A book or pamphlet containing an enumeration of things] [Reproduction by applying ink to paper as for publication] Average precision =

1+0.67+0.75+0.57 4

21

= 0.7475

New / Old 1/1 = 1 2/3 = 0.67 3/4 = 0.75 4/7 = 0.57

6.1

Gold standard

In order to be able to determine recall and average precision, one must first have a Gold Standard. A good way of acquiring a Gold Standard would be to compile a list of a number of CPV-nodes and all WordNet-senses which can be extracted from them. Then a number of subjects, preferably native English-speakers, would be instructed to (without collaborating) pick out those WordNet-senses which they think are good candidates for describing the whole or parts of each CPV-node. One could then view the WordNet-senses for a CPV-node which all subjects agreed on to be correct. Unfortunately, this was not possible to achieve, because of limited time and resources. Instead, the Gold Standard was made by myself and Henrik Oxhammar. It consists of 73 randomly selected CPV-nodes and the WordNet-senses which were seen by us as good candidates for describing the nodes as whole or in part.

7

Evaluation of the measures

For the test, the five measures Leacock-Chodorow, Lin, Jiang-Conrath, Wu-Palmer and Resnik were given the 73 randomly selected CPV-nodes seen in the appendix. When stoplists and constraints have filtered out some of the word-senses extracted from these nodes, there are 362 different word-senses in L-position for the measures to rank, out of which 184 are marked as correct in the Gold Standard. All correct senses are assumed to be equally correct. I have also assumed that compound words generally are more informative and less polysemous than simplex words. Therefor, extra weights were assigned to compound words in L-position. The weights were ×1.25 and ×1.75. If average precision and recall increase when weights are used, this would indicate that the assumption is correct. The output from the five similarity measures was evaluated with average precision (Section 6). A baseline was created by taking the average results from ten runs with randomly generated numbers instead of similarity values. If a measure does not perform better than the baseline, it means that there is a serious flaw in either the method used, the (implementation of the) similarity measure, or both.

7.1

Thresholds

I also wanted to investigate how the different measures perform when thresholds were used. It is likely that when using this program in an application, one does not want to see all candidate senses, just a few of the highest ranked. When none of the thresholds mentioned below are used, all word-senses with a score higher than zero are included in the ranked list. The first threshold allows the top three candidate word-senses plus all lower which have a score no less than two thirds of the third ranked word-sense. If there are only three or less word-senses in the ranked list, all are selected. This threshold is intended to cut off the

22

lowest ranked word-senses, and will reward a measure which gives all correct word-senses an equal amount of high scores. It is a bit hazardous, though, since it would reward a measure for giving all word-senses the same score. The second threshold looks at how many correct word-senses there are in the Gold Standard for a CPV-node. If e.g. there are four correct word-senses for a CPV-node in the Gold Standard, then only the four top candidates for that node are kept. The two thresholds are not used in combination with each other.

Part III

Results, discussion and conclusion In this part, the results from the test in Section 7 are presented (Section 8). After that follows a discussion and an analysis of the results (Section 9), conclusions (Section 10) and thoughts regarding what can be done in the future (Section 10.1).

8

Results

The results from the test described in Section 7 are presented in Tables 8 - 11. Table 8 shows the results when no thresholds are used, Table 9 shows what happens when the first threshold is applied, and Table 10 shows the effect of the second threshold (Section 7.1). Table 11 shows the mean score, when the weights have been averaged out, for the five measures (recall has been omitted here, since the variations are so small between different weight classes). The highest value in each column of each table is written in bold.

9

Discussion

When looking at Table 8 it is important not to be deceived by the high recall values for Leacock-Chodorow, Wu-Palmer and Baseline. There is no threshold used there, and since these three measures always return a value greater than zero, all word-senses are ranked and recall becomes very high. The fact that recall is not 1 is due to a few word extraction failures of the kind described in Section 5.3. The differences in recall for the information content based measures (Resnik, Lin and Jiang-Conrath) are due to their different tendencies to return zero.

9.1

Quantitative analysis of results

What do all the numbers in Tables 8 - 11 tell us? First of all that the Leacock-Chodorow measure clearly performs best, followed by the Resnik measure. In Table 8 Resnik consistently shows higher average precision than Leacock-Chodorow, but when the two thresholds 23

Table 8: Disambiguation results from 73 CPV-nodes. No threshold. Measure Weight Average precision Recall Resnik × 1.00 0.678 0.839 × 1.25 0.697 0.839 × 1.75 0.705 0.839 Leacock-Chodorow × 1.00 0.663 0.977 × 1.25 0.684 0.977 × 1.75 0.684 0.977 Wu-Palmer × 1.00 0.655 0.977 × 1.25 0.677 0.977 × 1.75 0.677 0.977 Lin × 1.00 0.604 0.580 × 1.25 0.614 0.580 × 1.75 0.616 0.580 Jiang-Conrath × 1.00 0.607 0.724 × 1.25 0.622 0.724 × 1.75 0.626 0.724 Baseline × 1.00 0.554 0.977 × 1.25 0.560 0.977 × 1.75 0.565 0.977

24

Table 9: Disambiguation results from 73 CPV-nodes. Threshold 1 used. Measure Weight Average precision Recall Resnik × 1.00 0.673 0.661 × 1.25 0.696 0.667 × 1.75 0.711 0.667 Leacock-Chodorow × 1.00 0.688 0.845 × 1.25 0.711 0.845 × 1.75 0.711 0.845 Wu-Palmer × 1.00 0.680 0.770 × 1.25 0.703 0.770 × 1.75 0.703 0.770 Lin × 1.00 0.608 0.483 × 1.25 0.624 0.483 × 1.75 0.626 0.483 Jiang-Conrath × 1.00 0.615 0.661 × 1.25 0.629 0.661 × 1.75 0.633 0.655 Baseline × 1.00 0.572 0.774 × 1.25 0.578 0.774 × 1.75 0.592 0.774

25

Table 10: Disambiguation results from 73 CPV-nodes. Threshold 2 used. Measure Weight Average precision Recall Resnik × 1.00 0.633 0.506 × 1.25 0.653 0.529 × 1.75 0.668 0.529 Leacock-Chodorow × 1.00 0.644 0.534 × 1.25 0.677 0.552 × 1.75 0.677 0.552 Wu-Palmer × 1.00 0.624 0.517 × 1.25 0.652 0.529 × 1.75 0.652 0.529 Lin × 1.00 0.555 0.425 × 1.25 0.577 0.431 × 1.75 0.577 0.437 Jiang-Conrath × 1.00 0.568 0.448 × 1.25 0.596 0.460 × 1.75 0.590 0.471 Baseline × 1.00 0.505 0.454 × 1.25 0.525 0.459 × 1.75 0.546 0.476

Table 11: Average precision when weight has been evened out.

No threshold Threshold 1 Threshold 2

Resnik Leacock-Chodorow 0.693 0.677 0.693 0.703 0.651 0.667

26

Wu-Palmer Lin Jiang-Conrath 0.669 0.611 0.618 0.695 0.619 0.626 0.643 0.570 0.585

Baseline 0.559 0.581 0.525

are used, the relation is reverse (Tables 9 and 10). This is also apparent in Table 11. It is also clear that Lin and Jiang-Conrath did not perform very well, which is rather remarkable. In e.g. Table 10 they show average precision dangerously close to the baseline, and recall mostly lower than baseline. As is shown by [Budanitsky and Hirst 2004] (Section 4.3.10) both the Jiang-Conrath measure and the Lin measure (sharing place with Leacock-Chodorow) perform very well, outperforming the Resnik measure at malapropism detection and correction. Also in the comparison made by [Seco et al. 2004] (Table 2), both Lin and Jiang-Conrath show higher correlation with human judgement than Resnik and Wu-Palmer. We see that Wu-Palmer comes third, but closer to Resnik than to Jiang-Conrath. The question if the bug in the WordNet::Similarity-implementation of the Wu-Palmer measure (Section 4.3.6) was fatal or not has gotten an answer: no, the bug is not fatal. This bug only occurs twice in the test set, in the word-sense pairs - and - , and they seem not to have affected the outcome.10 Another fact is that weighting of compounds seems to have had a positive effect on both recall and average precision in nearly all cases. A weight of 1.25 makes the measures perform better than without weights, and a weight of 1.75 makes the measures perform even better. Only Jiang-Conrath shows a slight loss of recall or precision in two cases, in Tables 9 and 10, with weight 1.75. This probably has to do with a compound extracted from a CPV-node that is not correct, and when it is given enough weight, it moves up one place in the ranking and switches places with a correct non-compound word-sense. Incorrect compounds do occur, but not very frequently. One example is “service book” () extracted from [Printing services for account books]. The positive effect of compound weighting is probably rather limited. Using weights ten times larger than the ones used here will probably not increase performance much, since they are not that common, and since incorrect compounds will appear.

9.2

Qualitative analysis of results

In this test, where 73 CPV-nodes were disambiguated, 5 nodes did not receive any candidate word-senses at all. They are [Dental haemostatic], [Aerial photography services], [Non-destructive testing services], [Repair and maintenance services of furniture] and [Incubators]. That is ∼ 7% of the nodes in the test set. For all of the leaf-nodes in the CPV the percentage is somewhat lower, ∼ 4.9%, or 274 nodes, using the heuristics described. That these nodes do not get any candidates is a flaw in the program itself, and has nothing to do with any of the similarity measures. If 274 CPV leaf-nodes do not get any candidate word-senses, it still means that 5370 nodes do. The reason that five of the nodes in the test set did not receive any candidate word10

Actually, during the process of finishing this thesis, a new version of WordNet::Similarity was released, where this particular bug had been fixed. So even if it turned out not to be fatal, we need not worry about it anymore.

27

senses is errors like the ones described in Section 5.3. Improved heuristics (such as smarter stoplists), and other sources of information such as extracting words from siblings in the CPV as well, might help here. I would also like to look a little closer on a few of the CPV-nodes in the test set, and see how the measures perform. No weights or thresholds are used here. First, let us look at [Lead], a child to [Lead, zinc and tin]. This is a perfect example: “lead” is a highly polysemous word with as many as 16 senses in WordNet, and just one of the senses is correct here (). The information in the parent node is potentially very helpful; “zinc” is monosemous and highly similar to “lead”, and “tin” has only three senses, out of which one denotes the metal. Interestingly, all of the measures rank the correct sense of “lead” highest. All measures give maximum recall and average precision, for this particular CPV-node. Another interesting CPV-node in the test set is [Mixes for stocks], an instance of [Stocks]. 2 senses are correct, out of 21 extracted by the program; and . Table 12: Results for disambiguating [Mixes for stocks]. Measure Leacock-Chodorow Resnik Lin Wu-Palmer Jiang-Conrath Baseline

Average Precision Recall 0.567 1 0.327 1 1 0.5 0.317 1 1 0.5 0.095 1

The words extracted by the program are Wparent = {stocks, stock} and Wleaf = {mix, stocks, stock}. The L-senses of “mix” will be compared for semantic similarity together with P -senses of both “stocks” and “stock”, whereas the L-senses of “stock” only will be compared together with the one P -sense of “stocks” and the L-sense of “stock” will only be compared together with the P -senses of “stock”. This is because senses belonging to the same word are not compared with each other if there are senses of other words available. All measures give the L-sense a high ranking; first or second place. The correct sense of “stock” is often not so high ranked. Leacock-Chodorow, the most successful measure, even in this small example, places the correct sense of “mix” first and“stock” at rank 15. Resnik, on the other hand, places the correct sense of “mix” second, and “stock” thirteenth. How come the average precision of Leacock-Chodorow is so much higher than Resnik, then? Fact is, that average precision can be rather unforgiving of such seemingly small differences: lch =

1 1

2 + 15 = 0.567 2

res = 28

1 2

2 + 13 = 0.327 2

We saw in this example that the senses of “stock” were not compared with as many other senses as the senses of “mix”. A possible way of evening out the difference between the ranking of senses due to the number of other senses they are compared with could be to divide their total score with the number of senses they were compared with.

9.3

Biased towards Resnik?

Three of the measures can return zero when computing the similarity of two concepts; Resnik, Jiang-Conrath and Lin, and they show different tendency to do that. When looking at how often they did return zero for the nodes in the test set, we see that Resnik returns zero 163 times, Jiang-Conrath 174 times and Lin 301 times. Lin’s poor results can partly be explained by this. It does not suffice using the relatively small SemCor corpus for wordfrequencies when using the Lin measure. Jiang-Conrath, on the other hand, does it just a few times more than Resnik, so the answer to Jiang-Conrath’s poor performance cannot be as easily explained. Another explanation might be found in the fact that there is a bias towards the Resnik measure. This bias consists of that the default mode of WordNet::Similarity (the one used here) uses the kind of frequency counting and smoothing done by Resnik. [Patwardhan 2003] p. 42 shows that Jiang-Conrath’s human correlation increases with 13% relative when optimal settings are used. If Jiang-Conrath’s performance in this test would rise with 13% relative, it would beat Resnik in Table 8 and Leacock-Chodorow in Table 9, but Leacock would still be best in Table 10.

10

Conclusion

I have shown a method of disambiguating an ontology using 5 different measures of semantic similarity and WordNet. The measure best suited for this task is shown to be the LeacockChodorow measure. The average precision of this similarity measure goes from 0.644 to 0.711, and recall between 0.534 and 0.977, both depending of the kind of threshold used. Out of the three semantic similarity measures based on information content, the Resnik, Jiang-Conrath and Lin measures, only Resnik proved to be of much use. However, the performance of the other two might improve considerably, when using larger corpora than SemCor, as was shown by [Patwardhan 2003]. The method described, regardless of the semantic similarity measure used, has the potential of enriching ∼95% of the CPV leaf-nodes (5370 out of 5644) with some kind of information. Average precision and recall for Leacock-Chodorow suggest that the information often is relevant. How useful this information is will be shown in the coming work of Henrik Oxhammar. This method could also be used for disambiguating other English-based ontologies, provided that the information in them is not too domain-specific to be found in WordNet. Though the actual program written here uses WordNet 1.7.1 and WordNet::Similarity 0.09, only a few minor changes need to be done in order to use later versions. 29

10.1

Future work

It would be interesting to see how other computer-readable information sources, e.g. online dictionaries, thesauri or search engines could be used instead of WordNet for similar tasks. To some extent, I think it would be possible. The only part of WordNet used in this work was the noun taxonomy, and similar structures could be attained from thesauri etc. Another way of disambiguating an ontology such as the CPV would be to take advantage of the fact that there exist exact copies of it in different languages. A word which is polysemous in one language is perhaps monosemous in one of the others. For example, to disambiguate the English CPV-node [31521320 Torches], you could look up the Swedish node by simply searching for the code. Then you find [31521320 Ficklampor], which is monosemous in Swedish. All you need to do then is to look up “ficklampa” (singular form of “ficklampor”) in a machine-readable Swedish-English dictionary to find the correct sense of “torch”.

References [Agirre et al. 2000] E. Agirre, O. Ansa, E. Hovy, and D. Martinez. 2000. Enriching very large ontologies using the www. In Proceedings of the Ontology Learning Workshop, ECAI, Berlin. [Banerjee and Pedersen 2003] S. Banerjee and T. Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805–810, Acapulco. [Budanitsky and Hirst 2004] Alexander Budanitsky and Graeme Hirst. 2004. Evaluating WordNet-based measures of semantic distance. Submitted for publication. [European Union and European Parliament 2002] European Union and European Parliament. 2002. On the common procurement vocabulary (CPV). regulation (EC) no 2195/2002 of the european parliament and of the council, November 5. [Faatz and Steinmetz 2002] Andreas Faatz and Ralf Steinmetz. 2002. Ontology enrichment with texts from the www. In Proc. Of the 13th European Conference on Machine Learning (ECML’02) and the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’02), Helsinki. [Fellbaum 1998] Christiane Fellbaum, editor. 1998. WordNet. An electronic lexical database. Language, Speech, and Communication. MIT Press, Cambridge, MA. [Hirst and St-Onge 1998] Graeme Hirst and David St-Onge, 1998. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms, chapter 13, pages 305–332. MIT Press, Cambridge, MA. [Jiang and Conrath 1997] Jay Jiang and David Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference Research on Computational Linguistics, Taiwan. 30

[Landes et al. 1998] Shari Landes, Claudia Leacock, and Randee Tengi, 1998. Building Semantic Concordances, chapter 8, pages 199–216. MIT Press, Cambridge, MA. [Leacock and Chodorow 1998] Claudia Leacock and Martin Chodorow, 1998. Combining Local Context and WordNet Similarity for Word Sense Identification, chapter 11, pages 265–283. MIT Press, Cambridge, MA. [Lin 1998] Dekang Lin. 1998. An information-theoretic definition of similarity. In Proc. 15th International Conf. on Machine Learning, pages 296–304. Morgan Kaufmann, San Francisco, CA. [Miller and Charles 1991] George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28. [Patwardhan 2003] Siddarth Patwardhan. 2003. Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, University of Minnesota. [Pedersen et al. 2004] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet::similarity - measuring the relatedness of concepts. [Resnik 1995a] Philip Resnik. 1995a. Disambiguating noun groupings with respect to Wordnet senses. In David Yarovsky and Kenneth Church, editors, Proceedings of the Third Workshop on Very Large Corpora, pages 54–68, Somerset, New Jersey. Association for Computational Linguistics. [Resnik 1995b] Philip Resnik. 1995b. Using information Content to evaluate semantic similarity in a taxonomy. In Chris Mellish, editor, IJCAI-95, pages 448–453, Montral, Canada. [Roux et al. 2000] Claude Roux, Denys Proux, Francois Rechenmann, and Laurent Julliard. 2000. An ontology enrichment method for a pragmatic information extraction system gathering data on genetic interactions. In Proc. Of Workshop on Ontology Learning at ECAI 2000, Berlin. [Seco et al. 2004] Nuno Seco, Tony Veale, and Jer Hayes. 2004. An intrinsic information content metric for semantic similarity in WordNet. In Proceedings of ECAI’2004, page PAGES?, Valencia, Spain. The 16th European Conference on Artificial Intelligence. [Valarakos et al. 2004] Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, and George Vuoros. 2004. A name-matching algorithm for supporting ontology enrichment. In Proceedings of SETN’04, 3d Hellenic Conference on Artificial Intelligence, Samos, Greece. [Vossen 1998] Piek Vossen, editor. 1998. EuroWordNet: A multilingual database with lexical semantic networks. Kluwer Academic Publishers, Dordrecht. [Wu and Palmer 1994] Zhibiao Wu and Martha Palmer. 1994. Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138.

31

A

Test set

Land-reclamation work Urban solid-refuse disposal services Dental haemostatic Repair and maintenance services of furniture Multi-functional buildings Industrial process control equipment Network components Unloading semi-trailers for agriculture Marine patrol vessels Printing services for stamp-impressed paper Dampers Railway transport of bulk liquids or gases Aerial photography services Fire hoses Copper sulphate Gully-emptying services Chests of drawers Leather waste Electric pumps Structural steelworks Overhead projectors Painters’ brushes Carbon paper Flat-rolled products of iron Sweet pies Notepaper Tomatoes Non-destructive testing services Shorts Installation services of beverage-processing machinery Clinical-waste collection services Motor oils Smoke-extraction equipment Underground railway works Apparatus for detecting fluids Natural sponges Chicken cuts Landscaping work for roof gardens Safe-deposit lockers Time recorders Ferry boats Ass, mule or hinny meat Studio mixing console Paints and wallcoverings Shampoos System quality assurance planning services Breakwater Facilities management services Mixes for stocks Cable-laying ship services Plastic self-adhesive sheets Insulated cable joints Refrigerated showcases Dressing gowns Lead Repair and maintenance services of dampers Incubators Electrical signalling equipment for railways Pollution-monitoring services File covers Pig iron Fittings for loose-leaf binders or files Sausage products Reconditioning services of rolling stock seats Gas-detection equipment Printing services for account books Headbands Diced, sliced and other frozen potatoes Electronic cards Ship refuelling services Processed pulses Wheelchair tyres Metal sheets

i