Mining Semantically Related Terms from Biomedical Literature

(1)

Mining Semantically Related Terms from Biomedical Literature

GORAN NENADI ´C and SOPHIA ANANIADOU

University of Manchester and National Centre for Text Mining

Discovering links and relationships is one of the main challenges in biomedical research, as scientists are interested in uncovering entities that have similar functions, take part in the same processes, or are coregulated. This article discusses the extraction of such semantically related entities (represented by domain terms) from biomedical literature. The method combines various text-based aspects, such as lexical, syntactic, and contextual similarities between terms. Lexical similarities are based on the level of sharing of word constituents. Syntactic similarities rely on expressions (such as term enumerations and conjunctions) in which a sequence of terms appears as a single syntactic unit. Finally, contextual similarities are based on automatic discovery of relevant contexts shared among terms. The approach is evaluated using the Genia resources, and the results of experiments are presented. Lexical and syntactic links have shown high precision and low recall, while contextual similarities have resulted in significantly higher recall with moderate precision. By combining the three metrics, we achieved F measures of 68% for semantically related terms and 37% for highly related entities.

Categories and Subject Descriptors: I.2.7 [Natural Language Processing]: Text analysis; H.3.1 [Content Analysis and Indexing]: Linguistic processing; J.3 [Life and Medical Sciences]:

Biology and genetics

General Terms: Algorithms, Documentation, Languages

Additional Key Words and Phrases: biomedical literature, contextual patterns, term similarities, text mining

1. INTRODUCTION

Dynamic progress in biology, molecular biology, and biomedicine has resulted in a huge body of knowledge that is represented by various concepts, entities, events, processes, functions, and relationships among them. Such knowledge can be represented by semantic networks that link related entities with

This research was partially supported by the UK BBSRC grant “Mining Term Associations from Literature to Support Knowledge Discovery in Biology” (BB/C007360/1).

Authors’ addresses: G. Nenadi´c, School of Informatics, University of Manchester, Manchester M60 1QD, UK; email: G.Nenadic@manchester.ac.uk. S. Ananiadou, School of Informatics, University of Manchester, P.O. Box 88, Sackvile Street, Manchester M60 1QD; email: Sophia.

Ananiadou@manchester.ac.uk.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax:+1 (212) 869-0481, or permissions@acm.org.

C 2006 ACM 1530-0226/06/0300-0022 $5.00

(2)

specific and/or general relations, or by classification, clustering or ontology- based annotations. In order to allow biologists to efficiently acquire, analyze, and use such information, domain resources (e.g., genomic databases) are being continuously adapted to integrate new knowledge as it becomes available.

While manual update (known as curation) involves labor-intensive work on producing summarized information [Blake et al. 2003], automatic and semi- automatic methods rely on mining either experimental data (e.g., assigning the Gene Ontology annotations using homology searching [Camon et al. 2003]) or the literature to suggest relationships and associations among bioentities [Shatkay and Feldman 2003]. The availability of vast textual resources has spurred huge interest in designing text-mining methods that can help scientists in locating, collecting, and extracting relevant knowledge represented in the literature. Traditionally, however, biomedical text-mining applications mainly focus on document retrieval and restricted information extraction, typically without linking and combing information that spans across documents.

In this article, we focus on the extraction of semantically related biomedical entities by combining information collected from multiple documents and multiple sources. The approach is terminology-driven, as terminology represents a means to communicate knowledge among scientists, in particular, when it is expressed in the literature. Terms represent domain concepts (such as entities, processes, and functions) and are particularly relevant in the biomedical domain, which is terminologically extremely dynamic, dense, and variable. Term identification in text is still far from being completely solved in the biomedical domain (cf. [Friedman et al. 2001a; Ananiadou and Nenadi´c 2006; Harkema et al. 2004; Krauthammer and Nenadic 2004; Hirschman et al. 2005]). However, discovering and establishing links and associations among entities is one of the main challenges in biomedical knowledge acquisition [Camon et al. 2005].

Biomedical entities are related in many ways: they have functional, structural, causal, hyponymous or other links [cf. Skuce and Meyer 1991;

Stapley et al. 2002]. Relationships include diverse types of general (such as, generalization, specialization, meronymy) and domain-specific relations (such as, binding, phosphorilation, and inhibition). For example, the term NF-kappa B is a hyponym of the term transcription factor, while the binding relationship links amino acid and amino acid receptor, as well as CREP and CREP-binding protein; further examples of diverse relationships are colocation of proteins ScNFU1 and Nfs1p in yeast mitochondria [Leon et al. 2003], or structural and functional similarities between proteins red blood cell protein 4.1 and synapsin I [Krebs et al. 1987]. The aim of this study is to mine such (pairs of) terms that are (potentially) semantically linked. We do not aim at identifying the type of the relationship(s) that exists among them, but rather at discovering links regardless of the type of the relationship (such terms are considered as semantically related).

The article is organized as follows. In Section 2, we overview related approaches to the extraction of term relationships from text. Section 3 introduces the term similarity measures that are used to establish links among entities, while Section 4 presents experiments and evaluation, which are further discussed in Section 5. Finally, Section 6 concludes the article.

(3)

2. RELATED WORK

Several methods have been suggested for the extraction of relationships from literature [for overviews see Mack and Hehenberger 2002; Shatkay and Feldman 2003]. The most straightforward approach for establishing term links is to measure lexical similarities among the words that constitute terms [Bourigault and Jacquemin 1999; Yeganova et al. 2004]. However, as naming conventions do not necessarily systematically reflect any particular functional property or relatedness between biological entities (in particular when abbreviations or adhoc names are used), term relationships are typically extracted using context analysis of occurrences of terms within and across corpora. Contexts, however, may be selected in a number of ways: as an entire abstract or document [Stapley and Benoit 2000], as a sentence [Grefenstette 1994], or as a chunk (e.g., a verb complementation phrase [Spasic et al. 2003]).

Ding et al. [2002] investigated the effectiveness of these contexts (based on a term-term cooccurrence measure): they reported that larger units naturally provided better recall, while smaller units (e.g., phrases) typically delivered significantly better precision. With respect to overall effectiveness (i.e., F measure), sentences were significantly better than phrases.

Identifying features within selected contexts that are informative for identification of relatedness among terms proved to be challenging. Although, as Blake and Pratt [2001] have indicated, “few researchers would claim that a word rep- resentation is optimal”, many approaches rely on simple words [e.g., Stapley and Benoit 2000; Stapley et al. 2002; Ding et al. 2002]. Orthographic [e.g., Collier et al. 1999] and morphological features [e.g., Hatzivassiloglou et al. 2001;

Kazama et al. 2002] are also used, as well as grammatical roles [e.g., objects or subjects Grefenstette 1994] and shallow-syntactic information [Yakushiji et al. 2001]. Some approaches rely on terminological features [e.g., Blake and Pratt 2001] rely on the distributions of controlled keywords, while Nenadi´c et al. [2002] use dynamically recognised terms).

Various methods have been applied to extract relationships from text. Re- ported rule-based approaches range from those based on predefined lexical patterns [Blaschke et al. 1999; Ng and Wong 1999] and templates [Maynard and Ananiadou 1999; Pustejovsky et al. 2002], to parsing of documents using domain-specific grammars [Friedman et al. 2001b; Yakushiji et al. 2001;

Gaizauskas et al. 2003]. Various statistical approaches—mainly based on mu- tual information and cooccurrence frequency counts—were used to associate terms that are not explicitly linked in text [Andrade and Valencia 1997; Stapley and Benoit 2000; Raychaudhuri et al. 2002; Ding et al. 2002; Nenadi´c et al.

2002]. Similarly, machine-learning approaches have been widely used to learn lexical contexts expressing a given relationship [Craven and Kumlien 1999;

Marcotte et al. 2001; Stapley et al. 2002; Donaldson et al. 2003; Nenadi´c et al.

2003b; Spasic and Ananiadou 2005].

As a rule, many of these approaches extract a specific, predefined type of relationship (e.g., binding and activation), and rarely do so by combining information from more than one text segment. If relationship-specific rules are hand- crafted, this significantly prolongs the construction of a knowledge-mining

(4)

system, reduces its adaptability, and makes it impossible to extract term relationships that do not correspond to the predefined patterns and templates.

On the other hand, co-occurrences and statistical distributions within larger text units (such as, documents) may not reveal significant links for some types of relationships. For example, many studies reported that 40% of cooccurrence- based relationships in the domain of biomedicine were biologically meaningless [cf. Jenssen et al. 2001; Tao and Leibel 2002].

For these reasons, in the following section, we introduce a hybrid method for the identification of semantically related entities, which is based on lexical, syntactic, and contextual similarities between the terms in question.

3. LEXICAL, SYNTACTIC, AND CONTEXTUAL TERM SIMILARITIES

Our approach to mining semantically related terms from literature follows the ideas that use more complex features rather than simple cooccurrence statistics.

We also combine rule-based and statistical methods and extract information from sentences rather than from entire abstracts/documents. Finally, we merge information mined from multiple sources. The method incorporates three text- based aspects, namely, lexical, syntactic and contextual similarities between terms. In the following subsections, we briefly describe each of them [see also Nenadi´c et al. 2004a].

3.1 Lexical Similarities

We generalized the lexical approaches mentioned earlier [Bourigault and Jacquemin 1999; Yeganova et al. 2004]. We consider constituents (head and modifiers) shared by terms as a basis for measuring their lexical similarity.

The rationale behind the approach involves the following hypothesis: a term derived by modifying another term may indicate further concept specialization (e.g., orphan nuclear receptor is a kind of receptor), or some functional rela- tionship (e.g., CREP-binding protein is linked to CREP through the binding relationship; this type of links has been studied in detail in Ogren et al. [2004]

for the Gene Ontology). In particular, terms sharing a terminological head¹are assumed to be (in)direct hyponyms of the same term (e.g., progesterone recep- tor and oestrogen receptor are both receptors). More generally, when a term is nested inside another term, we assume that the terms in question are somehow semantically related.

For each term we define its lexical profile containing its terminological head and all of its substrings (see Table I, for examples). We then use a weighted dicelike coefficient to compare lexical profiles of two terms. We give more credit to pairs that share longer nested constituents, with an additional weight given to the similarity if the two terms have common heads. More precisely, lexical similarity (LS) between two terms is defined as:

LS(t1, t2)= | P(h1)∩ P(h2)|

| P(h1)| + | P(h2)|+ | P(t1)∩ P(t2)|

| P(t1)| + | P(t2)| (1)

1The notion of terminological head refers to the element that awards termhood to the whole term [cf. Ananiadou 1994].

(5)

Table I. Examples of Lexical Profiles

Terms Lexical Profiles, P(term)

nuclear receptor {nuclear, receptor, nuclear receptor}

orphan receptor {orphan, receptor, orphan receptor}

orphan nuclear {orphan, nuclear, receptor, orphan nuclear, receptor nuclear receptor, orphan nuclear receptor}

Table II. Examples of Lexical Similarities Between Terms

Term1 Term2 LS (term1, term2)

nuclear receptor orphan nuclear receptor 0.83 orphan receptor nuclear orphan receptor 0.83 orphan nuclear receptor nuclear orphan receptor 0.75 nuclear receptor nuclear orphan receptor 0.72 orphan receptor orphan nuclear receptor 0.72

nuclear receptor orphan receptor 0.67

nuclear receptor nuclear translocator 0.17 orphan nuclear receptor nuclear translocator 0.11

orphan receptor nuclear translocator 0.00

where h1and h2are terminological heads of terms t1and t2respectively, and P(s) refers to a set of all nonempty subsequences of s. Examples of lexical similarities are provided in Table II.

This lexical metric is obviously useful for comparing multiword terms, but is rather limited when it comes to adhoc names (since they may have arbitrary constituents) or single-word terms. Also, lexical similarities can capture only restricted types of links (typically specialization/generalisation relationships), although, in some cases, domain-specific relationships can be lexically expressed (e.g., CREP-binding protein is concerned with binding). Finally, assessing lex- ical similarities depends on the ability to neutralise some lexical variations (e.g., inflection and simple structural variations) and to expand acronyms; a method that we have used for this is described in Section 4.2.

3.2 Syntactic Similarities

It is widely accepted that specific syntactic expressions may indicate functional similarities among terms [Hearst 1992]. For instance, when encountered in text, an enumeration of terms (e.g., steroid receptors, such as, estrogen receptor, gluco- corticoid receptor, and progesterone receptor), term coordination²(e.g., adrenal glands and gonads), or conjunction of terms (e.g., estrogen receptor and pro- gesterone receptor) typically indicate that the terms involved are highly re- lated and functionally similar. In these cases, a sequence of terms appears as a single syntactic unit, and, thus, involved terms are used within the same context (in the same sentence) and in combination with the same verb and/or preposition.

2In this article, we assume that, in a term coordination, a lexical constituent(s) common for two or more terms is shared (appears only once), while their distinct lexical parts are enumerated and coordinated; term conjunctions involve terms which are lexically represented as separate units (no constituents are shared among them (i.e., represented as belonging to each of them)).

(6)

Table III. Examples of Term Enumeration and Coordination Patterns^a

<TERM> ([(](such as)|like|(e.g.,[,]))<TERM> (,<TERM>)* [[,] <&> <TERM>] [)]

<TERM> (,<TERM>)* [,] <&> other <TERM>

<TERM> [,] (including | especially) <TERM> (,<TERM>)* [[,]<&><TERM>]

both <TERM> and <TERM>

either <TERM> or <TERM>

neither <TERM> nor <TERM>

aThe standard regular-expression notation is used:[], denotes optional elements; * denotes repeti- tion; and|, denotes alternatives; and <&>, denotes a coordination conjunction.

For the extraction of syntactic similarities (SS), we have generalized the approach of Hearst [1992] and defined a set of lexical patterns³(see Table III).

They are applied as filters in order to retrieve sets of terms appearing in enumerations, coordinations, and conjunctions. Note that corresponding patterns are typically ambiguous, as they may retrieve both coordinated terms and conjunctions of terms [Nenadi´c et al. 2004c]. Still, in either case, the retrieved terms are highly related.

Syntactic similarity typically captures taxonomic relationships (hyponyms, siblings, etc.). For more generic type of links between terms, comparison of other textual contexts in which terms appear individually is, therefore, necessary.

3.3 Contextual Similarities

Contextual similarities rely on the comparison of contexts in which terms tend to appear. We do not use the bag-of-words approach or full parsing to model contexts, but rather aim at describing (in a more generic way) a context in which a given term occurred. For example, by analyzing the following set of textual contexts:

. . . receptor is bound to these DNA sequences . . . . . . estrogen receptor bound to DNA . . .

. . . RXRs bound to respective DNA elements in vitro. . . . . . TR when bound to DNA . . .

one can note that terms receptor, estrogen receptor, RXRs, and TR appear in a context that can be “roughly” described using the following (right) contextual

“pattern”: TERM VERB:bind to CLASS:dna. Sharing this context may indicate functional similarity among these terms (i.e., these terms may have similar functions).

In order to generate generic contextual “descriptions,” we need to discard less informative contextual constituents (such as determiners, adverbs, linking phrases, and auxiliary verbs), and neutralize lexical variability. On the other hand, context elements with high information and domain-specific content (e.g., terms and terminological verbs) can in addition, be additionally lemmatized (and “instantiated”) to highlight their importance for establishing links.

We represent a context of an individual term occurrence by a generic regular expression called context pattern (CP). A CP contains only relevant

3These patterns assume that term occurrences have been previously identified in text.

(7)

elements, their part-of-speech and syntactic tags, terminological and additional ontological (or class) information (when available), and lemmatized significant contextual elements.

When mining CPs from text, one of the challenges is to deal with CPs of various lengths, in particular, with “nested” contexts. For example, the following contexts are nested in the contextTERM VERB:bind to CLASS:dna PREP:in CLASS:location:

TERM VERB:bind to CLASS:dna PREP:in TERM VERB:bind to CLASS:dna

TERM VERB:bind to

In our approach, we generate and consider all possible nested patterns when comparing contexts.

Another problem is to distinguish contexts that are relevant for the domain, as terms also appear in contexts that are not relevant for establishing their properties.⁴Our approach aims at automatic identification of relevant CPs by providing a weighting mechanism called CP value.⁵ The CP-value measure assigns importance weights as follows: the weights of CPs that do not appear as nested elsewhere are proportional to their frequency (in a given corpus) and length (sharing two longer CPs is more important than two shorter ones). If a CP also appears as nested, we take into account both the number of times it appears as a maximal one and the number of times it appears as nested. More precisely, CP value of pattern p is defined as

CP-value( p)=

log₂|p| · f (p), p is a not-nested CP log₂|p| · ( f (p) −_|T¹_p_|·

q∈Tp f (q)), p is a nested CP (2) where f ( p) is the absolute frequency of pattern p in the corpus,|p| is its length (as the number of constituents), Tpis a set of all CPs that contain p, and, consequently,|Tp| is the frequency of its occurrence within other CPs. The CPs whose CP values are within a certain interval can be deemed important: CPs with very high CP values are, as a rule, general contextual patterns, while CPs with low CP values may be irrelevant for comparisons (they typically have low frequencies). Table IV presents some examples of CPs. Note that the mined CPs are not information extraction patterns; they are rather used as an approximation of the contexts in which terms appear.

Finally, each term is associated with a set of the most characteristic patterns in which it occurs (we extract left and right patterns separately). Such patterns represent a contextual profile of a term. As we treat CPs, i.e., contextual profiles as terms’ features. We use a dicelike coefficient to assess contextual similarity (CS) between terms t₁and t₂as follows:

CS(t₁, t₂)= | CL1∩ CL2| + | CR1∩ CR2|

| CL1| + | CL2| + | CR1| + | CR2| (2)

4Apart from infrequent patterns, consider, for example, a frequently used, but noninformative and nondiscriminative context We report on TERM. . .

5This measure is analogous to the C-value termhood measure [Frantzi et al. 2000].

(8)

Table IV. Examples of (Left) CPs^a

Context Pattern CP value “Type”

PREP:of NP 232.65 general

PREP:in NP PREP:of NP 126.47 patterns

. . . . . .

PREP:with V:interact NP 12.32

TERM PREP:on NP PREP:of 10.17 domain-

TERM V:regulate NP PREP:by V:bind NP PREP:to 4.64 specific NP V:mediate NP PREP:through NP PREP:of 4.27 patterns NP PREP:of NP V:separate PREP:by TERM V:mediate 4.00

TERM PREP:of NP V:inhibit PREP:by 4.00

. . . . . .

TERM PREP:at NP V:induce 2.00 low-freq.

TERM NP V:activate PREP:as 2.00 patterns

. . . . . .

aPrepositions and most frequent verbs are instantiated.

Table V. Examples of Contextual Patterns Shared Between NF Kappa B and Transcription Factor^a

Examples of Shared Patterns Frequency

Left TERM V:inhibit NP PREP:of TERM 4

CPs NP V:bind TERM 3

TERM V:affect NP PREP:of 3

NP V:induce 2

TERM PREP:of TERM PREP:of NP PREP:between TERM 2 PREP:by NP PREP:in NP V:involve NP PREP:between TERM 2

NP V:mediate PREP:by TERM 2

NP V:stimulate PREP:with TERM V:inhibit NP PREP:of TERM 2

Right V:activate PREP:by TERM 8

CPs V:bind PREP:to NP PREP:in TERM 6

V:involve PREP:in NP PREP:of NP 4

TERM V:control NP PREP:of TERM 2

NP V:associate PREP:with TERM 2

TERM V:inhibit TERM PREP:of TERM 2

V:contribute PREP:to NP PREP:of TERM 2

V:allow NP PREP:of TERM 2

aIn these patterns, verbs and prepositions are instantiated. CS calculated for these two terms from their contexts extracted from the Genia corpus is 0.56 (note that NF kappa B is a hyponym of transcription factor).

where C_L1, C_R1, C_L2, and C_R2are the sets of left and right CPs associated with terms t₁ and t₂ respectively. Table V shows examples of contextual patterns shared between two terms.

4. EXPERIMENTS AND EVALUATION

Here we report on an experiment with extracting semantically related terms from the Genia corpus [Kim et al. 2003]. First we present our evaluation methodology and briefly overview the implementation of the methods (discussed in the previous section) used to extract related terms. Evaluation and comparisons are presented in Sections 4.3–4.5 and 4.6–4.7, respectively, further discussions are given in Section 5.

(9)

Table VI. Examples of Term Distances (Based on the Genia Ontology) and Their Contextual, Lexical, and Syntactic Similarities (as Extracted from the Genia Corpus)

Term1 Term2 Distance CS LS SS

human monocyte monocyte 0 0.48 0.75 1.00

B cell T cell 0 0.51 0.67 1.00

B cell line T cell line 0 0.35 0.75 1.00

hella cell Jurkat cell 0 0.43 0.67 1.00

hella cell Jurkat T cell 0 0.46 0.61 1.00

p50 p65 0 0.51 0.00 1.00

cell survival cell cycle progression 0 0.00 0.11 1.00 tumor necrosis factor tumor necrosis factorα 0 0.37 0.83 0.00

RAI therapy surgery 0 0.00 0.00 1.00

adult T cell primary T cell 0 0.00 0.75 0.00

NF kappa B transcription factor 1 0.56 0.00 0.00

4.1 Evaluation Environment and Methodology

The testing Genia corpus contains 2000 abstracts with manually marked part- of-speech (POS) information. Furthermore, each occurrence of more than 30,000 different terms is tagged and annotated with a corresponding class from the Ge- nia ontology. The ontology contains around 50 hierarchically organized classes, but only leaves (35 nodes) are used for annotations. These annotations were used to evaluate term links mined from the corpus.

It is obvious that defining and applying a consistent and meaningful evaluation approach for assessing relatedness among terms is a huge challenge [Camon et al. 2005]. In our approach, relatedness between two terms was esti- mated via the distance among the corresponding annotations in the Genia ontology (assigned to the terms in question). More precisely, the following method was used: the distance between two terms is calculated as the mean of the sum of distances (the number of edges) of their respective classes from the nearest common ancestor in the Genia ontology. Thus, if two terms share an annotation, then their distance is 0, while if they belong to sibling classes (i.e., have an immediate common ancestor), then their distance is (1+ 1)/2 = 1. The maximal distance between two classes (and, consequently, associated terms) in the Genia ontology is 10.

In the experiments reported here, we assume that terms with distances that are less than or equal to 1 (i.e., terms that belong either to the same class or to sibling classes) are highly related, while terms within the distance of 3 are considered as related. For example, the distance between cell survival and cell cycle progression is 0 (highly related terms), between NF kappa B and transcription factor is 1 (highly related), and between CRE and CRE binding protein is 2.5 (related terms). On the other hand, for the purpose of this study, terms with distances above 3 were deemed weakly or nonrelated. Examples of term distances and the values of their similarities mined from the Genia resources are given in Table VI.

We analyzed links between terms from a controlled set of 1749 terms with frequencies of occurrence above 5 in the Genia corpus. Table VII shows the distribution of distances in the controlled set: 15.07% of term pairs contain

(10)

Table VII. Distribution of Term Distances in the Controlled Set

Distance 0 ≤1 ≤3 ≥3

Term pairs (%) 15.07 22.32 57.54 42.46

terms that belong to the same class, with 22.32% of pairs with term distances≤ 1, and 57.54% of them with distances≤ 3.

In order to evaluate the approach, we measured the distances (induced by the Genia ontology) among terms that have been suggested as related by their similarities mined from the corpus. These distances were used to assess effectiveness of the approach. More precisely, for the suggested set of related term pairs, we calculated separate precision/recall values with respect to distances

≤ 0, 1 and 3 as follows. Precision with respect to distance d is the number of extracted pairs whose distance is less than or equals d in the Genia ontology, over the number of all extracted pairs. Recall with respect to distance d is the number of extracted pairs whose distance is less than or equals d , over the total number of pairs from the controlled set whose distance is less than or equals d in the Genia ontology. F measure with respect to distance d was calculated by taking into account corresponding precision/recall values. We evaluated individual types of term similarities with various thresholds (i.e., when similarities were above a certain value), as well as a combination of the measures (see Sec- tion 4.6). Before presenting the results, we briefly describe the implementation of the methods presented in Section 3.

4.2 Implementation of the Methods

Lexical similarities were calculated for each pair of the controlled terms using Eq. (1). As indicated in Section 3.1, the method for assessing lexical similarities depends on the neutralization of term variations (including expansion of acronyms) and the accurate identification of terminological heads. The latter was mainly dealt with heuristics: in English, the head is typically the left- most noun, but, in some cases, biomedical terms appear in a prepositional form (e.g., level of transcription) or end with a postmodifying adjective (e.g., human immunodeficiency virus type 1 or tumor necrosis factor alpha). If the heads are not correctly recognized (as level, virus, and factor, respectively, for the above examples) then the corresponding similarities will not be consistent. Therefore, some neutralization of variation (such as, prepositional, and inflectional) is nec- essary to help in this process. For example, the Genia term nuclear factor of activated T cells needs to be normalized (i.e., transformed) into activated T cell nuclear factor prior to any comparisons. In order to neutralize inflectional and simple structural variations, we used the method described in Nenadi´c et al.

[2004b]. This method essentially generates singular terms, resolves acronyms, and transforms (by inversion) each term containing a preposition to an equivalent form.

The extraction of syntactic and contextual similarities needs corpus pre- processing, but the Genia corpus has already been tagged with the necessary information. Term occurrences have been marked in the corpus and we have, in

(11)

addition, cross-linked some types of terminological variants⁶(such as, acronyms and terms containing prepositions) with respective terms (for details, see also Nenadi´c et al. [2004b]), in order to improve coverage of the mined links between them.

Syntactic similarities were extracted using the patterns presented in Table III, which have been applied on a version of the Genia corpus where only terms and coordination conjunctions have been tagged. We applied the patterns separately for enumerations and term conjunctions. As term coordinations were disambiguated and marked in the Genia corpus, there were no ambiguities when processing term coordinations and term conjunctions. Pair- wise similarities were calculated as follows: if two terms appeared in any expression described by the patterns, the syntactic measure for the pair was set to 1, and to 0 otherwise. Thus, when calculating syntactic similarity, we did not discriminate among different relationships among terms (represented by various patterns), but instead, considered terms appearing in the same syntactic role in the same sentence as (highly) related.

Contextual similarities were extracted from the POS-tagged version of the Genia corpus. We first collected concordances for all controlled terms. For each term occurrence, the maximal⁷left and right contexts were extracted (without crossing the sentence boundary) and normalized. Contexts containing “stop- list” constituents (e.g., report on, result in) were removed. From the remaining contexts, we firstly discarded noninformative constituents, namely adjectives (that are not part of terms), adverbs and determiners, as well as so-called link- ing “expressions” (e.g., however, moreover). Adjacent nouns/noun phrases were replaced by appropriate regular expressions. At the same time, constituents deemed relevant were instantiated. For the experiments reported here, we instantiated terms and verbs found in contexts, as we have previously shown that they were useful for mining relationships [Nenadi´c et al. 2003b; Spasic et al. 2003]. Finally, nested contexts were generated by “trimming” the left/right side until the contexts of the minimal length were reached.

Once we have CPs extracted, we calculated their CP values in order to es- timate their importance. The top 5% of the ranked patterns were discarded as general, while the lower CP value threshold was chosen empirically after several experiments. Each controlled term was then associated with a set of remaining CPs it appeared in, and pairwise similarities were calculated using formula (3).

4.3 Evaluation of Lexical Similarities

The results achieved with lexical similarities generally show low coverage and high precision. More precisely, by using lexical similarities, we were able to extract only 5% of total term pairs that are highly related (distance≤ 1) in the controlled set. For more semantically distant terms (distances≤ 3), recall was as low as 2.8%. On the other hand, lexical links were fairly accurate. Figure 1

6Note that terminological variants are not marked or linked in the Genia corpus.

7The minimal and maximal lengths of CPs were chosen empirically; in the experiments reported in this article, these lengths were set to 2 and 10, respectively.

(12)

Fig. 1. Precision of lexically related term pairs with regard to their distances (various threshold values for LS are given on the X axis).

shows precision for various thresholds used to cut-off pairs with lower LS. For example, if we consider all lexical similarities (the threshold set to 0), then in 42% of the cases involved terms belonged either to the same class or to sibling classes (distance≤ 1), while even two thirds of term pairs with LS above 0.25 had terms belonging to the same class, as did all terms with LS> 0.90. However, for such values, the number of involved terms and term pairs fell dramatically:

for example, precision of 99% (with respect to terms with distances≤ 3) was achieved at the recall point of only 1%.

4.4 Evaluation of Syntactic Similarities

Similar results were obtained for syntactic similarities. In 71% of the cases, terms occurring in an enumeration expression belonged to the same class (pre- cision for distance 0), in 86% of the cases terms found in term enumeration were members of either the same class or sibling classes (distance≤ 1), and virtually all were within the distance≤ 3. In the case of term conjunctions, the results were slightly less accurate (the corresponding values were 66, 76, and 98%, respectively). The average distance among terms appearing in conjunction expressions was also double the average distance of terms appearing in enumerations. This suggests that term enumerations express stronger similarity than term conjunctions. However, enumerations were eight times less frequent than conjunctions.

With respect to the coverage, the results were expectedly disappointing: only 0.25% of term pairs that belonged to the same class had syntactic similarities extracted from the Genia corpus. Of course, the size of the testing corpus here is a limiting factor; a larger corpus may reveal a larger coverage for this measure.

4.5 Evaluation of Contextual Similarities

As opposed to LS and SS, contextual similarities resulted in significantly higher coverage and modest precision. When no threshold for CS values was used, 15%

of contextually related terms belonged to the same class (precision for distance 0) and 58% of them were within distances≤ 3 (related terms). However, in the

(13)

Fig. 2. Precision of contextually related term pairs with regard to their distances (various threshold values for CS are given on the X axis).

latter case, the contextual measure covered more than 80% of related term pairs (compared to 2.8% covered by lexical similarity). When the threshold value was set to one-half of the maximal CS value in the controlled set, almost one-half of contextually related pairs contained terms either from the same or from the sibling classes, while in 78% cases distances were≤ 3 (see Figure 2). The results for contextual similarities were, to some extent, comparable to those achieved for lexical similarities in terms of precision when the extracted term sets with the comparable number of elements were considered.

Further experiments with contextual similarities have demonstrated that terms belonging to the same or sibling classes have a higher degree of contextual relatedness than terms belonging to different classes. More precisely, for the controlled set of terms and associated classes, we calculated that the average CS for terms that belonged to the same class was 0.15 (microaverage⁸) and 0.17 (macroaverage) compared to 0.12 for across-classes pairs.

As an additional test, we further examined the quality and consistency of contextual similarities by comparing the values of CS among variants that denoted the same terms (in this case, we used the original Genia corpus without linking variants). Typically, terminological variants were mutually highly contextually similar, and, in the majority of cases even the most similar⁹(see Table VIII for examples). This demonstrates that CS can be used as a consistent indicator of relatedness between terms.

4.6 Combining Similarities

In order to make use of all information mined from the literature and in an attempt to improve accuracy and coverage of the mined relationships, we ex- perimented with a linear combination of the similarity measures. We calculated

8In microaveraging, precision is averaged over the number of terms, while macroaverage gives the mean precision for each class [Yang 1997].

9For example, HIV and human immunodeficiency virus were mutually the most contextually similar terms.

(14)

Table VIII. Examples of Contextual Similarity Among Terminological Variants^a

Term1 Term2 CS

open reading frame open reading frames 0.53

transcription activation transcriptional activation 0.39

HIV infection AIDS 0.34

HIV 1 AIDS 0.25

HIV human immunodeficiency virus 0.23

HIV HIV 1 0.19

human immunodeficiency virus human immunodeficiency virus type 1 0.18 human T cell leukemia virus type 1 human T cell leukemia virus type I 0.18

aThe maximal CS in the Genia corpus was 0.60.

a hybrid term similarity measure (called the CLS similarity) as follows:

CLS(t1, t2)= αCS(t1, t2)+ βLS(t1, t2)+ γ SS(t1, t2) (3) whereα + β + γ = 1. The choice of the weights in the Eq. (4) is not a trivial problem and an automatic learning method can be used to suggest an optimal solution [Spasic et al. 2002]. For experiments performed here (see the following section), the best performance in terms of F measure was achieved forα = 0.2, β = 0.7, and γ = 0.1.

4.7 Comparisons

Tables IX–XI show detailed comparisons among different similarity metrics¹⁰ with respect to distances 0, 1, and 3. In all cases, contextual similarities have much wider coverage, and their F scores are significantly higher than those for lexical and syntactic similarities. Precision-wise, syntactic similarities are the most accurate, but at extremely low recall. The results have also shown slightly improved F measure of CLS compared to any measure on its own. By combining various types of metrics, we achieved F measures of 68% for related terms (distances≤ 3) and 37% for highly related entities (distances ≤ 1).

In order to compare the methods presented in this article with other approaches, we further analyzed term associations based on standard term co- occurrences within sentences and abstracts (using the same controlled set). We adopted the following approach: two terms were deemed semantically related if they cooccurred in the same sentence/abstract. The recall results show that when using term co-occurrences within sentences and abstracts in the Genia corpus, we were able to extract only 1.5 and 4.5% of all pairs with distances≤ 3, respectively. As for precision, as one can expect, within-sentence co-occurrences were more accurate: in 30% of the cases terms cooccurring within a sentence have distances ≤ 1, compared to 24% for within-abstract co-occurrences. If we consider terms with distances ≤ 3, then similarities based on term co- occurrences have precision of 60% for within-sentence co-occurrences, and 57%

for abstract-based co-occurrences. Thus, in more than 40% of the cases cooccurring term pairs have distances > 3 (these outcomes confirm the results presented previously in Jenssen et al. [2001] and Tao and Leibel [2002]. In most cases, the performance of cooccurrence-based similarities is, to some

10In all cases, the thresholds were set to 0.

(15)

Table IX. Comparison of the Precision, Recall, and F Measure Values for the Similarity Measures (Distances= 0) Similarity Measure Recall Precision F Measure

LS 0.05 0.36 0.09

SS 0.0025 0.65 0.005

CS 0.83 0.15 0.25

CLS 0.83 0.16 0.27

cooccurrence in sentences 0.02 0.20 0.04 cooccurrence in abstracts 0.05 0.16 0.08

Table X. Comparison of the Precision, Recall, and F Measure Values for the Similarity Measures (Distances≤ 1) Similarity Measure Recall Precision F Measure

LS 0.04 0.42 0.07

SS 0.002 0.76 0.004

CS 0.83 0.23 0.36

CLS 0.83 0.24 0.37

Table XI. Comparison of the Precision, Recall, and F Measure Values for the Similarity Measures (Distances≤ 3) Similarity Measure Recall Precision F Measure

LS 0.03 0.78 0.06

SS 0.0009 0.93 0.002

CS 0.82 0.57 0.67

CLS 0.82 0.58 0.68

extent, comparable to lexical relatedness and outperforms syntactic similarities, while the performance of contextual similarities is significantly better than cooccurrence-based relationships. More precisely, while the precisions of the cooccurrence-based approach and contextual similarities are comparable, the recall of the latter is considerably better.

5. DISCUSSION

The presented methods for mining semantically related terms are based on either internal lexical similarities or external aspects of term occurrences in documents (co-occurrences, syntactic, and contextual similarities). The internal aspect makes use of naming associations that have been “built” into terms, while the external aspects rely on various levels of similarity in their usage.

While co-occurrences rely on simple within-sentence or within-document distributions, syntactic similarities capture appearances in specific expressions (i.e., phrases) and contextual similarities indicate overall resemblances of contexts in which terms appear. In general, our results suggest that phrases (used for syntactic similarities) were relatively more accurate than other approaches, but extremely sparse. Term co-occurrences within sentences and documents might not reveal many term links as they are typically confined to single

(16)

sentences/documents, while contextual similarities integrate information from various sources and, consequently, improve recall.

The results presented show that only 5% of the highly related terms (terms with distances≤ 1) are lexically linked. In terms of recall, LS is more effective for highly than for weakly related terms. Lexical similarities are also accurate in predicting links among terms that have high values for LS. To further examine the consistency of the measure, we analyzed within-class lexical similarities for a subset of the Genia classes induced by the controlled set of terms and also the mean across-class LS. Apart from the protein class, the average LS among term pairs belonging to the same class (microaverage 0.43) was greater than the average LS for the whole term collection (0.27), which, in turn, was greater than the average LS among terms belonging to different classes (0.18).

The only class with the mean LS lower than the average for the whole collection was the protein class. This confirms that protein names do exhibit a higher degree of lexical variability than names of other biological classes (as indicated in Fukuda et al. [1998] and Tanabe and Wilbur [2002]).

Relationships that rely on term co-occurrences in enumeration and conjunction expressions provide a similarity measure with the highest precision, but with extremely low recall, as terms do not frequently appear in such expressions relative to the number of occurrences (in particular, for smaller corpora). As indicated above, the size of the corpus is an important factor and a larger corpus may reveal better recall. Analogously, bigger corpora may improve performance of cooccurrence based relationships.

Contextual similarity presented here is, to some extent, a generalization of the work of Hindle [1990] and Grefenstette [1994], who used only subject–

verb, object–verb, and adjective–noun relationships as a basis for establishing word similarities. On the other hand, we use more general patterns that describe various contexts, not necessarily predicate–argument structures. Fur- ther, our approach generalizes the contextual clustering approach presented in Maynard and Ananiadou [1999], which was based on a set of manually predefined semantic frames and that were semiautomatically tuned by corpus processing.

Contextual similarities have shown significantly higher recall compared to other measures. Although one can argue that these results could be biased to some extent as we considered more frequently occurring terms, additional experiments have shown that similar results were obtained for more or less frequently occurring terms. The experiments have also demonstrated the consistency of CS, as terms belonging to the same or sibling classes have a higher degree of contextual similarity than terms belonging to different classes.

In addition, tests on contextual correspondence among equivalent term variants have shown that the CS performance is coherent: in many cases, terminological variants (i.e., synonyms) are mutually the most contextually similar counterparts.

Since it is based on capturing unrestricted recurrent contextual patterns, CS can reveal not only named and known relationships (as it is the case with LS and SS), but also some latent links among terms, which can be beneficial for mining new or unknown relationships among entities. For example, CS revealed

(17)

links (such as the link between breast cancer and cell proliferation, or between 9-cis-RA and signal transduction pathways [Heyman et al. 1992]) that were not captured by other two approaches (lexical and syntactic). In our controlled set, only 2% of related term pairs could have been discovered by each of the three metrics. By combining single similarities into the CLS similarity, we used all information mined for a pair of terms.

Note that the results obtained on the controlled set may be biased, to some extent, as our evaluation methodology is mainly based on the hyponymy relationships represented through the Genia ontology, which are, in some cases, consistently reflected via lexical modifications of terms. This suggests that precision of LS might be lower if other evaluation environment was used. Also, syntactic similarity could have benefited more from the chosen methodology.

On the other hand, some relationships recognized by CS were considered false positives as they have not been suitably represented in the Genia ontology. For example, although suggested as contextually related, the pair glucocorticoid receptor and glucocorticoid receptor function was considered a false positive as their distance was above 3. Thus, precision of CS might be higher if another evaluation scheme was used.

Further improvements can be made to the measures presented. For example, contextual similarity can be enhanced by incorporating additional weights and statistical and distributional properties for comparing term contextual profiles.

For example, if two terms appear exclusively and/or frequently in a certain context, then this fact is more important than an “incidental” sharing of a context pattern. Another challenge is to handle modality and negation if appearing in a given context. Further, links based on various syntactic similarities among terms (represented by different patterns) can be weighted (e.g., enumerations seem to be more accurate than conjunctions); the values of syntactic similarity can be also parameterized by the number of patterns in which two terms appear simultaneously. Lexical similarity can be generalized (in particular, for single-word terms) by combining alternative methods for lexical comparison (e.g., approximate string matching or character-based n-gram comparisons).

Finally, the three measures were integrated by their linear combination, but other approaches (such as, polynomial) could improve performance. A further challenge is to automatically discover the type of the link among semantically related terms.

The suggested method for the extraction of semantically related terms can be used for several biomedical text-mining scenarios. Apart from mining links among terms, extracted similarities can be used as a basis for term classification [cf. Spasic et al. 2004; Spasic and Ananiadou 2005] and for term sense disambiguation (e.g., by comparing a contextual pattern corresponding to a given ambiguous term occurrence with patterns relevant to each of the term senses). Furthermore, the most significant CPs extracted from a domain corpus may be used to semiautomatically suggest patterns relevant for various information extraction tasks. For example, patternV:inhibit TERM1 PREP:of TERM2associated to transcription factor and other related terms can be used as a template to extract information about the inhibition of certain bioprocesses:

(18)

TERM1typically contains the information about the process¹¹in question, which is influenced by a given transcription factor, whileTERM2fills the “slot” corresponding to the respective target¹²of the inhibition.

Finally, the demand from the biomedical user community is directed toward systems that are able to recognize, extract, and relate entities and events from a large body of literature, so that they can be visualized and analyzed [Fukuda and Takagi 2004; Camon et al. 2005]. For this, the extraction of term relationships is essential. For example, many researchers are interested in information latent in huge repositories of biomedical text that can help in new hypotheses generation, and are mainly interested in wider coverage of possible links between entities. Once we have relatedness among two terms identified, they can be used to either propagate knowledge that we have about one of them to the other, or to hypothesize a novel link among them. This would be particularly feasible when text mining is harnessed with experimental data derived by postgenomic techniques, such as expression array and sequence analysis.

Outlier detection between textual and nontext information can also be a very powerful method for knowledge discovery. If, for instance, entities that appear linked from the results of text-mining behave very differently under a particular set of experimental conditions, then this can suggest the experiment is uncovering something that was previously unknown and is worthy of further investigation. Similarly, relationships mined from text can reveal some inconsistencies or contradictions in the literature, or identify gaps in existing knowledge by suggesting possible links among entities.

6. CONCLUSION

In this article we presented and evaluated a method for the automatic mining of semantically related terms that is based on comparison of their lexical, syntactic, and contextual profiles. One of the most important advantages of the approach is that it is entirely data-driven, as the terminological information is collected automatically from the literature without using external resources.

Lexical similarities are based on the level of sharing constituents among terms and are highly accurate. However, lexical similarity has low coverage:

only 5% of the semantically closest terms are lexically related. Syntactic similarities rely on co-occurrences in specific expressions (such as, enumerations and term conjunctions), which provide a term similarity measure with high precision, but with extremely low recall. These two methods are rather limited when it comes to discovering new relationships among terms, as they mainly rely on explicitly expressed associations (within term names or specific phrases). Therefore, if we aim at supporting systematic knowledge acquisition and discovery, then higher recall and contextual usage patterns are essential. In our approach, contextual similarities are extracted by automatic pattern mining. Such patterns are used as an approximation of the contexts

11In the Genia corpus, processes found in this pattern include transcription, activation, apoptosis, differentiation, translation, cell death, proliferation, etc.

12For example, mRNA, carcinoma cells, macrophanges, plasmids, etc.

(19)

in which terms appear. Contextual similarities are not confined to information represented only in individual documents, as the method collects patterns from several documents. Compared to other measures, recall of contextual similarity was significantly higher (for similar precision values). Overall, the combined CLS metric achieved F measures of 68% for semantically related terms and 37%

for highly related entities.

As opposed to approaches that are designed to extract only predefined and specific types of relations, our method can reveal not only named and explicit relationships, but also some latent links among terms. Such links can be beneficial for mining new or unknown relationships among entities, since it is based on capturing unrestricted recurrent contextual patterns. This can assist in the process of discovering and formulating new hypotheses or predictions by pos- sibly suggesting new relations among terms. Term relationships can also help in bridging the gap that exists between collective knowledge (represented by the domain literature) and individual requirements (or acquaintance) of domain specialists. They can be used to support systematic curation, updates and adjustments of existing terminological resources, resolving term ambiguities, semantic document indexing (annotation), similarity-based document retrieval, document categorization/classification, as well as learning domain-specific patterns relevant for information extraction. All these comprise future research directions.

ACKNOWLEDGMENTS

The UK National Centre for Text Mining (NaCTeM, http://www.nactem.ac.uk) is funded by the Joint Information Systems Committee (JISC), the Biotechnoloy and Biological Sciences Research Council (BBSRC), and the Engineering and Physical Sciences Research Council (EPSRC).

REFERENCES

A^NANIADOU, S. 1994. A methodology for automatic term recognition. In Proceedings of COLING 94, Kyoto, Japan. 1034–1038.

A^NANIADOU, S.^ANDN^ENADIC´, G. 2006. Automatic Terminology Management in Biomedicine. In S.

Ananiadou, J. McNaught (Eds.), Text Mining for Biology and Biomedicine, Artech House Books, pp. 67–98.

A^NDRADE, M.^ANDV^ALENCIA, A. 1997. Automatic annotation for biological sequences by extrac- tion of keywords from Medline abstracts. Development of a prototype system. In Proceedings of Intelligent Systems for Molecular Biology 5, 1, 25–32.

BLAKE, C. AND PRATT, W. 2001. Better rules, fewer features: a semantic approach to select- ing features from text. In Proceedings of IEEE Data Mining Conf, San Jose, California. 59–

66.

BLAKE, J. A., RICHARDSON, J. E., BULT, C. J., KADIN, J. A., EPPIG, J. T.,AND THEMOUSEGENOMEDATABASE

GROUP. 2003. MGD: the Mouse Genome Database. Nucleic Acids Research 31, 193–195.

BLASCHKE, C., ANDRADE, M., OUZOUNIS, C.,ANDVALENCIA, A. 1999. Automatic extraction of biolog- ical information from scientific text: protein-protein interactions. In Proceedings of Intelligent Systems for Molecular Biology 99, 60–67.

BOURIGAULT, D.ANDJACQUEMIN, C. 1999. Term extraction + term clustering: an integrated platform for computer-aided terminology. In Proceedings of the 8th EACL, Bergen. 15–22.

CÂMON, E., MÂGRANE, M., BÂRRELL, D., BÎNNS, D., F^LEISCHMANN, W., KÊRSEY, P., MÛLDER, N., OÎNN, T., MÂSLEN, J., CÔX, A.,ÂND A^PWEILER, R. 2003. The Gene Ontology Annotation (GOA) project: