Protein-protein Interaction: Network Alignment

(1)

Protein-protein Interaction: Network Alignment

^∗

Lecturer: Roded Sharan Scribers: Ofer Lavi and Lev Ferdinskoif Lecture 9, December 21, 2006

1 Introduction

In the last few years the amount of available data on protein-protein interaction (PPI) networks have in- creased rapidly, spanning different species such as yeast, bacteria, fly, worm and Human. The rapid growth is shown in Figure 1. Besides the availability of the data, other incentives to analyze several PPI networks at once are validation of our conclusions over several networks and prediction of unknown protein function and interactions.

The scribe is organized as follows: In section 2 we describe the network alignment and network querying problems. In section 3 we describe network pairwise alignment, and its usage in finding conserved protein paths and complexes in PPI networks. The comparative analysis approach, which allows us for detecting similar functionality by looking at multiple highly conserved interactions between similar proteins from different species, is presented in section 4 using QPath, an efficient algorithm for path queries, based on dynamic programming. In section 5 we show PPI networks can be used to predict functional orthologous genes. In section 6 we describe a model that extends the pairwise alignment model to a multiple alignments.

Finally, section 7 contains a brief summary.

2 Network Alignment and Querying

A fundamental problem in molecular biology is the identification of cellular machinery, that is, protein pathways and complexes. PPI data present a valuable resource for this task. But there is a considerable challenge to interpret it due to the high noise levels in the data and the fact that no good models are available to pathways and complexes. Comparative analysis is used to tackle these problems, and improve the accuracy of the predictions.

The main paradigm behind comparison of PPI networks is that evolutionary conservation implies functional significance. Conservation of protein subnetworks is measured both in terms of protein sequence similarity, and in terms of similarity in interaction topology.

This section describes some basic notions that appear in many previous works that find conserved pathways and complexes in the PPI networks of different organisms.

2.1 Network Alignment

A PPI network is conveniently modeled by an undirected graph G(V, E), where V denotes the set of proteins, and (u, v) ∈ E denotes an interaction between proteins u ∈ V and v ∈ V .

∗Based on a scribe by Irit Levy and Oved Ourfali, 2005

(2)

Figure 1: This graph shows the amount of species that their PPI network has been measured. We can clearly see a rapid growth in the number of species since 2003.

The network alignment problem: Given k different PPI networks belonging to different species, we wish to find conserved subnetworks within these networks. In order to find these conserved subnetworks an alignment graphis built. This graph consists of nodes representing sets of k sequence-similar proteins (one per species), and edges representing conserved interactions between the the species. Illustration of such alignment is shown in Figure 2. This concept was first introduced and used by Ogata et al. [13] and Kelley et al. [10].

Creating an alignment graph from a set of k original networks is one heuristic that enables us to search in all k PPI networks simultaneously. A heuristic approach is required here since the problem of finding conserved subnetworks in a group of networks is NP-Hard, because we can reduce it to subgraph-isomorphism (which is known to be NP-Hard). Other heuristics, or approximation methods are applicable as well.

2.2 Network Querying Problem Definition

Given a PPI network G, and a subnetwork S, we wish to find subnetworks in G that are similar to S.

Similarity is measured both in terms of sequence similarity and topological similarity.

The network querying problem can be reduced to a network alignment problem, as shown by Kelley et al. [10], simply by aligning the subnetwork S with the network G. Also, more general formulations are possible, which allow the insertion of proteins into the matched subnetwork, or deletion of vertices from the query subnetwork S.

Network queries can be used to identify conserved functional modules across multiple species, as will be described in the following sections.

2.3 Protein Similarity

In order to build an alignment graph we need to define similarity measure between proteins. First, let us define Homology of proteins (Figure 3 illustrates the speciation and duplication events, and the described below protein relations):

(3)

Figure 2: This figure illustrates an alignment graph of two species. Nodes are constructed of pairs of proteins, one per species, which present a high level of sequence-similarity. Edges represent interactions between proteins in the original networks which are conserved, meaning they exist in a high level of confidence in both original networks.

Figure 3: This figure show a gene that diverged after a speciation to a mouse gene and a rat gene. Within the mouse and the rat species the gene has been duplicated to two different genes rat gene 1 and rat gene 2 in the rat, and mouse gene 1 mouse gene 2 and in the mouse. Each pair of genes are homologous. Each pair of genes that consists of a rat gene and a mouse gene are orthologous, and each pair that consists of genes in the same species are paralogous.

• Orthologous proteins - two proteins from different species that diverged after a speciation event. In a speciationevent one species evolves into a different species (anagenesis) or one species diverges to become two or more species (cladogenesis).

• Paralogous proteins - two proteins from the same species that diverged after a duplication event, in which part of the genome is duplicated.

• Homologous proteins - two proteins that have common ancestry. This is often detected by checking the sequence similarity between these proteins. The proteins can be either from the same species, or from different species (either orthologous or paralogous).

We define similar proteins as potentially homologous proteins, i.e. proteins whose sequences maintain a certain degree of similarity.

(4)

3 Pairwise Alignment

In this section we take a closer look on the network alignment problem of two PPI networks.

3.1 PathBLAST

Kelley et al. [10] introduced an efficient computational procedure for aligning two PPI networks and identify their conserved interaction pathways, called PathBLAST. This method searches for high-scoring pathway alignments involving two paths, one from each network, in which proteins of the first path are paired with putative homolog proteins occurring in the same order in the second path (Figure 4). Since PPI data are noisy, and in order to overcome evolutionary variations in module structures, both gaps and mismatches were allowed:

• Gaps - A gap occurs when a protein interaction in one path skips over a protein in the other path.

In the global alignment graph this is shown by one direct protein interaction edge and one indirect protein interaction edge.

• Mismatches - A mismatch occurs when aligned proteins do not share sequence similarity, and thus are not a pair in the alignment graph. In the global alignment graph this is shown by two indirect protein interaction edges.

3.1.1 Global Alignment and Scoring

In order to build the global alignment graph we need to measure the similarity between proteins in the PPI networks. This similarity is measured using BLAST [2], which quantifies the similarity and assigns it with a p-value, indicating the probability of observing such similarity at random. Protein sequence alignments were computed using BLAST 2.0 with parameters b = 0, e = 1 × 10⁶, f = ”C; S”, and v = 6 × 10⁵. BLAST 2.0 also computes an E-value, or Expectation Value, associated with each blast hit, which is the number of different sequence pairs with score equivalent or better than this hit’s score that are expected to result by a random search. Unalignable proteins were assigned a maximum E-value of 5. A path through this combined graph represents a conserved pathway between the two networks. A log probability score S(P ) for linear paths in the combined graph was formulated as follows:

S(P ) =X

v∈P

log₁₀ p(v) prandom

+X

e∈P

log₁₀ q(e) qrandom

(1)

where p(v) is the probability of true homology within the protein pair represented by v, and q(e) is the probability that protein-protein interactions represented by e are indeed real, i.e., not false-positive. The background probabilities prandom and qrandom are the expected values of p(v) and q(e) over all possible vertices and edges in the combined graph.

3.1.2 Path Search in PathBLAST

After the alignment graph is built, simple paths of length 4 are searched for. A simple path is one with no repeated nodes, but since the original networks, and the alignment graph are undirected, finding simple paths using DFS and backtracking would be very costly.

In order to efficiently find simple paths, random acyclic orientation technique [1] is used, in which acyclic subgraphs are generated by randomly assigning an orientation for each edge. Searching for a

(5)

Figure 4: Source [10]. This figure show an example of pathway alignment and merged representation. (a) Vertical solid lines indicate direct proteinprotein interactions within a single pathway, and horizontal dotted lines link proteins with significant sequence similarity(E_value ≤ E_{cutof f}). An interaction in one pathway may skip over a protein in the other (protein C), introducing a ”gap”. Proteins at a particular position that are dissimilar in sequence(E_value> E_{cutof f}, proteins E and g) introduce a ”mismatch”. The same protein pair may not occur more than once per pathway, and neither gaps nor mismatches may occur consecutively. (b) Pathways are combined as a global alignment graph in which each node represents a homologous protein pair and links represent protein interaction relationships of three types: direct interaction, gap (one interaction is indirect), and mismatch (both interactions are indirect).

(6)

maximal-score simple path of length L in a directed acyclic graph can be done in linear time using dynamic programming, and by generating a sufficient number, 5L!, of acyclic subgraphs, the maximal-score path of length L can be found in linear time.

Every directed path of length L in the acyclic subgraph is simple and corresponds to two paths, one from each of the two original networks. Although the acyclic orientation technique detects only simple paths in the alignment graph, it is possible that the corresponding paths in the original networks are actually not simple, due to one of the following:

1. If a path is not simple in only one of the PPI networks then this path may also not be simple in the alignment graph, due to the use of gaps.

2. Even if a path is simple in both PPI networks, it may not be simple in the alignment graph, due to the use of mismatches.

The probability that a path of length L in the original graph will appear in the acyclic subgraph is _L!² (_L!¹ in each direction), thus by generating 5L! acyclic subgraphs we expect to find the optimal path in one them.

Conserved regions of the network could be highly interconnected (e.g., a conserved protein complex), thus it was sometimes possible to identify a large number of distinct paths involving the same small set of proteins. Rather than enumerating each of these, they were iteratively filtered in PathBLAST. Denote by Si the average score in the i-th iteration. Thus, for each iteration k, the set of 50 highest-scoring pathway alignments were recorded (with average score < S_k >) and then removed their vertices and edges from the alignment graph before the next stage. The p-value of each stage was assessed by comparing < Sk > to the distribution of average scores < S1 > observed over 100 random global alignment graphs (constructed as per the data in Figure 5) and assigned to every conserved network region resulting from that stage (Figures 6 and 7). The p-values for pathway queries (Figure 8) were computed individually, by comparing each pathway-alignment score to the best scores achieved over 100 random alignment graphs involving the query and target (yeast) network.

3.1.3 Experimental Results

The authors performed three experiments:

1. Yeast (S. cerevisiae) vs. Bacteria (H. pylori): orthologous pathways between the networks of two species.

2. Yeast vs. Yeast: paralogous pathways within the network of a single species, by aligning the yeast PPI network versus itself.

3. Yeast vs. Yeast: interrogating the protein network with pathway queries, by aligning the yeast PPI network versus simpe pathways.

3.1.4 Yeast vs. Bacteria: Orthologous Pathways Between the Networks of Two Species

Through this experiment a global alignment between the PPI network of the yeast and Bacteria was performed. The yeast network was constructed using the Database of interacting proteins ([29]), as of Novem- ber 2002, that included interactions from different data sets derived through systematic co-immunoprecipitation and two-hybrid studies. The Bacteria network was also constructed using the Database of interacting proteins and represented a single two-hybrid study (Rain et al. [8]).

Figure 5 shows a comparison between the yeast and bacteria global alignment graphs to the corresponding randomized networks obtained by permuting the protein names. As shown, both the graph size,

(7)

Figure 5: Source [10]. This figure shows a summary of the results of testing yeast vs. bacteria networks, and the yeast network vs. itself. The results were compared against random graphs that were constructed by permuting the protein names on each network before the creating the alignment graph. The first column presents the number of vertices (homologs) between the two tested networks. The second column presents the amount of different edges constructed by the alignment algorithm. The third column presents the CPU time that was needed to create the alignment graph. The last column presents the average of all the scores vs. the average of top 50 scores.

and the best pathway-alignment scores were significantly larger for the real aligned networks than for the randomized ones. This suggests that both species indeed share conserved interaction pathways, because the alignment results were significantly better compared to randomized networks. Surprisingly, the conservation of a single direct interaction between both networks was rare (but their number was higher in real networks than in randomized networks). However, the fact that ”mismatches” and ”gaps” were permitted, allowed to find much larger regions that were conserved. The use in gaps and mismatches allowed PathBLAST to overcome false negatives in the PPI data.

The top-scoring pathway alignments between bacteria and yeast are described in Figure 6. As validation that the pathway segments found indeed correspond to specific conserved cellular functions, it was observed that the network regions were significantly functionally enriched for particular protein functional categories from the Munich Information Center for Protein Sequences (MIPS - http://mips.gsf.de) for yeast, and the Institute for Genomic Research (TIGR - www.tigr.org) for bacteria.

In addition to recognition of conserved pathways between the two PPI networks, other insights can be observed from the results of this work. For example, due to the certain degree of freedom allowed in the alignment process, a conserved pathway can be found even though it includes a node corresponding two proteins from the original network which are not known to be similar. This might imply that the functionality of these two proteins might be similar, using only their location in the pathway topology, thus we can deduct from the functionality of the protein we know of to the one we don’t, a fact that can be also biologically validated.

Another insight we can deduct, is relation between seemingly unrelated processes, corresponding an aligned pathway of the two processes which are performed by aligned proteins in conserved structures.

Figure 6.

3.1.5 Yeast vs. Yeast: Paralogous Pathways Within the Network of a single Species

In addition to identifying homologous features between the protein networks of yeast and bacteria, a search was also performed within each network individually to identify its potentially paralogous pathways, that is, pathways with proteins and interactions that have been duplicated one or more times in the course of evolution.

Such an approach is similar to performing an ”all vs. all” BLAST of sequences encoded by a single genome in order to find gene families. This procedure was explored in the context of yeast, by constructing a global alignment graph merging the yeast protein interaction network with an identical copy of itself.

(8)

Figure 6: Source [10]. This figure shows the top scoring alignments between yeast (green vertices) vs.

bacteria (orange vertices) networks, their functional annotation and their p-value. (a) Both PPI networks. (b) Protein synthesis and cell rescue functionality. (c) Protein fate (chaperoning and heat shock) functionality.

(d) Cytoplasmic and nuclear membrane transport. (e) Protein degradation / DNA replication. (f) RNA polymerase and associated transcriptional machinery.

(9)

To ensure that pathway alignments occurred between two distinct network regions and to avoid aligning a path with its exact copy, proteins were not allowed to pair with themselves or their network neighbors.

The resulting graph wa analyzed to obtain the 300 highest-scoring pathway alignments of length four, corresponding to a level of significance of p ≤ 0.0001. Several regions involve alignments between protein complexes with related functions, confirming that the approach is capable of identifying paralogous network structures (Figure 7).

3.1.6 Yeast vs. Yeast: Interrogating the Protein Network with Pathway Queries

The last experiment was to query a single protein network with specific pathways of interest. Using of Path- BLAST in this mode is similar to using BLAST to interrogate a sequence database with a short nucleotide or amino acid sequence query.

The yeast protein network was queried with a classic MAPK pathway associated with the filamenta- tion response, consisting of a MAPK (Ste11), a MAPK kinase (Ste7), and a MAPK kinase (Kss1). MAPK pathways transmit incoming signals to the nucleus through activation cascades in which each kinase phos- phorylates the next one downstream. PATHBLAST identified two other well known MAPK pathways as the highest-scoring hits (the low-and high-osmolarity response pathways Bck1-Mkk1-Slt2 and Ssk2-Pbs2- Hog1), indicating that the algorithm was sufficiently sensitive and specific to identify known paralogous pathways.

This strategy was repeated to search for new components of the cellular ubiquitin and ubiquitin-like conjugation machinery. Ubiquitin targets proteins for degradation by the proteasome and modifies different sets of proteins through distinct pathways, some of which are unknown. These tests showed that pathway- based queries using PathBLAST are capable of identifying both known and potentially novel paralogous pathways within an organism. The pathways searched and the results are shown in Figure 8.

3.2 Identifying Conserved Protein Complexes

The previous section handled the problem of finding conserved linear pathways. It is not uncommon for such pathways to overlap, the following heuristic deals with those overlaps ending up identifying more complex conserved structures. First, and PathBLAST is used to find conserved paths and then overlapping paths are merged into complexes. An example of this is shows in Figure 9, where a conserved complex is found using two conserved intersecting pathways.

This section describes a direct approach for identifying conserved complexes. Sharan et al. [18] introduced a method for finding conserved complexes by comparative analysis of two PPI networks. This work assumed protein complexes to be manifested as dense subgraphs (Clusters). Indeed, in order for a complex to act as single mechanism, all it’s proteins should be connected between themselves. Moreover, the average density of currently known complexes is around 0.4 (40% of all possible interactions exist).

3.2.1 A Probabilistic Model for Protein Complexes

To measure how good a complex is, a likelihood ratio is used. The measure looks at the ratio between the likelihood of the complex to exist assuming all its proteins interact with each other, and the likelihood of the complex to exist assuming a random distribution of the protein interactions in the graph.

The two models are defined as follows:

1. The protein-complex model, Mc- assumes that every two proteins in a complex interact with some high probability p (0.8 is used in this work). In terms of the graph, the assumption is that two vertices that belong to the same complex are connected by an edge with probability p, independently of all other information.

(10)

Figure 7: Source [10]. This figure shows paralogous pathways within the yeast, by merging the yeast network with itself, and searching pathways. Each side pathway is drawn in a different color(green/blue).

The different regions in the figure shows top scoring alignments, their functional annotation, and their p- value. (a) RNA polymerase II vs. I/III. (b) Protein transport. (c+i+j) Paralogous kinase signaling cascades:

mating, osmolarity, and nutrient control of cell growth. (d) Establishment of cell polarity. (e) Nuclear transport. (f) Chromatin remodeling in osmoreg vs. DNA damage. (g) Glucose vs. phosphate signaling.

(h) Sporulation: cell wall vs. recombination. (k) Cytoplasmic vs. mitochondrial and peroxisomal proteases.

(l) Kinase pathways regulating nutrient response. (m) Mismatch repair vs. crossing-over machinery. (n) Regulation of mitosis. (o) Cell polarity control.

(11)

Figure 8: Source [10]. This figure shows the results of querying the yeast network with specific pathways.

The different regions in the figure shows top scoring alignments (each sub-figure for each pathway queried).

The high-scoring alignments are indicated in red. The p-value is also shows for each alignment. (a) Filamen- tation response (b) Skp1-Cdc53/cullin-F-Box (SCF) complex. (c) Anaphase Promoting Complex (APC). (d) SUMO-conjunction.

Figure 9: This figure shows two pairs of aligned paths from two networks on the left side, and the way they are joined into aligned complexes on the right.

(12)

2. The random model, Mnassumes that each edge is present with the probability that one would expect if the edges of G were randomly distributed but respected the degrees of the vertices. More precisely, let F^Grepresent the family of all graphs having the same vertex set as G and the same degree sequence.

The probability of observing the edge (u, v), p(u, v), is defined to be the fraction of graphs in F^Gthat include this edge. Note that in this way, edges incident on vertices with higher degrees have higher probability. We assume that all pairwise relations are independent.

Given a protein complex C = (V⁰, E⁰), a naive approach could be to define this complex score as follows:

L(C) = Y

(u,v)∈E⁰

p

p(u, v) × Y

(u,v)6∈E⁰

1 − p

1 − p(u, v) (2)

It can be easily seen that complexes with higher density will have more edges and thus higher scores.

However, such a score ignores information on the reliability of interactions. A more rigorous scoring would treat data of interactions as noisy observations of interactions. In other words, we will incorporate the edge confidence scores into our complex score to deal with noisy data.

Let Tuvdenote the event that two proteins u, v interact, and Fuvdenote the event that they do not interact.

Ouvdenote the (possibly empty) set of available observations on the proteins u and v , that is, the set of experiments in which u and v were tested for interaction and the outcome of these tests. Using prior biological information (see Section 4.1 of [18]), one can estimate for each protein pair the probability P r(Ouv|T_uv) of the observations on this pair, given that it interacts, and the probability P r(Ouv|F_uv) of those observations, given that this pair does not interact. Also, one can estimate the prior probability P r(T_uv) that two random proteins interact.

Given a subset U of the vertices, the likelihood of U under a protein-complex model (Mc) and a random model (M_n) is computed. Denoting by O_U the collection of all observations on vertex pairs in U, the probability that this collection of observations will occur under the complex model can be computed as follows:

P r(O_U|M_c) = Y

(u,v)∈U ×U

(pP r(Ouv|T_uv) + (1 − p)P r(Ouv|F_uv)) (3) and the probability that this collection of observations will occur in the random model can be computed as follows:

P r(O_U|M_n) = Y

(u,v)∈U ×U

(p(u, v)P r(O_uv|T_uv) + (1 − p(u, v))P r(O_uv|F_uv)) (4) therefore, the log likelihood score of a complex C can be calculated as follows:

L(C) = Y

(u,v)∈U ×U

P r(OU|M_c)

P r(O_U|M_n) = pP r(Ouv|T_uv) + (1 − p)P r(Ouv|F_uv)

p(u, v)P r(O_uv|T_uv) + (1 − p(u, v))P r(O_uv|F_uv) (5)

3.2.2 Scoring for Two Species

Consider C and C⁰ two network subsets, one for each species, and a mapping θ between them. Then, we can compute the likelihood score as follows:

L(C, C⁰) = L(C)L(C⁰) (6)

But, this does not take into account the degree of sequence conservation among the pairs of proteins mapped by θ. In order to include such information, a conserved complex model and a random model for pairs of proteins from two species were defined. Let Euv denote the BLAST E-value assigned to the

(13)

similarity between proteins u and v, and let huv, h⁰_uv denote the events that u and v are orthologous, or nonorthologous, respectively. The likelihood ratio corresponding to a pair of proteins (u, v) is therefore:

L(C, C⁰) = L(C)L(C⁰) Y

u,v−matched

P r(E_uv|h_uv)

P r(E_uv|h_uv)P r(h_uv) + P r(E_uv|h⁰_uv)P r(h⁰_uv) (7) A downside of this scoring method is that it treats the aligned complexes independently, meaning that it ignores the preservation of interactions between the complexes. Nevertheless, because most of currently available PPI networks originate from evolutionary distant species, this scoring produces similar results as other, more complex methods, which do incorporate interaction preservation scores.

3.2.3 Searching Conserved Protein Complexes

Using the model explained above, the problem of identifying conserved protein complexes reduces to the problem of finding heavy sub-graphs in the alignment graph.

3.2.4 The Search Strategy

The problem of searching for heavy induced subgraphs in a graph is NP-hard even when considering a single species where all edge weights are 1 or -1 and all vertex weights are 0 (Shamir et al. [16]). Thus, heuristic strategies for searching the alignment graph for conserved complexes were proposed.

A bottom-up search for heavy subgraphs in the alignment graph is performed, by starting from high weight seeds, refining them by exhaustive enumeration, and then expanding them using local search. An edge in the alignment graph is defined as strong if the sum of its associated weights (the edge weights within each species graph) is positive.

The search proceeds as follows:

1. Compute a seed around each node v, which consists of v and all its neighbors u such that (u, v) is a strong edge

2. If the size of this set is above a threshold (e.g.,10), iteratively remove from it the node whose contribution to the subgraph score is minimum, until we reach the desired size.

3. Enumerate all subsets of the seed that have size at least 3 and contain v . Each such subset is a refined seed on which a local search heuristic is applied.

4. Local search: Iteratively add a node, whose contribution to the current seed is maximum, or remove a node, whose contribution to the current seed is minimum, as long as this operation increases the overall score of the seed. Throughout the process the original refined seed is preserved and nodes are not deleted from it.

5. For each node in the alignment graph record up to k (e.g. 5) heaviest subgraphs that were discovered around that node.

Notice that the resulting subgraphs may overlap considerably. In order to solve that a greedy algorithm is used to filter subgraphs whose percentage of intersection is above a threshold as follows:

1. Iteratively find the highest weight subgraph.

2. Add that subgraph to the final output list.

(14)

Figure 10: Source [18]. This figure shows the conserved protein complexes identified between the yeast and bacteria. Table columns: ID (the ID of the complex), Score (−ln(p_value) adjusted for multiple testing), size (with the number of distinct bacterial and yeast proteins in parentheses), purity and complex category for the yeast, and purity and functional category for the bacteria.

3. Remove all other highly intersecting subgraphs (large overlap between two complexes was disallowed by filtering complex that has 60 percent or more than other complex, and its p-value is worse than the other complex). This p-value of a complex measures the fraction of random runs in which the output complex had higher score, as explained next.

3.2.5 Evaluation of Complexes

The statistical significance of identified complexes was tested in two ways:

1. The first is based on the z-scores that are computed for each subgraph and assumes a normal approximation to the likelihood ratio of a subgraph. The approximation relies on the assumption that the subgraphs nodes and edges contribute independent terms to the score. The latter probability is Bonferroni ([6]) corrected for multiple testing, according to the size of the subgraph.

2. The second is based on empirical runs on randomized data. The randomized data are produced by random shuffling of the input interaction graphs of the two species, preserving their degree sequences, as well as random shuffling of the orthology relations, preserving the number of orthologs associated with each protein. For each randomized dataset, a heuristic search is used to find the highest-scoring conserved complex of a given size. Then a p-value is estimated for a suggested complex of the same size, as the fraction of random runs in which the output complex had higher score.

3.2.6 Complex Identification and Validation

The algorithm was applied to yeast and bacteria networks, identifying 11 nonredundant complexes, with significant p-value < 0.05 (after the correction for multiple testing). The score was also compared against empirical runs on randomized data (p < 0.05). The results are listed in Figure 10.

(15)

In order to validate these results the MIPS database was used (www.mips.gsf.de) that contains assign- ment of yeast genes to known complexes. The purity of a complex was defined as follows: denote by x the highest number of proteins from a single complex category in MIPS. Denote by y all the proteins in the complex that are categorized members in MIPS; thus, the purity was calculated as x/y.

High purity indicates a conserved complex that corresponds to a known complex in yeast and serves as a validation for the result. Low purity may indicate either an incorrect complex or a previously unidentified correct one. Note that most of the predicted complexes also contain proteins that are not known to belong to any complex in yeast. Thus, the results could be used in order to suggest additional members in known complexes.

For bacteria, since experimental information on complexes is unavailable, functional annotations were used in order to calculate the purity of the complex. The functional annotations which were taken from the TIGR database (www.tigr.org).

The significant complexes that have been identified exhibit a nice correspondence between the protein complex annotation in yeast and the functional annotation in bacteria, as presented in Figure 10 and further visualized in Figure 11. For instance, complex 17 contains proteins from both yeast and bacteria that are involved in protein degradation, complexes 19 and 28 consist predominantly of proteins that are involved in translation, and complex 30 includes proteins that are involved in membrane transport.

The conserved protein complexes that were found imply new functions for a variety of uncharacterized proteins. For instance, complex 17 (Figure 11(a)) defines a set of conserved interactions for the cells protein degradation machinery. Bacterial proteins HP0849 and HP0879 (emphasized in Figure 11(a)) are uncharacterized, but their appearance within yeast and bacterial complexes involved in proteolysis suggests that they also play an important role in this process. Another example is the yeast proteins Hsm3 and Rfa1 (with known functional roles in DNA-damage repair) that may also be associated with the yeast proteasome (see Figure 11(b,d)).

As another example of protein functional prediction, Figure 11(b) shows a conserved complex which contains yeast proteins that function in the nuclear pore (NUP) complex. The NUP complex is integral to the eukaryotic nuclear membrane and serves to selectively recognize and shuttle molecular cargos (e.g., proteins) between the nucleus and cytoplasm. Unlike the yeast proteins, the corresponding bacterial proteins are less well characterized, although three have been associated with the cell envelope due to their predicted transmembrane domains. The results therefore indicate that the bacterial proteins may function as a coherent cellular membrane transport system in bacteria, similar to the nuclear pore in eukaryotes, or perhaps are part of some sort of ancient predecessor of the yeast NUP complex.

3.2.7 Comparison to Previous Methods

The authors performed a comparison between three methods/applications:

• Yeast vs. Bacteria: the algorithm was applied to the yeast bacteria alignment graph in search of conserved complexes.

• Yeast only: this method is a noncomparative variant of the algorithm that uses the protein-protein interactions in yeast only. That is, this variant searches for heavy subgraphs in the yeast interaction graph, where the edges of the graph are weighted according to the log-likelihood ratio model.

• A variant of the algorithm that relies on the previous probabilistic model for protein interactions by Kelley et al. [10].

And the following measures were used in order to compare the experiments:

(16)

Figure 11: Source [18]. This figure shows conserved protein complexes: (a) proteolysis complexes, (b,d) protein synthesis complexes, (c) nuclear transport complexes. Conserved complexes are connected subgraphs within the bacteria-yeast alignment graph, whose nodes represent orthologous protein pairs and edges represent conserved protein interactions of three types: direct interactions in both species (solid edges); direct in bacteria but distance 2 in the yeast interaction graph (dark dashed edges); and distance 2 in the bacterial interaction graph but direct in yeast (light dashed edges). The number of each complex indicates the corresponding complex ID listed in figure 10.

(17)

Figure 12: Source [18]. This figure shows a comparison between the three experiments using three comparison measures: the Jaccard measure, the Sensitivity measure and the Specificity measure.

• The Jaccard measure: Two proteins are called mates in a solution if they appear together in at least one complex in that solution. Given two solutions, let n₁₁be the number of pairs that are mates in both, and let n10(n01) be the number of pairs that are mates in the first (second) only. The Jaccard score is: n11/(n11+ n10+ n01). Hence, it measures the correspondence between protein pairs that belong to a common complex according to one or both solutions. Two identical solutions would get a score of 1, and the higher the score the better the correspondence (for more on Jaccard score see [9]).

• The Sensitivity measure - quantifies the extent to which a solution captures complexes from the different yeast categories. It is formally defined as the number of categories for which there was a complex with at least half its annotated elements being members of that category, divided by the number of categories with at least three annotated proteins.

• The Specificity measure - quantifies the accuracy of the solution. Formally, it is the fraction of predicted complexes whose purity exceeded 0.5.

A comparison of the performance of the three approaches is shown in Figure 12. Analysis of the results shows that:

• The Jaccard score is significantly better in the current approach than in Kelley et al. [10].

• The sensitivity is lower, as fewer categories are captured, but the specificity is much higher, so the predicted complexes are much more accurate.

• Using data on yeast only, Sharan et al. get even higher sensitivity, although again at the cost of specificity. The Jaccard score of this run is comparable to that of the comparative algorithm. This shows that the new probabilistic model can be effectively used, even for detecting complexes using interaction data from a single species.

• The results of the yeast vs. bacteria experiment were evaluated using data on yeast complexes only, not all of which are expected to be conserved. Still, the use of the bacterial data significantly improved the specificity of the results.

3.3 Evolutionary Based Scoring

The methods above for scoring and searching for conserved complexes do not take into account the evolutionary process shaping protein interaction. Koyuturk et al. introduce a scoring method which is based on the duplication/divergence model (see [11]).

(18)

Figure 13: Source [11]. This figure shows duplication, elimination and emergence events on a PPI network.

Starting with three interactions between proteins u1,u2 and u3. Then, node u1 is duplicated to node u⁰₁, together with its interactions (dashed circle and lines). Then, node u1loses its interaction with u3(elimination - dotted line). Finally, an interaction between u₁and u⁰₁is added to the network (emergence - dashed line).

3.3.1 The Duplication/Divergence Model

The duplication/divergence model is a common model used to explain the evolution of protein interaction networks via preferential attachment. According to this model, when a gene is duplicated in the genome, the node corresponding to the product of this gene is also duplicated together with its interactions. An example of protein duplication is shown in Figure 13. A protein loses many aspects of its functions rapidly after being duplicated. This translates to divergence of duplicated (paralogous) proteins in the interactome through elimination and emergence of interactions:

• Elimination of an interaction in a PPI network implies the loss of an interaction between two proteins due to mutations int their interface.

• Emergence of an interaction in a PPI network implies the introduction of a new interaction between two non-interacting proteins, again caused by mutations that change protein surfaces.

Examples of elimination and emergence of interactions are also illustrated in Figure 13.

If an elimination or emergence is related to a recently duplicated protein, it is said to be correlated;

otherwise, it is uncorrelated ([14]). Since newly duplicated proteins are more tolerant to interaction loss because of redundancy, correlated elimination is generally more probable than emergence and uncorrelated elimination ([24]). In the context of duplication two types of pairs of proteins are defined as follows:

• A pair of proteins from different species will be called in-paralogs, if they are the result of duplication that occurred before a speciation event.

• A pair of proteins from different species will be called out-paralog, if they are the result of a duplication that occurred after a speciation event.

The interaction profiles of duplicated proteins tend to almost totally diverge in about 200 million years, as estimated on the yeast interactome. On the other hand, the correlation between interaction profiles of duplicated proteins is significant for up to 150 million years after duplication, with more than half of interactions being conserved for proteins that are duplicated less than 50 million years back. Thus, while comparatively analyzing the proteome and interactome, it is important to distinguish in-paralogs from out- paralogs since the former are more likely to be functionally related. This, however, is a difficult task since out-paralogs also show sequence similarity.

(19)

3.3.2 Local Alignment of the PPI Network

Given two PPI networks G(U, E) and H(V, F ), a protein subset pair P = { ˜U , ˜V } is defined as a pair of protein subsets ˜U ⊆ U and ˜V ⊆ V . Any protein subset P induces a local alignment A(G, H, S, P ) = {M, N, D} of G and H with respect to S, which is the similarity function between each pair of proteins in U ∪ V :

• M - set of matches. A match corresponds to a conserved interaction between two orthologous protein pairs, which is rewarded by a match score that reflects the confidence in both protein pairs being orthologous.

• N - set of mismatches. A mismatch is the lack of an interaction in the PPI network of one organism between a pair of proteins whose orthologs interact in the other organism. A mismatch may correspond to the emergence of a new interaction or the elimination of a previously existing interaction in one of the species after the split, or an experimental error. Thus, mismatches are penalized to account for the divergence from the common ancestor.

• D - set of duplications. A duplication is the duplication of a gene in the course of evolution. Each duplication is associated with a score that reflects the divergence of function between the two proteins, estimated using their similarity.

Let functions 4G(u, u⁰) and 4_H(v, v⁰) denote the distance between two corresponding proteins in the interaction graphs G and H, respectively. Given a pairwise similarity function S, a distance cutoff ¯4, and the set P from above, we get:

M = {u, u⁰ ∈ ˜U , v, v⁰ ∈ ˜V : S(u, v) > 0, S(u⁰, v⁰) > 0,

((uu⁰ ∈ E ∧ 4_H(v, v⁰) ≤ ¯4) ∨ (vv⁰ ∈ F ∧ 4_G(u, u⁰) ≤ ¯4))} (8) N = {u, u⁰ ∈ ˜U , v, v⁰ ∈ ˜V : S(u, v) > 0, S(u⁰, v⁰) > 0,

((uu⁰ ∈ E ∧ 4_H(v, v⁰) > ¯4) ∨ (vv⁰ ∈ F ∧ 4_G(u, u⁰) > ¯4))} (9) D = {u, u⁰ ∈ ˜U : S(u, u⁰) > 0} ∪ {v, v⁰ ∈ ˜V : S(v, v⁰) > 0} (10) Following the definition of match and mismatch we see that not only direct but also indirect interactions are allowed. If two proteins directly interact with each other in one organism, and their orthologs are reachable from each other via at most ¯4 interactions in the other (the value ¯4 = 2 is used), it is considered as a match. Conversely, a mismatch corresponds to the situation in which two proteins are not reachable via 4 interactions in one network while their orthologs directly interact in the other.¯

There are two observations that explain the use of the distance cutoff :

1. Proteins that are linked by a short alternate path are more likely to tolerate interaction loss because of relaxation of evolutionary pressure ([11]).

2. High-throughput methods such as TAP ([7]) identify complexes that are associated with a single central protein and these complexes are recorded in the interaction database as star networks with the central protein serving as a hub.

(20)

3.3.3 Scoring Match, Mismatch and Duplications

For scoring the matches and mismatches, the similarity between two protein pairs is defined as follows:

S(uu⁰, vv⁰) = S(u, v)S(u⁰, v⁰) (11)

The similarity value is calculated using Inparanoid ([15]), which is a sequence-based method for finding orthology relations. It uses clustering in order to derive orthology families, leaving some of the orthology relations ambiguous. S(uu⁰, vv⁰) quantifies the likelihood that the interactions between u and v, and u⁰and v⁰ are orthologous. Consequently, a match that corresponds to a conserved pair of orthologous interactions is rewarded as follows:

µ(uu⁰, vv⁰) = ¯µS(uu⁰, vv⁰) (12)

Here, ¯µ is the match coefficient that is used to tune the relative weight of matches against mismatches and duplications, based on the evolutionary distance between the species that are being compared.

A mismatch may correspond to the functional divergence of either interacting partner after speciation. It might also be due to a false positive or negative in one of the networks that is caused by incompleteness of data or experimental error. However, this problem was already solved by considering indirect interactions as matches. According to Wagner ([26]), after a duplication event, duplicate proteins that retain similar functions in terms of being part of similar processes are likely to be part of the same complex. Moreover, since conservation of proteins in a particular module is correlated with interconnectedness ([28]), we expect that interacting partners that are part of a common functional module will at least be linked by short alter- native paths. Based on these observations, mismatches are penalized for possible divergence in function as follows:

v(uu⁰, vv⁰) = −¯vS(uu⁰, vv⁰) (13)

As for match score, mismatch penalty is also normalized by a coefficient ¯v, that determines the relative weight of mismatches w.r.t. matches and duplications.

A duplication has an evolutionary significance. Since duplicated proteins rapidly lose their interactions, it is more likely that in-paralogs, i.e., the proteins that are duplicated after a speciation event, will share more interacting partners than out-paralogs do ([26]). Furthermore, sequence similarity is employed as a means for distinguishing in-paralogs from out-paralogs. This is based on the observation that sequence similarity provides a crude approximation for the age of duplication ([27]). Moreover, recently duplicated proteins are more likely to be in-paralogs, and thus show more significant sequence similarity than older paralogs.

Therefore, duplicate score is defined as follows:

δ(u, u⁰) = ¯δ(S(u, u⁰) − ¯d) (14)

Here ¯d is the cutoff for being considered in-paralogs. If S(u, u⁰) > ¯d, suggesting that u and u⁰are likely to be in-paralogs, the duplication is rewarded by a positive score. If, on the other hand, S(u, u⁰) < ¯d, the proteins are considered out-paralogs, thus the duplication is penalized.

3.3.4 Alignment Score and the Optimization Problem

Given PPI networks G and H, the score of alignment A(G, H, S, P ) = M, N, D is defined as:

σ(A) = X

m∈M

µ(m) + X

n∈N

v(n) +X

d∈D

δ(d) (15)

The PPI network alignment problem is one of finding all maximal protein subset pairs P such that σ(A(G, H, S, P )) is locally maximal, i.e. the alignment score cannot be improved by adding individual proteins to or removing proteins from P. The aim is to find local alignments with locally maximal score.

(21)

The information regarding matches, mismatches and duplications of the two PPI networks is represented using a single weighted alignment graph: Given G(U, E), H(V, F ), and protein similarity function S, the corresponding weighted alignment graph G( ¯V , ¯E) is computed as follows:

V = {¯¯ v = {u, v} : u ∈ U, v ∈ V and S(u, v) > 0} (16) In other words, there is a node in the alignment graph for each pair of putatively ortholog proteins. Each edge ¯v¯v⁰, where ¯v = {u, v} and ¯v⁰ = {u⁰, v⁰}, is assigned a weight:

w(¯v, ¯v⁰) = µ(uu⁰, vv⁰) + v(uu⁰, vv⁰) + δ(u, u⁰) + δ(v, v⁰) (17) Here, µ(uu⁰, vv⁰) = 0 if (uu⁰, vv⁰) /∈ M and the same for mismatches and duplications.

They used a greedy search heuristic in order to find the conserved complexes in the alignment graph.

For more information on this heuristic refer to section 3.3 in [11].

3.3.5 Significance Evaluation

To evaluate the statistical significance of discovered high-scoring alignments, a comparison is made between the alignments and a reference model generated by a random source. In the reference model, it is assumed that the interaction networks of the two species are independent of each other. In order to assess the significance of conservation of interactions between orthologous proteins rather than the conservation of proteins itself, it is assumed that the orthology relationship between protein is already established, i.e., is not generated by a random source. Other interactions are generated randomly while preserving the degree sequence.

Given proteins u and u⁰, that are interacting with duand d_u⁰ proteins, respectively, then the probability p_uu⁰ can be estimated as:

p_uu⁰ = dud_u⁰ P

v∈Udv

(18) Recall that the weight of a subgraph of the alignment graph is equal to the score of the corresponding alignment, therefore, in the reference model, the expected value of the score of an alignment induced by V ⊆ V is :˜

E[W ( ˜V )] = X

v,v⁰∈ ˜V

E[w(vv⁰)] (19)

where

E[w(vv⁰)] = ¯µS(uu⁰, vv⁰)p_uu⁰p_vv⁰ − ¯vS(uu⁰, vv⁰)(p_uu⁰(1 − p_vv⁰) + (1 − p_uu⁰)p_vv⁰) + δ(u, u⁰) + δ(v, v⁰) (20) is the expected weight of an edge in the alignment graph. With the simplifying assumption of independence of interactions, they have

V ar[W ( ˜V )] = X

v,v⁰∈ ˜V

V ar[w(vv⁰)] (21)

enabling them to compute the z-score to evaluate the statistical significance of each discovered high-scoring alignment, under the normal approximation that is assumed.

(22)

Species No. of proteins No. of Interactions

S. Cerevisiae 5157 18192

C. Elegans 3345 5988

D. Melanogaster 8577 28829

Table 1: Source [11]. Number of proteins and interactions for yeast (S. Cerevisiae), worm (C. Elegans) and fly (D. Melanogaster).

Figure 14: Source [11]. The number of nodes, matched nodes, matches, mismatches and duplications, for each experiment done: SC vs. CE (yeast vs. worm), SC vs. DM (yeast vs. fly) and CE vs. DM (worm vs.

fly). It shows the data both for ¯4 = 1 and ¯4 = 2.

3.3.6 Experimental Results

The interaction data that was used was downloaded from BIND ([3]) and DIP ([29]) molecular interaction databases. The statistics for the PPI networks of yeast (S. Cerevisiae), worm (C. Elegans) and fly (D.

Melanogaster) are shown in Table 1.

They performed pairwise alignments of the three pairs of PPI networks, using the following alignment parameters: ¯µ = 1.0, ¯v = 1.0 and ¯δ = 0.1. The alignment was done between yeast-worm, yeast-fly, and worm-fly, for both for ¯4 = 1 and ¯4 = 2. The results are shown in Figure 14.

Alignment of yeast PPI network with fly PPI network results in identification of 412 conserved subnetworks. Ten of the conserved subnetworks with highest alignment scores are shown in Figure 15. In total, 83 conserved subnetworks are identified on yeast-worm alignment, and 146 are identified on worm-fly alignment.

While most of the conserved subnetworks are dominated by one particular processes and the dominant processes are generally consistent across species, there also exist different processes in different organisms that are mapped to each other by the discovered alignments. This illustrates that the comparative analysis of PPI networks is effective in not only identifying particular functional modules, pathways, and complexes, but also in discovering relationships between different processes in separate organisms and crosstalk between known functional modules and pathways. Moreover, alignment results provide a means for discovery of new functional modules in relatively less studied organisms through mapping of functions at a modular level rather than at the level of single protein homologies. These significant use of the experiments results was also noticed by Both Kelley et al. with PathBLAST ([10]), and Sharan et al. ([18]).

A selection of interesting conserved subnetworks is shown in Figure 16. The alignments in the figure illustrate that the alignment algorithm takes into account the conservation of interactions in addition to sequence similarity while mapping orthologous proteins to each other. In all of the alignments shown in

(23)

Figure 15: Source [11]. This figure shows the representative top-scoring subnetworks identified by the alignment of yeast and fly. The dominant biological process/functionality for each species, in which the majority of proteins in the conserved subnetwork participate is also shown in the second raw of each subnetwork. For each subnetwork we it also show the z-score, number of proteins (for each species in parenthesis), number

(24)

Figure 16: Source [11]. This figure shows a sample of conserved subnetworks identified by the alignment algorithm. Orthologous and paralogous proteins are either vertically aligned, or connected by blue dotted lines. Existing interactions are shown by green solid lines, and missing interactions that have an orthologous counterpart are shown by red dashed lines. The organisms aligned and the rank of the alignment are shown in the label. (a,b,c) yeast vs. fly. (d,e) worm vs. fly (f) yeast vs. worm.

(25)

the figure, the interactions of proteins that belong to the same orthologous group are highly conserved, suggesting relatively recent duplications.

4 Path Queries

Sequence comparison is a basic tool in biological research, widely used for nucleotide sequences comparison and search (as in RNA and DNA molecules), and for amino acids sequence comparisons (as in protein homology discovery). It is used both for evolutionary and functional research of both genes and proteins.

The availability of PPI Networks allows us to extend the use of sequence comparison methods to more complex functional units, such as protein pathways and modules, and thus elevate homology detection from the level of single protein homology to the level of functional protein pathways and modules homology.

This section describes another method for comparing and aligning PPI networks, QPath ([20]), which overcomes some fundamental drawbacks of the PathBLAST algorithm introduced in section 3:

1. In a PathBLAST result, a matched pathway may contain the same protein more than once, which is biologically implausible.

2. The resulted matched pathways must be very close to each other, while we might want to allow a higher degree of freedom, and support more than a single consecutive insertion or a single consecutive deletion difference between the paths, which is the maximum PathBLAST allows.

3. The running time of the algorithm involves a factorial function of the pathway length, limiting its applicability to short pathways (in practice, it was applied to paths of up to 5 proteins).

4.1 The Path Query Problem

The problem setting is defined as follows: the input is a target network, represented as an undirected weighted graph G(V, E), with a weight function on the edges w : E × E −→ R, and a path query Q = (q₁, . . . , q_k). Additionally, a scoring function H : Q × V is given. The output is a set of best matching pathways P = (p1, . . . , pk) in G, where a good match is measured in two respects:

1. Each node in the matched pathway and its corresponding node in the query are similar with respect to the given scoring function H.

2. The reliability of edges in the matched pathway is high.

If we don’t force the size of the query and matching paths to be equal, we can still measure the match between a query Q = (q1, . . . , qk) and a pathway P = (p1, . . . , pl) by introducing dummy nodes which allow for deletions, if inserted in the matching path and for insertions, if inserted in the query.

In the PPI Network framework, as described in section 2, the target graph is a PPI Network of species 1, where the vertices are proteins, and edges’ weights represent the interaction probability between two proteins. The query pathway Q is a pathway extracted from a PPI Network of species 2, and the function H is a similarity measure between proteins in the two species.

4.2 The QPath algorithm

First, in order to allow more flexibility in deletions and insertions, deletions of nodes in the target network are allowed by introducing a mapping M form Q to P ∪ {0} where deleted query nodes are mapped to 0 by M . The total score of an alignment reflects the measures of protein homology, and the interaction

(26)

Figure 17: An example of an alignment that induces insertions (F’) and deletions (C).

probabilities of the path, while keeping the path similarity with a certain degree of freedom for insertions and and deletions, and is set to be:

l−1

X

i=1

w(p_i, p_i+1) +

k

X

i=1,pi6=0

h(q_i, M (p_i))

Where the first summation is the interaction score and the second is the sequence score. Edge weights represent the logarithm of reliability of interaction between two proteins, and the protein similarity scoring function H is set to be the BLAST E-value for the two proteins, normalized by the maximal E-value over all pairs of proteins from the two networks.

4.3 Avoiding cycles (non-trivial paths)

In order to find only simple paths, QPath uses the color coding technique (Alon et al. [1]). The method allows finding simple paths of size k by randomly choosing a color out of k colors for every vertex in the graph, and looking only for subgraphs that do not contain more one vertex of the same color. Since a particular path may be assigned non-distinct color, the method requires choosing many random colorings, and running the search for each of them separately.

4.4 Finding the best matching paths

QPath sets in advance two parameters - Nins and Ndel, which are the number of insertions and deletion allowed in the matched path. When looking for a path of size k, QPath assigns k + Nins colors for the vertices.

The following dynamic programming recursion is then used for dynamically building the best path:

W (i, j, S, Θ_del) = max_m∈V







W (i − 1, m, S − c(j), θdel) + w(m, j) + h(qi, j) (m, j) ∈ E W (i, m, S − c(j), θ_del) + w(m, j), (m, j) ∈ E W (i − 1, m, S, θ_del− 1), θ_del < N_del

(27)

W (i, j, S, Θ_del) is the maximum weight of an alignment for the first i nodes in the query that ends at vertex j ∈ V , induces θdel deletions, and visits a vertex of each color in S. The first case is the case where q_i is aligned with vertex j, and thus we add to the best alignment so far the score h(q_i, j), and remove the color of j from the set of available colors S. In the second case, qiis not aligned with j, meaning qi is an insertion, and the score does not change. The third case is a deletion case, and therefore we decrease the number of allowed deletions from this point on by one.

The best alignment score will be maxj∈V,S⊆C,θdel<NdelW (k, j, S, θ), and the alignment itself can be find by backtracking. The running time for each coloring choice is 2Ô(k+Nîns^)mN^del. For a choice of ε ∈ (0, 1) such that the probability to find the optimal match is at least 1 − ε we would need to choose ln(n/ε) random colorings, which will give a total running time of ln(n/ε)2Ô(k+Nîns^)mN^del.

In order to use QPath for searching homologous paths between two given PPI networks, it is first required to extract good candidates from the first network, and then search for these paths in the target network. QPath can find good candidates by searching the first network for a dummy path query, consisting of dummy proteins that have the same similarity score H to all vertices in the network. Such a search yields pathways with high interaction scores in the first network, regardless of the path query itself.

4.5 Running QPath on yeast and fly PPI networks

The yeast (S. cerevisiae) PPI network contains 4,726 proteins and 15,166 known interaction between them.

The fly (D. melanogaster) PPI network contains 7,028 proteins and 22,837 interactions, but in spite of its larger size, it is much less complete than the yeast network.

The algorithm was tested first on the more complete yeast PPI network, finding good candidates for querying the fly PPI network next. It discovered 271 pathways which were better than 99% of randomly chosen pathways obtained by setting all interaction scores to be equal and running the query on the tweaked data. The 271 pathways were then used as queries for the fly PPI network.

The results of running the algorithm on the yeast PPI network were assessed by looking at the functional enrichment of the found paths. 80% of the paths found were functional enriched, implying their biological significance. In comparison, running dummy queries on the less complete fly network resulted with only 39% of the 132 fly paths found to be functionally enriched.

Running the 271 paths found in the yeast PPI network as queries on the fly network discovered that 63%

of them had a match in the fly network (Figure 18).

The results show that pathway similarity can be used for identification of functionally significant pathways, and that those query pathways can help us to infer the actual function of matched pathways. a first annotation map of protein pathways in fly that are conserved from yeast was obtained this way by QPath.

4.6 Scoring the paths

After setting the scoring framework, there is a need to set the weighs parameters, and define the actual contribution of the different scoring components - the interaction score, the sequence score and the cost of insertions and deletions.

The target is to find a weight function that will maximize the probability that a path with high score is indeed functionally enriched. This was done by using logistic regression on the path attributes - interactions reliability, sequences similarity, number of insertions, and number of deletions, using known functionally enriched paths in the yeast network for training.

(28)

Figure 18: Source [20]. Functional significance of best-match pathways in fly. Functional enrichment (a) and expression coherency (b) of fly best-match pathways obtained by QPath compared to fly pathways that are not the result of a query

4.7 Is the insertion and deletion flexibility really required?

As mentioned before, one of the most important features QPath introduced is the ability to align sequences with a high number of subsequent insertions or deletions. Figure 19 illustrates that this feature is indeed important, as most of the conserved paths between the yeast and the fly, required more than one insertion and deletion.

In the same manner, discovering functionally enriched paths was also found to be strongly depended on the fact that a high number of insertions and deletions is required (See Figure 20)

4.8 Functional conservation

Results of running QPath on the yeast and fly PPI networks, yielded that for 64% of the conserved paths, the matched paths in the fly network conserved one or more functions of the yeast query pathways. In

Figure 19: Source [20]. Fraction of matched queries between yeast and fly networks in respect to the number of deletions and insertions in the conserved paths

(29)

Figure 20: Source [20]. Fraction of functionally enriched matches in respect to the number of deletions and insertions in the conserved paths

contrast, a random shuffling of the matches was tested and resulted to in conservation rate of only 31%.

Interestingly, the functional conservation was much lower when limiting the protein homology only to the best pairs, one from each species. This implies that pathway homology can be used to predict function.

More explicit methods for such a prediction, that make use of the networks homology on top of the straight forward sequence alignment will be presented in the next section.

5 Orthology Mapping

Annotating protein function across species is an important task which is often complicated by the presence of large paralogous gene families. Most of the methods of dealing with this problem are sequence-based models, thus sequence of proteins from different species was compared, in order to find a group of proteins that have the same functional annotation. Two such methods are COG (Clusters of Orthologous Groups) (Tatusov et al. [23]) and Inparanoid ([15]).

The COG approach defines orthologs using sets of proteins that contain reciprocal best BLAST matches across a minimum of three species. The Inparanoid approach is a sequence-based method of finding functional annotation. It uses clustering in order to derive orthology families, leaving some of the orthology relations ambiguous. For more information see [15].

Based on the concept that a protein and its functional ortholog are likely to interact with proteins in their respective networks that are themselves functional orthologs, Bandyopadhyay et al. in [4] introduced a novel strategy for identifying functionally related proteins that supplements sequence-based comparisons with information on conserved protein-protein interactions.

While the tools we described in the previous sections used orthology to identify conserved protein information, the approach shown here reverse that logic and use conserved protein interactions to predict functional orthology.

5.1 Functional Orthology

Ambiguities in the functional annotation process arise when the protein in question has similarity to not one but many paralogous proteins, making it harder to distinguish which of these is the true ortholog that is, the protein that is directly inherited from a common ancestor. Especially in the genomes of mammals and other higher eukaryotes, large protein families are typically not the exception but the rule.