MODULARITY STRUCTURE MATRIX FOR INVESTIGATING PROTEIN

A crucial step toward understanding cellular systems properties is to analyze the topology of biological networks and biochemical progress in cells. Many graphic features are purposed to measure the role of proteins and identify local modularity structures of high connectivity in a PPI network. Laplacian matrix is a matrix representation of a given network. Here, we proposed the modularity structure matrix (MS-matrix), which is the pseudoinverse of the Laplacian matrix for describing the kernels on a graph, to evaluate the modularity structure properties of a PPI network. According to our knowledge, the modularity structure property is the first property to identify both global important proteins and local modularity structures within a network. For a given PPI network of S. cerevisiae, our results demonstrate that the important proteins identified by the MS-matrix are related to the essential biological processes (i.e. essential genes) and highly consistence with the topology features (i.e. degree, closeness centrality, and betweenness centrality). Then, the relationship between proteins derived from the MS-matrix could reflect the similarity of Gene Ontology and could be useful for the module identification. Furthermore, biological characterization (e.g. Gene Onotology) of the modules derived from the MS-matrix is similar to the modules collected from the experiment database (e.g. MIPS). Our results demonstrate that the MS-matrix would provide the insight for investigating a PPI network through important proteins and local modularity structures.

5-1. Introduction

A crucial step toward understanding cellular systems properties is to analyze the topology

interaction (PPI) network as completely as possible, genome-scale interaction discovery approaches, such as high-throughput yeast two-hybrid screening ^25,26 and coaffinity purification

27 , have been proposed. Because of the complexity of a PPI network, many graphic features (e.g. degree, closeness centrality, and betweenness centrality) are purposed to measure the role of proteins in a PPI network ¹¹⁵. In addition, several agglomerative algorithmic approaches

116,117

have been developed to identify local modularity structures of high connectivity with relatively low connectivity to the rest of network. These dense sub-graphs are treated as potential functional modules.

In the mathematical and computational field of graph theory, the Laplacian matrix (or Kirchhoff matrix) is a matrix representation of a graph. In addition, the pseudoinverse of the Laplacian matrix plays a key role, has a nice interpretation in terms of random walk on a graph, and defines the kernels on a graph ¹¹⁸. Its application on biological field, the Gaussian network model has succeeded in describing the local modularity structures (e.g. flexible/rigid regions and domains of proteins) and the important residues of a given protein ^119,120. However, a PPI network, which has the functional local modularity structures (i.e. module and complex) and the important hubs, is similar to the behaviors of a protein.

To address these issues, we proposed the MS-matrix to evaluate the modularity structure property within a PPI network. According to our knowledge, the MS-matrix is the first property to identify both global important proteins and local modularity structures within a network. For a given PPI network of S. cerevisiae, our results demonstrate that the important proteins identified by the MS-matrix are related to the essential biological processes (i.e. essential genes). In addition, the important proteins derived from MS-matrix are highly consistence with the topology features (i.e. degree, closeness centrality, and betweenness centrality). Then, the relationship between proteins derived from the MS-matrix could reflect the similarity of Gene Ontology and could be useful for the module identification. Furthermore, biological

characterization (e.g. Gene Onotology) of the modules derived from the MS-matrix is similar to the modules collected from the experiment database (e.g. MIPS). Our results demonstrate that the MS-matrix would provide the insight for investigating a PPI network through important proteins and local modularity structures.

5-2. Methods

Modularity structure matrix

Figure 5-1. The overview of the evaluating the importance of each node in a simple network through the

"MS-matrix"

(A) A simple network with three local density regions (red, blue and green nodes). (B) Laplacian matrix of the simple network. (C) MS-matrix is derived from the pseudo-inverse of Laplacian matrix.

Here, we consider a PPI network as an undirected graph. The Laplacian matrix is a matrix representation of a graph. Here, we use a simple network (Fig. 5-1A) with 17 proteins to

matrix M (Fig. 5-1B) for the network. The M_ij is given as

M_ij = {

−1, if i ≠ j and protein i interacts with protein j 0, if i ≠ j and protein i not interact with protein j

k, if i = j, k is the degree of protein i (1)

For example, the degree of node 8 is 3 (interacting with node 6, 9, and 14); and the M8 8, M_{6 8}, M_{8 9}, and M_{8 14} are 4, -1, -1, and -1, respectively. Then, the MS-matrix (MS) (Fig. 5-1C) is the pseudoinverse of Laplacian matrix M. Here, we got the pseudoinverse of Laplacian matrix based on the Scientific Tools for Python (SciPY).

According to the local modularity structure (MS_ij), these 17 proteins in this matrix MS can be clustered into three local modularity structures matching with the original network (red, blue and green regions). Additionally, the three lowest diagonal values (nodes 6, 9 and 14) of MS-matrix (MSii) are the centrality nodes; conversely, two highest values (nodes 7 and 13) of MS_ii are the peripheral nodes. These results are highly consistent with the graphic features, such as degree, closeness and betweenness centrality (Table 5-1).

Table 5-1. The degree, clustering coefficient, closeness centrality, betweenness centrality, and dynamic property of each node in the simple network (Fig. 5-1)

ID Degree clustering

coefficient

closeness centrality

betweenness

centrality Qii

1 3 0.667 0.421 0.004 0.52

2 3 0.667 0.421 0.004 0.52

3 3 0.667 0.421 0.004 0.52

4 3 0.667 0.421 0.004 0.52

5 3 0.667 0.421 0.004 0.52

6 9 0.222 0.64 0.563 0.183

7 1 0 0.4 0 1.066

8 3 1 0.516 0 0.36

9 6 0.4 0.593 0.4 0.242

10 3 1 0.41 0 0.595

11 3 1 0.41 0 0.595

12 4 0.5 0.421 0.125 0.566

13 1 0 0.302 0 1.448

14 5 0.3 0.571 0.329 0.272

15 2 0 0.39 0.058 0.845

16 2 0 0.39 0.058 0.845

17 2 0 0.296 0.004 1.036

Centrality properties

Here, we introduce two measures of centrality determining the relative importance of a node within a network. The betweenness centrality Cb(i) measures the node centrality in a network by computing the number of the shortest paths from all nodes to all others that pass through the node i. Cb(i) is defined as follows:

C_b(i) = ∑_s≠i≠t(σ_st(i) σ⁄ _st) (2)

where s and t are nodes different from i, σst denotes the number of shortest paths from s to t, and σ_st (i) is the number of the shortest paths from s to t that i lies on. The betweenness value of the node i is normalized by dividing by the number of node pairs excluding i: (N-1)(N-2)/2, where N is the total number of nodes in the paths that i belongs to.

The closeness centrality C_c(i) of a node i is defined as the reciprocal of the average shortest path length and is computed as follows:

C_c(i) = 1 avg(L(i, m))⁄ (3)

where L(i,m) is the length of the shortest path between two nodes n and m. The closeness centrality of each node is a value between 0 and 1.

The modular similarity between protein pair

The non-diagonal value of MS-matrix (MSij) could provide the relationship between related modularity properties of protein i and j. For a given protein A, we could identify the overall MSAi of A and all proteins to evaluate overall modularity relationships. Therefore, we are able to identify the similarity between a protein pair (A and B) based on the overall MSAi

and MSBi. Here, the similarity is evaluated by the Pearson correlation coefficient (r) and

computed as follows:

r(A, B) = ^∑ⁿ^k=1^(MS^Ak^−MS^A^)(MS^Bk^−MS^B⁾

√∑ⁿ_k=1(MS_Ak−MS_A)²√∑ⁿ_k=1(MS_Bk−MS_B)²

(4)

where 𝑀𝑆_𝐴 and 𝑀𝑆_𝐵 are the averages of MSAk and MSBk, respectively.

For example, the 𝑟(5,6) between nodes 5 and 6 located in the same region (red part in Fig. 5-1A) is 0.88. On the contrary, the r between nodes 6 and 9 which are in the different region (red and blue) is -0.53.

The protein-protein interaction network of S. cerevisiae

The high-through put data usually have the non-reliable protein-protein interactions. To construct a high-quality protein interaction yeast, we collected protein-protein interaction data from the core subset (named DIPc) of the DIP database ⁹ which consists of 1,882 proteins and 4,104 protein-protein interactions (the version dated 10 October 2010). Here, the DIPc consists of only the most reliable interactions ¹²¹.

Data set of module of S. cerevisiae

To evaluate reliability of modules which are identified through the MS-matrix, we collected a positive set of yeast module derived MIPS ⁸⁵. For 193 modules derived MIPS, we selected 160 modules which have more than a half of proteins in the network constructed by DIPc. According to the definitions of module from the previous studies ^84,122,123, a module should have a higher connectivity. Here, the connectivity is defined by previous study ¹²⁴ and calculated as follow:

connectivity =No.of PPI within a module

k×(k−1) (11)

where, k is the number of protein within a module. Finally, we defined a golden positive dataset which includes 69 MIPS modules, which connectivity is more than 0.6.

5-3. Results

The diagonal value of MS-matrix infers essential genes in PPI network of S. cerevisiae

Essential genes usually involve in the fundamental cellular processes which required for the survival of an organism ^96,97,125. As a result, the proteins which are products of essential genes should play an important role in the protein-protein interaction network of an organism.

To further investigate the relationship between essential genes and important proteins detected by the diagonal values of MS-matrix (MSii), we constructed the yeast protein interaction network by using the high-quality protein-protein interaction data extracting from the core sub set in DIP database (named DIPc). Figure 5-2 displays the progressive ration of essential protein for MS_ii from 0 to corresponding value. There are approximately one-half of the proteins recorded as essential proteins while whose MSii values are less than 0.2; and the proportion of essential protein decreases with the increasing value of MSii. Furthermore, YBR160W (main cell cycle cyclin-dependent kinase ¹²⁶) and YJR045C (Hsp70 family ATPase

127) are the proteins with lowest value of MSii, are recorded as essential genes, and play a key role in the important biological processes (e.g. cell cycle and protein folding). These two proteins have enriched interactions and locate on the center of the network. On the contrary, YGL001C and YLR100W, which are related to a non-essential process (ERGosterol biosynthesis), have highest value and only one interaction in the network. These results suggest that those proteins with lower MSii are located within the steadier regions among the network and more critical for the survival of an organism.

Figure 5-2. The relationship between importance of protein and essential proteins

The importance of protein is calculated by MS-matrix diagonal value (MS_ii). The interval of MS_ii denotes the progressive ratios of essential proteins; the lower MS_ii value, more essential proteins are among the network.

The characterization and quantification of network topology derived from the diagonal value of MS-matrix

For a given network, there are various types of measurement for determining the relative importance of a node (protein) within a network. For example, degree (degree centrality) is defined as the number of links incident upon a node. According to the degree distribution, P(k), a network could be identified as a scale-free network, which is the architecture of many cellular networks ⁹⁴. Closeness centrality is defined as the inverse of the average shortest paths of a given node. The average shortest paths can be regarded as a measure of how fast it will take to spread information from a node to all other nodes sequentially ¹²⁸. The betweenness represents the fraction of all of the shortest paths between all nodes in a network that pass through a given node ¹¹⁵.

30 35 40 45 50

0 <0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 >1.8

Es sen tial p roet ins (%)

MS-matrix diagonal value (MS

_ii

)

Our experimental result confirms that the MS_ii could represent the essential gene within the yeast PPI network. Next, we evaluated the relationship between MSii and relative importance (i.e. degree, closeness centrality, and betweenness centrality) of protein within a PPI network (Fig. 5-3). Although the Pearson's correlation coefficient (r) between degree and MS_ii is only -0.50, the Spearman correlation (s) is -0.85. This result implicates that the relative importance detected by the older of MSii is related to the older of relative importance detected by the degree. For example, the protein with the lowest MS_ii, YBR160W (main cell cycle cyclin-dependent kinase ¹²⁶), is also the node with highest degree (58). Furthermore, the r between closeness centrality and MS_ii is -0.78. For example, according to the network described in Figure 5-1, the node 8 is relative important by closeness centrality (0.52; top 4) and could also be identified by using MS_ii. In addition, the MS_ii is slightly similar (r=-0.3 and s=-0.70) to the betweennes centrality.

Figure 5-3. Evaluation importance of protein by (A) Degree centrality (B) Closeness centrality (C) Between centrality

(A) The Spearman correlation between degree centrality and MS_ii is -0.85. (B) The Pearson correlation between closeness centrality and MS_ii is -0.78. (C) The Spearman correlation between betweenness centrality and MS_ii is -0.70.

The non-diagonal value of MS-matrix reflects the relationship between proteins in yeast PPI network

Rank of MS-matrix diagonal value (MS_ii)

A B C

network, we utilize the similarity of Gene Ontology³⁶ and distance of a given protein pair (i and j) to evaluate the MSij. The similarity of Gene Ontology is detected by the relative specificity similarity (RSS), proposed by Wu et al. ¹²², to measure the biological process, molecular function, and cellular component similarities.

For a given protein A, we could identify the overall MSAi of A and all proteins to evaluate overall modularity structure relationships. Therefore, we are able to identify the similarity of overall modularity relationships between protein pair (A and B) based on the MSAi and MSBi. Here, the similarity between A and B is evaluated by the Pearson correlation coefficient and derived from the equation (4).

Figure 5-4. The distribution of gene ontology similarities (i.e. RSS of BP, CC, and MF) and the shortest path between protein pairs under different modular similarity

The RSS-BP and RSS-MF have the highest value while modular similarity is more than 0.9; moreover, the average distance is lower than 2. The RSS-CC are higher than 0.7 while modular similarities are higher than 0.4.

Figure 5-4 illustrates the distribution of gene ontology similarities and the shortest path between protein pairs. While the protein pairs have ≥0.1 modular similarity, the average of

0 1 2 3 4 5 6 7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Ave. shortest path

RSS of gene ontology

Pearson's correlation of MS-matrix value

BP CC MF Distance

their distance has an obvious decrease (from 4.88 to 3.48) and all of the average RSS have an obvious increase. In addition, the average of protein pair’s distance would be less than 2.5 and share the higher biological process and cellular component annotation (RSS-BP > 0.7 and RSS-CC > 0.8 ), while these protein pairs have more than 0.4 modular similarity. The RSS-BP and RSS-MF have the highest value while modular similarity is more than 0.9; moreover, the average distance is lower than 2. This result implies that a protein pair with a highly modular similarity would share a significant similarity of Gene Ontology, especially BP and CC, and are neighboring proteins (e.g. an interaction protein pair) in the PPI network.

Identification of modules based on the non-diagonal value of MS-matrix

According to the definitions of module from the previous studies ^84,122,123, the proteins of a module should locate on the same component, join a same biological process, carry out similar or related function, and have relatively autonomous of the whole network. We have introduced that the modular similarity of protein pair (A and B) derived from the Pearson correlation coefficient of MSAi and MSBi, could infer the similarity of Gene Ontology and the relationship between A and B within the PPI network. Therefore, we believe that the MS-matrix could be useful for identifying modules of a give PPI network. Here, we utilize the hierarchical clustering method to identify the modules and the distance between protein pair (A and B) is calculated by using the modular similarity (i.e. Pearson correlation coefficient of MSAi and MSBi). Then, we identified 126 modules including 724 proteins derived from the MS-matrix. To further investigate the reliability of modules, we compare our modules with the modules recorded in MIPS and analysis the Gene Ontology and connectivity of our modules.

For 193 modules derived MIPS, we selected 160 modules which have more than a half of proteins in the network constructed by DIPc. According to the definitions of module from the

previous studies ^84,122,123, a module should have a higher connectivity. Finally, we defined a golden positive dataset which includes 69 MIPS modules, which connectivity is more than 0.6.

The overlap between a reference MIPS module R and a predicted module M can be quantified by Jaccard index ¹²⁹. The Jaccard index is calculated as follow:

Jaccard index = ^|R∩M|

|R∪M| (12)

where, the |𝑅 ∩ 𝑀| is the number of protein which is the intersection of R and M; the

|𝑅 ∪ 𝑀| is the number of protein which is the union of R and M.

For each reference module, we find the prediction that has the highest Jaccard index. Total 47 modules are related to our modules (Jaccard index > 0). If a module with Jaccard index ≥ 0.5 is considered as a hit module, our method has 36 (52%) hits of golden positive dataset.

Next, because modules have relatively autonomous of the whole network, the connectivity of modules should be higher than the proteins which include the module and the proteins connecting to the module (named "extent 1 layer"). Table 5-2 shows the connectivity of our module, 160 MIPS module, and golden positive dataset. Because the 160 MIPS modules are only filtered by number of protein within PPI network, these 160 MIPS have a lower connectivity. In addition, both of our modules and the golden positive dataset have a higher average connectivity (i.e. 0.73 and 0.84, respectively). The average connectivity of all set would have an obvious decreasing from modules to the extent 1 layer. In addition, all modules derived from MS-matrix and golden positive dataset have a higher connectivity than the extent 1 layer (Table 5-2).

Table 5-2. Connectivity of module and proteins which include the module and the proteins connecting to the

module (named "extent 1 layer") Module

Set

No. of Module

Average connectivity

Average connectivity of extent 1 layer

No. of module which connectivity >

connectivity of extent 1 layer

Our 126 0.73 0.32 126

MIPS 160 0.49 0.18 150

Golden

positive 69 0.84 0.28 69

Furthermore, we annotated modules by utilizing the consensus GO terms within a given module. To annotate a module with Y proteins, we define a consensus ratio (CRM) of GO term i as CRM=Yi/Y, where Yi is the number of proteins with GO term i in a module. Next, the enrichment for each module in each GO term was determined by the p-value of the hypergeometric distribution and then this p-value was adjusted based on Bonferroni correction

130,131

. Here, a GO term is considered as a representative GO term of a module if CRM > 0.6 and adjusted p-value of GO term ≤ 0.05 ^130,131 based on statistically analysis. Figure 5-5 illustrates the distribution of the number of representative GO term within a given module derived from MS-matrix and MIPS. Then, we applied the two-tailed T-test to further investigate the difference between MS-matrix and MIPS. However, all of the P-values (0.18, 0.30, and 0.13) imply that the number of representative GO term within a given module do not have a significant different between MS-matrix and MIPS. In addition, we also investigate the representative GO terms which have the top 5 ratio in our modules or MIPS modules. The Jaccard index of BP, CC, and MF are 0.67, 0.67, and 1, respectively. This result implies that the biological characterization (i.e. No. of representative GO terms in a module and top 5 terms) of our module derived from the MS-matrix is similar to the MIPS modules which are identified by the experiments.

Figure 5-5. The distribution of the number of gene ontology annotations (i.e. (A)BP, (B)CC, and C(MF) within a

Based on the two-tailed T-test between MS-matrix and MIPS, all of gene ontology annotations (i.e. BP, CC and BF) do not have significant different (i.e. P-values are 0.18, 0.30, and 0.13 respectively).

Example of modules derived from the MS-matrix

According to 126 modules including 724 proteins derived from the MS-matrix, Figures 6A and 6B illustrate the 9 modules, which sizes are greater than 10, on the network and their density region on the MS-matrix. Two modules with the lowest average MS_ii values (0.1 and 0.16) are the 19S proteasome and U4/U6 x U5 tri-snRNP complex (purple and light blue regions in Fig. 5A). The proteasome is a protease that controls diverse processes in eukaryotic cells; and snRNPs are large RNA-protein molecular complexes upon which splicing of pre-mRNA occur. Both of two modules are play essential roles in a yeast PPI network. In addition, two largest modules (19 and 17 proteins) are the F1-F0 ATP synthase and peroxisomes. Then, we use two modules (i.e. anaphase-promoting complex/cyclosome (APC/C) and peroxisomes) as examples to further introduce the module identification derived from the MS-matrix.

Figure 5-6. The modules derived from the MS-matrix

(A) Yeast protein interaction network with 9 colored modules (e.g. F1-F0 ATP synthase (red), 19S proteasome (purple), anaphase-promoting complex/cyclosome (pink), and peroxisome (light green)). (B) The MS-matrix of

19S proteasome F1F0 ATP synthase Peroxisomes

Rab family GTPase

U4/U6 x U5 tri-snRNP complex CCR4-NOT complex

在文檔中同源蛋白質交互作用與複合體剖析蛋白質交互作用體行為 (頁 93-109)