• 沒有找到結果。

Inferring Protein-protein Interactions from Domain-domain Interactions

Chapter 2 Materials and Methods

2.3 Inferring Protein-protein Interactions from Domain-domain Interactions

2.3.1 Generate Protein-protein Interaction Candidates

Because of domain-domain interactions will be a good indicator to infer protein-protein interactions and domains are structural subunits of proteins that can be thought of as “building blocks” that are conserved during evolution. We can infer protein-protein interactions stride across different organisms via domain-domain interactions. For instance, we observed a domain-domain interaction in Saccharomyces cerevisiae; we can infer protein-protein interactions in other organisms (e.g. Homo sapiens) with similar domain-domain pair composition. We generated the protein-protein interaction candidates according to the domain composition of proteins; the domain is defined as Pfam domain (Figure 4). We employed the

“swisspfam” to help us to identify the domain composition in proteins. We inferred over one million protein-protein interactions via domain-domain interactions which we obtained. Each structural domain pair will generate thousands of protein-protein interaction candidates. We applied our approach on eight common organism models, including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Helicobacter pylori and, Escherichia coli.

2.3.2 Scoring Function

To generate the protein-protein interaction candidates without any criteria will contain lots of false positives in candidates. In our approach, we inferred protein-protein interactions according to domain composition in proteins. Because of lots of proteins might contain the same sequence-based domain with various biological functions. We developed a scoring function to measure the similarity between the proteins we inferred (candidates) and the

original proteins we identified domain-domain interactions (templates). Our scoring function is based on biological annotation, sequence similarity and the degree of bind site conservation.

Our scoring function is given as

b

score to measure biological function similarity between proteins. and are the scores of measuring the sequence similarity between two proteins. is the score to measure the degree of binding site conservation.

Wk We Wp Wb Sk Se Sp Sb Sk

Se Sp Sb

We developed a novel approach to measure biological function similarity between proteins. Swiss-prot is a curated protein sequence database which strives to provide a high level of annotation. We notice the annotation in Swiss-prot will provide as a good indicator to identify the biological function similarity between proteins. We developed a scoring function to measure biological function similarity between proteins. It is similar to information retrieval. We download “uniprot_sprot.dat.gz” from:

ftp://us.expasy.org/databases/swiss-prot/release/

The uniprot_sprot.dat is a high level of annotation such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc. We focus on the keywords annotation (Figure 5) of each protein. Table 3 shows the total keywords in Swiss-prot database. We consult the well-known information retrieval skill to develop our scoring function. The keywords annotated of each protein should have different importance.

According to each keyword frequency, we transformed the frequency to score by TF/IDF

number of its occurrence in a corpus. In our study, the TF value will be 1, because the keyword will not appear twice in a protein annotation. Inverse Document Frequency of a word is the number of document where the word occurs at least once. The IDF of each keyword is given as

fi vector of keyword i, s as total number of protein stored in Swiss-prot (in this thesis s=163235), fi as the frequency of keyword i in Swiss-prot database (we remove the keyword fi < 10 and fi

> 10000). After we transform the frequency of each keyword to score, we notice that some keywords might more important than others. We download “spkw2go” from:

http://www.geneontology.org/external2go/spkw2go

The spkw2go means that Swiss-prot keywords mapping to Gene Ontology (GO)[38]. Some keywords will repeat in this data. We consider that these repeat keywords might have more significant importance (Table 2). In our scoring function, it will calculate these keywords twice or treble. In other hand, we remove the keywords IDF of which keywords with huge frequency. Because of the keyword appear in each protein will not have significance to distinguish each other. Each keyword have its own score, a protein might contain more than one keywords. How to measure the similarity between proteins via keywords annotations? We apply the “Vector Space Model” to do this. The vector space model is given as

⎟⎠

where Px, Py are protein x and protein y, n as total unique keyword in Swiss-prot database (n dimensional vectors, in this thesis n=949), Vix, Viy as the vector strength of protein x, y in ith

dimension. In this scoring function, we extracted the proteins corresponding to the template protein in keywords annotation. After the were calculated, we normalize the score to Z-score and scale to 0~1. The normalized Z-score is used for measuring the keywords score separation between template and candidates.

Sk

We employed PSI-BLAST [39] to measure the sequence similarity between proteins. In session 2.2.1 we had download the standalone BLAST and fasta format sequence data. The command to perform PSI-BLAST:

blastpgp –d DATABASE –i INPUT –o OUTPUT F F –G 8 –E 2 –j 3 –t F –h 5

Program “blastpgp” take a protein query and perform PSI-BLAST search to create a position specific matrix using a protein database. Some of arguments used in PSI-BLAST are the same as BLAST. There are different options between BLAST and PSI-BLAST, such as

“-j 3” which is maximum number of rounds, “-t F” which means that program do not use composition based statistics, and “-h 5” that is the e-value threshold for including sequences in the score matrix model. The e-value threshold default is 0.001. However in order to obtain correct result and best performance, we change the value from 0.001 to 5 for PSI-BLAST.

The top part of output of PSI-BLAST for each round distinguishes the sequences into:

sequence found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users in NCBI. To skip quickly from the output of one round to the next, search for the string “producing”, that is part of the header for each round and likely does not appear elsewhere in output. PSI-BLAST

“converges” and stops if all sequences found at round i+1 below the e-value threshold were already in the model at the beginning of the round.

We took the sequence of the template protein as PSI-BLAST input and change the e-value threshold to 5; the maximum number of rounds as 3. The Se, Spand Sb are base on

result of PSI-BLAST result.

The Se is given as

E

Se =−log (4) where E means the e-value of the candidate protein. We transform e-value of PSI-BLAST output to our sequence similarity score. Then, we normalize the score to Z-score and scale to 0~1.

The Sp is given as

iden pos

Sp = + (5)

where pos is the positive percentage of PSI-BLAST sequence alignment result; iden is the identical percentage of PSI-BLAST sequence alignment result. Then, we normalize the score to Z-score and scale to 0~1.

where bcandidate means that the binding site conservation percentage of PSI-BLAST sequence alignment result (candidate); btemplate is the binding site score of query protein (template).

The bcandidate is given as

where n as the length of sequence alignment of binding site, Si means that the substitution score of sequence alignment result at position i. The btemplate iscompletely imitating the same equation. We use the “BLOSUM62” substitution matrix to calculate the binding site score for

each PSI-BLAST sequence alignment result. Then, we transform the score to percentage.

Then, we normalize the score to Z-score and scale to 0~1.

Our purpose is to find out proteins which were composed of specific domain and similar to template protein in biological functions. We sum the four scoring terms of each protein candidate. The total score will show that how similar between template and candidates. In general, the template protein will be the highest score in our scoring function. We define a threshold to determine how the candidate similar to template protein. Our threshold is defined as the percentage between candidates and template protein.

相關文件