Integration of Web Biological Databases

Chapter 2 Biological Resource

2.1 Integration of Web Biological Databases

The main data resources come from two web sites; one is the metalloprotein database and browser (MDB) of metalloprotein structure and design program of the Scripps Research Institute (http://metallo.scripps.edu). Another one is Protein Data Bank (PDB, http://www.rcsb.org/pdb/), which provides general information about every protein structure. Hence, by combing these two data sets, the detail description of metalloprotein can be driven. For simplicity, the PDB information can be replaced by another compacted data － PDBFinder (http://www.cmbi.kun.nl/gv/pdbfinder/) released at September, 14, 2003.

In MDB, all proteins with binding metal can be entirely extracted and the binding site is also defined by nearby amino acid residues and compounds in order to develop a sufficient understanding about metalloproteins. To achieve this, it is needed to comprehend the set of structural, environmental, and functional requirements for metal-binding sites in existing metalloprotein or metalloenzymes.

In structure, MDB has catalogued several important issues, such as what types of (metal) ions are bound to protein molecule, what types of ligands that bind these metal ions (i.e. the first-shell ligands), and what residues that contact the metal-binding ligands (i.e. the second-shell ligands) as illustrated in Figure 2.1.a.

As the result, there are three tables (As shown in Table 2.1.a to Table 2.1.c) － Protein Table, Site Table and Ligand Table needed to describe the structural relationship between protein and metal binding site in nature. That is to say, the objective can be translated as discovering the metal-binding second-shell ligands (residues) from protein primary structure (protein amino acid sequence).

Fig 2.1.a Metal-Ligand diagram in metal-binding protein

Table 2.1.a Protein Table in MDB

Table 2.1.b Site Table in MDB

Table 2.1.c Ligand Table in MDB

But these databases are not directly released to public. Only available information is formatted into 43 ligand text files with respect to 43 kind different binding metals.

The latest version of file package is “18” and is updated at January, 17, 2003. The following Figure 2.1.b shows the format of ligand file.

Fig 2.1.b Format of ligand file in MDB package

Each line in file represents one binding site surrounded by one center atom in protein. The file format of each line in ligand file can be expressed as

i=1

[protein information] + [center information] + [number of ligands (N)] +

∑

[ligand information]

The term protein information is the file name of PDB file, noted as unique PDB ID tailed with “.pdb.” The second term is the information about metal center which is one text with 5 fields － type of central atom, recognition type (A or H) of central atom, protein chain identifier where central atom located, residue series number of central atom located, and symbol of central atom. In recognition type, if it shows “A”

then the central element is recognized as atom of standard residue; else if it shows

“H” then it is recognized as atom of non-standard group in protein.

The third term is an integer number which indicates the number of binding ligands (assume to be N for illustration) with respect to this central atom. After this term, there are N ligands information follow. In ligand (binding atom) information, there are 7 fields involved － type of binding atom (P, N, M, W, A or H), recognition type of binding atom (A or H, the same as central atom recognition rule), protein chain identifier where binding atom located, residue name where binding atom located, residue series number of binding atom located, symbol of binding atom and distance (in angstrom) between central atom and binding atom. In binding atom type, if it shows “P”, “N”,”M”, “W”, “A” or “H’ then the type of binding atom is classified as atom of protein, atom of nucleic acid, metal atom, atom of water molecule, anion (negatively charged ion) or hetero atom. It is very useful when searching for metal binding residue (recognition type is ‘A’) in database.

In the similar way, PDB or PDBFinder database is released in the form of text files or accessed from html browser. For efficiency, it is necessary to build stand-alone database on local machine by parsing these released files. Therefore, each field in text file must be identified and clarified. In PDBFinder, there are three major levels of information about － (1) entire protein, (2) each one chain in protein and (3) hetero group of entire protein.

Level (1) PROTEIN provides several important messages about whether this protein is enzyme or not, the experimental details in determining this protein

structure and statistics on total number of aligned sequences in HSSP (database of

Homology-derived Secondary Structure of Proteins), fraction of helix or beta sheet

(major secondary structures), total umber of amino acid residues (standard and non-standard), total number of nucleic acids in protein, and total number of water molecules.

Level (2) CHAIN offers detailed description of each chain in protein, such as statistics about secondary structures (helix, 3/10 helix, pi helix, beta sheet, beta bridges, extended bridges, number of parallel and anti-parallel strand hydrogen bonds), amino acids (number of standard amino acids, number of non-standard amino acids, number of backbone-missing amino acids, number of sidechain-missing amino acids, number of only-Ca-given amino acid, number of unknown amino acids, number of Cystine residues, and number of chain-break which is larger than 4.5 angstrom) number of nucleic acids, number of enzyme substrate, number of water molecule, and primary structure sequence in this chain.

Level (3) HET-Groups show hetero group information as records in PDB file headed by HET. Table 2.1.d, 2.1.e and 2.1.f shows tree organization of records in level (1), (2), and (3) respectively. By combining these databases, the binding site of each protein can be extracted and it is possible to classify all proteins into enzyme or non-enzyme groups. Figure 2.1.c shows one example in PDBFinder released file.

Because the details about PDB file format is too verbose so that it is described in appendix.

Integer

{X, NMR, FIBER, MODEL, NEUTRON, OTHER}

Integer Total # of NA residues

T-Nres-Nucl

Integer Total # of water molecules

T-Water-Mols

Integer Total # of non-standard residues

T-non-std

Total # of AA residues within the protein, including non standard

T-Nres_Prot T-Nres_Prot

Integer Total fraction of helix or beta

T-Frac-Helix, Beta

Integer

# of aligned sequences in HSSP HSSP-N-Align

X.X.X.X, (X is one integer) Text Total # of NA residues

T-Nres-Nucl

Integer Total # of water molecules

T-Water-Mols

Integer Total # of non-standard residues

T-non-std

Total # of AA residues within the protein, including non standard

T-Nres_Prot T-Nres_Prot

Integer Total fraction of helix or beta

T-Frac-Helix, Beta

Integer

# of aligned sequences in HSSP HSSP-N-Align

X.X.X.X, (X is one integer) Text

Table 2.1.d Record organization in level (1) PROTEIN

Nucl-Acids 1-letter code of AA or NA

Sequence

# of Cys residues, it’s about SS bond

# of unknown type AA

# of only-CA-given AA

# of sidechain-missing AA

# of backbone-missing AA

# of non-standard AA

# of AA residues, including non-standard

# of anti-parallel strand Hydrogen bonds

# of parallel strand Hydrogen bonds

# of residues which are extended bridge

# of residues which are beta bridge

# of residues which are beta

# of residues which are pi helix

# of residues which are 3/10 helix

# of residues which are helix

# of residues which has SS Protein polymer Chain ID 1-letter code of AA or NA

Sequence

# of Cys residues, it’s about SS bond

# of unknown type AA

# of only-CA-given AA

# of sidechain-missing AA

# of backbone-missing AA

# of non-standard AA

# of AA residues, including non-standard

# of anti-parallel strand Hydrogen bonds

# of parallel strand Hydrogen bonds

# of residues which are extended bridge

# of residues which are beta bridge

# of residues which are beta

# of residues which are pi helix

# of residues which are 3/10 helix

# of residues which are helix

# of residues which has SS Protein polymer Chain ID

Table 2.1.e Record organization in level (2) CHAIN

Name Full name of HET group Text Integer Integer Integer

Possible value/Data type

# of atoms within each HET group HET residue series number

Name Full name of HET group Text

Integer Integer Integer

Possible value/Data type

# of atoms within each HET group HET residue series number

Table 2.1.f Record organization in level (3) HET-Groups

• ID : 1AOO

• Header : METALLOTHIONEIN

• Date : 1997-07-08

• Compound : ag-metallothionein

• Compound : (ag-mt)

• Compound : biological_unit: monomer;

• Source : (saccharomyces cerevisiae)

Fig 2.1.c an example of released text file in PDBFinder

在文檔中基於類神經網路之蛋白質金屬鍵結胺基酸預測 (頁 18-24)