ProteMiner-SSM: a web server for efficient analysis of similar protein tertiary substructures.

(1)

ProteMiner-SSM: a web server for efficient analysis of

similar protein tertiary substructures

Darby Tien-Hau Chang, Chien-Yu Chen, Wen-Chin Chung, Yen-Jen Oyang*,

Hsueh-Fen Juan

1

and Hsuan-Cheng Huang

2

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC,

1

Institute of Biotechnology and Department of Chemical Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC and2Institute of Biological Chemistry, Academia Sinica, Taipei, Taiwan, ROC

Received February 15, 2004; Revised April 1, 2004; Accepted April 12, 2004

ABSTRACT

Analysis of protein–ligand interactions is a funda-mental issue in drug design. As the detailed and accu-rate analysis of protein–ligand interactions involves calculation of binding free energy based on thermo-dynamics and even quantum mechanics, which is highly expensive in terms of computing time, confor-mational and structural analysis of proteins and ligands has been widely employed as a screening process in computer-aided drug design. In this paper, a web server called ProteMiner-SSM designed for efficient analysis of similar protein tertiary sub-structures is presented. In one experiment reported in this paper, the web server has been exploited to obtain some clues about a biochemical hypothesis. The main distinction in the software design of the web server is the filtering process incorporated to expedite the analysis. The filtering process extracts the residues located in the caves of the protein tertiary structure for analysis and operates with O(n log n) time complexity, where n is the number of residues in the protein. In comparison, the a-hull algo-rithm, which is a widely used algorithm in computer graphics for identifying those instances that are on the contour of a three-dimensional object, features O(n2) time complexity. Experimental results show that the filtering process presented in this paper is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times. The ProteMiner-SSM web ser-ver can be found at http://proteminer.csie.ntu.edu.tw/. There is a mirror site at http://p4.sbl.bc.sinica.edu.tw/ proteminer/.

INTRODUCTION

One of the fundamental issues in drug design is analysis of protein-ligand interactions (1,2). The detailed and accurate analysis of protein–ligand interactions involves calculation of binding free energy based on thermodynamics and even quantum mechanics (3,4). However, this approach is highly expensive in terms of computing time. As a result, conforma-tional and structural analysis of proteins and ligands has been widely employed as a screening process in computer-aided drug design (5–8).

In this paper, a web server designed for efficient analysis of similar protein tertiary substructures, named ProteMiner-SSM, is presented. Figure 1 illustrates one application that the design of ProteMiner-SSM addresses. In this application, the biochem-ist is given the crystal structure of a protein bound with a specific ligand and wants to conduct a search in the Protein Data Bank (PDB) database (9) for the other proteins that contain a similar binding site and therefore could interact with the specific ligand. In one experiment reported in this paper, ProteMiner-SSM has been exploited to investigate whether some proteins in the caspase family contain a similar binding site to the structure of integrin reported in (10). The experimental results provide biochemists with some valuable clues that conform to a biochemical hypothesis.

In terms of the application illustrated in Figure 1, it is apparent that only the substructures in the caves of the protein tertiary structure are of interest. Therefore, in order to expedite the analysis process, it is desirable to incorporate a mechanism that can effectively extract the residues in the caves of the protein tertiary structure. In this paper, an efficient filtering process with O(nlogn) time complexity is employed, wheren is the number of residues in the protein. In comparison with thea-hull algorithm (11), which is a widely used algo-rithm in computer graphics for identifying those instances on the contour of a three-dimensional (3D) object, the filtering *To whom correspondence should be addressed. Tel:+886 2 23625336 #431 Fax: +886 2 23688675; Email: [email protected]

Correspondence may also be addressed to Chien-Yu Chen. Email: [email protected] Present address:

Chien-Yu Chen, Graduate School of Biotechnology and Bioinformatics, Yuan-Ze University, Chung-Li, Taiwan

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.

ª 2004, the authors

(2)

process employed in this paper features a lower time complex-ity, O(nlogn) versus O(n2_{). Experimental results show that the}

filtering process presented in this paper is able to speed up the analysis by a factor ranging from 3.15 to 9.37 times.

The next section of this paper elaborates the software design of ProteMiner-SSM. We then report the experiments con-ducted to evaluate the performance of ProteMiner-SSM. Finally, concluding remarks are presented, and two appendices give the mathematical basis of Equations 1 and 2 in the text. SOFTWARE DESIGN OF ProteMiner-SSM

ProteMiner-SSM carries out analysis in two steps. In the first step, a filtering process based on an efficient kernel density estimation algorithm is invoked to identify the crucial tertiary substructures on the contour of the protein that the analysis should focus on. In the second step, the geometric hashing algorithm in computer graphics (12,13) is invoked to compare the crucial substructures of the target protein and the binding/ active site of the reference protein. In this paper, we refer to the protein that contains the binding/active site of interest as the reference protein and the proteins in PDB against which the alignment is to be performed as the target proteins.

ProteMiner-SSM conducts analysis at the residue level with each residue represented by its alpha carbon in the vector space. In other words, a protein substructure is defined by the coordinates of the alpha carbons included in the

substructure. The efficient kernel density estimation algorithm that forms the basis of the filtering process treats the set of residuesfs1, s2, . . . , sn} of a protein asn samples randomly

taken from a probability distribution in the 3D vector space and employs the learning algorithm that we have recently proposed (14,15) to construct an approximate probability den-sity function of the following form:

^ ffð Þ =n 1 n Xn i¼1 b l si m exp kn sik 2 2s2 i ! , 1 where

(i) n is a vector in anm-dimensional vector space and in this paper m= 3,

(ii) b is the parameter that controls the smoothness of the approximation function, ðiiiÞ si = bdi = b RðsiÞ ffiffiffip p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðk + 1ÞGððm=2Þ + 1Þ m p ,

where R(si) is the distance between sample si and its k-th

nearest neighbor,k is a parameter to be set by the user, and G( ) is the Gamma function (18),

ðivÞ l = X

¥ h¼¥

expðh2_=2b2

Þ:

One interesting observation is that, regardless of which b = ðsi=diÞ ratio is employed, we have ðl=bÞ ﬃ

ffiffiffiffiffiffi 2p p

. If this observation can be proved to be generally correct, then we can further simplify Equation 1 and obtain

^ ffð Þ =n 1 n Xn i¼1 1 ffiffiffiffiffiffi 2p p si m exp kn sik 2 2s2 i ! : 2

The mathematical basis of Equations 1 and 2 is elaborated in the appendices.

As the approximate probability density function shown in Equations 1 and 2 is a continuous and smooth function in the vector space, we can expect that the function values at the residues located on the contour of the protein tertiary structure are generally smaller than the function values at the inner residues. Accordingly, we can set a threshold of the function values to distinguish those residues that are located on the contour from those that are not.

With the residues on the contour of the protein tertiary structure been successfully identified, the next task of the filtering process is to further classify each of these residues depending on whether it is located in a cave or not. This task can be carried out by applying Equation 1 or 2 again but with a largerb value. Applying Equation 1 or 2 with a larger b value implies that the approximate probability density function obtained is smoother. As a result, the function values at those residues that are located in a cave will be generally larger than the function values at those residues that are on the contour of the protein tertiary structure but not in a cave. Accordingly, a threshold can be set to classify these residues.

With the filtering process applied to both the reference protein and the target protein, the next task that ProteMiner-SSM carries out is conducting structural alignment on the crucial substructures identified. In ProteMiner-SSM, we have adopted the common practice for carrying out protein

Given the co-crystal structure of a protein bound with a ligand

Protein A Ligand

Extract substructure

from protein A

Search in PDB _PDB

Proteins containing similar substructures

(3)

structural alignment with the geometric hashing algorithm (5–8,16). With this practice, the coordinate systems examined by the geometric hashing algorithm are limited to those defined by the two backbone bonds connected to the alpha carbon of each residue. With the filtering process incorporated, in our implementation, the geometric hashing algorithm further narrows down its search space to only the coordinate systems defined by the residues located in the caves. In the design of ProteMiner-SSM, the likelihood of residue substitu-tion is also taken into account. If the entry in the PAM 250 matrix (1,17) corresponding to a pair of residues aligned by the geometric hashing algorithm is<2, then this pair of residues is excluded from the list of those successfully aligned.

The discussions presented in this section so far elaborate the basics of the software design of ProteMiner-SSM. Additional details, including parameter settings and time complexity ana-lysis, can be found in the supplementary material.

EXPERIMENTAL RESULTS

This section reports two experiments conducted to evaluate the performance of ProteMiner-SSM. The main objective of the first experiment is to test the accuracy of ProteMiner-SSM. The second experiment demonstrates how biochemists can exploit ProteMiner-SSM to facilitate their research works. In the first experiment, three datasets, each of which con-tains a reference structure and a number of target proteins, are used to test whether ProteMiner-SSM is able to identify the region on the contour of the target protein that contains a similar substructure as the reference protein. Table 1 shows the characteristics of these three reference protein structures.

The first two reference structures are two enzymes in PDB, PDB ID = 1HDZ (alcohol dehydrogenase) and 1BL5 (Iso-citrate dehydrogenase), and the third reference structure, PDB ID= 1L5G, contains an integrin aVb3 bound with a peptide ligand as reported in (10). For each of the two enzyme pro-teins, five proteins from the same family in PDB are employed as the target proteins. For integrin, the alternative structures of integrin aVb3 with different bindings, PDB ID = 1JV2 and 1M1X, are employed as the target proteins. Table 2 reports the results of the first experiment. The experimental results show that, with a high degree of accuracy, ProteMiner-SSM is able to identify the residues in the binding/active sites of the target protein. The only miss occurs when protein 1HJ6 is aligned with reference protein 1BL5. However, as Table 2 shows, the miss is not due to the filtering process invoked to expedite the analysis. Without the filtering process, the geometric hashing algorithm still can only successfully align seven out of the eight residues in the active site of protein 1HJ6 with the resid-ues in the active site of the reference protein.

In the second experiment, ProteMiner-SSM is invoked to figure out whether some proteins in the Caspase family may contain a similar binding site to the structure of integrin reported in (10). Table 3 shows the results output by ProteMiner-SSM. It is observed that caspase-7, PDB ID= 1F1J and 1K86, Procaspase-7, PDB ID = 1GQF, caspase-8, PDB ID = 1F9E and caspase-9, PDB ID= 1JXQ, have the largest numbers of residues successfully aligned with the residues in the bind-ing site of integrin. This result is in conformity with a hypo-thesis theorized by biochemists. However, the outputs of ProteMiner-SSM can only be regarded as interesting clues and, as shown in Table 4, it is typical that multiple possible alignments are found. Therefore, more in-depth analyses, such as protein docking or protein affinity analysis, must be con-ducuted to further confirm the hypothesis.

The results in Tables 1–4 also show that the filtering process incorporated in ProteMiner-SSM is able to speed up the ana-lysis by a factor ranging from 3.15 to 9.37 times. However, for the case reported in Table 3, the experimental results reveal that a certain degree of accuracy has been traded for efficiency. On the other hand, no such tradeoff has been observed for Table 1. Characteristics of the reference proteins in the first experiment

PDB ID Number of residues Number of residues in the binding/active site Number of residues remaining with the filtering process applied

1HDZ 748 14 307

1BL5 414 8 130

1L5G 1470 18 833

Table 2. Experimental results for the first experiment

Reference protein 1HDZ 1BL5 1L5G

Target protein 3HUD 1HTB 1HDY 1DEH 1HDX 1IDE 1HJ6 1IDC 1IDD 1IDF 1JV2 1M1X

Number of residues in the active/binding site 14 14 14 14 14 8 8 8 8 8 18 18

Geometric hashing without filtering Execution time of geometric hashing

in seconds

66.69 67.07 67.23 66.88 67.00 10.76 10.80 10.76 10.73 10.77 447.14 444.12 Number of residues in the active/binding

site that are successfully aligned

14 14 14 14 14 8 7 8 8 8 18 18

RMSD of aligned pairs 0.79 0.37 0.47 0.36 0.49 0.52 0.42 0.55 0.49 0.43 1.21 1.22

Geometric hashing with filtering applied

Execution time of filtering in seconds 0.13 0.14 0.13 0.14 0.14 0.06 0.06 0.06 0.06 0.06 0.33 0.33 Execution time of geometric hashing

in seconds

11.95 11.32 11.86 11.78 11.59 1.25 1.10 1.10 1.15 1.09 140.99 140.74 Number of residues in the active/binding site

that are successfully aligned

14 14 14 14 14 8 7 8 8 8 18 18

RMSD of aligned pairs 0.96 0.37 0.54 0.36 0.66 0.56 0.54 0.62 0.5 0.57 1.23 1.22

Speedup due to the filtering process 5.52 5.85 5.61 5.61 5.71 8.21 9.31 9.28 8.87 9.37 3.16 3.15 RMSD= root-mean-square deviation.

(4)

the cases reported in Table 2. Nevertheless, our experience is that the loss of accuracy due to the filtering process is generally within an acceptable range. In the supplementary material, we present in-depth discussions on parameter setting.

CONCLUSION AND FUTURE WORK

In this paper, a web server designed for efficient analysis of similar protein tertiary substructures is presented. In one experiment presented in this paper, ProteMiner-SSM has

been exploited to investigate whether some proteins in the caspase family contain a similar binding site to the structure of integrinaVb3, and the experimental results are in confor-mity with the biochemical hypothesis. However, the predic-tions made by ProteMiner-SSM can only be regarded as interesting clues that require more in-depth investigations to be conducted. The experimental results also show that the filtering process presented in this paper is able to speed up the analysis process by a factor ranging from 3.15 to 9.37 times.

Table 4. Two possible mappings from the second experiment of the residues in the crucial substructures of caspase-8 to the residues in the binding site of integrin aVb3

Protein integrinaVb3 (reference protein) Protein caspase-8 (PDB ID= 1F9E) PAM250 Score

Chain Residue index Residue type Chain Residue index Residue type

A 178 TYR D 320 TYR 10 A 218 ASP A 297 GLU 3 B 119 ASP B 388 GLN 2 B 121 SER B 339 SER 2 B 122 TYR B 340 TYR 10 B 123 SER B 378 SER 2 B 126 ASP B 351 GLN 2 B 158 ASP A 291 GLN 2 B 215 ASN B 381 ASP 2 B 216 ARG B 384 LYS 3 B 217 ASP D 323 ASP 4 B 219 PRO D 322 PRO 6 B 220 GLU D 324 GLU 4 B 251 ASP B 374 ASN 2 A 150 ASP K 289 ASN 2 A 178 TYR K 290 TYR 10 A 218 ASP L 385 GLN 2 B 119 ASP K 170 ASN 2 B 121 SER K 236 SER 2 B 122 TYR K 244 TYR 10 B 126 ASP K 178 ASP 4 B 127 ASP K 180 ASN 2 B 215 ASN K 239 ASP 2 B 216 ARG K 240 LYS 3 B 217 ASP K 286 GLN 2 B 218 ALA K 284 ALA 2 B 219 PRO L 387 PRO 6 B 220 GLU K 283 GLN 2 B 251 ASP V 4604 ASP 4

Table 3. Output of ProteMiner-SSM for the second experiment

PDB ID of the target protein 1C15 1CWW 1CY5 1F1J 1F9E 1GQF 1JXQ 1K86 1K88 1NME 2YGS

Number of residues 97 102 93 469 1476 530 940 464 461 238 92

Number of residues remaining with filtering applied

26 31 40 184 542 103 329 110 112 66 33

Geometric hashing without filtering

Execution time of geometric hashing 5.85 6.13 5.38 65.37 429.70 81.19 217.50 63.42 63.76 21.01 5.33 Number of residues in a cave that

are successfully aligned

9 10 9 16 15 14 14 14 13 12 9

RMSD of aligned pairs 3.91 3.46 3.37 4.24 3.99 4.75 4.24 4.09 4.04 4.14 3.51

Geometric hashing withfiltering applied

Execution time of filtering 0.01 0.01 0.01 0.07 0.32 0.08 0.17 0.07 0.07 0.03 0.01

Execution time of geometric hashing 0.92 1.08 1.34 14.81 91.48 9.22 44.52 8.81 9.12 3.38 1.12 Number of residues in a cave that

are successfully aligned

9 8 9 13 15 13 12 12 13 11 9

RMSD of aligned pairs 3.91 4.08 3.37 4.01 4.25 5.06 4.16 4.15 4.04 3.87 3.51

(5)

As the experiences from this research work have been encouraging, it is of interest to investigate how to extend the ideas presented in this paper to other protein analysis problems. Possible topics include protein function predic-tion, protein structural clustering and protein structural classification.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

ACKNOWLEDGEMENTS

This research is sponsored by National Science Council of ROC under contract NSC 92-2323-B-002-013 and NSC 92-3112-B-027-001.

REFERENCES

1. Krane,D.E. and Raymer,M.L. (2002)Fundamental Concepts of Bioinformatics, Benjamin Cummings.

2. Lesk,A.M. (2002)Introduction to bioinformatics, Oxford University Press, New York.

3. Atkins,P.W. and Depaula,J. (2001),Physical Chemistry, 7th edn. W H Freeman & Co.

4. Bourne,P.E. and Weissig,H. (eds) (2003)Structural Bioinformatics, Wiley-Liss Inc., New Jersey.

5. Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and Wodak,S.J. (1995) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins.Protein Eng., 8, 647–662.

6. Orengo,C. and Taylor,W. (1996) SSAP: sequential structure alignment program for protein structure comparison.Methods Enzymol., 266, 617–635.

7. Pennec,X. and Ayache,N. (1994) An O(n2_{) algorithm for 3D substructure} matching of proteins. In Califano,A., Rigoutsos,I. and Wolson,H.J. (eds), Shape and Pattern Matching in Computational Biology. Proceedings of the First Internatinal Workshop, Seattle, Plenum Publishing, pp. 25–40. 8. Pennec,X. and Ayache,N. (1998) A geometric algorithm to find small but highly similar 3D substructures in proteins.Bioinformatics, 14, 516–522. 9. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,

Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank.Nucleic Acids Res., 28, 235–242.

10. Xiong,J.P., Stehle,T., Zhang,R., Joachimiak,A., Frech,M., Goodman,S.L. and Arnaout,M.A. (2002) Crystal structure of the extracellular segment of integrin alpha Vbeta3 in complex with an Arg-Gly-Asp ligand.Science, 296, 151–155.

11. Edelsbrunner,H. and Mucke,E.P. (1994) Three-dimensional alpha shapes.ACM Trans. Graphics, 13, 43–72.

12. Haim,J.W. (1997) Geometric hashing: an overview.IEEE Comput. Sci. Eng., 4, 10–21.

13. Lamdan,Y. and Wolfson,H. (1988) Geometric Hashing: A General and Efficient Model-Based Recognition Scheme.Proceedings of

International Conference on Computer Vision, pp. 238–249. 14. Oyang,Y.-J., Chang,D.T.-H., Chen,C.-Y. and Hwang,S.-C. (2003)

Expediting Protein Structural Analysis with an Efficient Kernel Density Estimation Algorithm.Proceedings of IEEE 5th International Symposium on Multimedia Software Engineering, Taichung, Taiwan. 15. Oyang,Y.-J., Hwang,S.-C., Ou,Y.-Y., Chen,C.-Y. and Chen,Z.-W.

(2002) A Novel Learning Algorithm for Data Classification with Radial Basis Function Networks.Proceedings of 9th International Conference on Neural Information Processing (ICONIP-2002), Singapore. 16. Tu,J.-T. (2003) Protein active site prediction by matching 3D structural

data. Master thesis, Department of Computer Science and Information Engineering, National Taiwan University.

17. Altschul,S.F. (1991) Amino acid substitution matrices from an information heoretic perspective.J. Mol. Biol., 219, 555–565. 18. Artin,E. (1964)The Gamma Function, Holt, Rinehart and Winston,

New York.

APPENDIX A

The efficient kernel density estimation algorithm that forms the basis of the filtering process employed in this paper treats a given set of instancesfs1, s2, . . . ,sn} in the vector space asn

samples randomly taken from a probability distribution and constructs an approximate probability density function of the following form: ^ ffð Þ =n X n i= 1 wiexp kn sik 2 2s2 i ! , A:1

where n is a vector in the vector space and kn sik is the

distance between vectors n and si. Accordingly, the task that

the efficient kernel density estimation algorithm carries out is to determine the values ofwiandsiin Equation A.1, so that ^ff

provides a good approximation of the original probability den-sity functionf. In fact, the kernel density estimation problem described here can be transformed to a kernel smoothing pro-blem, if we employ the following equation to estimate the values off at si,i= 1, 2, . . . ,n: f sð Þ ﬃi k+ 1 ð Þ n R sð Þi mpm=2 Gððm=2Þ + 1Þ 1 , A:2 where

(i) m is the dimension of the vector space,

(ii) R(si) is the distance between instance si and its k-th

nearest neighbor,

(iii) ½RðsiÞmpm=2=Gððm=2Þ + 1Þ is the volume of a

hyper-sphere with radiusR(si) in anm-dimensional vector space,

(iv) G() is the Gamma function (18) and (v) k is a parameter to be set by the user.

As shown in Equation A.1, the efficient kernel density esti-mation algorithm places one spherical Gaussian function at each instance. For an instance si, the efficient kernel density

estimation algorithm conducts a mathematical analysis on a synthesized data set. The synthesized data set is derived from two ideal assumptions and serves as an analogy of the dis-tribution of the instances in the proximity of si. The first ideal

assumption is that the sampling density in the proximity of siis

sufficiently high and, therefore, the variation of the probability density function f at si and its neighboring instances

approaches 0. The second ideal assumption is that the instances in the proximity of siare evenly spaced by a distance

determined by the value off(si). The details of the synthesized

data set are elaborated in the following:

(i) Instance si is located at the origin and the neighboring

instances are located at (h1di, h2di, . . . , hmdi), where

h1,h2, . . . ,hmare integers anddiis the average distance

between two adjacent instances in the proximity of si.

Howdiis determined will be addressed later on.

(ii) The values of the probability density function at the in-stances in the synthesized data set, includingsi, are all

equal to f(si). The value of f(si) is estimated based on

Equation A.2.

The efficient kernel density estimation algorithm begins with an analysis on the synthesized data set to figure out the values ofwiandsithat make functiongi() defined in the following

(6)

virtually a constant function equal to f(si), gið Þ = wx i " X¥ h1=¥ X¥ h2=¥ X ¥ hm=¥ exp kx ðh1di,h2di, . . . ,hmdiÞk 2 2s2 i !# ﬃ f sð Þ:i A:3 In other words, the objective is to makegi(x) a good

approx-imator off(x) in the proximity of si. Let x= (x1,x2, . . . ,xm),

then we have gið Þ = wx i X¥ h1=¥ exp ðx1 h1diÞ 2 2s2 i ! X ¥ h2=¥ exp ðx2 h2diÞ 2 2s2 i ! X ¥ hm=¥ exp ðxm hmdiÞ 2 2s2 i ! : It is shown in Appendix B that, with si= di,

2:5066282745 1:34 · 108_< X ¥ h=¥ exp ðy hdiÞ 2 2s2 i ! " # < 2:5066282745+ 1:34 · 108 Therefore, withsi= di,gi(x) defined in Equation A.3 is

vir-tually a constant function. In fact, it can be shown that, as long assi> 0.45 di,gi(x) is virtually a constant function.

Accord-ingly, the next thing to do is to find the appropriate value ofwi

that makesgi(x) approximately equal tof(si). We have

gið Þ = gsi ið0, . . . , 0Þ = wi " X¥ h1=¥ X¥ h2=¥ X ¥ hm=¥ exp ðh 2 1 + h 2 2 + + h 2 mÞd 2 i 2s2 i !# = wi X¥ h=¥ exp h 2 2b2 ! " #m ,

whereb = si=di. Therefore, we need to setwias follows:

wi X¥ h=¥ exp h 2 2b2 ! " #m = f sð Þ:i

If we employ Equation A.2 to estimate the value off(si), then

we have wi= k+ 1 ð Þ G m=2 + 1ð Þ lm n R sð Þi m pm=2, wherel = X¥ h=¥ exp h 2 2b2 ! : A:4

So far, we have found that if we set an appropriate ratio of b = si/diand setwiaccording to Equation A.4, we can make

gi(x) a good approximator off(x) in the proximity of si. The

only remaining issue is to derive a closed form ofsi. In this

paper,diis set to the average distance between two adjacent

instances in the proximity of sample si. In anm-dimensional

vector space, the number of uniformly distributed instances,N, in a hypercube with volumeV can be computed by N ﬃ V/am_,

where a is the spacing between two adjacent instances. Accordingly, we set di = R sð Þi ffiffiffip p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k+ 1 ð ÞG ðm=2Þ + 1ð Þ m p : A:5

Finally, with Equations A.4 and A.5 incorporated into Equation A.1, we obtain an approximate probability density function of the form shown in Equation 1 in the main text.

APPENDIX B

Letq yð Þ =P¥h=¥expðððy hdÞ 2

=2s2_{ÞÞ, where d and s are}

two coefficients andy is a real number. We have

q0ð Þ =y dq yð Þ dy = 1 s2 _X¥ h=¥ y hd ð Þexp ðy hdÞ 2 2s2 ! :

Sinceq(y) is a symmetric and periodical function, if we want to find the global maximum and minimum values ofq(y), we only need to analyzeq(y) within the interval ½0,d

2. Let y02

½0,d

2 and y0 = ðd=2Þð j=nÞ + e, where n > 1 and 0 4 j 4 n 1

are integers, and 0 <e < d

2n. We have q yð Þ = q0 jd 2n + Z ðjd=2nÞ+e jd=2n q0ð Þdt:t

Let us consider the special case with s = d. Then, we have qðy0Þ = X¥ h= ¥ h exp1 2 j 2n h 2 1 s2 Z ðjd=2nÞ+e jd=2n ðt hdÞexpðt hdÞ 2 2s2 dti: Let rðhÞ = 1=s2Rðjd=2nÞ+e jd=2n ðt hdÞexpððt hdÞ 2 =2s2_Þdt. Since ð1=s2_{Þðt hdÞexpððt hdÞ}2 =2s2_{Þ is a decreasing}

function fort2 [(h 1)d, (h + 1)d] and is an increasing function fort =2 [(h 1)d, (h + 1)d], we have ðiÞ r 0ð Þ < e 1 s2 jd 2n exp 1 2s2 jd 2n 2 " # ¼ e 1 s j 2n exp 1 2 j 2n 2 " # ; ðiiÞ r 1ð Þ < e 1 s2 jd 2n d exp 1 2s2 jd 2n d 2 " # = e 1 s j 2n 1 exp 1 2 j 2n 1 2 " # ;

(7)

(iii) forh „ 0 and h „ 1, r hð Þ < e 1 s2 j+ 1 ð Þd 2n hd · exp 1 2s2 j+ 1 ð Þd 2n hd 2 " # = e 1 s j+ 1 ð Þ 2n h exp 1 2 j+ 1 ð Þ 2n h 2 " # : Therefore, q yð Þ =0 X¥ h=¥ exp 1 2 j 2n h 2! + r hð Þ " # < X ¥ h=¥ exp 1 2 j 2n h 2! " # + eq, where q = 1 s j 2n exp 1 2 j 2n 2 " # þ 1 s j 2n 1 exp 1 2 j 2n 1 2 " # þ 1 s _X¥ h= ¥ h„ 0;1 j+ 1 ð Þ 2n h exp 1 2 j+ 1 ð Þ 2n h 2 " # :

Ifq > 0, then we have for any 0 < e < d 2n X¥ h=¥ exp 1 2 j 2n h 2! " # + eq < X ¥ h=¥ exp 1 2 j 2n h 2! " # + d 2nq: B:1

On the other hand, ifq < 0, then we have for any 0 < e < d 2n X¥ h=¥ exp 1 2 j 2n h 2! " # + eq < X ¥ h=¥ exp 1 2 j 2n h 2! " # : B:2

Combining Equations B.1 and B.2, we obtain, for all y2 0,d 2 , q yð Þ < lim n!¥max0<j<n1imum ( X¥ h¼¥ exp 1 2 j 2n h 2! , X¥ h¼¥ exp 1 2 j 2n h 2! + d 2nq ) : Similarly, we can show that

q yð Þ > lim n!¥min0<j<n1imum ( X¥ h¼¥ exp 1 2 j 2n h 2! , X¥ h¼¥ exp 1 2 j 2n h 2! + d 2nr ) , where r = 1 s j+ 1 2n exp 1 2 j+ 1 2n 2 " # + 1 s jþ 1 2n 1 · exp 1 2 jþ 1 2n 1 2 " # þ 1 s _X¥ h= ¥ h„ 0;1 j 2n h exp 1 2 j 2n h 2 " # :

If we set n = 100 000, then we have, with s = d,

2.506628261 < q(y) < 2.506628288, for y2 0,d 2