• 沒有找到結果。

Sequence Homology

N/A
N/A
Protected

Academic year: 2022

Share "Sequence Homology"

Copied!
43
0
0

加載中.... (立即查看全文)

全文

(1)

BLAST@NCBI

組員 : 林哲賢 謝友恆 李沂芳 黃堂榮 林資皓

(2)

Outline

• A brief introduction on various kind of BLAST

• Different Sequences: introduction of NCBI and FAST A format

• Web version BLAST

• BLAST on Linux system

• An application of BLAST on Bioengineering

(3)

A Brief Introduction o n various kind of BLAS T

R05921040 Yu-Heng Hsieh

(4)

Sequence Homology

• Definition:

• Shared ancestry in evolutionary history of life

• Biological homology between DNA and protein seq uence

• How to we detect sequence homology?

• Two homology sequence would be similar

• Sequence similarity!!!

(5)

Sequence Similarity

• Global vs Local

• A dynamic programming method (Needleman & W unsch, 1970)

• High computational complexity

• Impractical for searching large databases

(6)

Objective

• Found Sequence Homology between species

• DNA and amino acid sequence databases

• A database contains known gene sequence

• Hundreds of millions of sequence and hundreds of billio ns of base

• Will be introduced later

• With this size of databases, an efficient tool is need ed to found the sequence homology

(7)

BLAST algorithm

• Maximal Segment Pair(MSP):

• highest scoring pair of identical length segments chosen from 2 sequences.

• In other words, the most similar part of 2 sequences.

• Local Maximal Segment Pair:

• One may be interested in not only the most similar part, but all sequence.

• The sequence is local MSP if its score cannot be improve d either by extending or by shortening both segments

• BLAST search all local MSP with a cutoff score

(8)

Algorithm steps

1. Finds the interesting word list

2. Find all word match with score > T 3. Extend these words to find MSP

(9)

Analysis of BLAST

• Use a parameter T to control the trade off between speed and sensitivity

• Higher value of T increase the speed but also increa se the probability of missing weak similarity

• What is the bottleneck of BLAST algorithm?

• The extension step.

• How about lower T value, but strict extension rule?

• That’s what Gapped Blast does.

(10)

Gapped BLAST

• Lower T value to have more hits in phase 1

• However, only extends word that are on the same d iagonal and within a distance

• Since fewer hits have to be extended in this step, th e running time decrease significantly (up to 3x spee d up)

• However, the result subsequence alignment may be come insignificant due to low T

(11)

Gapped BLAST (continue)

• To make the result subsequence more significant, w e have to increase T

• Change extending rule to a dynamic programming method and looks for an area near both end of the hit.

(12)

PSI-BLAST

• Motif search

• Search motifs in the sequence

• More sensitive than pairwise comparison methods at de tecting distant relationships

• However, typically need substantial user intervention wh en running.

• Automates this process!!!

• Modify BLAST to generate position-specific score m atrix at each iteration, and uses it as the input for n ext iteration.

(13)

Different Sequences:

introduction of NCBI and FASTA form at

R09549010 李沂芳

(14)

NCBI

• National Center for Biotechnology Information

• houses a series of databases relevant to biotechnolog y

• important resource for bioinformatics tools and servic es

• DNA sequence database GenBank (with EMBL in Euro pe and DDBJ in Japan)

(15)

NCBI

(16)

Search for Sequence

(17)
(18)
(19)

FASTA

• text format for amino acid and nucleic acid

• begins with a single-line description

• followed by lines of sequence data

• “>” symbol at the beginning

• bar “|” separates different fields

(20)

FASTA format

gb|M73307|AGMA13GT gb tag :from GenBank

M73307 :GenBank Accession number AGMA13GT :GenBank locus

(21)

FASTA

(22)

Web version BLAST

R05921043 林哲賢

(23)
(24)

Step 1

(25)

Step 2

(26)

step3

(27)

step4

(28)

step5

(29)

step6

(30)

Other resources

• NCBI API

• Image on cloud server

(31)

BLAST on Linux system

R05945018 林資皓

(32)

BLAST on Linux

• Command:

• blastn: nucleotide  nucleotide

• blastx: nucleotide  protein

• tblastn: protein  nucleotid

• blastp: protein protein

(33)

Example -- blastn

• -db: database (“makeblastdb” to create your own da tabase)

• -query: input file.fasta

• -out: output file

• -outfmt: 0~11 (different formation)

• -evalue: evalue (e.g. 1e-100)

• -perc_identity: float value

• -max_target_seqs: numbers of sequences

• -num_threads: integer number

(34)

Example -- blastn

• blastn -db blast_db/rna_refseq_human/refseq_rna -query trinity_out_dir/trinity_len_523_upper.fa -ou t blast_out_len_523 -evalue 1e-100 -num_threads 8 -max_target_seqs 1 -perc_identity 100.0 -outfmt 6

(35)

Example -- blastn

• Output (outfmt 6)

Query ID subject ID Identity

Alignment length

mismatches

Gap opens

Query start & end

Subject start & end E-value

Bit score

(36)

Example -- blastn

• Output (outfmt 0)

(37)

Let’s talk about HLA typing.

HLA typing- 人類白血球組織抗原分型

Reference:

Next-Generation Sequencing (NGS) HLA Typing:

Beyond Allele Assignment, Pedro Cano et al., Abstracts / Human Immunology 77 (2016) 40–

156

R05945037 Tang-Jung,Huang

(38)

Aim:

To create a method to open the data collected by NGS to any kind of query.

R05945037 Tang-Jung,Huang

• biological information

Allele assignment

• Variation of HLA -Located on Chr6

-polygeny( 多基因性 ) -genetic polymorphism ( 遺傳多形性 )

(39)

NGS :

A test to compatibility between tissues from different people

R05945037 Tang-Jung,Huang

HLA-typing

(40)

Method:

BLAST is still one of the most robust and efficient sequence-matching and sequence- alignment methods.

R05945037 Tang-Jung,Huang

(41)

Method:

R05945037 Tang-Jung,Huang

Compile a database

Convert sample format

Create a BLAST database Run any BLAST

query Here reverse the approach

build a database of sample sequences against which we

query for matches for particular sequences of interest

Old-fashioned:

built with reference sequences against which a sample

sequence is queried for matches.

(42)

Discussion/Result:

R05945037 Tang-Jung,Huang

• dataset collected only for typing purposes

• The BLAST output

accurate information -sequences carried the query polymorphism, which matched what is known about the association of these SNPs with HLA-C alleles.

(43)

Conclusion:

• NGS provides data that goes beyond the need for simple allele assignment.

• The method(BLAST) presented here provides

-a robust and reliable way to store this accumulated information

-a quick and simple way to query this database of sequence data

-an open method to ask any sequence question

R05945037 Tang-Jung,Huang

參考文獻

相關文件

In the presence of inexact arithmetic cancelation in statement 3 can cause it to fail to produce orthogonal vectors. The cure is process

Process:  Design  of  the  method  and  sequence  of  actions  in  service  creation and  delivery. Physical  environment: The  appearance  of  buildings, 

The algorithms have potential applications in several ar- eas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing

To help students achieve the curriculum aims and objectives, schools should feel free to vary the organization and teaching sequence of learning elements. In practice, most

It is interesting that almost every numbers share a same value in terms of the geometric mean of the coefficients of the continued fraction expansion, and that K 0 itself is

In the work of Qian and Sejnowski a window of 13 secondary structure predictions is used as input to a fully connected structure-structure network with 40 hidden units.. Thus,

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

In other words, the partition nodes bounding the problem do not occur at immediate neighbors in the grid, hence there is at least one point on the partition line lying between