BLAST@NCBI
組員 : 林哲賢 謝友恆 李沂芳 黃堂榮 林資皓
Outline
• A brief introduction on various kind of BLAST
• Different Sequences: introduction of NCBI and FAST A format
• Web version BLAST
• BLAST on Linux system
• An application of BLAST on Bioengineering
A Brief Introduction o n various kind of BLAS T
R05921040 Yu-Heng Hsieh
Sequence Homology
• Definition:
• Shared ancestry in evolutionary history of life
• Biological homology between DNA and protein seq uence
• How to we detect sequence homology?
• Two homology sequence would be similar
• Sequence similarity!!!
Sequence Similarity
• Global vs Local
• A dynamic programming method (Needleman & W unsch, 1970)
• High computational complexity
• Impractical for searching large databases
Objective
• Found Sequence Homology between species
• DNA and amino acid sequence databases
• A database contains known gene sequence
• Hundreds of millions of sequence and hundreds of billio ns of base
• Will be introduced later
• With this size of databases, an efficient tool is need ed to found the sequence homology
BLAST algorithm
• Maximal Segment Pair(MSP):
• highest scoring pair of identical length segments chosen from 2 sequences.
• In other words, the most similar part of 2 sequences.
• Local Maximal Segment Pair:
• One may be interested in not only the most similar part, but all sequence.
• The sequence is local MSP if its score cannot be improve d either by extending or by shortening both segments
• BLAST search all local MSP with a cutoff score
Algorithm steps
1. Finds the interesting word list
2. Find all word match with score > T 3. Extend these words to find MSP
Analysis of BLAST
• Use a parameter T to control the trade off between speed and sensitivity
• Higher value of T increase the speed but also increa se the probability of missing weak similarity
• What is the bottleneck of BLAST algorithm?
• The extension step.
• How about lower T value, but strict extension rule?
• That’s what Gapped Blast does.
Gapped BLAST
• Lower T value to have more hits in phase 1
• However, only extends word that are on the same d iagonal and within a distance
• Since fewer hits have to be extended in this step, th e running time decrease significantly (up to 3x spee d up)
• However, the result subsequence alignment may be come insignificant due to low T
Gapped BLAST (continue)
• To make the result subsequence more significant, w e have to increase T
• Change extending rule to a dynamic programming method and looks for an area near both end of the hit.
PSI-BLAST
• Motif search
• Search motifs in the sequence
• More sensitive than pairwise comparison methods at de tecting distant relationships
• However, typically need substantial user intervention wh en running.
• Automates this process!!!
• Modify BLAST to generate position-specific score m atrix at each iteration, and uses it as the input for n ext iteration.
Different Sequences:
introduction of NCBI and FASTA form at
R09549010 李沂芳
NCBI
• National Center for Biotechnology Information
• houses a series of databases relevant to biotechnolog y
• important resource for bioinformatics tools and servic es
• DNA sequence database GenBank (with EMBL in Euro pe and DDBJ in Japan)
NCBI
Search for Sequence
FASTA
• text format for amino acid and nucleic acid
• begins with a single-line description
• followed by lines of sequence data
• “>” symbol at the beginning
• bar “|” separates different fields
FASTA format
gb|M73307|AGMA13GT gb tag :from GenBank
M73307 :GenBank Accession number AGMA13GT :GenBank locus
FASTA
Web version BLAST
R05921043 林哲賢
Step 1
Step 2
step3
step4
step5
step6
Other resources
• NCBI API
• Image on cloud server
BLAST on Linux system
R05945018 林資皓
BLAST on Linux
• Command:
• blastn: nucleotide nucleotide
• blastx: nucleotide protein
• tblastn: protein nucleotid
• blastp: protein protein
Example -- blastn
• -db: database (“makeblastdb” to create your own da tabase)
• -query: input file.fasta
• -out: output file
• -outfmt: 0~11 (different formation)
• -evalue: evalue (e.g. 1e-100)
• -perc_identity: float value
• -max_target_seqs: numbers of sequences
• -num_threads: integer number
Example -- blastn
• blastn -db blast_db/rna_refseq_human/refseq_rna -query trinity_out_dir/trinity_len_523_upper.fa -ou t blast_out_len_523 -evalue 1e-100 -num_threads 8 -max_target_seqs 1 -perc_identity 100.0 -outfmt 6
Example -- blastn
• Output (outfmt 6)
Query ID subject ID Identity
Alignment length
mismatches
Gap opens
Query start & end
Subject start & end E-value
Bit score
Example -- blastn
• Output (outfmt 0)
Let’s talk about HLA typing.
HLA typing- 人類白血球組織抗原分型
Reference:
Next-Generation Sequencing (NGS) HLA Typing:
Beyond Allele Assignment, Pedro Cano et al., Abstracts / Human Immunology 77 (2016) 40–
156
R05945037 Tang-Jung,Huang
Aim:
To create a method to open the data collected by NGS to any kind of query.
R05945037 Tang-Jung,Huang
• biological information
Allele assignment
• Variation of HLA -Located on Chr6
-polygeny( 多基因性 ) -genetic polymorphism ( 遺傳多形性 )
NGS :
A test to compatibility between tissues from different people
R05945037 Tang-Jung,Huang
HLA-typing
Method:
BLAST is still one of the most robust and efficient sequence-matching and sequence- alignment methods.
R05945037 Tang-Jung,Huang
Method:
R05945037 Tang-Jung,Huang
Compile a database
Convert sample format
Create a BLAST database Run any BLAST
query Here reverse the approach
build a database of sample sequences against which we
query for matches for particular sequences of interest
Old-fashioned:
built with reference sequences against which a sample
sequence is queried for matches.
Discussion/Result:
R05945037 Tang-Jung,Huang
• dataset collected only for typing purposes
• The BLAST output
accurate information -sequences carried the query polymorphism, which matched what is known about the association of these SNPs with HLA-C alleles.
Conclusion:
• NGS provides data that goes beyond the need for simple allele assignment.
• The method(BLAST) presented here provides
-a robust and reliable way to store this accumulated information
-a quick and simple way to query this database of sequence data
-an open method to ask any sequence question
R05945037 Tang-Jung,Huang