BLAST+ user manual
Introduction
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences.
The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.
BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families
[
more...].
Programs available for the BLAST search
Programs available on ExPASy
- blastp
-
compares an amino acid query sequence against a protein sequence database
- blastn
-
compares a nucleotide query sequence against a nucleotide sequence database
- tblastn
-
compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- blastx
-
compares a nucleotide query sequence translated in all reading frames against a protein sequence database
Other programs
- blastp with different algorithms: PSI-BLAST, PHI-BLAST and DELTA-BLAST
-
compare an amino acid query sequence against a protein sequence database.
Available at the NCBI.
- blastn with different algorithms: megablast and discontiguous megablast
-
compare a nucleotide query sequence against a nucleotide sequence database.
Available at the NCBI.
- tblastx
-
compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Available at the NCBI.
Enter a sequence
Enter a query protein or nucleotide sequence into the text area. Accepted input:
e.g. P00750,
P05067-5,
A4_HUMAN or acccgtggtcgctgctg...
The format and the nature of the input (protein or nucleotide) are determined automatically.
Choose a database
Protein databases
UniProt Knowledgebase
The
UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins,
with accurate, consistent and rich annotation.
UniProtKB is composed of two sections:
- Reviewed (Swiss-Prot) - Manually annotated
-
Records with information extracted from literature and curator-evaluated computational analysis.
- Unreviewed (TrEMBL) - Computationally analyzed
-
Records that await full manual annotation.
In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description,
taxonomic data and citation information), as much annotation information as possible is added [
more...].
N.B. Since UniProtKB contains a huge number of sequences, if you are interested in a particular taxon that is listed under 'UniProtKB taxonomic subsets',
you are strongly advised to select from that menu.
Moreover, you can also greatly reduce the search space by choosing to perform the BLAST against UniProtKB/Swiss-Prot only instead of the whole UniProtKB (UniProtKB/Swiss-Prot + UniProtKB/TrEMBL)
and this whether you choose UniProtKB 'Complete database', 'Proteomes', 'Reference proteomes' or any of the 'UniProtKB taxonomic subsets'.
- Complete database
-
UniProtKB/Swiss-Prot + UniProtKB/TrEMBL
- Proteomes
-
Proteomes are sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced
[more...].
- Reference proteomes
-
Reference proteomes are 'Complete proteomes' which have been selected as reference proteomes.
Reference proteomes are both manually defined and algorithmically selected according to a number of criteria [more...].
Notes:
-
For database option 'Complete database',
you may filter the displayed results to a particular taxon by entering a species name, a TaxID or the latin name of a taxonomic group (elements of the UniProtKB OC, OS and OX lines).
If you wish to enter more than one term, separate them with an exclamation mark e.g. 'Fungi! Homo sapiens' or '4751! 9606'.
This filter is applied only if the selected output format is 'html' (default).
WARNING: This is post-processing of the results: the BLAST is performed on 'Complete database', and only results fulfilling the taxonomic criteria you have entered are shown.
This will decrease your hits and statistically bias your results.
ADVICE: If the taxon you're interested in is in the 'UniProtKB taxonomic subsets' select menu, we strongly suggest you use that list instead.
- For any of the options 'Complete database', 'Proteomes' or 'Reference proteomes',
if the checkbox 'UniProtKB/Swiss-Prot (manually annotated) only' is checked, only UniProtKB/Swiss-Prot sequences will be considered.
UniProtKB taxonomic subsets
Each option constitutes a taxonomic subset of UniProtKB. If the checkbox 'UniProtKB/Swiss-Prot (manually annotated) only' is checked, only UniProtKB/Swiss-Prot sequences will be considered.
Prokaryotic proteomes
The
HAMAP proteomes are non-redundant sets of all the proteins from selected complete genome sequencing projects, compiled from UniProtKB.
Other databases
- UniRef100, UniRef90 and UniRef50
-
The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase (including isoforms) and
selected UniParc records in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view.
UniRef100 contains all UniProt Knowledgebase records plus selected UniParc records. In UniRef100, all identical sequences and subfragments with 11 or more residues are placed into a single record.
UniRef50 and UniRef90 are built based on UniRef100 [more...].
- PDB
-
The Protein Data Bank (PDB) is a database of protein 3D structures [more...].
Sequences extracted from the PDB SEQRES lines are processed into a non-redundant set where identical sequences are merged into single records.
Nucleotide databases
Sequence databases
The European Nucleotid Archive (ENA, formerly EMBL-Bank) constitutes the world's primary nucleotide sequence resource.
Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications
[
more...].
ENA records are broken down into different 'Data classes'
e.g. 'Annotation', 'Reads' and each of theses data classes can be further broken down
into 'Data domain' such as 'HTC' and 'HTG' for 'Annotation' [
more...].
- ENA
-
Standard annotated assembled sequences
- HTG
-
High throughput assembled gemomic sequences with optional annotation
- EST
-
Raw expressed sequence tag sequence data (no qualities) and sample/library information
- Unigene - EST
-
Database of EST clusters (list of ESTs known to match the same cDNA) from the NCBI (updated occasionally).
This database contains also useful information like Sequence Tagged Site (STS) matches, tissue distribution, or transcript map [more...].
- GSS
-
Genome survey sequences; single pass, single direction sequences
- STS
-
Sequence tagged sites
- Patents
-
Sequence associated with a patent process
Taxonomic groups
Each option represents a subset of the ENA Standard annotated assembled sequences (STD) database.
Each substet is taxonomy-based except for the last one 'Synthetic'.
- Eukaryota
- Sequences belonging to superkingdom 'Eukaryota'
- Vertebrata
- Sequences belonging to 'Vertebrata'
- Mammalia
- Sequences belonging to class 'Mammalia'
- Homo sapiens
- Sequences belonging to species 'Homo sapiens'
- Rodentia
- Sequences belonging to order 'Rodentia'
- Other mammalia
- Sequences belonging to class 'Mammalia' but not species 'Homo sapiens', nor order 'Rodentia'
- Other vertebrata
- Sequences belonging to 'Vertebrata' but not to the 'Mammalia' class
- Invertebrata
- Sequences belonging to superkingdom 'Eukaryota' but not to 'Vertebrata'
- Fungi
- Sequences belonging to kingdom 'Fungi'
- Viridiplantae
- Sequences belonging to kingdom 'Viridiplanta'
- Prokaryota
- Sequences belonging to superkingdom 'Bacteria' or superkingdom 'Archaea'
- Bacteriophages
- Sequences belonging to diffent groups of superkingdom 'Viruses'
- Viruses
- Sequences belonging to supekingdom 'Viruses'
- Unclassified
- Sequences not belonging to superkingdom 'Eukaryota', 'Bacteria', 'Archaea' or 'Viruses'
- Synthetic
- Sequences corresponding to synthetic molecules and constructs [more...]
Microbial genomes
One of the selected genomes released in the form of a complete, assembled sequence.
Select options
Comparison matrix
The matrix assigns a probability score for each position in an alignment.
The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins.
BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. The PAM set of matrices is also available.
If the "Auto-select" option is selected (default), the matrix will be selected depending on the query sequence length, based on the following
(
empirically constructed) table:
Default values
Query length | Substitution matrix |
<35 | PAM-30 |
35-50 | PAM-70 |
50-85 | BLOSUM-80 |
>85 | BLOSUM-62 |
Note: If the input sequence is shorter than 30 amino acids and the blast program is blastp, the matrix is automatically set to be PAM-30 [more...].
E-value threshold
Default value: 10
The expectation value (E) threshold is a statistical measure of the number of expected matches in a random database.
The lower the e-value, the more likely the match is to be significant.
E-values between 0.1 and 10 are generally dubious, and over 10 are unlikely to have biological significance.
In all cases, those matches need to be verified manually. You may need to increase the E threshold in the following cases :
- if you have a very short query sequence
- to detect very weak similarities, or similarities in a short region
- if your sequence has a low complexity region and you use the masking option ('Filter low complexity regions').
Filter low complexity regions
Default value: yes
'yes' translates to '-seg yes', 'no' to ''
Low-complexity regions,
e.g. stretches of cysteine in CSP_DROME (
Q03751),
hydrophobic regions in membrane proteins, tend to produce spurious,
insignificant matches with sequences in the database which have the same kind of low-complexity regions,
but are unrelated biologically.
If this option is checked, the query sequence will be run through the program
SEG,
and all amino acids in low-complexity regions will be replaced by X's which will appear in the alignment. The masked regions will also be visible as slashed regions in the PaintBlast image.
Note: If the input sequence is shorter than 30 amino acids and the blast program is blastp, this parameter is automatically set to 'no' [
more...].
Allow introduction of gaps in alignments
Default value: yes
'yes' translates to '', 'no' to '-comp_based_stats F -ungapped'
This will allow gaps to be introduced in the sequences when the comparison is done.
Note: If the input sequence is shorter than 30 amino acids and the blast program is blastp, this parameter is automatically transformed to '-comp_based_stats 0' [
more...].
blastp and short sequences
For comparison of amino acid sequences shorter than 30 amino acids against protein sequence databases, some input parameters are automatically overridden to optimize the search
following
The BLAST® Command Line Applications User Manual,
(table 2, "blastp-short" task):
-
The chosen matrix is set to PAM30 (-matrix PAM30).
-
The filter low complexity regions is set to 'no'.
-
The 'Gapped alignments' parameter is set to '-comp_based_stats 0'.
Additional parameters, that not may not be tuned using the BLAST form on ExPASy are added to end up with the following parameters:
-word_size 2 -gapopen 9 -gapextend 1 -matrix PAM30 -threshold 16 -comp_based_stats 0 -window_size 15
Programmatic access
Examples
-
https://web.expasy.org/cgi-bin/blast/blast.pl?seq=P00750&prot_db1=UniProtKB&curated=onðr=10&Gap=T&matrix=auto&Filter=T&showal=100&showsc=100&format=html
blastp
-
https://web.expasy.org/cgi-bin/blast/blast.pl?seq=P00750&curated=on
blastp - generates the same blast command as the one in the previous example.
-
https://web.expasy.org/cgi-bin/blast/blast.pl?seq=P00750&curated=on&Tax=9989! 9605
blastp - generates the same blast command as the one in the previous example, the results are then filtered to keep only matches belonging
taxa 'Rodentia' or 'Homo sapiens'.
Note that this filter is applied only if prot_db1=UniProtKB ('UniProtKB - Complete database' in the form, default) and 'format=html' (default).
-
https://web.expasy.org/cgi-bin/blast/blast.pl?seq=ACACG...&nt_db1=embl&format=txt
blastn
-
https://web.expasy.org/cgi-bin/blast/blast.pl?seq=A4_HUMAN&nt_db2=rod&format=xml
tblastn
-
https://web.expasy.org/cgi-bin/blast/blast.pl?seq=ACACG...&prot_db4=pdbaa&matrix=PAM30ðr=1&Filter=F&Gap=F&showsc=50&showal=50
tblastx
Enter a sequence
seq= | Input sequence |
---|
P00750 | UniProtKB accession |
A4_HUMAN | UniProtKB identifier |
ACACGGTCATCGCGCGCCTGCGCAAGGAG... | bare protein or nucleotide sequence, remove FASTA header if any for programmatic access |
Choose a database
prot_db1= | UniProt Knowledgebase (UniProtKB) |
---|
UniProtKB | Complete database (default) |
Complete_proteomes | Proteomes |
Reference_proteomes | Reference proteomes |
Tax= | Taxonomic filter only applied if protdb1=UniProtKB (default) and format=html (default) |
---|
Fungi! Homo sapiens | Fungi or Homo sapiens |
4751! 9606 | Fungi or Homo sapiens |
f
prot_db2= | UniProtKB taxonomic subsets |
Archaea | Archaea |
Bacteria | Bacteria |
Eukaryota | Eukaryota |
Viruses | Viruses |
Arthropoda | Arthropoda |
Fungi | Fungi |
Mammalia | Mammalia |
Metazoa | Metazoa |
Primates | Primates |
Rodentia | Rodentia |
Vertebrata | Vertebrata |
Viridiplantae | Viridiplantae |
ARATH | Arabidopsis thaliana |
CAEEL | Caenorhabditis elegans |
DICDI | Dictyostelium discoideum |
DROME | Drosophila melanogaster |
ECOLI | Escherichia coli |
HUMAN | Homo sapiens |
MOUSE | Mus musculus |
PLAFA | Plasmodium falciparum |
RAT | Rattus norvegicus |
YEAST | Saccharomyces cerevisiae |
SCHPO | Schizosaccharomyces pombe |
curated= | UniProtKB/Swiss-Prot (manually annotated) only |
---|
on |
UniProtKB/Swiss-Prot (manually annotated) only
Only relevant if the target database is UniProtKB (prot_db1 or prot_db2)
|
prot_db3= | Prokaryotic proteomes |
OS_code | OS_code OS_name |
ACEP3 | ACEP3 Acetobacter pasteurianus (strain NBRC 3283 / LMG 1513 / CCTM 1153) |
prot_db4= | Other database |
UniRef100 | UniRef100 |
UniRef90 | UniRef90 |
UniRef50 | UniRef50 |
pdbaa | PDB |
nt_db1= | Sequence databases |
embl | ENA |
htg | HTG |
est | EST |
unigene_est | Unigene EST |
gss | GSS |
sts | STS |
pat | Patents |
nt_db2= | ENA taxonomic subsets |
euk | Eukaryota |
all_vrt | Vertebrata |
all_mam | Mammalia |
hum | Homo sapiens |
rod | Rodentia |
mam | Other mammalia |
vrt | Other vertebrata |
inv | Invertebrata |
fun | Fungi |
pln | Viridiplantae |
pro | Prokaryota |
vrl | Viruses |
phg | Bacteriophages |
unc | Unclassified |
syn | Synthetic |
nt_db3= | Prokaryotic genomes |
OS_code | OS_code OS_name |
NANEQ | NANEQ Nanoarchaeume equitans (Strain Kin4-M) |
Select options
matrix= | Comparison matrix |
auto | Auto-select (default) |
BLOSUM62 | BLOSUM62 |
BLOSUM45 | BLOSUM45 |
BLOSUM80 | BLOSUM80 |
PAM30 | PAM30 |
PAM70 | PAM70 |
ethr= | E-value |
0.0001 | 0.0001 |
0.001 | 0.001 |
0.01 | 0.01 |
0.1 | 0.1 |
1 | 1 |
10 | 10 (default) |
100 | 100 |
1000 | 1000 |
10000 | 10000 |
Filter= | Filter low complexity regions |
T | yes (translates to '-seg yes', default) |
F | no (translates to '') |
Gap= | Gapped alignments |
T | yes (translates to '', default) |
F | no (tranlates to -comp_based_stats F -ungapped) |
showsc= | Number of best scoring sequences to show |
50 | 50 |
100 | 100 (default) |
250 | 250 |
1000 | 1000 |
3000 | 3000 |
showal= | Number of best alignments to show |
50 | 50 |
100 | 100 (default) |
250 | 250 |
1000 | 1000 |
format= | Output format |
html | HTML (default) |
txt | Plain text |
xml | XML |