BLAST+ user manual






Introduction

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families [more...].

Programs available for the BLAST search

Programs available on ExPASy
blastp
compares an amino acid query sequence against a protein sequence database
blastn
compares a nucleotide query sequence against a nucleotide sequence database
tblastn
compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
blastx
compares a nucleotide query sequence translated in all reading frames against a protein sequence database

Other programs
blastp with different algorithms: PSI-BLAST, PHI-BLAST and DELTA-BLAST
compare an amino acid query sequence against a protein sequence database. Available at the NCBI.
blastn with different algorithms: megablast and discontiguous megablast
compare a nucleotide query sequence against a nucleotide sequence database. Available at the NCBI.
tblastx
compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Available at the NCBI.

Enter a sequence

Enter a query protein or nucleotide sequence into the text area. Accepted input: e.g. P00750, P05067-5, A4_HUMAN or acccgtggtcgctgctg...

The format and the nature of the input (protein or nucleotide) are determined automatically.

Choose a database

Protein databases

UniProt Knowledgebase

The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. UniProtKB is composed of two sections:
Reviewed (Swiss-Prot) - Manually annotated
Records with information extracted from literature and curator-evaluated computational analysis.
Unreviewed (TrEMBL) - Computationally analyzed
Records that await full manual annotation.
In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added [more...].

N.B. Since UniProtKB contains a huge number of sequences, if you are interested in a particular taxon that is listed under 'UniProtKB taxonomic subsets', you are strongly advised to select from that menu. Moreover, you can also greatly reduce the search space by choosing to perform the BLAST against UniProtKB/Swiss-Prot only instead of the whole UniProtKB (UniProtKB/Swiss-Prot + UniProtKB/TrEMBL) and this whether you choose UniProtKB 'Complete database', 'Proteomes', 'Reference proteomes' or any of the 'UniProtKB taxonomic subsets'.
Complete database
UniProtKB/Swiss-Prot + UniProtKB/TrEMBL
Proteomes
Proteomes are sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced [more...].
Reference proteomes
Reference proteomes are 'Complete proteomes' which have been selected as reference proteomes. Reference proteomes are both manually defined and algorithmically selected according to a number of criteria [more...].
Notes:
  • For database option 'Complete database', you may filter the displayed results to a particular taxon by entering a species name, a TaxID or the latin name of a taxonomic group (elements of the UniProtKB OC, OS and OX lines). If you wish to enter more than one term, separate them with an exclamation mark e.g. 'Fungi! Homo sapiens' or '4751! 9606'.
    This filter is applied only if the selected output format is 'html' (default).
    WARNING: This is post-processing of the results: the BLAST is performed on 'Complete database', and only results fulfilling the taxonomic criteria you have entered are shown. This will decrease your hits and statistically bias your results.
    ADVICE: If the taxon you're interested in is in the 'UniProtKB taxonomic subsets' select menu, we strongly suggest you use that list instead.
  • For any of the options 'Complete database', 'Proteomes' or 'Reference proteomes', if the checkbox 'UniProtKB/Swiss-Prot (manually annotated) only' is checked, only UniProtKB/Swiss-Prot sequences will be considered.

UniProtKB taxonomic subsets

Each option constitutes a taxonomic subset of UniProtKB. If the checkbox 'UniProtKB/Swiss-Prot (manually annotated) only' is checked, only UniProtKB/Swiss-Prot sequences will be considered.

Prokaryotic proteomes

The HAMAP proteomes are non-redundant sets of all the proteins from selected complete genome sequencing projects, compiled from UniProtKB.

Other databases
UniRef100, UniRef90 and UniRef50
The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt Knowledgebase (including isoforms) and selected UniParc records in order to obtain complete coverage of the sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. UniRef100 contains all UniProt Knowledgebase records plus selected UniParc records. In UniRef100, all identical sequences and subfragments with 11 or more residues are placed into a single record. UniRef50 and UniRef90 are built based on UniRef100 [more...].
PDB
The Protein Data Bank (PDB) is a database of protein 3D structures [more...]. Sequences extracted from the PDB SEQRES lines are processed into a non-redundant set where identical sequences are merged into single records.

Nucleotide databases

Sequence databases

The European Nucleotid Archive (ENA, formerly EMBL-Bank) constitutes the world's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications [more...].
ENA records are broken down into different 'Data classes' e.g. 'Annotation', 'Reads' and each of theses data classes can be further broken down into 'Data domain' such as 'HTC' and 'HTG' for 'Annotation' [more...].
ENA
Standard annotated assembled sequences
HTG
High throughput assembled gemomic sequences with optional annotation
EST
Raw expressed sequence tag sequence data (no qualities) and sample/library information
Unigene - EST
Database of EST clusters (list of ESTs known to match the same cDNA) from the NCBI (updated occasionally). This database contains also useful information like Sequence Tagged Site (STS) matches, tissue distribution, or transcript map [more...].
GSS
Genome survey sequences; single pass, single direction sequences
STS
Sequence tagged sites
Patents
Sequence associated with a patent process

Taxonomic groups

Each option represents a subset of the ENA Standard annotated assembled sequences (STD) database. Each substet is taxonomy-based except for the last one 'Synthetic'.
Eukaryota
Sequences belonging to superkingdom 'Eukaryota'
Vertebrata
Sequences belonging to 'Vertebrata'
Mammalia
Sequences belonging to class 'Mammalia'
Homo sapiens
Sequences belonging to species 'Homo sapiens'
Rodentia
Sequences belonging to order 'Rodentia'
Other mammalia
Sequences belonging to class 'Mammalia' but not species 'Homo sapiens', nor order 'Rodentia'
Other vertebrata
Sequences belonging to 'Vertebrata' but not to the 'Mammalia' class
Invertebrata
Sequences belonging to superkingdom 'Eukaryota' but not to 'Vertebrata'
Fungi
Sequences belonging to kingdom 'Fungi'
Viridiplantae
Sequences belonging to kingdom 'Viridiplanta'
Prokaryota
Sequences belonging to superkingdom 'Bacteria' or superkingdom 'Archaea'
Bacteriophages
Sequences belonging to diffent groups of superkingdom 'Viruses'
Viruses
Sequences belonging to supekingdom 'Viruses'
Unclassified
Sequences not belonging to superkingdom 'Eukaryota', 'Bacteria', 'Archaea' or 'Viruses'
Synthetic
Sequences corresponding to synthetic molecules and constructs [more...]

Microbial genomes

One of the selected genomes released in the form of a complete, assembled sequence.

Select options

Comparison matrix

The matrix assigns a probability score for each position in an alignment. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins.
BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. The PAM set of matrices is also available.
If the "Auto-select" option is selected (default), the matrix will be selected depending on the query sequence length, based on the following (empirically constructed) table:

Default values
Query lengthSubstitution matrix
<35PAM-30
35-50PAM-70
50-85BLOSUM-80
>85BLOSUM-62

Note: If the input sequence is shorter than 30 amino acids and the blast program is blastp, the matrix is automatically set to be PAM-30 [more...].


E-value threshold

Default value: 10
The expectation value (E) threshold is a statistical measure of the number of expected matches in a random database. The lower the e-value, the more likely the match is to be significant. E-values between 0.1 and 10 are generally dubious, and over 10 are unlikely to have biological significance. In all cases, those matches need to be verified manually. You may need to increase the E threshold in the following cases :
  • if you have a very short query sequence
  • to detect very weak similarities, or similarities in a short region
  • if your sequence has a low complexity region and you use the masking option ('Filter low complexity regions').


Filter low complexity regions

Default value: yes
'yes' translates to '-seg yes', 'no' to ''

Low-complexity regions, e.g. stretches of cysteine in CSP_DROME (Q03751), hydrophobic regions in membrane proteins, tend to produce spurious, insignificant matches with sequences in the database which have the same kind of low-complexity regions, but are unrelated biologically.
If this option is checked, the query sequence will be run through the program SEG, and all amino acids in low-complexity regions will be replaced by X's which will appear in the alignment. The masked regions will also be visible as slashed regions in the PaintBlast image.

Note: If the input sequence is shorter than 30 amino acids and the blast program is blastp, this parameter is automatically set to 'no' [more...].

Allow introduction of gaps in alignments

Default value: yes
'yes' translates to '', 'no' to '-comp_based_stats F -ungapped'

This will allow gaps to be introduced in the sequences when the comparison is done.

Note: If the input sequence is shorter than 30 amino acids and the blast program is blastp, this parameter is automatically transformed to '-comp_based_stats 0' [more...].

blastp and short sequences

For comparison of amino acid sequences shorter than 30 amino acids against protein sequence databases, some input parameters are automatically overridden to optimize the search following The BLAST® Command Line Applications User Manual, (table 2, "blastp-short" task):
  • The chosen matrix is set to PAM30 (-matrix PAM30).
  • The filter low complexity regions is set to 'no'.
  • The 'Gapped alignments' parameter is set to '-comp_based_stats 0'.
Additional parameters, that not may not be tuned using the BLAST form on ExPASy are added to end up with the following parameters:
-word_size 2 -gapopen 9 -gapextend 1 -matrix PAM30 -threshold 16 -comp_based_stats 0 -window_size 15


Programmatic access

Examples

  1. http://web.expasy.org/cgi-bin/blast/blast.pl?seq=P00750&prot_db1=UniProtKB&curated=on&ethr=10&Gap=T&matrix=auto&Filter=T&showal=100&showsc=100&format=html
    blastp

  2. http://web.expasy.org/cgi-bin/blast/blast.pl?seq=P00750&curated=on
    blastp - generates the same blast command as the one in the previous example.

  3. http://web.expasy.org/cgi-bin/blast/blast.pl?seq=P00750&curated=on&Tax=9989! 9605
    blastp - generates the same blast command as the one in the previous example, the results are then filtered to keep only matches belonging taxa 'Rodentia' or 'Homo sapiens'.
    Note that this filter is applied only if prot_db1=UniProtKB ('UniProtKB - Complete database' in the form, default) and 'format=html' (default).


  4. http://web.expasy.org/cgi-bin/blast/blast.pl?seq=ACACG...&nt_db1=embl&format=txt
    blastn

  5. http://web.expasy.org/cgi-bin/blast/blast.pl?seq=A4_HUMAN&nt_db2=rod&format=xml
    tblastn

  6. http://web.expasy.org/cgi-bin/blast/blast.pl?seq=ACACG...&prot_db4=pdbaa&matrix=PAM30&ethr=1&Filter=F&Gap=F&showsc=50&showal=50
    tblastx

Enter a sequence

seq=Input sequence
P00750UniProtKB accession
A4_HUMANUniProtKB identifier
ACACGGTCATCGCGCGCCTGCGCAAGGAG...bare protein or nucleotide sequence, remove FASTA header if any for programmatic access


Choose a database

prot_db1=UniProt Knowledgebase (UniProtKB)
UniProtKBComplete database (default)
Complete_proteomesProteomes
Reference_proteomesReference proteomes
Tax=Taxonomic filter
only applied if protdb1=UniProtKB (default) and format=html (default)
Fungi! Homo sapiensFungi or Homo sapiens
4751! 9606Fungi or Homo sapiens


f
prot_db2=UniProtKB taxonomic subsets
ArchaeaArchaea
BacteriaBacteria
EukaryotaEukaryota
VirusesViruses
ArthropodaArthropoda
FungiFungi
MammaliaMammalia
MetazoaMetazoa
PrimatesPrimates
RodentiaRodentia
VertebrataVertebrata
ViridiplantaeViridiplantae
ARATHArabidopsis thaliana
CAEELCaenorhabditis elegans
DICDIDictyostelium discoideum
DROMEDrosophila melanogaster
ECOLIEscherichia coli
HUMANHomo sapiens
MOUSEMus musculus
PLAFAPlasmodium falciparum
RATRattus norvegicus
YEASTSaccharomyces cerevisiae
SCHPOSchizosaccharomyces pombe


curated=UniProtKB/Swiss-Prot (manually annotated) only
on UniProtKB/Swiss-Prot (manually annotated) only
Only relevant if the target database is UniProtKB (prot_db1 or prot_db2)


prot_db3=Prokaryotic proteomes
OS_codeOS_code OS_name
ACEP3ACEP3 Acetobacter pasteurianus (strain NBRC 3283 / LMG 1513 / CCTM 1153)
prot_db4=Other database
UniRef100UniRef100
UniRef90UniRef90
UniRef50UniRef50
pdbaaPDB


nt_db1=Sequence databases
emblENA
htgHTG
estEST
unigene_estUnigene EST
gssGSS
stsSTS
patPatents
nt_db2=ENA taxonomic subsets
eukEukaryota
all_vrtVertebrata
all_mamMammalia
humHomo sapiens
rodRodentia
mamOther mammalia
vrtOther vertebrata
invInvertebrata
funFungi
plnViridiplantae
proProkaryota
vrlViruses
phgBacteriophages
uncUnclassified
synSynthetic
nt_db3=Prokaryotic genomes
OS_codeOS_code OS_name
NANEQNANEQ Nanoarchaeume equitans (Strain Kin4-M)


Select options

matrix=Comparison matrix
autoAuto-select (default)
BLOSUM62BLOSUM62
BLOSUM45BLOSUM45
BLOSUM80BLOSUM80
PAM30PAM30
PAM70PAM70
ethr=E-value
0.00010.0001
0.0010.001
0.010.01
0.10.1
11
1010 (default)
100100
10001000
1000010000
Filter=Filter low complexity regions
Tyes (translates to '-seg yes', default)
Fno (translates to '')
Gap=Gapped alignments
Tyes (translates to '', default)
Fno (tranlates to -comp_based_stats F -ungapped)
showsc=Number of best scoring sequences to show
5050
100100 (default)
250250
10001000
30003000
showal=Number of best alignments to show
5050
100100 (default)
250250
10001000
format=Output format
htmlHTML (default)
txtPlain text
xmlXML