Home  |  Contact

Help for the ExPASy BLAST Interface

Query sequence

Enter a query protein sequence in raw format (no fasta header, use one-letter amino acid codes) or a UniProt Knowledgebase (Swiss-Prot or TrEMBL) accession number.

Output format

HTML - BLAST native output format with hyperlinks and some formatting.
NiceBlast - View with full descriptions and organism sources.
Plain Text - Text format with no links.

BLAST program and databases

Programs available on ExPASy

blastp compares a protein query sequence against a protein sequence database.
tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.

Programs available elsewhere

blastn

compares a nucleotide query sequence against a nucleotide sequence database.
Available at NCBI

blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
Available at NCBI
tblastx

compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Available at NCBI

PSI-BLAST Position Specific Iterative BLAST detects weak homologs by building a profile from a multiple alignment of the highest scoring hits in an initial BLAST search.
Available at NCBI
PHI-BLAST

Pattern-Hit Initiated BLAST combines matching of regular expressions with local alignments surrounding the match.
Available at NCBI

Databases

Protein Databases

UniProt Knowledgebase (UniProtKB) UniProt (Universal Protein Resource) is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. The UniProt Knowledgebase consists of two sections: Swiss-Prot, containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and TrEMBL, a section with computationally analyzed records that await full manual annotation. Updated biweekly and includes splice variants.

Since UniProtKB contains a huge number of sequences, it may be useful to restrict the search using the following criteria:

  • Taxonomic groups

    database subsection
    a number of specific subsections have been prepared. This is the fastest and recommended way to limit a BLAST search.
    a specific taxonomic group
    you may enter either a numeric NCBI TaxID (e.g. 10090), or a taxon (e.g. Bacteria), or a species name either in Latin or in English. For the list of known species names and synonyms, see Swiss-Prot species list. As the hits will be filtered in a post-processing stage, this may result in a significant delay.
    a complete microbial proteome
    non-redundant sets of all the proteins from complete genome sequencing projects, compiled from Swiss-Prot and TrEMBL.

    A display of the BLAST hits as a taxonomic tree is also available from the result page, by clicking on the "Taxonomic view of BLAST hits" button.

  • Search only UniProtKB/Swiss-Prot (manually annotated sequences)
UniRef100, UniRef90 and UniRef50 The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches. The UniRef100 database combines identical sequences and sub-fragments of the UniProt Knowledgebase (from any species) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged UniProt entries, and links to the corresponding UniProt and UniParc records. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the representative sequence. UniRef90 and UniRef50 yield a database size reduction of approximately 40% and 65%, respectively, providing for significantly faster sequence searches.
PDB Protein Data Bank for protein 3D structures. Sequences extracted from the PDB SEQRES lines are processed into a non-redundant set where identical sequences are merged into a single record.
Translated EST Protein sequences derived from EST sequencing data (human, mouse, rat, zebrafish, drosophila, bovine, arabidopsis). This database contains many potential errors because of the low quality of the data.

DNA Databases (for tblastn)

All databases are subdivided into taxonomic sections, selectable from the Taxonomic groups drop-down list.

All EMBL + GSS All entries from the EMBL database (equivalent to GenBank and DDBJ).
HTG Unverified data from high-throughput genomic sequencing. Usually in the form of cosmids.
dbEST Expressed sequence tag database from the NCBI.
EST contigs Database of contigs based on EST clusters from Unigene (human, mouse, rat, bovine, zebrafish) and SwissClusters (Drosophila melanogaster, Arabidopsis thaliana).
Unigene EST Database of EST clusters (list of ESTs known to match the same cDNA) from the NCBI (updated occasionally). This database contains also useful information like STS matches, tissue distribution, or transcript map.
Complete genomes Genomes released in the form of a complete, assembled sequence.
Select a microbial genome One of the genomes released in the form of a complete, assembled sequence.

E-mail address

Enter your e-mail address to receive the results by e-mail. Otherwise, they will arrive interactively in your browser. The e-mail option is recommended for tblastn searches on big databases such as EMBL. If your interactive search is too long, you will receive an error message requiring you to resubmit via e-mail.

Options

Comparison matrix

The matrix assigns a probability score for each position in an alignment. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. The PAM set of matrices is also available. If the "Auto-select" option is selected (default), the matrix will be selected depending on the query sequence length, based on the following (empirically constructed) table:
Query length Substitution matrix
<35 PAM-30
35-50 PAM-70
50-85 BLOSUM-80
>85 BLOSUM-62

Setting the E threshold

The expectation value (E) threshold is a statistical measure of the number of expected matches in a random database. The lower the e-value, the more likely the match is to be significant. E-values between 0.1 and 10 are generally dubious, and over 10 are unlikely to have biological significance. In all cases, those matches need to be verified manually. You may need to increase the E threshold in the following cases :

Filter the sequence for low-complexity regions

Low-complexity regions (e.g. stretches of cysteine in CSP_DROME (Q03751), hydrophobic regions in membrane proteins) tend to produce spurious, insignificant matches with sequences in the database which have the same kind of low-complexity regions, but are unrelated biologically. If this option is checked, the query sequence will be run through the program SEG, and all amino acids in low-complexity regions will be replaced by X's which will appear in the alignment. The masked regions will also be visible as slashed regions in the PaintBlast image.

Gapped alignment

This will allow gaps to be introduced in the sequences when the comparison is done, and is usually left checked.

Output page

The output page is divided into three sections. The first is a summary of the hits, including the score and e-value of the best HSP for each hit. The second part is a graphical view summarizing the matching portions for each hit. The third part contains the alignments between the query and the hits. From the summary of the hits, several operations may be performed on selected sequences. This is only available for blastp against the protein databases :
ClustalW
is a multiple sequence alignment program,
T-COFFEE
is an alignment program that often gives better results than ClustalW, especially when dealing with divergent sequences and long insertions,
Reduce redundancy
is a program to reduce the redundancy in a set of unaligned sequences.
PRATT
is a tool to discover patterns that are conserved in a set of protein sequences.
Retrieve selected entries/sequences/accession numbers
allows several sequences (complete entries, accession numbers only or fasta format) to be retrieved at a time from the database. Individual entries are always available by clicking on the accession numbers.

Graphical view

The graphical view is composed of two images.

Other references

BLAST tutorial at NCBI

BLAST Frequently Asked questions at NCBI (includes error messages)

The Statistics of Sequence Similarity Scores by Altschul