PRATT version 2.1 documentation

PRATT is a tool to discover patterns conserved in a set of protein sequences. It was created by Inge Jonassen at the University of Bergen, Norway.

This page is an edited version of the original PRATT version 2.1 documentation tailored for the use of the PRATT form on this server.






User manual

Running PRATT from the web form on ExPASY is a simple multiple steps process, starting at the top of the page and following the steps to the bottom, the PRATT submission form contains three steps:
  1. Enter a set of PROTEIN sequences or an alignment
    Enter either a set of protein sequences or a protein sequence alignment, then specify the nature of your input (sequences or alignment).
  2. Modify default parameters (optional)
    Modify, if you wish, default parameters. You can do so by either tuning parameters that are listed or by directly typing parameters in a command-line format.
  3. Submit your job
    Decide whether you want to submit the best pattern found by PRATT to ScanProsite or if you want to see the PRATT results file, then submit your job.

STEP 1 - Enter a set of PROTEIN sequences or an alignment

In STEP 1 of the PRATT submission form, you must enter a set of protein sequences or an alignment of protein sequences. If you enter sequences, accepted formats are FASTA or UniProtKB. If you choose to submit an alignment, it must be in FASTA format.
  • FASTA format
    Each sequence starts with a '>' followed by the name of the sequence (free text) on the same line. The end of a sequence is identified by looking either for the start of a new sequence ('>') or the end of the whole input if it is the last sequence.
    Note that PRATT does not allow for annotation in a FASTA input.
  • UniProtKB format
    Each entry starts with '
    ID   name_of_the_sequence
    ', followed by an arbitrary number of lines and then a line starting with '
    SQ
    ' followed by the actual sequence on subsequent lines, followed by a line containing only '
    //
    ', marking the end of the entry.
You can then specify if your input is a set of sequences or an alignment.
If 'alignment' is chosen, then the input is considered as an alignment which will be used to guide the pattern search and only patterns consistent with it will be considered by PRATT.
Loosely, a pattern is considered consistent with the alignment if each symbol in the pattern corresponds to an ungapped column in the alignment and all the characters of the column in question match the pattern symbol. Also, the wildcards in the pattern must be compatible with the number of residues between the corresponding columns in the alignment.
For instance, the pattern
A-x(2,3)-B
is consistent with the following alignment:

            ALVGB

            AG-LB

            ALD-B

        
STEP 2 - Modify default parameters

In the form, parameters are divided into three groups according to their thematic: Pattern parameters, Search parameters and Output parameters.
If you're comfortable with PRATT options, you can modify default parameters by directly entering parameters in a command-line format.

Pattern parameters

Command line option Description Possible values Default value
CM Minimum number of input sequences that must be matched by reported patterns Any number in ℕ*
Maximum value: total number of input sequences
Total number of input sequences
C% Minimum percentage of input sequences that must be matched by reported patterns Any number in ℚ*+
Maximum value: 100
100
PP1 Position of reported patterns in input sequences off, complete, start off
PL Maximum length of reported patterns
For instance,
G-G-[PS]-L-x(1,3)-R
has a length of 8 (1+1+1+1+3+1).
Increasing PL increases the memory requirement.
Any number in ℕ* 50
PN Maximum number of different pattern symbols that reported patterns can contain
For instance,
G-G-[PS]-L-x(1,3)-R
has 4 different symbols:
G, [PS], L and R
.
Increasing PN increases the memory requirement.
Any number in ℕ* 50
PX Maximum length that any given wildcard (x) in reported patterns can have
For instance
x
has a length of 1,
x(7)
has a length of 7 and
x(1,3)
has a length of 3.
Increasing PX increases the time used by PRATT and also slightly the memory requirement.
Any number in ℕ 5
FN Maximum number of flexible wildcards (x) in reported patterns
For instance,
A-x(2)-P-x-G-x(0,2)-D-x(3,5)-S
contains 2 flexible wildcards:
x(0,2)
and
x(3,5)
.
Increasing FN increases the time used by PRATT.
Any number in ℕ 2
FL Maximum flexibility of wildcards (x) in reported patterns
For instance,
x(3)
has a flexibility of 0,
x(1,2)
has a flexibility of 1 and
x(0,2)
has a flexibility of 2.
Increasing FL increases the time used by PRATT.
Any number in ℕ 2
FP Maximum product of wildcard (x) flexibility in reported patterns
The equation for calculating the product is
(flexibility of wildcard_1 + 1) * ... * (flexibility of wildcard_n + 1) 
For instance, for
C-x(2,4)-[DE]-x(10)-F
the product is
(2+1) * (0+1)
= 3
and for
C-x(2,4)-[DE]-x(10,14)-F
the product is
(2+1) * (4+1)
= 15.
Increasing FP increases the memory requirement.
Any number in ℕ* 10
BI2 Input a pattern symbol file off, on off
BN Maximum number of pattern symbols to be used in the initial search
PRATT uses a set of pattern symbols to perform the search. This set contains the 20 one-letter amino acid symbols like
G
followed by ambiguous symbols of amino acids sharing some physico-chemical properties like
[DE]
.
Increasing BN slows down the search and increases the memory requirement, but allows for more ambiguous pattern symbols to be used.
Any number in ℕ* 20
S3 Pattern scoring method info, mdl, tree, dist, ppv info

1) The possibility to change the PP value to 'complete' or 'start' is not implemented on this server (always 'off').
2) The possibility to input a pattern symbol file is not implemented on this server (always 'off').
3) The only pattern scoring methods available on this server are 'info' and 'mdl':
  • info: patterns are scored by their information content as defined in "Finding flexible patterns in unaligned protein sequences" Protein Science,4(8)(1995),pp.1587-1595.
    Note that with this scheme a pattern's score is independent of how many sequences it matches.
  • mdl: this scoring method is derived from a Minimum Description Length (mdl) principle. This method is related to the 'info' scheme but the number of sequences matched is taken into account, i.e. patterns scoring few sequences are penalized in comparison with patterns scoring many.



Command line option Description Accepted values Default value
G4 Use an alignment or a query sequence to restrict the pattern search seq, al, query seq
E Greediness of the search
Setting the greediness to 0 the search will be exhaustive. Increasing the greediness decreases the time used in the search.
Any number in ℕ 3
R Pattern refinement
Consists in the application of an algorithm on patterns found during the initial search phase in which more ambiguous pattern symbols can be added. The refinement could lead for
C-x(4)-D
to be refined to
C-x-[ILV]-x-D-x(3)-[DEF]
.
off, on on
RG Generalise pattern symbols during the pattern refinement phase.
If this option is off only the letters needed to match the input sequences are included in the ambiguous pattern positions. If on, only symbols present in the symbols set may be reported. This set contains the 20 one-letter amino acid symbols like
I
followed by ambiguous symbols of amino acids sharing some physico-chemical properties like
[ILV]
.
Let's take input sequences that contain
I
or
L
at the same position.
If this option is off,
[IL]
will be reported, while if it is on,
[ILV]
will be reported instead, because
[ILV]
is in the symbols set while
[IL]
is not.
off, on off

4) On this server, the possibility to use a query sequence to guide the pattern search is not implemented (only 'seq' or 'al') and the possibility to switch to 'al' is addressed in STEP 1.

Output parameters

Note that except for the kind of patterns that must be reported (-P option), modifying the other output parameters only makes sense if you select to view the PRATT ouput file in STEP 3 of the submission form.

Command line option Description Accepted values Default value
OP Pattern format
A pattern expressed as
C-x(2,4)-DE
in PROSITE format is written
Cxx--DE
in simple consensus format where 'x' matches exactly one arbitrary sequence symbol and '-' matches zero or one arbitrary sequence symbol.
off (simple consensus), on (PROSITE) on
ON Maximum number of patterns
Set the maximum number of patterns to be reported.
Any number in ℕ* 50
OA Maximum number of alignments
Set the maximum number of patterns for which an alignment of the sequence regions matching it will be reported.
The alignments appear in the section 'Best patterns with alignments' of the PRATT output file.
Any number in ℕ 50
M Print patterns in sequences
Report the location of the input sequence segments matching each of the best patterns.
This information is reported in the section 'Pattern matches' of the PRATT output file.
Note that the maximum number of patterns in sequences that can be printed is 52 no matter if the number of input sequences is higher.
off, on on
MR Ratio for printing
Set the K value (ratio) used for printing the summary information about where in each sequence the pattern matches are found.
Any number in ℕ* 10
MV Print vertically
Allow to print patterns in sequences vertically instead of horizontally, vertical output may be better for large sequence sets.
off (horizontal), on (vertical) off


STEP 3 - Submit your job

In this step simply choose whether you want the best pattern found by PRATT to be transmitted to ScanProsite in which case you will be directed to ScanProsite or if you want to see the output file.
Note that if no pattern is found, you will view the output file no matter what you've chosen.


PRATT output file

If run with default parameters, a PRATT output/results file contains the following sections:
Introduction
The introduction contains summary information about the PRATT run reported, including a breakdown of the parameters applied.
Best patterns before refinement
In this section, the best patterns found before the refinement phase are listed in decreasing order of fitness (score). For each pattern reported, its rank, its score, the number of hits and the number of input sequences hit are also reported.
Best pattern (after refinement phase)
In this section, the best patterns found after the refinement phase are listed in decreasing order of fitness (score). For each pattern reported, its rank, its score, the number of hits and the number of input sequences hit are also reported.
Best patterns with alignments
When showing the sequence segments matching each pattern, the sequence symbols matching non-wildcard positions (components) in the pattern, are written in upper-case while sequence symbols matching wildcards are in lower-case. Also gaps are added to align the symbols matching each pattern component.
Pattern matches
This section provides summary information about where pattern matches are located in input sequences. Patterns are given labels A->Z and a->z in order of decreasing pattern score. Each sequence is printed on a line, one character by K-tuple in the sequence. If the pattern with label 'C' matches the 3rd K-tuple in a sequence, C will be printed out. If several patterns match in the same K-tuple, only the best will be printed.


Pattern terminology

Patterns that can be found by PRATT constitute a subset of the ones that can be described using PROSITE syntax. A pattern that is susceptible to be found by PRATT can be written in the following form:
A(1)-x(i1,j1)-A2-x(i2,j2)-....A{p-1}-x(i{p-1},j{p-1})-Ap
where:
  • A(k) is a component of the pattern. There are two types of components:
    • identity component:
      specifies exactly one amino acid like for instance 'C' or 'L'.
    • ambiguous component:
      specifies more than one amino acids like for instance '[ILV]' or '[FWY]'.
  • x(ik,jk) specifies a wildcard region matching between ik and jk arbitrary amino acids, where ik and jk are positive integers so that ik <= jk for all k.
    There are two types of wildcard regions:
    • fixed wildcard region
      where jk is equal to ik, e.g. x(2,2) which can be written as x(2).
    • flexible wildcard region
      where jk is bigger than ik, e.g x(2,3).
      The flexibility of such a wildcard region is equal to jk-ik. For example the flexibility of x(2,3) is 3-2 = 1.
      The product of flexibility for a pattern is the product of the flexibilities of the flexible wildcard regions in the pattern.
Examples:
C-x(2)-H
This pattern contains two components 'C' and 'H' and one fixed wildcard region 'x(2)'.
It matches any sequence containing a C followed by any two arbitrary amino acids followed by an H like for instance 'CnaH'.
C-x(2,3)-H
This pattern contains two components 'C' and 'H' and one flexible wildcard region 'x(2,3)'.
It matches any sequence containing a C followed by any two or three arbitrary amino acids followed by an H like for instance 'CnaH' or 'ClwgH'.
C-x(2,3)-[ILV]
This pattern contains two components 'C' and '[ILV]' and one flexible wildcard region 'x(2,3)'.
It matches any sequence containing a C followed by any two or three arbitrary amino acids followed by I, L or V like for instance 'CvqgL' or 'CvyV'.


References

If you find patterns using Pratt, please cite the first reference below.
  1. I. Jonassen, J.F. Collins, D. Higgins
    Finding flexible patterns in unaligned protein sequences
    Protein Science,4(8)(1995),pp.1587-1595

  2. I. Jonassen
    Efficient discovery of conserved patterns using a pattern graph
    Computer applications in the biosciences: CABIOS,13(5)(1997),pp.509-522

  3. I. Jonassen, C. Helgesen, D. Higgins
    Scoring function for pattern discovery programs taking into account sequence diversity
    Department of Informatics, University of Bergen, Reports in Informatics,116(Feb 1996)

  4. A. Brazma, I. Jonassen, E. Ukkonen, J. Vilo
    Discovering patterns and subfamilies in biosequences
    In Proceedings of the Fourth International Conference on Intellignent Systems for Molecular Biology (ISMB-96), AAAI Press,(1996),pp.34-43