NAME
psa - biological sequence alignment file format
DESCRIPTION
psa is an output format used by the pftools package to
describe alignments between biological sequences (DNA or
protein) and PROSITE profiles.
psa is apparented to the widely used biological sequence
file format fasta. Nevertheless it does not only describe a
biological sequence, it is especially used to include infor-
mation of alignments between a motif descriptor like a
PROSITE profile and a given sequence. This information is
included in the header and reflected in the structure of the
sequence following the header line.
SYNTAX
Each sequence in a psa alignment file or output must be pre-
ceded by a fasta header line.
The general syntax of such a fasta header line is as fol-
lows:
>seq_id [ free_text ]
The header must start with a '>' character which is directly
followed by the seq_id field. This field is interpreted by
most programs as the sequence's identifier and/or accession
number. It ends at the first encountered whitespace charac-
ter.
The pftools programs will use the free_text to add informa-
tion about the match score, position and description of the
sequence or motif. Please refer to the man page of the cor-
responding programs for further information about the output
formats.
The header can only extend over one line. The following
lines up to a new line starting with a '>' character or the
end of the file are interpreted as sequence data.
The line following the header, starts the alignment data
between a sequence and a PROSITE profile. This data can span
over several lines of different length.
The data is formed by upper or lower-case characters of the
corresponding sequence alphabet (DNA or protein). The gap
characters '.' and '-' are also supported.
The alignment always has at least the length of the matching
profile. Insertions or deletions detected during the
motif/sequence alignment step will vary the length of the
data reported, and can be identified using the following
conventions:
upper-case character
Any upper-case character of the sequence alphabet
identifies a match position between the sequence
and the motif descriptor.
lower-case character
A lower-case character of the sequence alphabet is
used to symbolize an insertion in the sequence
compared to the motif descriptor.
'-' (dash) character
A '-' character in the output identifies the pres-
ence of a deletion in the sequence compared to the
motif descriptor.
EXAMPLES
(1) >YD28_SCHPO 556 pos. 291 - 332 sp|Q10256|YD28_SCHPO
PTDPGlnsKIAQLVSMGFDPLEAAQALDAANGDLDVAASFLL--
This is an example of the output produced by
pfsearch(1) using the '-x' (i.e. psa output) option.
The first line starting with the '>' character is the
fasta header. It also contains information about the
raw score of the alignment as well as its position in
the input sequence.
On the next line you find the alignment proper. Start-
ing at position 6, we can find an insertion of the
'lns' residues in the sequence compared to the motif.
The last two positions of the motif are not present in
the sequence (i.e. they are deleted). This is indi-
cated by the presence of two '-' (dash) characters at
the end of the alignment.
NOTES
(1) The xpsa(5) format defines a more strict syntax of the
header line, allowing the exchange of information
between different sequence analysis tools. It uses key-
word=value pairs to annotate the current match between
a sequence and a motif descriptor. This syntax can be
easily parsed and extended, according to the needs of
bioinformatic tools.
(2)
The current implementation of the pftools package does not
use the '.' (dot) character in the psa output. Nevertheless
psa2msa(1) will read it and interpret it in the same manner
as the '-' (dash) character.
SEE ALSO
xpsa(5), pfsearch(1), pfscan(1), pfw(1), pfmake(1),
psa2msa(1)
AUTHOR
This manual page was originally written by Volker Flegel.
The pftools package was developped by Philipp Bucher.
Any comments or suggestions should be addressed to
<pftools@isb-sib.ch>.