The following is an excerpt from the chapter
Protein Identification and Analysis Tools in the ExPASy Server
published in the book
2-D Proteome Analysis Protocols (1998). Editor A.J. Link. Humana Press, New Jersey.
The AACompIdent tool (references
) can identify proteins by their AA composition.
The program matches the percent empirically measured AA composition of an
unknown protein against the theoretical percent AA compositions of proteins
in the Swiss-Prot and/or TrEMBL
databases. A score, which represents the degree of
difference between the composition of the unknown protein and a protein in
the database, is calculated for each database entry by the sum of the
squared difference between the percent AA composition for all amino acids of
the unknown protein and the database entry. All proteins in the database are
then ranked according to their score, from lowest (best match) to highest
(worst match). Estimated protein pI and MW, as well as species and keyword of interest
can also be used in the identification procedure.
Basic Use of the AACompIdent Tool
After selecting the AACompIdent tool, you must first
choose the relevant AA constellation to use in matching. For AA
compositions determined by standard methods, use Constellation 2. This
constellation is for 16 AAs (Asx, Glx, Ser, His, Gly, Thr, Ala, Pro, Tyr,
Arg, Val, Met, Ile, Leu, Phe, Lys), does not consider Cys or Trp, and
calculates Asn and Asp together as Asx, and Glu and Gln together as Glx.
You should then specify the e-mail address to which the results should be
sent, then scroll down to the "Unknown Protein" field. Here you should
specify a name for the search that will appear as the subject of the e-mail
message, the protein pI and MW estimated from the 2-D gel as well as error
ranges that reflect the accuracy of these estimates (see comments 6.
), the Swiss-Prot
abbreviation for the species of interest (a list of species abbreviations
is available, and this list can easily be searched using the Netscape "Find" function in the "Edit"
pull-down menu), and if desired a Swiss-Prot keyword (list of keywords
used in Swiss-Prot). If desired, matching can also be done against all species
in the database by specifying "ALL". Finally, specify the experimentally
determined AA composition of the protein, with compositional data expressed
as molar percent. If you have analysed a calibration protein in parallel
with unknowns as part of your AA analysis procedure, the composition of this
protein can be used to compensate for error inherent to the AA analysis
procedure. To do this, go to the "Calibration Protein" field, specify the
Swiss-Prot ID name for the protein (e.g. ALBU_BOVIN for bovine serum
albumin) and enter the experimentally determined AA composition of the
protein, with data expressed as molar percent. Finally, select the "Search"
button to submit the data to the ExPASy server. Results will be sent to your
Use of the AACompIdent Tool with sequence tags
Protein samples from 2-D gels can be submitted to Edman protein sequencing
to create a "sequence tag" of 3 or 4 amino acids, after which the same
protein sample can be used for AA composition analysis (2
). This approach
provides protein identification of higher confidence than identification by
amino acid composition analysis alone. To use AA composition and sequence
tag data together for protein identification, fill in the AACompIdent form
as described above but do not immediately submit it to ExPASy. Go to the bottom of
the form, select the tagging option by clicking in the small box, and enter
a protein sequence tag of up to 6 amino acids in single AA code into the
"Tag" text field. Finally, specify if the sequence tag is N- or C-terminal,
and select the "Search" button to submit the data to the ExPASy server.
Results will be sent to your e-mail address.
Interpretation of AACompIdent results
The output of AACompIdent contains three lists of proteins ranked according
to their AA score (figures 5A
). The first list contains the results
from matching the AA composition of the query protein against all proteins
from the species of interest, but without considering the specified pI and
MW. The second list shows the result of matching the AA composition of the
query protein against all proteins from all species in Swiss-Prot, again
without considering pI and MW. The third list contains the results of
matching the AA composition of the query protein only against the proteins
from the species of interest that lie within the specified pI and MW range.
This is the most powerful search. In all lists, a score of 0 indicates a
perfect match between the query protein and a protein in the database, with
larger scores indicating increasing difference.
We have found that a top-ranked protein is likely to represent a correct
identification if it meets three conditions (figure 5A). Firstly, the same
protein, or type of protein, should appear at the top of the three lists.
Secondly, the top-ranked protein in the third list should have a score less
than 30 (indicating a "good fit" of the query protein with that database
entry). Finally, the third list should show a large score difference
between the top-ranked protein and the second ranked protein (indicating a
unique matching of the query protein with the top-ranked database entry).
For proteins from E. coli, we have shown that a score difference greater
than a factor of 2 gives high confidence that the top-ranked protein
represents the correct identity (1). If the top-ranking protein in the
results do not meet these three conditions, the correct identity is often
within the list of best-matching proteins. In such cases, the use of
AACompIdent with a protein sequence tag can provide unambiguous
identification due to the high specificity of sequence tag data (2). Figure
5B shows the result of protein identification by AA composition, pI, MW,
species and sequence tag. Note that when the sequence tag option is
selected, the AACompIdent output will show 40 amino acids of each protein's
predicted N- or C-terminal sequence instead of its description, and show an
asterisk to the left of a protein's rank if it carries the sequence tag. If
the tag is found in the displayed N- or C-terminal sequence, it will be
shown in lower case letters. We are confident that a protein from
Swiss-Prot represents a correct identification if the query protein's
empirically determined sequence tag of 3 amino acids or more is present at
the expected N- or C-terminal position, and that this protein is ranked
within the first 10 or so closest entries by amino acid compositon.
- Protein pI and MW in AACompIdent are calculated as described for
Compute pI / MW .
- Care must be taken in the use of estimated pI and MW from gels as part of
protein identification strategies (see documentation for TagIdent (Comments)).
- When calibration proteins are used, AACompIdent compares the experimental
composition of the protein against the theoretical composition in the
Swiss-Prot database to create a factor set. This factor set is then applied
to the experimental composition of the unknown protein before it is matched
against the Swiss-Prot database. Use of calibration proteins can increase
identification efficiency dramatically, and is advised wherever possible.
Note however that calibration proteins should be electrophoretically
prepared in the same manner as unknown proteins, and subjected to AA
analysis in parallel with unknown proteins. It is also essential that
calibration proteins be in the Swiss-Prot database.
- Protein AA composition and MW are highly conserved across species
bondaries and serve as useful parameters for cross-species protein
identification (4, 5). Protein pI is, however, poorly conserved between
species. Cross-species protein identification in AACompIdent can be done by
specifying "ALL" for the species of interest, or specifying the Swiss-Prot
species code of a well-defined organism that is closely related to the
species under study. It must be noted that high confidence cross-species
protein identification usually requires peptide mass data or sequence as
well as AA composition (see the MultiIdent Tool).
- If you do not known one of the pI or Mw parameters or would
like to search using only one of them, you can specify an
unrestricted window to cover all possibilities for the other parameter. For
example, a search where pI is set to 7.0 ± 7 units but where a Mw window of
20000 ± 10% is used will return all proteins of sizes 18000 to 22000 Mw,
regardless of their pI.
- Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez, J.-C., Yan, J.X., Gooley, A.A.,
Hughes, G., Humphery-Smith, I., Williams, K.L. and Hochstrasser, D.F. (1996) From Proteins to
Proteomes: Large scale protein identification by two-dimensional electrophoresis and amino acid analysis.
Bio/Technology 14, 61-65.
- Wilkins, M.R., Ou, K., Appel, R.D., Sanchez, J.-C., Yan, J.X., Golaz, O., Farnsworth, V., Cartier, P.,
Hochstrasser, D.F, Williams, K.L. and Gooley, A.A. (1996) Rapid protein identification using N-terminal
"sequence tag" and amino acid analysis. Biochem. Biophys. Res. Commun. 221, 609-613.
- Golaz, O., Wilkins, M.R., Sanchez, J.-C., Appel, R.D., Hochstrasser, D.F. and Williams, K.L. (1996)
Identification of proteins by their amino acid composition: an evaluation of the method. Electrophoresis
- Cordwell, S.J., Wilkins, M.R., Cerpa-Poljak, A., Gooley, A.A., Duncan, M., Williams, K.L. and
Humphery-Smith, I. (1995) Cross-sepcies identification of proteins separated by two-dimensional gel
electrophoresis using matrix-assisted laser desorption time of flight mass spectrometry and amino acid
composition. Electrophoresis 16, 438-443.
- Wilkins, M.R. and Williams, K.L. (1997) Cross-species protein identification using amino acid
composition, peptide mass fingerprinting, isoelectric point and molecular mass: a theoretical evaluation. J.
Theor. Biol. 186, 7-15.