Home  |  Contact

AACompIdent Tool

The following is an excerpt from the chapter
Protein Identification and Analysis Tools in the ExPASy Server by Wilkins et al.

published in the book
2-D Proteome Analysis Protocols (1998). Editor A.J. Link. Humana Press, New Jersey.

The AACompIdent tool (references) can identify proteins by their AA composition. The program matches the percent empirically measured AA composition of an unknown protein against the theoretical percent AA compositions of proteins in the Swiss-Prot and/or TrEMBL databases. A score, which represents the degree of difference between the composition of the unknown protein and a protein in the database, is calculated for each database entry by the sum of the squared difference between the percent AA composition for all amino acids of the unknown protein and the database entry. All proteins in the database are then ranked according to their score, from lowest (best match) to highest (worst match). Estimated protein pI and MW, as well as species and keyword of interest can also be used in the identification procedure.

Basic Use of the AACompIdent Tool

After selecting the AACompIdent tool, you must first choose the relevant AA constellation to use in matching. For AA compositions determined by standard methods, use Constellation 2. This constellation is for 16 AAs (Asx, Glx, Ser, His, Gly, Thr, Ala, Pro, Tyr, Arg, Val, Met, Ile, Leu, Phe, Lys), does not consider Cys or Trp, and calculates Asn and Asp together as Asx, and Glu and Gln together as Glx. You should then specify the e-mail address to which the results should be sent, then scroll down to the "Unknown Protein" field. Here you should specify a name for the search that will appear as the subject of the e-mail message, the protein pI and MW estimated from the 2-D gel as well as error ranges that reflect the accuracy of these estimates (see comments 6.), the Swiss-Prot abbreviation for the species of interest (a list of species abbreviations is available, and this list can easily be searched using the Netscape "Find" function in the "Edit" pull-down menu), and if desired a Swiss-Prot keyword (list of keywords used in Swiss-Prot). If desired, matching can also be done against all species in the database by specifying "ALL". Finally, specify the experimentally determined AA composition of the protein, with compositional data expressed as molar percent. If you have analysed a calibration protein in parallel with unknowns as part of your AA analysis procedure, the composition of this protein can be used to compensate for error inherent to the AA analysis procedure. To do this, go to the "Calibration Protein" field, specify the Swiss-Prot ID name for the protein (e.g. ALBU_BOVIN for bovine serum albumin) and enter the experimentally determined AA composition of the protein, with data expressed as molar percent. Finally, select the "Search" button to submit the data to the ExPASy server. Results will be sent to your e-mail address.

Use of the AACompIdent Tool with sequence tags

Protein samples from 2-D gels can be submitted to Edman protein sequencing to create a "sequence tag" of 3 or 4 amino acids, after which the same protein sample can be used for AA composition analysis (2). This approach provides protein identification of higher confidence than identification by amino acid composition analysis alone. To use AA composition and sequence tag data together for protein identification, fill in the AACompIdent form as described above but do not immediately submit it to ExPASy. Go to the bottom of the form, select the tagging option by clicking in the small box, and enter a protein sequence tag of up to 6 amino acids in single AA code into the "Tag" text field. Finally, specify if the sequence tag is N- or C-terminal, and select the "Search" button to submit the data to the ExPASy server. Results will be sent to your e-mail address.

Interpretation of AACompIdent results

The output of AACompIdent contains three lists of proteins ranked according to their AA score (figures 5A and 5B). The first list contains the results from matching the AA composition of the query protein against all proteins from the species of interest, but without considering the specified pI and MW. The second list shows the result of matching the AA composition of the query protein against all proteins from all species in Swiss-Prot, again without considering pI and MW. The third list contains the results of matching the AA composition of the query protein only against the proteins from the species of interest that lie within the specified pI and MW range. This is the most powerful search. In all lists, a score of 0 indicates a perfect match between the query protein and a protein in the database, with larger scores indicating increasing difference.

We have found that a top-ranked protein is likely to represent a correct identification if it meets three conditions (figure 5A). Firstly, the same protein, or type of protein, should appear at the top of the three lists. Secondly, the top-ranked protein in the third list should have a score less than 30 (indicating a "good fit" of the query protein with that database entry). Finally, the third list should show a large score difference between the top-ranked protein and the second ranked protein (indicating a unique matching of the query protein with the top-ranked database entry). For proteins from E. coli, we have shown that a score difference greater than a factor of 2 gives high confidence that the top-ranked protein represents the correct identity (1). If the top-ranking protein in the results do not meet these three conditions, the correct identity is often within the list of best-matching proteins. In such cases, the use of AACompIdent with a protein sequence tag can provide unambiguous identification due to the high specificity of sequence tag data (2). Figure 5B shows the result of protein identification by AA composition, pI, MW, species and sequence tag. Note that when the sequence tag option is selected, the AACompIdent output will show 40 amino acids of each protein's predicted N- or C-terminal sequence instead of its description, and show an asterisk to the left of a protein's rank if it carries the sequence tag. If the tag is found in the displayed N- or C-terminal sequence, it will be shown in lower case letters. We are confident that a protein from Swiss-Prot represents a correct identification if the query protein's empirically determined sequence tag of 3 amino acids or more is present at the expected N- or C-terminal position, and that this protein is ranked within the first 10 or so closest entries by amino acid compositon.


  1. Protein pI and MW in AACompIdent are calculated as described for Compute pI / MW .
  2. Care must be taken in the use of estimated pI and MW from gels as part of protein identification strategies (see documentation for TagIdent (Comments)).
  3. When calibration proteins are used, AACompIdent compares the experimental composition of the protein against the theoretical composition in the Swiss-Prot database to create a factor set. This factor set is then applied to the experimental composition of the unknown protein before it is matched against the Swiss-Prot database. Use of calibration proteins can increase identification efficiency dramatically, and is advised wherever possible. Note however that calibration proteins should be electrophoretically prepared in the same manner as unknown proteins, and subjected to AA analysis in parallel with unknown proteins. It is also essential that calibration proteins be in the Swiss-Prot database.
  4. Protein AA composition and MW are highly conserved across species bondaries and serve as useful parameters for cross-species protein identification (4, 5). Protein pI is, however, poorly conserved between species. Cross-species protein identification in AACompIdent can be done by specifying "ALL" for the species of interest, or specifying the Swiss-Prot species code of a well-defined organism that is closely related to the species under study. It must be noted that high confidence cross-species protein identification usually requires peptide mass data or sequence as well as AA composition (see the MultiIdent Tool).
  5. If you do not known one of the pI or Mw parameters or would like to search using only one of them, you can specify an unrestricted window to cover all possibilities for the other parameter. For example, a search where pI is set to 7.0 ± 7 units but where a Mw window of 20000 ± 10% is used will return all proteins of sizes 18000 to 22000 Mw, regardless of their pI.
  1. Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez, J.-C., Yan, J.X., Gooley, A.A., Hughes, G., Humphery-Smith, I., Williams, K.L. and Hochstrasser, D.F. (1996) From Proteins to Proteomes: Large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Bio/Technology 14, 61-65.
  2. Wilkins, M.R., Ou, K., Appel, R.D., Sanchez, J.-C., Yan, J.X., Golaz, O., Farnsworth, V., Cartier, P., Hochstrasser, D.F, Williams, K.L. and Gooley, A.A. (1996) Rapid protein identification using N-terminal "sequence tag" and amino acid analysis. Biochem. Biophys. Res. Commun. 221, 609-613.
  3. Golaz, O., Wilkins, M.R., Sanchez, J.-C., Appel, R.D., Hochstrasser, D.F. and Williams, K.L. (1996) Identification of proteins by their amino acid composition: an evaluation of the method. Electrophoresis 17, 573-579.
  4. Cordwell, S.J., Wilkins, M.R., Cerpa-Poljak, A., Gooley, A.A., Duncan, M., Williams, K.L. and Humphery-Smith, I. (1995) Cross-sepcies identification of proteins separated by two-dimensional gel electrophoresis using matrix-assisted laser desorption time of flight mass spectrometry and amino acid composition. Electrophoresis 16, 438-443.
  5. Wilkins, M.R. and Williams, K.L. (1997) Cross-species protein identification using amino acid composition, peptide mass fingerprinting, isoelectric point and molecular mass: a theoretical evaluation. J. Theor. Biol. 186, 7-15.