Using functional annotations to assess the coherence of a protein set

A web supplement to the work published in:
Chagoyen M, Carazo JM and Pascual-Montano A.
Assessment of protein set coherence using functional annotations
[BMC Bioinformatics 2008, 9:444]


Computational analysis of systematic experiments frequently produces one or more sets of genes or proteins. Several methods exists for the functional interpretation and validation of such sets. Nevertheless, little attention has been paid to the assessment of the coherence of those sets using functional annotations.

Supplementary data files

Positive sets: Random sets:
  • Figure (coherence score and significance measures) [PDF file]
Matlab code:

  • To measure the degree of coherence of a protein set based on the global similarity of their functional annotations.
  • To assess the statistical significance of that coherence in the context of a reference set.


To evaluate our methodology we analyzed both positive and random sets in the context of the Saccharomyces cerevisiae genome (
Positive sets correspond to macromolecular complexes, cellular components and proteins participating the the same pathway. These sets are compiled from:

  1. Gene Ontology cellular component subontology (release 12/2007)
  2. MIPS complex catalogue (release 18-05-2006)
  3. Kegg Pathway database (downloaded 17-12-2008)

As the catalogue of MIPS complexes comprises both curated data as well as results from systematic analysis [1-3], we have analysed these datasets separately.
For each set we computed the coherence score in terms of GO biological processes (release 12/2007), together with significance measure in the context of the whole S. cerevisiae genome.

Method overview:
  1. Each protein is represented as an n-dimensional vector. Each dimension corresponds to the n functional terms used in the reference set. Both direct GO term associations and corresponding ancestors are used in this representation.
  2. The similarity of two proteins is computed using their functional representations.
  3. The coherence score of a set is defined as the mean similarity of all protein pairs (excepting autosimilarities).
  4. The statistical significance of this score is assessed based on the hypergeometric test. For that purpose, three neighborhoods are established on set, and therefore three distinct p-values are calculated.


[1] Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature, 415, 141-147.

[2] Ho, Y., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry, Nature, 415, 180-183.

[3] Krogan, N.J., et al. (2004) High-definition macromolecular composition of yeast RNA-processing complexes, Mol Cell, 13, 225-239.


Contact: monica.chagoyen [at]