How to...

prepare input data (for sim_score.m)


A web supplement to the work published in:
Chagoyen M, Carazo JM and Pascual-Montano A.
Assessment of protein set coherence using functional annotations
[BMC Bioinformatics 2008, 9:444]



To compute overall functional coherece score of a protein set and its statistical significance given a reference set, you need to construct a protein x protein functional similarity matrix for the whole reference set. Here, you will find two examples on how to obtain this matrix.

1. Obtain the annotation matrix (Am) of the reference set:
Am is a binary matrix of (proteins x functional terms), where known functional associations are represented as 1, while the rest of (protein, term) pairs are represented as 0. In the case of hierarchical functional schemes (e.g. Gene Ontology, FunCat, ...) both terms with direct associations as well as corresponding ancestor terms can be used to build Am.

You need also to keep track of the indices of the rows corresponding to your protein set (set).

2. Now, you can apply a weight to each functional term to account for its specificy/generality. E.g., the Information Content (IC) is inversely related to its probability of annotation in the reference set, and is computed as:
f=sum(Am);
ic=-log(f./sum(f));

3. Finally, compute the similarity matrix (Msim) of the whole reference set
  • Using cosine similarity with IC weights:
Msim=squareform(1-pdist(Am*diag(ic),'cosine'));
  • Alternatively you can obtain a straighforward Jaccard similarity (based on a binary annotation matrix):
Msim=squareform(1-pdist(Am,'jaccard'));
IMPORTANT NOTE: in both cases, the Msim diagonal contains all zero values (although conceptually, functional self-similarity should be equal to 1).

4. Now, you are ready to use sim_score.m, using as input:  Msim as similarity matrix,  the indices of the protein set to analyze (set), as well as the size of the reference set (ng).


Send comments to: monica.chagoyen [at] cnb.csic.es