To compute overall functional
coherece score of a protein set and its
statistical significance given a reference set, you need to construct a
protein x protein functional similarity matrix for the whole reference
set. Here, you will find two examples on how to obtain this matrix. 1. Obtain the annotation matrix (Am) of the reference set: Am is a binary matrix of (proteins x functional terms), where known functional associations are represented as 1, while the rest of (protein, term) pairs are represented as 0. In the case of hierarchical functional schemes (e.g. Gene Ontology, FunCat, ...) both terms with direct associations as well as corresponding ancestor terms can be used to build Am. You need also to keep track of the indices of the rows corresponding to your protein set (set). 2. Now, you can apply a weight to each functional term to account for its specificy/generality. E.g., the Information Content (IC) is inversely related to its probability of annotation in the reference set, and is computed as: f=sum(Am); ic=-log(f./sum(f)); 3. Finally, compute the similarity matrix (Msim) of the whole reference set
Msim=squareform(1-pdist(Am*diag(ic),'cosine'));
Msim=squareform(1-pdist(Am,'jaccard'));IMPORTANT NOTE: in both cases, the Msim diagonal contains all zero values (although conceptually, functional self-similarity should be equal to 1). 4. Now, you are ready to use sim_score.m, using as input: Msim as similarity matrix, the indices of the protein set to analyze (set), as well as the size of the reference set (ng). |