Motivation:A complete characterization of biological processes at the genomic and the proteomic level requires the combination of numerous aspects, among which functional information is one of the most difficult to automatically acquire and interpret.
|Supplementary data files
To discover biological topics in the biomedical literature relevant to sets of genes/proteins that allow to infer functional associations.
Since the literature covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining . Therefore, data mining techniques able to extract biological patterns from large lists of genes from biomedical literature are very useful tools to interpret experimental data and derive new biological knowledge.
In this work we present a method for extracting common biological topics from the biomedical literature associated to sets of genes/proteins, in the form of semantic features. This characterization of topics provides the means to associate genes with semantic profiles which indicate their functional role, ant to establish functional relationships among genes.
approach applies non negative
matrix factorization (NMF), a
algorithm capable of identifying local patterns that exist in only a
sub-portion of the data . NMF was originally applied to image and text analysis
been used to analyse gene expression data [3,
4], sequence data  and gene functional annotations .
We have applied
our method to two datasets in order to test its performance:
 Shatkay, H. and R. Feldman, Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 2003. 10(6): p. 821-855.
 Lee, D.D. and H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature, 1999. 401(6755): p. 788-91.
 Kim, P.M. and B. Tidor, Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res, 2003. 13(7): p. 1706-18.
J.P., et al., Metagenes and molecular
pattern discovery using matrix factorization. Proc Natl Acad Sci U
S A, 2004. 101(12): p. 4164-9.
 Heger, A. and L. Holm, Sensitive
pattern discovery with 'fuzzy' alignments of distantly related proteins.
Bioinformatics, 2003. 19 Suppl 1: p. i130-9.
 Pehkonen, P., et al., Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics, 2005. 6: p. 162.
 Homayouni, R., et al., Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics, 2005. 21(1): p. 104-15.