Motivation: |
|
Supplementary data files Datasets: (gene/PMIDs files) Results: SGD8 data:
Reelin data:
|
Objective:
To discover biological topics in the biomedical literature relevant to sets of genes/proteins that allow to infer functional associations. Abstract: Since
the literature covers all aspects of biology, chemistry, and medicine,
there is
almost no limit to the types of information that may be recovered
through
careful and exhaustive mining [1]. Therefore, data mining techniques
able to extract biological patterns from large lists of genes from
biomedical
literature are very useful tools to interpret experimental data and
derive new
biological
knowledge. In this
work we present a method
for extracting common biological topics from the biomedical literature
associated
to sets of genes/proteins, in the form of semantic features. This
characterization of topics provides the means to associate genes with
semantic profiles which indicate their functional role, ant to
establish functional relationships among genes. Our
approach applies non negative
matrix factorization (NMF), a
machine-learning
algorithm capable of identifying local patterns that exist in only a
sub-portion of the data [2]. NMF was originally applied to image and text analysis
and more
recently has
been used to analyse gene expression data [3,
4], sequence data [5] and gene functional annotations [6]. We have applied
our method to two datasets in order to test its performance:
References: [1] Shatkay, H. and R. Feldman, Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 2003. 10(6): p. 821-855. [2] Lee, D.D. and H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature, 1999. 401(6755): p. 788-91. [3] Kim, P.M. and B. Tidor, Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res, 2003. 13(7): p. 1706-18. [4] Brunet,
J.P., et al., Metagenes and molecular
pattern discovery using matrix factorization. Proc Natl Acad Sci U
S A, 2004. 101(12): p. 4164-9. [5] Heger, A. and L. Holm, Sensitive
pattern discovery with 'fuzzy' alignments of distantly related proteins.
Bioinformatics, 2003. 19 Suppl 1: p. i130-9. [6] Pehkonen, P., et al., Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics, 2005. 6: p. 162. [7] Homayouni, R., et al., Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics, 2005. 21(1): p. 104-15.
|