dorsal/arxiv
View SchemaMotif Discovery through Predictive Modeling of Gene Regulation
| Authors | Manuel Middendorf, Anshul Kundaje, Mihir Shah, Yoav Freund, Chris H. Wiggins, Christina Leslie |
|---|---|
| Categories | |
| ArXiv ID | q-bio/0701021 |
| URL | https://arxiv.org/abs/q-bio/0701021 |
| DOI | 10.1007/11415770_41 |
| Journal | Research in Computational Molecular Biology 2005 |
Abstract
We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a $k$-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.
{
"annotation_id": "9e251911-969d-4c14-803a-37c1eaf99424",
"date_created": "2026-03-02T18:01:35.412000Z",
"date_modified": "2026-03-02T18:01:35.412000Z",
"file_hash": "744473c70136c4144b0d28598f344174e4a70a3f1b65f3f4d52090498545b73a",
"private": false,
"record": {
"abstract": "We present MEDUSA, an integrative method for learning motif models of\ntranscription factor binding sites by incorporating promoter sequence and gene\nexpression data. We use a modern large-margin machine learning approach, based\non boosting, to enable feature selection from the high-dimensional search space\nof candidate binding sequences while avoiding overfitting. At each iteration of\nthe algorithm, MEDUSA builds a motif model whose presence in the promoter\nregion of a gene, coupled with activity of a regulator in an experiment, is\npredictive of differential expression. In this way, we learn motifs that are\nfunctional and predictive of regulatory response rather than motifs that are\nsimply overrepresented in promoter sequences. Moreover, MEDUSA produces a model\nof the transcriptional control logic that can predict the expression of any\ngene in the organism, given the sequence of the promoter region of the target\ngene and the expression state of a set of known or putative transcription\nfactors and signaling molecules. Each motif model is either a $k$-length\nsequence, a dimer, or a PSSM that is built by agglomerative probabilistic\nclustering of sequences with similar boosting loss. By applying MEDUSA to a set\nof environmental stress response expression data in yeast, we learn motifs\nwhose ability to predict differential expression of target genes outperforms\nmotifs from the TRANSFAC dataset and from a previously published candidate set\nof PSSMs. We also show that MEDUSA retrieves many experimentally confirmed\nbinding sites associated with environmental stress response from the\nliterature.",
"arxiv_id": "q-bio/0701021",
"authors": [
"Manuel Middendorf",
"Anshul Kundaje",
"Mihir Shah",
"Yoav Freund",
"Chris H. Wiggins",
"Christina Leslie"
],
"categories": [
"q-bio.GN"
],
"doi": "10.1007/11415770_41",
"journal_ref": "Research in Computational Molecular Biology 2005",
"title": "Motif Discovery through Predictive Modeling of Gene Regulation",
"url": "https://arxiv.org/abs/q-bio/0701021"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "dfbc922b-db09-40b2-a2ee-a9140296bee4",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}