Annotation: dorsal/arxiv

Authors	Wentian Li, Fengzhu Sun, Ivo Grosse
Categories	q-bio.QM q-bio.GN
ArXiv ID	q-bio/0403038
URL	https://arxiv.org/abs/q-bio/0403038
DOI	10.1089/1066527041410445
Journal	Journal of Computational Biology, 11(2-3):215-226 (2004)

Authors

Wentian Li, Fengzhu Sun, Ivo Grosse

Abstract

One important issue commonly encountered in the analysis of microarray data is to decide which and how many genes should be selected for further studies. For discriminant microarray data analyses based on statistical models, such as the logistic regression models, gene selection can be accomplished by a comparison of the maximum likelihood of the model given the real data, $\hat{L}(D|M)$, and the expected maximum likelihood of the model given an ensemble of surrogate data with randomly permuted label, $\hat{L}(D_0|M)$. Typically, the computational burden for obtaining $\hat{L}(D_0|M)$ is immense, often exceeding the limits of computing available resources by orders of magnitude. Here, we propose an approach that circumvents such heavy computations by mapping the simulation problem to an extreme-value problem. We present the derivation of an asymptotic distribution of the extreme-value as well as its mean, median, and variance. Using this distribution, we propose two gene selection criteria, and we apply them to two microarray datasets and three classification tasks for illustration.

{ "annotation_id": "b74bf926-833a-47ee-9f7b-9dbeb943e497", "date_created": "2026-03-02T18:01:31.991000Z", "date_modified": "2026-03-02T18:01:31.991000Z", "file_hash": "dc39250ca7885bff23f759ab63e536be4caaa7e1e3a18be921d306e8eba4f3b9", "private": false, "record": { "abstract": "One important issue commonly encountered in the analysis of microarray data\nis to decide which and how many genes should be selected for further studies.\nFor discriminant microarray data analyses based on statistical models, such as\nthe logistic regression models, gene selection can be accomplished by a\ncomparison of the maximum likelihood of the model given the real data,\n$\\hat{L}(D|M)$, and the expected maximum likelihood of the model given an\nensemble of surrogate data with randomly permuted label, $\\hat{L}(D_0|M)$.\nTypically, the computational burden for obtaining $\\hat{L}(D_0|M)$ is immense,\noften exceeding the limits of computing available resources by orders of\nmagnitude. Here, we propose an approach that circumvents such heavy\ncomputations by mapping the simulation problem to an extreme-value problem. We\npresent the derivation of an asymptotic distribution of the extreme-value as\nwell as its mean, median, and variance. Using this distribution, we propose two\ngene selection criteria, and we apply them to two microarray datasets and three\nclassification tasks for illustration.", "arxiv_id": "q-bio/0403038", "authors": [ "Wentian Li", "Fengzhu Sun", "Ivo Grosse" ], "categories": [ "q-bio.QM", "q-bio.GN" ], "doi": "10.1089/1066527041410445", "journal_ref": "Journal of Computational Biology, 11(2-3):215-226 (2004)", "title": "Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression", "url": "https://arxiv.org/abs/q-bio/0403038" }, "schema_id": "dorsal/arxiv", "source": { "execution_id": "42dfd072-f539-4064-b393-183231a318bb", "id": "arXiv Dataset IDs", "type": "Model", "variant": "snapshot-2026-03-01", "version": "0.1.0" }, "user_id": 1000002 }