dorsal/arxiv
View SchemaZipf's Law in Importance of Genes for Cancer Classification Using Microarray Data
| Authors | Wentian Li |
|---|---|
| Categories | |
| ArXiv ID | physics/0104028 |
| URL | https://arxiv.org/abs/physics/0104028 |
| DOI | 10.1006/jtbi.2002.3145 |
| Journal | W Li and Y Yang (2002), J. Theoretical Biology, 219(4):539-551. |
Abstract
Microarray data consists of mRNA expression levels of thousands of genes under certain conditions. A difference in the expression level of a gene at two different conditions/phenotypes, such as cancerous versus non-cancerous, one subtype of cancer versus another, before versus after a drug treatment, is indicative of the relevance of that gene to the difference of the high-level phenotype. Each gene can be ranked by its ability to distinguish the two conditions. We study how the single-gene classification ability decreases with its rank (a Zipf's plot). Power-law function in the Zipf's plot is observed for the four microarray datasets obtained from various cancer studies. This power-law behavior in the Zipf's plot is reminiscent of similar power-law curves in other natural and social phenomena (Zipf's law). However, due to our choice of the measure of importance in classification ability, i.e., the maximized likelihood in a logistic regression, the exponent of the power-law function is a function of the sample size, instead of a fixed value close to 1 for a typical example of Zipf's law. The presence of this power-law behavior is important for deciding the number of genes to be used for a discriminant microarray data analysis.
{
"annotation_id": "12754938-840d-4464-bcff-c82381b7c904",
"date_created": "2026-03-02T18:00:35.938000Z",
"date_modified": "2026-03-02T18:00:35.938000Z",
"file_hash": "b1881411e00bdf661cd0c953f65df887d6320c92a3ed9ef30da5633e2fc4c9ab",
"private": false,
"record": {
"abstract": "Microarray data consists of mRNA expression levels of thousands of genes\nunder certain conditions. A difference in the expression level of a gene at two\ndifferent conditions/phenotypes, such as cancerous versus non-cancerous, one\nsubtype of cancer versus another, before versus after a drug treatment, is\nindicative of the relevance of that gene to the difference of the high-level\nphenotype. Each gene can be ranked by its ability to distinguish the two\nconditions. We study how the single-gene classification ability decreases with\nits rank (a Zipf\u0027s plot). Power-law function in the Zipf\u0027s plot is observed for\nthe four microarray datasets obtained from various cancer studies. This\npower-law behavior in the Zipf\u0027s plot is reminiscent of similar power-law\ncurves in other natural and social phenomena (Zipf\u0027s law). However, due to our\nchoice of the measure of importance in classification ability, i.e., the\nmaximized likelihood in a logistic regression, the exponent of the power-law\nfunction is a function of the sample size, instead of a fixed value close to 1\nfor a typical example of Zipf\u0027s law. The presence of this power-law behavior is\nimportant for deciding the number of genes to be used for a discriminant\nmicroarray data analysis.",
"arxiv_id": "physics/0104028",
"authors": [
"Wentian Li"
],
"categories": [
"physics.bio-ph",
"physics.data-an",
"q-bio.QM"
],
"doi": "10.1006/jtbi.2002.3145",
"journal_ref": "W Li and Y Yang (2002), J. Theoretical Biology, 219(4):539-551.",
"title": "Zipf\u0027s Law in Importance of Genes for Cancer Classification Using Microarray Data",
"url": "https://arxiv.org/abs/physics/0104028"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "35e3cf9f-afa0-4af9-bbeb-36bfc1e023ca",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}