dorsal/arxiv
View SchemaThe Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007
| Authors | A. N. Gorban, A. Yu. Zinovyev |
|---|---|
| Categories | |
| ArXiv ID | q-bio/0412015 |
| URL | https://arxiv.org/abs/q-bio/0412015 |
| DOI | 10.1007/s11538-007-9229-6 |
| Journal | Bulletin of Mathematical Biology 69 (2007), 2429--2442 |
Abstract
In special coordinates (codon position--specific nucleotide frequencies) bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position--specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean--field approximation, which is also known as context--free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.
{
"annotation_id": "6dfc6d98-4062-454b-8c92-a2b73d2a1cde",
"date_created": "2026-03-02T18:01:32.323000Z",
"date_modified": "2026-03-02T18:01:32.323000Z",
"file_hash": "dae4bf6dd78113f52c5c75c0f1f0ce904cb299352e029cb81ee9d2694b14c06e",
"private": false,
"record": {
"abstract": "In special coordinates (codon position--specific nucleotide frequencies)\nbacterial genomes form two straight lines in 9-dimensional space: one line for\neubacterial genomes, another for archaeal genomes. All the 348 distinct\nbacterial genomes available in Genbank in April 2007, belong to these lines\nwith high accuracy. The main challenge now is to explain the observed high\naccuracy. The new phenomenon of complementary symmetry for codon\nposition--specific nucleotide frequencies is observed. The results of analysis\nof several codon usage models are presented. We demonstrate that the\nmean--field approximation, which is also known as context--free, or complete\nindependence model, or Segre variety, can serve as a reasonable approximation\nto the real codon usage. The first two principal components of codon usage\ncorrelate strongly with genomic G+C content and the optimal growth temperature\nrespectively. The variation of codon usage along the third component is related\nto the curvature of the mean-field approximation. First three eigenvalues in\ncodon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and\narchaeal genomes codon usage is clearly distributed along two third order\ncurves with genomic G+C content as a parameter.",
"arxiv_id": "q-bio/0412015",
"authors": [
"A. N. Gorban",
"A. Yu. Zinovyev"
],
"categories": [
"q-bio.GN",
"q-bio.BM",
"stat.AP"
],
"doi": "10.1007/s11538-007-9229-6",
"journal_ref": "Bulletin of Mathematical Biology 69 (2007), 2429--2442",
"title": "The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007",
"url": "https://arxiv.org/abs/q-bio/0412015"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "c39ce32b-861a-476c-a7d7-577f3e66ffb7",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}