dorsal/arxiv
View SchemaPCA and K-Means decipher genome
| Authors | A. N. Gorban, A. Y. Zinovyev |
|---|---|
| Categories | |
| ArXiv ID | q-bio/0504013 |
| URL | https://arxiv.org/abs/q-bio/0504013 |
| DOI | 10.1007/978-3-540-73750-6_14 |
| Journal | A.N. Gorban, B. Kegl, D.C. Wunsch, A. Zinovyev (eds.) Principal Manifolds for Data Visualization and Dimension Reduction, Lecture Notes in Computational Science and Engineering 58, Springer, Berlin - Heidelberg, 2008, 307-323 |
Abstract
In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they ``discover'' that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. Appendix 1 contains program listings that go along with this exercise. Appendix 2 includes 2D PCA plots of triplet usage in moving frame for a series of bacterial genomes from GC-poor to GC-rich ones. Animated 3D PCA plots are attached as separate gif files. Topology (cluster structure) and geometry (mutual positions of clusters) of these plots depends clearly on GC-content.
{
"annotation_id": "9fe94846-cc77-40a9-886b-36ef91616e81",
"date_created": "2026-03-02T18:01:31.163000Z",
"date_modified": "2026-03-02T18:01:31.163000Z",
"file_hash": "1501ecaa0d469299df30f75d06f643d744030885df5b2eeb7d2334972e037227",
"private": false,
"record": {
"abstract": "In this paper, we aim to give a tutorial for undergraduate students studying\nstatistical methods and/or bioinformatics. The students will learn how data\nvisualization can help in genomic sequence analysis. Students start with a\nfragment of genetic text of a bacterial genome and analyze its structure. By\nmeans of principal component analysis they ``discover\u0027\u0027 that the information in\nthe genome is encoded by non-overlapping triplets. Next, they learn how to find\ngene positions. This exercise on PCA and K-Means clustering enables active\nstudy of the basic bioinformatics notions. Appendix 1 contains program listings\nthat go along with this exercise. Appendix 2 includes 2D PCA plots of triplet\nusage in moving frame for a series of bacterial genomes from GC-poor to GC-rich\nones. Animated 3D PCA plots are attached as separate gif files. Topology\n(cluster structure) and geometry (mutual positions of clusters) of these plots\ndepends clearly on GC-content.",
"arxiv_id": "q-bio/0504013",
"authors": [
"A. N. Gorban",
"A. Y. Zinovyev"
],
"categories": [
"q-bio.QM",
"q-bio.GN"
],
"doi": "10.1007/978-3-540-73750-6_14",
"journal_ref": "A.N. Gorban, B. Kegl, D.C. Wunsch, A. Zinovyev (eds.) Principal\n Manifolds for Data Visualization and Dimension Reduction, Lecture Notes in\n Computational Science and Engineering 58, Springer, Berlin - Heidelberg,\n 2008, 307-323",
"title": "PCA and K-Means decipher genome",
"url": "https://arxiv.org/abs/q-bio/0504013"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "f3514116-343a-493a-99c8-5f68ec1f2cde",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}