Annotation: dorsal/arxiv

Authors	A. N. Gorban, A. Y. Zinovyev
Categories	q-bio.QM q-bio.GN
ArXiv ID	q-bio/0504013
URL	https://arxiv.org/abs/q-bio/0504013
DOI	10.1007/978-3-540-73750-6_14
Journal	A.N. Gorban, B. Kegl, D.C. Wunsch, A. Zinovyev (eds.) Principal Manifolds for Data Visualization and Dimension Reduction, Lecture Notes in Computational Science and Engineering 58, Springer, Berlin - Heidelberg, 2008, 307-323

Authors

A. N. Gorban, A. Y. Zinovyev

Abstract

In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they ``discover'' that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. Appendix 1 contains program listings that go along with this exercise. Appendix 2 includes 2D PCA plots of triplet usage in moving frame for a series of bacterial genomes from GC-poor to GC-rich ones. Animated 3D PCA plots are attached as separate gif files. Topology (cluster structure) and geometry (mutual positions of clusters) of these plots depends clearly on GC-content.

{ "annotation_id": "9fe94846-cc77-40a9-886b-36ef91616e81", "date_created": "2026-03-02T18:01:31.163000Z", "date_modified": "2026-03-02T18:01:31.163000Z", "file_hash": "1501ecaa0d469299df30f75d06f643d744030885df5b2eeb7d2334972e037227", "private": false, "record": { "abstract": "In this paper, we aim to give a tutorial for undergraduate students studying\nstatistical methods and/or bioinformatics. The students will learn how data\nvisualization can help in genomic sequence analysis. Students start with a\nfragment of genetic text of a bacterial genome and analyze its structure. By\nmeans of principal component analysis they ``discover\u0027\u0027 that the information in\nthe genome is encoded by non-overlapping triplets. Next, they learn how to find\ngene positions. This exercise on PCA and K-Means clustering enables active\nstudy of the basic bioinformatics notions. Appendix 1 contains program listings\nthat go along with this exercise. Appendix 2 includes 2D PCA plots of triplet\nusage in moving frame for a series of bacterial genomes from GC-poor to GC-rich\nones. Animated 3D PCA plots are attached as separate gif files. Topology\n(cluster structure) and geometry (mutual positions of clusters) of these plots\ndepends clearly on GC-content.", "arxiv_id": "q-bio/0504013", "authors": [ "A. N. Gorban", "A. Y. Zinovyev" ], "categories": [ "q-bio.QM", "q-bio.GN" ], "doi": "10.1007/978-3-540-73750-6_14", "journal_ref": "A.N. Gorban, B. Kegl, D.C. Wunsch, A. Zinovyev (eds.) Principal\n Manifolds for Data Visualization and Dimension Reduction, Lecture Notes in\n Computational Science and Engineering 58, Springer, Berlin - Heidelberg,\n 2008, 307-323", "title": "PCA and K-Means decipher genome", "url": "https://arxiv.org/abs/q-bio/0504013" }, "schema_id": "dorsal/arxiv", "source": { "execution_id": "f3514116-343a-493a-99c8-5f68ec1f2cde", "id": "arXiv Dataset IDs", "type": "Model", "variant": "snapshot-2026-03-01", "version": "0.1.0" }, "user_id": 1000002 }