Annotation: dorsal/arxiv

Authors	Ricardo ZN Vêncio, Leonardo Varuzza, Carlos AB Pereira, Helena Brentani, Ilya Shmulevich
Categories	q-bio.QM q-bio.TO
ArXiv ID	q-bio/0703007
URL	https://arxiv.org/abs/q-bio/0703007
DOI	10.1186/1471-2105-8-246
Journal	BMC Bioinformatics 2007, 8:246

Authors

Ricardo ZN Vêncio, Leonardo Varuzza, Carlos AB Pereira, Helena Brentani, Ilya Shmulevich

Abstract

Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST ``digital northern'', are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster/ . Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.

{ "annotation_id": "ceefbd54-dda9-455e-83ac-ce292a58d87a", "date_created": "2026-03-02T18:01:35.093000Z", "date_modified": "2026-03-02T18:01:35.093000Z", "file_hash": "676307ae015845163a83259685c1d711a94870e4072eb3b0206800ab52fd7851", "private": false, "record": { "abstract": "Transcript enumeration methods such as SAGE, MPSS, and\nsequencing-by-synthesis EST ``digital northern\u0027\u0027, are important high-throughput\ntechniques for digital gene expression measurement. As other counting or voting\nprocesses, these measurements constitute compositional data exhibiting\nproperties particular to the simplex space where the summation of the\ncomponents is constrained. These properties are not present on regular\nEuclidean spaces, on which hybridization-based microarray data is often\nmodeled. Therefore, pattern recognition methods commonly used for microarray\ndata analysis may be non-informative for the data generated by transcript\nenumeration techniques since they ignore certain fundamental properties of this\nspace. Here we present a software tool, Simcluster, designed to perform\nclustering analysis for data on the simplex space. We present Simcluster as a\nstand-alone command-line C package and as a user-friendly on-line tool. Both\nversions are available at: http://xerad.systemsbiology.net/simcluster/ .\nSimcluster is designed in accordance with a well-established mathematical\nframework for compositional data analysis, which provides principled procedures\nfor dealing with the simplex space, and is thus applicable in a number of\ncontexts, including enumeration-based gene expression data.", "arxiv_id": "q-bio/0703007", "authors": [ "Ricardo ZN V\u00eancio", "Leonardo Varuzza", "Carlos AB Pereira", "Helena Brentani", "Ilya Shmulevich" ], "categories": [ "q-bio.QM", "q-bio.TO" ], "doi": "10.1186/1471-2105-8-246", "journal_ref": "BMC Bioinformatics 2007, 8:246", "title": "Simcluster: clustering enumeration gene expression data on the simplex space", "url": "https://arxiv.org/abs/q-bio/0703007" }, "schema_id": "dorsal/arxiv", "source": { "execution_id": "fcd3963d-4895-4659-b852-bcbb67253fb5", "id": "arXiv Dataset IDs", "type": "Model", "variant": "snapshot-2026-03-01", "version": "0.1.0" }, "user_id": 1000002 }