Annotation: dorsal/arxiv

Authors	Noam Slonim, Gurinder Singh Atwal, Gasper Tkacik, William Bialek
Categories	q-bio.QM
ArXiv ID	q-bio/0511043
URL	https://arxiv.org/abs/q-bio/0511043
DOI	10.1073/pnas.0507432102

Authors

Noam Slonim, Gurinder Singh Atwal, Gasper Tkacik, William Bialek

Abstract

In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here we reformulate the clustering problem from an information theoretic perspective which avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "prototype", does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures non-linear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.

{ "annotation_id": "98f6dd4b-a600-48f8-a3de-7ef23da5fd6f", "date_created": "2026-03-02T18:01:35.123000Z", "date_modified": "2026-03-02T18:01:35.123000Z", "file_hash": "1136879b2b1ef725fc650ca0b3d63f4c65078819d642649a2773788831cdab4a", "private": false, "record": { "abstract": "In an age of increasingly large data sets, investigators in many different\ndisciplines have turned to clustering as a tool for data analysis and\nexploration. Existing clustering methods, however, typically depend on several\nnontrivial assumptions about the structure of data. Here we reformulate the\nclustering problem from an information theoretic perspective which avoids many\nof these assumptions. In particular, our formulation obviates the need for\ndefining a cluster \"prototype\", does not require an a priori similarity metric,\nis invariant to changes in the representation of the data, and naturally\ncaptures non-linear relations. We apply this approach to different domains and\nfind that it consistently produces clusters that are more coherent than those\nextracted by existing algorithms. Finally, our approach provides a way of\nclustering based on collective notions of similarity rather than the\ntraditional pairwise measures.", "arxiv_id": "q-bio/0511043", "authors": [ "Noam Slonim", "Gurinder Singh Atwal", "Gasper Tkacik", "William Bialek" ], "categories": [ "q-bio.QM" ], "doi": "10.1073/pnas.0507432102", "title": "Information based clustering", "url": "https://arxiv.org/abs/q-bio/0511043" }, "schema_id": "dorsal/arxiv", "source": { "execution_id": "34381bee-875d-4056-a218-e6e659c9c3d2", "id": "arXiv Dataset IDs", "type": "Model", "variant": "snapshot-2026-03-01", "version": "0.1.0" }, "user_id": 1000002 }