dorsal/arxiv
View SchemaHow many clusters? An information theoretic perspective
| Authors | Susanne Still, William Bialek |
|---|---|
| Categories | |
| ArXiv ID | physics/0303011 |
| URL | https://arxiv.org/abs/physics/0303011 |
Abstract
Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade--off between energy-- and entropy--like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias which arises due to sampling errors will allow us to find a clustering solution at a temperature which is optimal in the sense that we capture maximal meaningful structure -- without having to define an external criterion for the goodness or stability of the clustering. We show that, in a general information theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters which can be resolved from the data in the hard clustering limit.
{
"annotation_id": "f0853a14-86b7-492a-80a7-6301471a4ae9",
"date_created": "2026-03-02T18:00:42.949000Z",
"date_modified": "2026-03-02T18:00:42.949000Z",
"file_hash": "171140e78245d235f764c9253bccdb0a91c0185d9b0a0137e9d2596fd9dcedcb",
"private": false,
"record": {
"abstract": "Clustering provides a common means of identifying structure in complex data,\nand there is renewed interest in clustering as a tool for the analysis of large\ndata sets in many fields. A natural question is how many clusters are\nappropriate for the description of a given system. Traditional approaches to\nthis problem are based either on a framework in which clusters of a particular\nshape are assumed as a model of the system or on a two-step procedure in which\na clustering criterion determines the optimal assignments for a given number of\nclusters and a separate criterion measures the goodness of the classification\nto determine the number of clusters. In a statistical mechanics approach,\nclustering can be seen as a trade--off between energy-- and entropy--like\nterms, with lower temperature driving the proliferation of clusters to provide\na more detailed description of the data. For finite data sets, we expect that\nthere is a limit to the meaningful structure that can be resolved and therefore\na minimum temperature beyond which we will capture sampling noise. This\nsuggests that correcting the clustering criterion for the bias which arises due\nto sampling errors will allow us to find a clustering solution at a temperature\nwhich is optimal in the sense that we capture maximal meaningful structure --\nwithout having to define an external criterion for the goodness or stability of\nthe clustering. We show that, in a general information theoretic framework, the\nfinite size of a data set determines an optimal temperature, and we introduce a\nmethod for finding the maximal number of clusters which can be resolved from\nthe data in the hard clustering limit.",
"arxiv_id": "physics/0303011",
"authors": [
"Susanne Still",
"William Bialek"
],
"categories": [
"physics.data-an",
"physics.gen-ph"
],
"title": "How many clusters? An information theoretic perspective",
"url": "https://arxiv.org/abs/physics/0303011"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "2fe9bc22-4a60-4b93-97ab-87a9d72b42ea",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}