dorsal/arxiv
View SchemaCompression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances
| Authors | Jairo Rocha, Francesc Rosselló, Joan Segura |
|---|---|
| Categories | |
| ArXiv ID | q-bio/0603007 |
| URL | https://arxiv.org/abs/q-bio/0603007 |
Abstract
Kolmogorov complexity has inspired several alignment-free distance measures, based on the comparison of lengths of compressions, which have been applied successfully in many areas. One of these measures, the so-called Universal Similarity Metric (USM), has been used by Krasnogor and Pelta to compare simple protein contact maps, showing that it yielded good clustering on four small datasets. We report an extensive test of this metric using a much larger and representative protein dataset: the domain dataset used by Sierk and Pearson to evaluate seven protein structure comparison methods and two protein sequence comparison methods. One result is that Krasnogor-Pelta method has less domain discriminant power than any one of the methods considered by Sierk and Pearson when using these simple contact maps. In another test, we found that the USM based distance has low agreement with the CATH tree structure for the same benchmark of Sierk and Pearson. In any case, its agreement is lower than the one of a standard sequential alignment method, SSEARCH. Finally, we manually found lots of small subsets of the database that are better clustered using SSEARCH than USM, to confirm that Krasnogor-Pelta's conclusions were based on datasets that were too small.
{
"annotation_id": "bc376165-93d1-431f-8149-aff137b6ce00",
"date_created": "2026-03-02T18:01:35.480000Z",
"date_modified": "2026-03-02T18:01:35.480000Z",
"file_hash": "1d91f834938c8fe24509085bf2a65da014acd33ed0682bac1af4d0fada04ccac",
"private": false,
"record": {
"abstract": "Kolmogorov complexity has inspired several alignment-free distance measures,\nbased on the comparison of lengths of compressions, which have been applied\nsuccessfully in many areas. One of these measures, the so-called Universal\nSimilarity Metric (USM), has been used by Krasnogor and Pelta to compare simple\nprotein contact maps, showing that it yielded good clustering on four small\ndatasets. We report an extensive test of this metric using a much larger and\nrepresentative protein dataset: the domain dataset used by Sierk and Pearson to\nevaluate seven protein structure comparison methods and two protein sequence\ncomparison methods. One result is that Krasnogor-Pelta method has less domain\ndiscriminant power than any one of the methods considered by Sierk and Pearson\nwhen using these simple contact maps. In another test, we found that the USM\nbased distance has low agreement with the CATH tree structure for the same\nbenchmark of Sierk and Pearson. In any case, its agreement is lower than the\none of a standard sequential alignment method, SSEARCH. Finally, we manually\nfound lots of small subsets of the database that are better clustered using\nSSEARCH than USM, to confirm that Krasnogor-Pelta\u0027s conclusions were based on\ndatasets that were too small.",
"arxiv_id": "q-bio/0603007",
"authors": [
"Jairo Rocha",
"Francesc Rossell\u00f3",
"Joan Segura"
],
"categories": [
"q-bio.QM",
"cs.CE",
"physics.data-an",
"q-bio.OT"
],
"title": "Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances",
"url": "https://arxiv.org/abs/q-bio/0603007"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "6090a2bc-3a99-41cf-ab43-5834bb42f55e",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}