Annotation: dorsal/arxiv

Authors	Jeffrey B. Endelman, Jesse D. Bloom, Christopher R. Otey, Marco Landwehr, Frances H. Arnold
Categories	q-bio.BM
ArXiv ID	q-bio/0505018
URL	https://arxiv.org/abs/q-bio/0505018

Authors

Jeffrey B. Endelman, Jesse D. Bloom, Christopher R. Otey, Marco Landwehr, Frances H. Arnold

Abstract

Proteins created by combinatorial methods in vitro are an important source of information for understanding sequence-structure-function relationships. Alignments of folded proteins from combinatorial libraries can be analyzed using methods developed for naturally occurring proteins, but this neglects the information contained in the unfolded sequences of the library. We introduce two algorithms, logistic regression and excess information analysis, that use both the folded and unfolded sequences and compare them against contingency table and statistical coupling analysis, which only use the former. The test set for this benchmark study is a library of fictitious proteins that fold according to a hypothetical energy model. Of the four methods studied, only logistic regression is able to correctly recapitulate the energy model from the sequence alignment. The other algorithms predict spurious interactions between alignment positions with strong but individual influences on protein stability. When present in the same protein, stabilizing amino acids tend to lower the energy below the threshold needed for folding. As a result, their frequencies in the alignment can be correlated even if the positions do not interact. We believe any algorithm that neglects the nonlinear relationship between folding and energy is susceptible to this error.

{ "annotation_id": "6d29a055-87da-4946-9330-f49986a15931", "date_created": "2026-03-02T18:01:31.811000Z", "date_modified": "2026-03-02T18:01:31.811000Z", "file_hash": "20e3c83ceb6e751446f161cfa204091a73bf95956d0e23e34bea9a45ab7c3807", "private": false, "record": { "abstract": "Proteins created by combinatorial methods in vitro are an important source of\ninformation for understanding sequence-structure-function relationships.\nAlignments of folded proteins from combinatorial libraries can be analyzed\nusing methods developed for naturally occurring proteins, but this neglects the\ninformation contained in the unfolded sequences of the library. We introduce\ntwo algorithms, logistic regression and excess information analysis, that use\nboth the folded and unfolded sequences and compare them against contingency\ntable and statistical coupling analysis, which only use the former. The test\nset for this benchmark study is a library of fictitious proteins that fold\naccording to a hypothetical energy model. Of the four methods studied, only\nlogistic regression is able to correctly recapitulate the energy model from the\nsequence alignment. The other algorithms predict spurious interactions between\nalignment positions with strong but individual influences on protein stability.\nWhen present in the same protein, stabilizing amino acids tend to lower the\nenergy below the threshold needed for folding. As a result, their frequencies\nin the alignment can be correlated even if the positions do not interact. We\nbelieve any algorithm that neglects the nonlinear relationship between folding\nand energy is susceptible to this error.", "arxiv_id": "q-bio/0505018", "authors": [ "Jeffrey B. Endelman", "Jesse D. Bloom", "Christopher R. Otey", "Marco Landwehr", "Frances H. Arnold" ], "categories": [ "q-bio.BM" ], "title": "Inferring interactions from combinatorial protein libraries", "url": "https://arxiv.org/abs/q-bio/0505018" }, "schema_id": "dorsal/arxiv", "source": { "execution_id": "c345d44a-3ca5-4dc3-b19c-387a7a99e6eb", "id": "arXiv Dataset IDs", "type": "Model", "variant": "snapshot-2026-03-01", "version": "0.1.0" }, "user_id": 1000002 }