Annotation: dorsal/arxiv

Authors	Changyu Hu, Xiang Li, Jie Liang
Categories	q-bio.BM
ArXiv ID	q-bio/0407040
URL	https://arxiv.org/abs/q-bio/0407040

Authors

Changyu Hu, Xiang Li, Jie Liang

Abstract

Motivation. Protein design aims to identify sequences compatible with a given protein fold but incompatible to any alternative folds. To select the correct sequences and to guide the search process, a design scoring function is critically important. Such a scoring function should be able to characterize the global fitness landscape of many proteins simultaneously. Results. To find optimal design scoring functions, we introduce two geometric views and propose a formulation using mixture of nonlinear Gaussian kernel functions. We aim to solve a simplified protein sequence design problem. Our goal is to distinguish each native sequence for a major portion of representative protein structures from a large number of alternative decoy sequences, each a fragment from proteins of different fold. Our scoring function discriminate perfectly a set of 440 native proteins from 14 million sequence decoys. We show that no linear scoring function can succeed in this task. In a blind test of unrelated proteins, our scoring function misclassfies only 13 native proteins out of 194. This compares favorably with about 3-4 times more misclassifications when optimal linear functions reported in literature are used. We also discuss how to develop protein folding scoring function.

{ "annotation_id": "9bb41057-9c5b-4d1e-a3f1-968d3e232551", "date_created": "2026-03-02T18:01:31.485000Z", "date_modified": "2026-03-02T18:01:31.485000Z", "file_hash": "a1716e19c6b7f4551dd9a9546030079fcf9912a2f0ff97b7fdafe68947b3612f", "private": false, "record": { "abstract": "Motivation. Protein design aims to identify sequences compatible with a given\nprotein fold but incompatible to any alternative folds. To select the correct\nsequences and to guide the search process, a design scoring function is\ncritically important. Such a scoring function should be able to characterize\nthe global fitness landscape of many proteins simultaneously.\n Results. To find optimal design scoring functions, we introduce two geometric\nviews and propose a formulation using mixture of nonlinear Gaussian kernel\nfunctions. We aim to solve a simplified protein sequence design problem. Our\ngoal is to distinguish each native sequence for a major portion of\nrepresentative protein structures from a large number of alternative decoy\nsequences, each a fragment from proteins of different fold. Our scoring\nfunction discriminate perfectly a set of 440 native proteins from 14 million\nsequence decoys. We show that no linear scoring function can succeed in this\ntask. In a blind test of unrelated proteins, our scoring function misclassfies\nonly 13 native proteins out of 194. This compares favorably with about 3-4\ntimes more misclassifications when optimal linear functions reported in\nliterature are used. We also discuss how to develop protein folding scoring\nfunction.", "arxiv_id": "q-bio/0407040", "authors": [ "Changyu Hu", "Xiang Li", "Jie Liang" ], "categories": [ "q-bio.BM" ], "title": "Developing optimal nonlinear scoring function for protein design", "url": "https://arxiv.org/abs/q-bio/0407040" }, "schema_id": "dorsal/arxiv", "source": { "execution_id": "46e9c7fd-7eed-41df-ba68-b47ed18641bc", "id": "arXiv Dataset IDs", "type": "Model", "variant": "snapshot-2026-03-01", "version": "0.1.0" }, "user_id": 1000002 }