dorsal/arxiv
View SchemaDeveloping optimal nonlinear scoring function for protein design
| Authors | Changyu Hu, Xiang Li, Jie Liang |
|---|---|
| Categories | |
| ArXiv ID | q-bio/0407040 |
| URL | https://arxiv.org/abs/q-bio/0407040 |
Abstract
Motivation. Protein design aims to identify sequences compatible with a given protein fold but incompatible to any alternative folds. To select the correct sequences and to guide the search process, a design scoring function is critically important. Such a scoring function should be able to characterize the global fitness landscape of many proteins simultaneously. Results. To find optimal design scoring functions, we introduce two geometric views and propose a formulation using mixture of nonlinear Gaussian kernel functions. We aim to solve a simplified protein sequence design problem. Our goal is to distinguish each native sequence for a major portion of representative protein structures from a large number of alternative decoy sequences, each a fragment from proteins of different fold. Our scoring function discriminate perfectly a set of 440 native proteins from 14 million sequence decoys. We show that no linear scoring function can succeed in this task. In a blind test of unrelated proteins, our scoring function misclassfies only 13 native proteins out of 194. This compares favorably with about 3-4 times more misclassifications when optimal linear functions reported in literature are used. We also discuss how to develop protein folding scoring function.
{
"annotation_id": "9bb41057-9c5b-4d1e-a3f1-968d3e232551",
"date_created": "2026-03-02T18:01:31.485000Z",
"date_modified": "2026-03-02T18:01:31.485000Z",
"file_hash": "a1716e19c6b7f4551dd9a9546030079fcf9912a2f0ff97b7fdafe68947b3612f",
"private": false,
"record": {
"abstract": "Motivation. Protein design aims to identify sequences compatible with a given\nprotein fold but incompatible to any alternative folds. To select the correct\nsequences and to guide the search process, a design scoring function is\ncritically important. Such a scoring function should be able to characterize\nthe global fitness landscape of many proteins simultaneously.\n Results. To find optimal design scoring functions, we introduce two geometric\nviews and propose a formulation using mixture of nonlinear Gaussian kernel\nfunctions. We aim to solve a simplified protein sequence design problem. Our\ngoal is to distinguish each native sequence for a major portion of\nrepresentative protein structures from a large number of alternative decoy\nsequences, each a fragment from proteins of different fold. Our scoring\nfunction discriminate perfectly a set of 440 native proteins from 14 million\nsequence decoys. We show that no linear scoring function can succeed in this\ntask. In a blind test of unrelated proteins, our scoring function misclassfies\nonly 13 native proteins out of 194. This compares favorably with about 3-4\ntimes more misclassifications when optimal linear functions reported in\nliterature are used. We also discuss how to develop protein folding scoring\nfunction.",
"arxiv_id": "q-bio/0407040",
"authors": [
"Changyu Hu",
"Xiang Li",
"Jie Liang"
],
"categories": [
"q-bio.BM"
],
"title": "Developing optimal nonlinear scoring function for protein design",
"url": "https://arxiv.org/abs/q-bio/0407040"
},
"schema_id": "dorsal/arxiv",
"source": {
"execution_id": "46e9c7fd-7eed-41df-ba68-b47ed18641bc",
"id": "arXiv Dataset IDs",
"type": "Model",
"variant": "snapshot-2026-03-01",
"version": "0.1.0"
},
"user_id": 1000002
}