Quantitative Biology

2604 Submissions

[1] viXra:2604.0030 [pdf] submitted on 2026-04-10 20:07:00

Reshaping ESM-2 Representation Geometry for Viral Protein Classification

Authors: Ishir Rao
Comments: 8 Pages.

Distinguishing viral proteins from their human host counterparts is a fundamental challenge in computational virology, with direct implications for gene therapy vector design and antiviral therapeutics. We present a systematic comparison of three classification frameworks on 26,771 SwissProt-reviewed sequences (6,350 viral, 20,421 human): a TF-IDF k-mer Random Forest baseline, a logistic regression probe on frozen ESM-2 embeddings [1], and a supervised contrastive learning (SupCon [2]) projection head trained on those same embeddings. The k-mer baseline achieves84% overall accuracy but fails on viral sequences (recall = 40%), while ESM-2 embeddings alone raise accuracy to 98% and viral recall to 96%, confirming that evolutionary pretraining encodes substantial host—viral discriminative signal without any task-specific supervision. Supervised contrastive fine-tuning further improves overall accuracy to 98.69%and viral F1 to 0.97, but the most consequential gains appear among proteins where biology itself is ambiguous: viral sequences that have evolved human-like surface features to evade immune detection show a disproportionate improvement under contrastive training, with mean classification accuracy on host-mimicry proteins rising from 55.5%(k-mers) to 69.4% (ESM-2) to 96.1% (ESM-2 + SupCon) — a 26.7 percentage-point leap attributable directly to the contrastive objective. Manifold analysis via UMAP confirms that SupCon progressively restructures the embedding geometry over training, tightening intra-class cohesion and widening the inter-class margin in precisely the regions where host and viral proteomes overlap most.
Category: Quantitative Biology