Publications
ThesisJan 2025

Machine Learning Approaches to Understanding Codon Choice

Sakharova, HA
Product Used
Variant Libraries
Abstract
Identical proteins can be encoded in DNA using different synonymous codons, which are translated by the ribosome at different rates. The mechanisms by and extent to which codon choice impacts biological processes remains a fundamental open question. Elucidating the rules governing codon choice is vital both to understanding disorders caused by synonymous mutations, and to improve our ability to design synthetic mRNAs. The structure and function of a protein may set requirements on the process of translation that create pressure to select for slower or faster translated codons. Leveraging existing protein language models, I build a machine learning model to predict codon choice from amino acid sequence. My model effectively combines information about position and protein structure to learn subtle but wide-reaching constraints on codon choice in yeast. In parallel, I conduct a genome-wide screen in yeast to reliably identify synonymous variants that significantly decrease or increase fitness, using Cas9 retron editing to create thousands of synonymous codon substitutions in endogenous loci. Lastly, we extend our exploration of codon usage to create Trias, a generative codon-language model applicable to human sequences. We demonstrate that Trias can be used to generate realistic mRNA sequences with high protein output.
Product Used
Variant Libraries

Related Publications