ReverTra is a practical tool designed for mapping protein sequences (amino-acid) to species-optimized codon sequences. It relies on AI models (with a transformer architecture) that were developed by Tomer Sidi, Shir Bahiri-Elitzur, Tamir Tuller, and Rachel Kolody, to study the evolutionarily selected codons encoding proteins in 4 species: S. cerevisiae, S. pombe, E. coli, and B. subtilis. For detailed insights into the models, please refer to our paper link . The project code can be found here, and you can also find there working notebooks for model inference and data exploration.
Codon usage plays a critical role in the efficiency of protein expression. In biological systems, different species exhibit variations in their preferred codon usage patterns, which can significantly impact translational efficiency and other aspects of gene expression. ReverTra predicts codon sequences for 4 host species: S. cerevisiae, S. pombe, E. coli, and B. subtilis. By providing a user-friendly tool for this purpose, we aim to empower researchers and bioengineers to streamline their protein expression efforts, facilitating more accurate and effective studies across diverse biological contexts.
To generate a codon sequence, the user must provide an out of host protein (amino-acid) sequence, specify the target host species, and desired expression level of the translated protein. Also, in the model configuration section the user can define the type of model to use for generating the sequences, which includes the window size on which the data was trained of sequences and whether to input the model a single sequence (amino-acid) or a pair of sequences that includes a codon sequence from the original trained hosts (i.e., mimicking).
* For inference of the test-set proteins in the paper please visit ReverTra-Evaluation-TestSets.
(1) Inference type - Mask/Mimic; the two inference type are presented at the paper. In mask mode the input to the model is the AA sequence of the target protein. In mimic mode, an additional codon sequence aligned to the target AA sequence is provided to the model.
(2) Model window size - 10/30/50/75/100/150; In the paper, we present different models trained with different window sizes. This selects the model for prediction.