Augmenting RNA-Ligand Binding Prediction with Machine Learning: A Leap Towards Enhanced Drug Discovery

Reading time:

time

min

July 18, 2024

In pharmaceutical research, the exploration of RNA-ligand interactions is a significant challenge, marking a stark contrast to the more developed understanding of protein-ligand interactions.

Our AI model reduces the number of missed protein crystals by over 30% compared to state-of-the-art benchmarks. Learn more about Crystal Clear Vision.

This complexity arises from the relatively uncharted nature of RNA-ligand interactions, despite extensive studies, thereby complicating the development of RNA-targeted drugs. To tackle this challenge, we partnered with the scientists from the International Institute of Molecular and Cell Biology in Warsaw (IIMCB).

Researchers provided us with comprehensive datasets, which we utilized, applying our expertise in machine learning, to develop predictive models for RNA-ligand binding.

This collaboration led to very promising results: our models achieved Area Under the Receiver Operating Characteristic curve (AUC) values between 0.65-0.68 on test sets, surpassing the molecular docking techniques (which currently deliver state-of-the-art results and are widely utilized for virtual screenings) that reach between 0.50-0.60 on the same RNAs. This approach marks a significant step forward in RNA-targeted drug discovery.

Figure 1. **AUC curves and scores** reflecting performance of molecular docking as well as our model’s performance in test and validation conditions. Test on RNA X: model trained on two other RNAs and tested on RNA X. Validation on RNA X: average performance on a held-out subset of samples for RNA X when training on RNA X and another RNA. The class of the ligands (i.e. binders or non-binders) was determined experimentally.

The Challenge in Detail

Certain RNA molecules have been identified as crucial targets for therapeutic agents, highlighting the potential of RNA molecules in revolutionizing drug development. These include bacterial ribosomes and the human pre-mRNA of the survival of motor neuron 2 (SMN2) protein, targeted by specific drugs like bacterial ribosome-targeting antibiotics and risdiplam, respectively. Moreover, other RNAs such as mRNAs, regulatory RNAs in humans, riboswitches in bacteria, and conserved non-coding RNAs in viruses are acknowledged as promising candidates for novel therapeutics.

The flexible and dynamic nature of RNA structures poses significant challenges for in silico modeling and prediction, making it difficult to accurately target RNAs with drugs. Unlike proteins, whose domains are in most cases relatively rigid and well-defined, making them easier to model (with tools building on AlphaFold’s results and architecture providing a solid foundation), RNAs lack similar structural predictability, thus complicating the drug design process. Another challenge is the limited availability of experimental data on RNA and small molecule ligands which makes careful curation and preparation of the data set particularly important in method development.

“Effective prediction of binding of small molecule ligands to RNA, is the ultimate challenge of rational drug discovery. The Machine Learning-based methods developed with Appsilon are taking us closer to that goal.”

Filip Stefaniak, PhD
International Institute of Molecular and Cell Biology in Warsaw

Our Approach

Combining Expertise:

Our collaboration with IIMCB combined a blend of structural biology of RNA, bioinformatics, and machine learning expertise. We teamed up with RNA-ligand interactions specialists Filip Stefaniak and Natalia Szulc, who created initial models for predicting binding of small molecule ligands to RNA [1]. Our role was to incorporate our knowledge in building custom neural networks, in particular to handle input data of variable lengths reflecting variability of lengths of RNA sequences.

Data Curation and Preparation:

The IIMCB researchers provided us with carefully curated datasets which contained experimental results on the interactions between three RNAs and tens of thousands of molecules, known as ligands. The data showed whether or not a ligand binds to a specific RNA. For every pair of RNA and ligand 3D structures, and for each nucleotide in the RNA, their software, fingeRNAt, generated a sequence of numbers, representing the nature of the noncovalent interactions.

Leveraging Transformer Architecture:

For handling variable length sequences of nucleotides, we utilized the transformer architecture, known for its success in designing large language models. This architecture can handle sequences of any length, making it ideal for RNA-ligand binding predictions. As it turns out, it is useful for building smaller models as well!

We aimed to create models that could learn from data determined for some RNAs and then predict binding for other RNAs, where little to no binding data might exist. To achieve this, we trained and validated our models with data from two of the RNAs, and then tested the models' performance using the third RNA. This strategy ensured that our models could potentially be applied to a broad range of RNAs in future screenings.

Achievements and Results

Our collaborative efforts bore fruit, with our models achieving AUC scores between 0.65-0.68 on test sets, and even higher scores of 0.70-0.72 on validation sets, suggesting the potential for even better performance with additional data or refined models. This is a significant improvement over existing approaches, which in our test setup achieve AUC scores in the range of 0.53-0.61 (c.f. Figure 1).

In addition to AUC, our evaluation included the enrichment score (EF₁₀), a crucial metric in assessing the efficiency of identifying true positives early in the screening process. EF₁₀ measures the increase in likelihood of identifying active compounds in the top 10% samples ranked by the model, indicating the model's efficiency in prioritizing potential binders. Our models achieved EF₁₀ scores of around 3, compared to SoTA results, which only reach between 1.1 and 2.0 on this dataset (c.f. Figure 2).

Figure 2. The percentage of all binders in the datasets that are ranked in the top 10% of the scores by each method, equal to (EF10)*10% *Test/validation conditions: see Fig. 1 caption.*

Curious about how machine learning is applied to drug discovery? Check out these 5 promising applications.

Implications for Drug Discovery

Our approach may have profound implications for drug discovery. With an EF₁₀ score of 3, our models demonstrate that, for the RNAs and molecules tested, selecting just 10% of the ligands and testing them in a wet lab would likely yield approximately 30-45 active binders from a dataset containing about 100-150 binders overall. This efficiency has the potential to drastically reduce the amount of time and resources needed for screening by increasing the likelihood of discovering an effective drug among those top-ranked binders. By effectively narrowing the field of candidates, our models enable a more focused and economically feasible approach to initial drug testing, and we look forward to making further improvements that greatly enhance the capabilities of our models.

Moreover, the rapid development and deployment of our models underscore our capability to deliver significant advancements in predictive analytics for pharmaceutical research. These results were achieved within a limited preliminary effort, on a relatively small dataset, illustrating our team's ability to quickly adapt and develop customized machine learning models tailored to specific datasets. Partnering with us allows you to leverage our expertise and cutting-edge technology to accelerate your drug discovery processes, ultimately reducing costs and expediting the development of new treatments.

Develop your AI models with us!

The advent of AI in biotechnology heralds a new era of innovation and discovery, with machine learning at its core catalyzing advancements in drug discovery and biomedical research. Our pioneering work in RNA-ligand binding prediction is a testament to the transformative potential of integrating AI with biotechnology. If your organization is looking to sort through complex data more efficiently or speed up your research and development efforts, we’re ready to help.

As we continue to explore the frontiers of cheminformatics and biology, we envision ourselves contributing significantly to this revolution, aiding in the discovery of new drug targets, the development of novel therapeutics, and expanding our collective understanding of the molecular world. To understand better how our past projects have paved the way for innovative developments, we invite you to explore our previous work, in particular how we apply computer vision to speed up drug development.

Let’s discuss how we can support your goals with a customized proof-of-concept model. Contact us to see how a partnership can benefit your research needs and help forge the path to groundbreaking discoveries in pharmaceutical research.

[1] Szulc NA, Mackiewicz Z, Bujnicki JM, Stefaniak F (2022) fingeRNAt—A novel tool for high-throughput analysis of nucleic acid-ligand interactions. PLoS Comput Biol 18(6): e1009783. https://doi.org/10.1371/journal.pcbi.1009783

Shorten timelines, improve discovery rates, cut costs, and get to the next stage faster. Learn more about our work in AI for Drug Discovery.

This blog post was co-authored by Jędrzej Świeżewski, Natalia Szulc and Filip Stefaniak.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.

Book the Audit