Introduction

The pharmaceutical industry is facing an unprecedented challenge in drug discovery, with the cost of developing new drugs soaring to an estimated $6.16 billion in 2023—more than double the adjusted cost of $2.87 billion in 2016. Drug development is a lengthy and expensive process, with significant financial and time investments required for preclinical research and drug discovery stages. Failure rates during drug screening and optimization mean that preclinical research accounts for roughly 43% of the total costs. Moreover, despite the increasing reliance on artificial intelligence (AI) and machine learning (ML) in drug discovery, many of the challenges posed by data and model evaluation are still underappreciated. This article explores the critical data-related issues in AI-driven drug discovery, along with potential strategies to overcome them, to enhance the efficiency and efficacy of the drug development process.

The Current State of Drug Discovery and the Role of AI

The early stages of drug discovery—particularly drug screening and optimization—are often the most time-consuming and costly. Large pharmaceutical companies have historically been slow to adopt more efficient methods, and research shows that a significant proportion of drug discovery projects fail to produce viable lead compounds. A study by GlaxoSmithKline (GSK) revealed that 93% of their antibacterial projects failed to produce any lead candidates. Unlike large pharmaceutical companies, smaller biotech firms typically demonstrate higher research and development (R&D) efficiency. As a result, many major pharmaceutical companies are increasingly turning to biotech firms and academic drug discovery centers to collaborate on promising new drug candidates.

AI, particularly machine learning, has emerged as a powerful tool in accelerating drug discovery. Many AI-driven biotech startups are signing lucrative partnerships with major pharmaceutical companies, demonstrating the vast potential of AI in small-molecule drug discovery. However, while the potential of AI in this field is widely acknowledged, its implementation is far from straightforward. The application of AI faces significant data-related challenges, some of which are unique to drug discovery, while others are common to many scientific disciplines.

This article delves into the various data-related challenges encountered in AI-based drug discovery, as well as strategies to mitigate these issues, focusing on aspects such as data bias, inconsistency, model evaluation, and the potential biases of researchers involved in the process.

Key Data Challenges in AI-Driven Drug Discovery

  • Data Bias

AI models in drug discovery rely heavily on large datasets for training, and any biases in these datasets can severely limit the generalizability of the models. For instance, drug activity prediction models may be biased toward predicting molecules within a specific distribution range, making it difficult for the model to extrapolate to new, previously uncharted regions. This challenge can lead to reduced predictive accuracy when testing molecules outside of the original training set. To mitigate this issue, researchers recommend employing effective training-test data-splitting strategies that account for distribution shifts. These methods help to evaluate the potential impact of data distribution changes on model performance.

  • Data Inconsistency

Variability in experimental conditions across different laboratories can result in data inconsistencies, which may affect the generalization capability of AI models. For example, even if the same cell lines and drug conditions are used, drug activity measurements may vary significantly between laboratories. Standardizing experimental procedures is essential to minimize these inconsistencies. Cross-validation techniques are also recommended to detect and assess the degree of variability in the data, providing a more robust evaluation of model performance.

  • Class Imbalance

In drug discovery datasets, the number of target molecules with activity is typically much smaller than the number of inactive molecules, leading to a class imbalance issue. This imbalance can skew the model’s predictions, making it more likely to predict inactivity for the majority of molecules. To address this, techniques such as generating synthetic negative samples (false negatives), active learning, oversampling, and semi-supervised learning can be applied to improve model performance and make the models more sensitive to active molecules.

  • Small Datasets

Many tasks in drug discovery involve small datasets, which can present challenges for supervised learning models. Small datasets often result in overfitting, where models perform well on the training set but fail to generalize to new data. To overcome this challenge, techniques like self-supervised learning and transfer learning have proven effective. By pre-training models on large, unlabeled datasets and fine-tuning them on small, labeled datasets, these techniques enable models to adapt more effectively to specific drug discovery tasks.

  • High-Dimensional Data

High-dimensional data, such as genomic or metabolomic data, can pose additional challenges for AI models, particularly in cancer drug response prediction tasks. These types of data often contain a large number of features, making it more difficult to evaluate the model’s performance. Feature selection becomes a critical technique in this context, as it helps identify the most relevant features that contribute to the model’s predictive power. Dimensionality reduction methods, such as principal component analysis (PCA), and cross-validation can improve model accuracy and generalization to unseen data.

  • Uncertainty Quantification and Model Evaluation

Uncertainty quantification (UQ) is an emerging area of focus in AI-based drug discovery. The reliability of AI predictions can be enhanced by quantifying the uncertainty in the model’s outputs. Approaches such as Gaussian Processes (GP) and Conformal Prediction (CP) are increasingly used to assess model uncertainty. Although these methods have not yet been widely implemented in prospective studies, their potential benefits have been validated in several retrospective studies. By incorporating uncertainty quantification into AI models, researchers can better assess the confidence in their predictions, which is critical for high-stakes drug discovery applications.

Model Evaluation Challenges

  • Conceptual Errors

One common misunderstanding in AI drug discovery is the concept of overfitting. Overfitting occurs when a model performs exceptionally well on the training dataset but fails to generalize to new, unseen data. However, overfitting is not a binary issue but rather a continuum that depends on the dataset and algorithm used. To properly assess a model’s true performance, multiple random splits of the training-test data should be used. This approach provides a more accurate reflection of a model’s ability to generalize to real-world data.

  • Performance Misjudgment

Evaluating model performance using certain standard metrics—such as ROC-AUC—can sometimes be misleading, especially for early-stage identification tasks. Other performance metrics, such as hit rate and normalized enrichment factor (NEF), may better align with the needs of virtual screening tasks. Thus, careful selection of appropriate metrics is essential for evaluating model efficacy in drug discovery.

  • Unrealistic Benchmarks

Many benchmark datasets in drug discovery focus too much on optimizing specific methods’ performance, often neglecting their practical applicability in real-world scenarios. For example, some datasets aim to enhance chemical diversity by filtering molecules, which can inadvertently exclude potentially active compounds. More realistic benchmarks, like LIT-PCBA, that reflect the real-world challenges of drug discovery are essential for better evaluating the applicability of AI models.

Addressing Bias in Drug Discovery Research

Researcher biases can hinder the effective development and application of AI technologies in drug discovery. These biases can stem from both AI researchers and drug discovery experts.

  • Biases in AI Research

AI researchers may have a tendency to view scientific problems solely through the lens of algorithm optimization, potentially overlooking the importance of domain-specific knowledge. To make meaningful advances, AI teams need to collaborate with experts from the drug discovery field to incorporate crucial biological and chemical insights into AI models.

  • Biases in Drug Discovery

On the other hand, drug discovery experts may harbor skepticism toward AI technologies due to fears of losing control over the process. Some professionals may use AI as a secondary tool rather than leveraging its full potential. Overcoming this requires drug discovery experts to acquire a fundamental understanding of AI and work collaboratively with AI specialists to ensure that AI-driven approaches are used to their maximum benefit.

  • Lack of Prospective Application

Many researchers focus on retrospective studies, which, while useful for validation, fail to offer the practical insights needed for real-world applications. Encouraging more prospective research and validating AI models in real-life scenarios will help build confidence in AI-based drug discovery and facilitate the wider adoption of these technologies.

Future Directions and Conclusion

To address the challenges outlined above, the following strategies are recommended:

  1. Data-Related Solutions: Standardize experimental processes, generate high-quality negative samples, and optimize feature selection methods.
  2. Uncertainty Quantification: Develop more effective tools for predicting and quantifying prediction errors.
  3. Improved Model Evaluation: Use diverse, task-aligned performance metrics and establish more realistic benchmark datasets.
  4. Cross-Disciplinary Collaboration: Foster stronger collaboration between AI researchers and drug discovery experts to ensure a comprehensive approach to AI model development.
  5. Prospective Applications: Encourage more prospective studies to validate AI models in real-world drug discovery applications.

By addressing these issues, AI’s role in drug discovery can be significantly enhanced, potentially reducing the costs and timelines of drug development while opening doors to new and innovative therapies, as well as personalized medicine breakthroughs.

Creative Biolabs harnesses the transformative potential of artificial intelligence to provide cutting-edge solutions in antibody discovery and development. By incorporating AI-driven platforms into our workflow, we facilitate high-throughput screening, precise optimization, and accurate structural prediction, significantly accelerating the development timelines for therapeutic antibodies. Our comprehensive range of services includes AI-augmented antibody discovery, AI-powered screening and engineering, small molecule design and optimization, as well as model training data support. These offerings enable unmatched innovation, streamlining and expediting your research and development efforts. Contact us today to discover how our advanced capabilities can enhance your projects.