Introduction
Digital technologies—computers, smartphones, the internet, and other devices that work with digital data—let us store, process, and share information electronically, making them the backbone of digitalization [1]. They power almost every field, from business and education to entertainment, and their influence on healthcare has grown rapidly. In medicine, digitalization appears in telemedicine, where video or phone consultations connect patients and clinicians without a physical visit—especially valuable for people in remote or underserved areas [2]. High-resolution imaging tools such as CT and MRI reveal detailed internal views that guide diagnosis and treatment, while medical robots can dispense drugs or assist in surgery. By integrating these technologies into care delivery and risk assessment, healthcare systems can improve outcomes, boost efficiency, and lower costs.
With the rise of digital technologies, the field of toxicology is moving toward digital toxicology [3]. Traditionally, toxicology has relied on small-scale experimental approaches to assess the adverse effects of chemicals on living organisms. These methods, while scientifically rigorous, are time-consuming, costly, and often limited in their ability to evaluate the growing number of chemicals in modern industrial and environmental contexts. Therefore, new digitalized methods for toxicity testing such as big data [4], artificial intelligence (AI) for toxicity prediction [5], molecular dynamics [6], Physiologically Based Pharmacokinetic modeling [6] are rapidly developed.
Among them, big data and AI have emerged as core digital toxicological techniques in recent years [7]. The proliferation of big data—driven by high-throughput screening (HTS), multi-omics platforms, environmental monitoring, and extensive toxicological databases—creates an unprecedented opportunity for data-driven toxicology. When coupled with AI, the field is shifting toward a more predictive, efficient, and systems-based discipline [8]. Machine learning (ML) algorithms and other AI techniques enable the analysis of high-dimensional datasets, uncovering patterns and relationships that may be invisible to traditional statistical methods [9]. From predicting toxicity of chemicals to identifying biomarkers of exposure, AI empowers researchers to extract actionable insights from big data [10,11].
In response, this review paper explores how big data and AI are used in toxicology, highlighting both current and emerging applications. It also identifies key challenges and future directions for AI-based toxicity prediction, ultimately contributing to the modernization of chemical risk assessment.
Big Data in Toxicology
The term "big data" refers to structured or unstructured datasets that encompass a wide variety of data types and are generated at high speed and in large volumes, often requiring high-performance computing and advanced computational approaches for effective analysis[12]. Large public databases store vast amounts of toxicity data, regulation, exposure, network information which are crucial for data-driven toxicological research (Figure 1).
In vitro mechanism data
Big data for in vitro mechanism data includes high-throughput and high-content screening data, as well as data generated through omics technologies and gene arrays, such as transcriptomics, metabolomics, proteomics, and microbiome studies. These data are often used to identify biomarker or mechanism-based chemical screening in the various studies (Ref). Moreover, one of the prerequisites for applying AI approaches is the availability of large-scale datasets, as a sufficiently large dataset is necessary to develop reliable models without overfitting. The diverse sources of in vitro mechanism data are used to develop toxicity prediction models. For example, MoleculeNet is widely used for training AI-based toxicity prediction models. One of its key characteristics is that, in addition to ensuring data homogeneity and sufficiency, it provides a standardized data format that is easily interpretable by modelers. As AI research in toxicology advances, even datasets that require specialized toxicological expertise are being utilized in increasingly diverse applications. Toxicogenomics datasets, such as those in ToxCast and Open TG-GATEs, are also widely used to develop toxicity prediction models, further enhancing the field of predictive toxicology.
In vivo animal testing data
Traditionally, large numbers of animals have been used to assess the safety of chemicals through testing. The significance of results from these animal studies extends beyond past risk assessments; they remain valuable for validating data from alternative methods, including new approach methodologies (NAMs) [19, 20]. Although animal test results are not the definitive gold standard and have clear limitations in validating NAMs, the current regulatory framework remains largely dependent on animal testing. Thus, comparing animal test results with NAMs-derived data can provide meaningful insights into the validation and refinement of NAMs [21].
Thus, systematically collecting, curating, and structuring existing in vivo experimental data into reference databases is essential. Efforts to make these databases publicly accessible align with the OECD’s Mutual Acceptance of Data (MAD) principle, which aims to reduce redundant testing and promote international data sharing [22].
To support this transition, large-scale initiatives in Europe and the U.S. have been actively developing in vivo reference databases (Table 3). For instance, ToxRefDB, launched by the U.S. Environmental Protection Agency (EPA), compiles in vivo toxicity data from thousands of studies, primarily focusing on pesticide chemicals. By providing key in vivo endpoints such as NOAELs (No Observed Adverse Effect Levels) and LOAELs (Lowest Observed Adverse Effect Levels), ToxRefDB serves as a critical benchmark for evaluating and validating NAM-based toxicity predictions.
Additionally, eChemPortal, a global regulatory toxicology hub managed by the Organisation for Economic Co-operation and Development (OECD), provides worldwide access to in vivo regulatory toxicology studies by integrating datasets from governmental and international agencies. By offering comprehensive information on physicochemical properties, environmental fate, and ecotoxicology, this platform plays a crucial role in harmonizing regulatory frameworks across different regions.
Artificial Intelligence in Toxicology
Various AI algorithms, ranging from conventional machine learning approaches to Large Language Models (LLMs), which have recently gained significant attention across various fields, are actively being used to develop toxicity prediction models(Figure 2). In this study, we provide a brief overview of current methodologies for AI-based toxicity prediction models while highlighting several limitations that need to be addressed.
Machine learning
Early in-silico toxicology relied on hand-crafted molecular descriptors combined with algorithms such as random forests, support vector machines, gradient boosting, and k-nearest neighbors [11]. These models rely on molecular fingerprints and physicochemical properties to predict toxicity endpoints, making them effective for analyzing structured datasets with well-defined chemical properties. One of the key advantages of machine learning in toxicology is its efficiency, as these models can perform well even with relatively small training datasets. Additionally, traditional machine learning models, such as decision trees and random forests, offer greater transparency and interpretability compared to complex deep learning models, making them more accessible for regulatory applications [29].
Despite these advantages, machine learning approaches have several limitations. Many models require manual feature engineering, which can be labor-intensive and may not fully capture complex biological interactions. Moreover, traditional machine learning models struggle with predicting the toxicity of novel chemicals, particularly those that interact with biological systems in ways that are not well represented in the training data. To overcome these limitations, hybrid approaches that integrate machine learning with deep learning and mechanistic toxicology are being explored to enhance predictive accuracy.
Deep learning
Deep learning has emerged as a powerful tool for toxicity prediction, particularly for modeling complex chemical features [30]. Unlike traditional machine learning models, deep learning techniques such as graph neural networks (GNNs) [31], convolutional neural networks (CNNs) [32], and recurrent neural networks (RNNs) [33] can automatically extract features from raw toxicological data without requiring extensive manual preprocessing. These models have shown superior performance in simplified molecular descriptor-based toxicity prediction.
Among deep learning models, GNNs have proven particularly effective for representing molecular structures as mathematical graphs, capturing intricate structural relationships that are difficult to model using conventional QSAR approaches. Additionally, generative models such as autoencoders and generative adversarial networks (GANs) [34] are being used to augment toxicology datasets by synthesizing artificial data to improve AI training.
However, despite their advantages, deep learning methods present challenges that must be addressed for broader adoption in regulatory toxicology. One of the most significant concerns is the "black box" nature of these models, which makes it difficult to understand how they arrive at specific toxicity predictions [35]. This lack of transparency raises concerns among regulatory agencies, as decision-making in chemical risk assessment requires a clear rationale for predictions. Furthermore, deep learning models require large, high-quality datasets for training, which are not always available in toxicology. Addressing these issues requires the development of explainable AI (XAI) techniques [36,37], such as attention mechanisms and Shapley additive explanations (SHAP) [38], which help interpret model predictions and provide insights into the key features driving toxicity assessments.
Generative artificial intelligence
Generative deep-learning frameworks have become a cornerstone of modern molecular design because they internalize the statistical signatures of experimentally characterized compounds and then propose entirely new structures that manifest comparable—or deliberately improved—properties [7]. Early successes came from recurrent-neural-network architectures trained on SMILES strings; when these sequence models are placed under a reinforcement-learning regime the network receives an immediate reward for every token it adds, allowing it to bias generation toward high-throughput-screening libraries that already meet predefined physicochemical envelopes or show predicted affinity for a chosen biological target [39]. Variational auto-encoders extend this idea by mapping molecules into a smooth, differentiable latent space and then performing gradient-guided walks through that space; as a result, they can be steered automatically toward lower lipophilicity, higher solubility or enhanced drug-likeness before decoding the optimized latent vectors back into valid chemical graphs [40,41]. Collectively, these methods not only expand the explorable chemical universe by orders of magnitude but also compress the design–make–test–learn cycle, enabling chemists to front-load toxicity considerations and converge more rapidly on candidate molecules with balanced profiles of potency, safety, and manufacturability.
Large language model (LLM)
Large language models (LLMs), such as GPT-4, BERT, and BioBERT, are increasingly being utilized in toxicology for tasks ranging from literature mining to predictive modeling [42]. These models have demonstrated the ability to extract toxicological insights from vast scientific corpora, enabling automated data curation and knowledge discovery. One of the key applications of LLMs in toxicology is their use in analyzing and summarizing toxicity-related research, facilitating the integration of diverse sources of information for risk assessment. LLMs can also generate molecular descriptions and predict toxicological profiles based on natural language descriptions of chemical properties, providing an alternative approach to structure-based modeling [43].While LLMs offer new opportunities for toxicological research, they also present challenges. One of the primary concerns is the potential for generating false or misleading information, commonly referred to as the "hallucination problem." [44]. Since LLMs rely on probabilistic language generation rather than direct chemical-mechanistic reasoning, their outputs may not always be scientifically accurate. Additionally, while these models excel in text-based knowledge retrieval, they are less effective in making precise numerical toxicity predictions compared to structured AI models. To mitigate these limitations, it is needed to explore hybrid approaches that integrate LLMs with structured toxicological databases, allowing for improved accuracy and regulatory compliance.
Application of Artificial Intelligence in Toxicology
Chemical prioritization and next generation risk assessment
The ultimate goal of developing AI-based toxicity prediction models is to apply them for chemical screening and prioritization while reducing unnecessary animal testing. Although AI-driven toxicity prediction models inherently have some level of uncertainty, this is a common challenge in all modeling approaches. As the well-known saying in modeling states, "all models are wrong, but some are useful," the key is to effectively leverage these models by incorporating accurate uncertainty quantification and evaluation [45]. Ensuring that these models provide reliable toxicity predictions despite their limitations is crucial for regulatory acceptance and practical application in chemical risk assessment.
In this study, we analyzed several representative research efforts that apply AI-based toxicity prediction models for chemical prioritization (Table 3). These studies utilized AI-driven approaches to assess various chemicals, including consumer product chemicals, microplastics, diesel particulate matter, per- and polyfluoroalkyl substances (PFAS), bisphenols, and nanoparticles. The applicability domain of AI-driven toxicity prediction has been steadily expanding, moving beyond organic compounds to include inorganic chemicals and nanomaterials, demonstrating the increasing versatility of these predictive models [46].
Currently, AI-based toxicity models are primarily used as low-tier screening tools within risk assessment frameworks. While these in silico approaches are not yet considered standalone replacements for traditional experimental methods, they enable the assessment of thousands of chemicals efficiently. By prioritizing chemicals for further experimental evaluation, AI-based models contribute to reducing the number of required in vivo and in vitro tests, thereby improving efficiency in chemical safety assessment.
Emerging applications of artificial intelligence in toxicology
Recent advances in artificial intelligence are opening new frontiers in toxicology beyond advancingtoxicity prediction models. Key areas include (1) large language models for literature mining and knowledge extraction, (2) image-based toxicity analysis using deep learning, and (3) precision toxicology integrating genomic and exposure data. Below, we review each of these areas, summarizing state-of-the-art approaches and findings.
Large language models for literature mining and knowledge extraction
In many cases, safety information about chemicals scattered, thus poses a challenge for risk assessors to synthesize evidence. One application is using LLMs to assist safety knowledge extraction from unstructured text. For example, the European Food Safety Authority’s AI4NAMS project evaluated GPT-based models for rapid evidence retrieval on chemicals of concern [53,54]. In a case study on Bisphenol A, a baseline GPT-3 model was compared to a fine-tuned version that had been trained on domain-specific toxicology data. The fine-tuned model substantially outperformed the generic model in accurately extracting and consolidating toxicological information from scientific publications. This demonstrates that with domain tuning, LLMs can reliably identify key toxicity findings (e.g., dose–response data, hazard classifications) and summarize them, thereby supporting risk assessors in reviewing evidence. LLMs are also being explored for automating systematic reviews and regulatory document preparation. Given a large set of study reports or publications, an LLM can be prompted to output structured summaries, highlight critical endpoints, and even draft sections of risk assessment reports. Early demonstrations suggest that a regulatory-focused LLM could significantly streamline tasks like compiling dossiers under REACH, by automatically extracting toxicity data from both structured databases (e.g., PubChem, ToxCast) and unstructured sources (journal articles, study reports). Importantly, the role of these models is to assist experts rather than replace them – for instance, flagging relevant studies and pulling out key results for human evaluation. Nonetheless, the efficiency gains are clear: an AI co-worker that can read and distill thousands of pages of toxicology information in minutes enables scientists to focus on interpretation and decision-making. However, challenges remain in ensuring accuracy (avoiding AI “hallucinations” or misinterpretation of data). A recent study demonstrated that introducing as little as 0.001% of false information into a training dataset could cause LLMs to generate medically harmful content while still passing standard performance benchmarks [55]. This phenomenon, known as data poisoning, raises serious concerns about LLMs trained on unverified toxicological sources, where subtle misinformation could lead to erroneous safety assessments or misclassification of chemical hazards. These risks highlight the urgent need for external validation, provenance tracking, and the use of curated knowledge when deploying LLMs in domains like toxicology that require high levels of reliability and accountability.In summary, LLMs offer a transformative approach to literature mining in toxicology, from building knowledge of chemical–effect relationships to drafting evidence-based assessments. As these models continue to improve and incorporate toxicology-specific training, they are expected to become invaluable for keeping pace with the ever-growing scientific knowledge base in environmental health.
Image-based toxicity analysis using deep learning
AI is also being harnessed to interpret complex biological images in toxicology. Deep learning models, particularly convolutional neural networks (CNNs), can automatically recognize morphological patterns indicative of toxicity in microscopy and histopathology images. This is enabling high-throughput image-based toxicity assays that historically required laborious human scoring. A prime example is the comet assay for DNA damage, where individual cell nuclei exhibit “comet tails” when DNA strand breaks occur. Recent studies applied CNNs to classify comet assay images and achieved accuracy levels on par with expert analysis [56]. In one study, a CNN was trained on ~800 labeled comet images (augmented to nearly 10,000) and learned to categorize DNA damage into four severity classes (healthy to highly damaged) with 96.1% overall accuracy. This automated image analysis removes the bottleneck of manual scoring and adds consistency in genotoxicity evaluation. Similarly, deep learning has been introduced to toxicologic histopathology – the examination of tissue slides from animal studies [57]. Whole-slide imaging and CNN-based image analysis are poised to assist pathologists by detecting microscopic lesions or classifying tissue abnormalities caused by toxicants. For instance, prototype deep neural networks have been trained on thousands of annotated histology images from rodent toxicity studies to learn features of “normal” vs. diseased tissue, facilitating objective identification of treatment-related changes. These approaches have demonstrated the potential to improve workflow efficiency and quantitative rigor in pathology assessment. Early applications include automated identification of liver and kidney lesions and mapping their distribution across a slide, which can help flag subtle toxic effects that might be missed by the human eye. In high-content cell imaging assays, deep learning-based phenotypic profiling is also used to predict toxic mechanisms. Imaging platforms like Cell Painting, which capture multichannel cellular morphology, combined with deep neural networks can cluster compounds by phenotypic signatures and even predict modes of action or toxicity endpoints [58]. For example, AI models integrating Cell Painting images have successfully distinguished specific forms of cell death (e.g., apoptosis vs. ferroptosis) and improved detection of mitochondrial toxicity by combining image features with gene expression data [59,60]. Overall, deep learning is enabling image-based toxicology to move from qualitative observations to quantitative, scalable analysis. By extracting rich feature patterns from cell and tissue images, these models enhance our ability to detect toxicity earlier (in vitro) and more reliably (in vivo), supporting more rapid and mechanistically anchored safety evaluations.
Precision toxicology integrating genomic and exposure data
Another frontier is the integration of toxicogenomic data and personal exposure information to enable precision or personalized toxicology. Just as precision medicine tailors treatments to a patient’s genetic makeup, precision toxicology aims to predict individual risk and susceptibilities to chemical exposures by considering omics data (genomics, transcriptomics, metabolomics) alongside environmental context. AI and machine learning are critical to analyze these high-dimensional datasets and uncover biomarker signatures predictive of toxicity. One approach uses toxicogenomic profiles (e.g. gene expression changes or protein biomarkers in cells exposed to chemicals) to forecast adverse outcomes in vivo. For example, Rahman et al. (2022) employed machine learning to identify an optimal panel of DNA damage response proteins that serve as predictors of genotoxicity [61]. By applying feature-selection algorithms to in vitro transcriptomic data, they discovered that only a handful of biomarkers (five genes/proteins) could accurately predict rodent carcinogenicity and Ames test mutagenicity (with ≥ 70% accuracy for both). Notably, these top biomarkers were all involved in conserved DNA repair pathways, providing mechanistic insight. This illustrates how integrating omics endpoints with AI can bridge molecular-level data to whole-animal outcomes, improving predictivity while reducing testing. Another pioneering strategy uses generative AI to simulate toxicogenomic responses. Chen et al. (2022) developed a GAN-based model (“Tox-GAN”) trained on a large toxicogenomics database of rat studies, which can generate in vivo gene expression profiles for untested chemicals across dose levels [62]. The synthetic profiles showed > 87% agreement with actual data in terms of affected pathways, pointing to a future where animal experiments could be augmented or partially replaced by AI-simulated omics data.
A hallmark of precision toxicology is accounting for inter-individual variability. Factors such as genetic polymorphisms, pre-existing health conditions, microbiome, and cumulative exposures can cause people to respond differently to the same chemical. AI-driven models are starting to integrate these layers. For instance, researchers are combining human genomic data (e.g. risk alleles), personal exposure biomarkers, and toxicogenomic readouts to build personalized risk assessment models. By incorporating individual-level data, these models can stratify populations into more or less susceptible groups and predict toxic responses with higher resolution. As described by Singh et al. (2023), an AI/ML precision toxicology framework might integrate a person’s omics profile (gene or protein expression signatures indicating sensitivity) with their exposure history to tailor risk estimates [63]. Such models have been piloted in studies of air pollutants and occupational chemicals, where omics-based biomarkers of exposure and early effect are used to forecast long-term health outcomes in specific subpopulations. Likewise, environmental epidemiology is adopting multi-omics and AI to identify vulnerable cohorts – for example, using gene expression changes in blood as sentinels of chemical exposure in a community, then applying machine learning to link those changes with health records and personal exposure measures. The PrecisionTox initiative (EU) is one large-scale effort applying such integrative approaches across species to improve cross-species extrapolation and human relevance of toxicity findings [64]. Overall, precision toxicology represents a convergence of big data from toxicogenomics, exposomics, and human genetics. AI methods are essential to sift through these data and identify the combinations of molecular markers that best predict adverse outcomes in defined populations or even individuals. The outcome is a more tailored risk assessment, where instead of a one-size-fits-all safety threshold, we might predict a range of responses and safeguard those most at risk. Though still in its infancy, this approach holds promise for more equitable and effective public health protection, as interventions could be directed to high-risk groups identified by their molecular and environmental profiles. In coming years, we can expect AI-fueled models to increasingly inform regulatory science by integrating diverse data streams – from genomics to wearable sensors – thus refining toxicity predictions to reflect real-world human variability.
Regulatory Applications of Toxicity Big Data and Artificial Intelligence
In an exceptional display of regulatory coordination, three leading U.S. agencies signaled an important turn away from animal models in the spring of 2025 [65]. On 10 April, FDA Commissioner Martin A. Makary released a Preclinical Safety Roadmap that commits the Agency to sharply decrease animal use, citing evidence that > 90 % of drugs successful in animal studies ultimately fail in humans because of unforeseen efficacy or safety issues. That same day, EPA Administrator Lee Zeldin revived—and strengthened—the Agency’s hidden guarantee to eliminate vertebrate testing by 2035, extending the target to both industrial chemicals and pesticides. Nineteen days later, NIH Director Dr Jay Bhattacharya announced the creation of the Office of Research Innovation, Validation and Application (ORIVA), a hub charged with developing, validating and mainstreaming non-animal NAMs across the NIH research portfolio.
Within this context, several regulatory agencies have published frameworks and guidance to support the integration of AI and big data into chemical risk assessment and regulatory decision-making. These efforts reflect a growing recognition of AI’s potential to enhance chemical safety evaluations, reduce reliance on animal testing, and improve the predictive accuracy and efficiency of toxicological assessments. For instance, the European Food Safety Authority (EFSA) has developed an extensive framework through the AI4NAMS initiative (Artificial Intelligence for New Approach Methodologies), published in 2024 [54]. This report outlines a seven-step AI workflow that includes literature search, AI-based ranking, structuring, harmonization, data extraction, and integration with adverse outcome pathways (AOPs). EFSA provides 96 technical recommendations across domains such as ontology-based term normalization, automated picklist generation for reporting templates (OHT201), and the use of LLMs for data extraction with embedded reliability scoring. These efforts are aligned with the SPIDO NAMs Roadmap, which outlines EFSA’s plan to transition the majority of chemical safety assessments to NAMs by 2027, supported by AI-enabled dashboards, repositories, and knowledge graphs. The U.S. Food and Drug Administration (FDA) has also explored the use of AI in regulatory science. Its 2023 draft guidance, Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products, recommends a risk-based approach for evaluating the reliability of AI-generated outputs throughout the product lifecycle. It emphasizes methodological transparency, model documentation, performance monitoring, and change management procedures. The European Medicines Agency (EMA) has adopted a holistic and strategic approach to AI integration. Its Artificial Intelligence Workplan lays out a roadmap for embedding AI across regulatory domains, including pharmacovigilance and product evaluation. Complementing this, the 2024 guiding principles on the use of LLMs provide ethical and operational guidance for using advanced AI in regulatory science, highlighting issues such as data governance, bias mitigation, and continuous learning. In parallel, the OECD continues to lead in harmonizing international standards for computational toxicology. The QSAR Assessment Framework (QAF), originally published in 2007 and periodically updated, offers detailed criteria for evaluating the scientific validity and regulatory acceptability of predictive models. The OECD’s ongoing initiatives increasingly focus on transparent documentation, applicability domain (AD) definition, and cross-agency interoperability of AI tools. While these agencies share a commitment to transparency, risk assessment, and scientific rigor, their strategies differ in emphasis. EFSA focuses on AI-driven automation of hazard identification using real-world toxicological data, with a strong emphasis on workflow standardization and ontology harmonization. FDA’s approach centers on model credibility and risk management within regulated product development. EMA promotes a governance-based model that integrates policy, ethics, and stakeholder engagement. OECD, by contrast, supports international alignment of computational models through quality assurance frameworks and collaborative testing. Together, these regulatory efforts underscore the increasing role of AI and big data in toxicology and chemical management. As AI tools continue to evolve, sustained guidance, robust validation practices, and harmonized policy development will be critical to ensuring the responsible, reproducible, and ethical use of AI in regulatory science.
Challenges and Future Perspectives
The integration of big data and AI in toxicology presents significant opportunities for improving chemical risk assessment and reducing reliance on animal testing. However, several challenges must be addressed to fully realize the potential of these technologies. Two critical obstacles are (1) the limited availability of high-quality data and (2) the black box nature of AI models, both of which hinder regulatory acceptance and broader application in toxicology.
Limited high-quality and quantity of data
One of the primary challenges in AI-driven toxicology is the limited availability of high-quality and homogeneous datasets. Although vast amounts of toxicological data exist, they are often fragmented, inconsistent, or inaccessible due to proprietary restrictions. Many datasets originate from diverse sources, but differences in study design, data formats, and reporting standards make integration difficult and inefficient. This gap in data completeness limits the effectiveness of predictive models, as AI algorithms require rich, well-annotated datasets to achieve high reliability and generalizability.
To address this issue, efforts should focus on harmonizing data formats, improving data-sharing frameworks, and increasing data accessibility. These initiatives align with the OECD MAD System, which ensures that a test performed in one country is accepted in over 40 others, thereby reducing duplicative testing and enhancing international regulatory collaboration. Additionally, collaborative initiatives involving regulatory agencies, academia, and industry, such as the Accelerating the Pace of Chemical Risk Assessment (APCRA) project and the Toxicology in the 21st Century (Tox21) program, aim to improve data quality, availability, and integration across different platforms [19].
As advanced data processing capabilities continue to evolve, it is critical to establish effective strategies for managing, structuring, and sharing existing toxicological data. Developing unified data standards and centralized repositories will be essential for ensuring that AI-driven toxicology can be applied effectively and transparently in chemical risk assessment and regulatory decision-making.
The black box problem
Another significant challenge in applying AI to toxicology is the black box nature of many machine learning models, particularly deep learning algorithms. These models often lack interpretability and transparency, making it difficult for regulatory agencies and toxicologists to trust their predictions. Unlike traditional QSAR models, which provide intuitive prediction mechanisms, many AI-driven approaches generate output without clearly explaining how or why a specific toxicity prediction was made.
This lack of interpretability raises concerns in regulatory decision-making, risk assessment, and hazard identification, where explainability is crucial for ensuring scientific credibility. Without a clear understanding of how AI models arrive at their conclusions, regulatory bodies may hesitate to fully integrate AI-based predictions into risk assessment frameworks.
To overcome this challenge, researchers are actively developing explainable AI (XAI) techniques, such as attention mechanisms, feature attribution methods, and rule-based models, to improve model transparency. To address this challenge, explainable AI (XAI) techniques such as SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations) have been introduced in toxicological research. These methods provide feature attribution scores, helping researchers identify which molecular descriptors or biological signals contributed most to a given prediction. For instance, SHAP has been used to interpret models trained on Tox21 data, offering insights into the physicochemical properties that drive compound activity in bioassays. Similarly, LIME has helped evaluate the local fidelity of predictions in QSAR toxicity classification tasks. However, while these methods represent meaningful progress toward model transparency, they remain partial solutions. They typically offer post hoc explanations that approximate the model’s logic rather than fully capturing its internal reasoning, and their reliability may vary depending on model complexity and input perturbation. Therefore, we believe that these techniques, though valuable, are not yet sufficient for establishing fundamental trust in high-stakes applications such as regulatory toxicology. To build more trustworthy AI systems, future research should focus on inherently interpretable models and hybrid approaches that incorporate mechanistic toxicological knowledge—such as the adverse outcome pathway (AOP) framework—directly into model design [11]. These efforts will be critical in developing reproducible, scientifically justifiable AI models capable of gaining regulatory acceptance and broader adoption in the field.
Moving forward
Despite these challenges, big data and AI have the potential to revolutionize toxicology by enabling faster, more cost-effective, and ethically responsible toxicity assessments. Addressing data limitations through standardization, improved data sharing, and data augmentation techniques will enhance the quality and quantity of available toxicological datasets. Additionally, advancing explainable AI approaches will help bridge the gap between machine learning predictions and regulatory requirements.
By fostering global collaboration, investing in data infrastructure, and developing interpretable AI models, the field of toxicology can fully harness the power of AI to improve chemical safety assessments, minimize reliance on animal testing, and advance the adoption of NAMs in regulatory science (Figure 3).










