Retrosynthesis lies at the core of organic chemistry and modern drug discovery, offering a systematic strategy for breaking down complex target molecules into simpler, synthetically accessible precursors. It serves as a guiding principle in the design...
Retrosynthesis lies at the core of organic chemistry and modern drug discovery, offering a systematic strategy for breaking down complex target molecules into simpler, synthetically accessible precursors. It serves as a guiding principle in the design of synthetic routes and the identification of viable reaction pathways for novel compounds. Traditional retrosynthetic approaches have largely relied on expert-driven heuristics and rule-based frameworks, which, although effective for well known reaction types, face limitations in scalability, flexibility, and adaptability across the vast and diverse chemical space. As the complexity of molecular structures and reaction mechanisms continues to increase, there is an urgent need for data-driven, intelligent systems capable of generalizing beyond human-defined rules and learning from large scale reaction data.
In response to these challenges, this thesis introduces SB-Net, an innovative deep learning framework that integrates Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (Bi-LSTM) networks to advance retrosynthesis prediction. SB-Net adopts a dual branch architecture designed to exploit both the sequential and structural properties of molecules. It processes two complementary molecular representations Simplified Molecular Input Line Entry System (SMILES) strings, which encode molecular syntax and connectivity, and Extended Connectivity Fingerprints (ECFPs), which capture topological and substructural features at varying levels of molecular depth. This hybrid representation allows SB-Net to extract multi-scale contextual and structural information, enabling it to model complex chemical transformations with greater accuracy and interpretability.
The thesis presents a detailed ablation study to evaluate the contribution of each molecular descriptor and network component. Results show that combining SMILES and ECFP features significantly enhances prediction performance, confirming their complementary roles in encoding molecular information. Similarly, the integration of CNN and Bi-LSTM components demonstrates a synergistic effect, where CNNs effectively capture local feature patterns while Bi-LSTM layers model long range dependencies within molecular sequences. Comparative analyses across benchmark datasets, including USPTO-50k for chemical retrosynthesis and MetaNetX for bioretrosynthesis, reveal that SB-Net consistently outperforms existing models in top-k accuracy and generalization capability.
Beyond its superior predictive performance, SB-Net represents an interpretable and extensible framework adaptable to various cheminformatics tasks. Its design principles, centered on multi-scale feature extraction and hybrid representation learning, can be further applied to other molecular prediction domains such as reaction condition optimization, forward reaction prediction, and enzyme catalyzed reaction modeling. By bridging chemical informatics and deep learning, SB-Net contributes toward the development of AI-enabled synthesis planning systems capable of accelerating the discovery and design of new chemical entities.
In essence, this work advances the frontier of computational retrosynthesis by demonstrating how hybrid deep learning architectures can efficiently learn from molecular data, generalize across diverse reaction types, and provide scalable, data-driven insights into chemical reactivity and synthesis planning.