Advancing chemical synthesis with machine learning: opportunities and limitations
dataset
posted on 2024-09-04, 15:28authored byBozhao Nan
With advancements in computational power and increased data availability, machine learning (ML) has been applied in predicting chemical reactions and proposing synthetic pathways. This thesis contributes to advancing chemical reaction discovery through ML across three primary domains. Initially, computational methods were used to analyze transition states in reaction routes generated by the Sarpong group using Synthia™, evaluating their computational feasibility. Then, industrial electronic lab notebook (ELN) data, supported by AZ, were processed and featurized. Various ML techniques, including Random Forests (RF), k-Nearest Neighbors (KNN), Neural Networks (NN), and Graph Neural Networks (GNN), were applied to predict reaction yields. Yield imbalances in HTE and ELN were addressed to enhance yield prediction in critical regions using imbalanced regression methods. Large Language Models (LLMs) were integrated for data extraction, solving inconsistencies in USPTO datasets from multiple sources, and investigating the intricate information space of reaction procedure through a specific study on t-butyl ester deprotection. In the second part, substantial advancements were achieved in Molecular Representation Learning (MRL) to accurately capture molecular structures and physical behavior. By evaluating 3D GNNs and conformer ensemble-based models, this research extends beyond traditional SMILES, fingerprints, and 2D molecular graphs, enhancing the precision of predictions for molecule and reaction-level properties. These improvements are crucial for tasks such as enantiomeric excess (ee) selectivity prediction and binding energy (BE) prediction studies.