Thermodynamics-Informed Machine Learning for the Design of Sustainable Materials: The Dawn of Digital Molecular Discovery
dataset
posted on 2024-04-30, 16:10authored byJoão Dinis Oliveira Abranches
Scientists have traditionally employed trial-and-error methodologies to design novel materials, often complemented by basic heuristic rules or chemical intuition (e.g., “like dissolves like”). However, to date, this simplistic approach has led to the discovery and characterization of only a small fraction of all synthesizable compounds. Data-driven approaches such as machine learning are promising alternative routes to these traditional trial-and-error methodologies. Unfortunately, most machine learning models proposed so far do not embed chemical or thermodynamic information in their architectures and molecular descriptors. In turn, this leads to overly complex models that require a tremendous volume of experimental data to be properly trained.
At the interface between artificial intelligence and green chemistry, the work developed throughout this dissertation uses thermodynamics-informed machine learning to bridge the gap between small, scarce datasets and data-driven approaches. This is accomplished using two major avenues. The first is through the development of active learning workflows, based on Gaussian process machine learning models, that target the description of activity coefficients. This unique approach was particularly directed at capturing the physicochemical properties of mixtures, namely deep eutectic solvents. Active learning was able to efficiently guide the acquisition of experimental data, and, in many cases, a single data point was sufficient to accurately describe mixture properties (namely phase equilibria diagrams), dramatically reducing the effort and cost necessary to characterize novel sustainable materials.
The second major avenue lies in the development of a digital molecular space based on sigma profiles. These molecular descriptors, derived from quantum chemistry, were shown to be a powerful feature set for neural networks, leading to the accurate prediction of assorted physicochemical properties (e.g., boiling points and aqueous solubilities) for organic and inorganic molecules. A graph neural network was also developed to predict sigma profiles, bypassing the need for expensive quantum chemistry calculations. Finally, sigma profiles were shown to behave as a digital molecular space where optimization tasks can be performed. A remarkable example of this was that of Bayesian optimization towards boiling point optimization. Holding no knowledge of chemistry except for the sigma profile and normal boiling temperature of carbon monoxide (the worst possible initial guess), Bayesian optimization found the global maximum of the available normal boiling temperature dataset (over 1000 molecules encompassing more than 40 families of organic and inorganic compounds) in just fifteen iterations (i.e., fifteen property measurements), cementing sigma profiles as an ideal digital chemical space for molecular optimization and discovery, particularly when little experimental data is available.