The LCGC Blog: Molecular Feature Generation for Machine Learning in Analytical Measurements

October 1, 2024

Blog

Article

Predicting physicochemical properties for molecules and optimizing chemical analysis processes are both important aspects of modern analytical science. These are data-intense processes, which can often involve a multitude of chemical measurements or high-level theoretical computations. The development of artificial intelligence (AI) strategies, such as those based on machine learning (ML) and deep learning (DL), have helped make processes more efficient, but they still require a significant amount of high-quality data.

Data transmission channel. Transferring of big data. Motion of digital data flow. | Image Credit: © AnuStudio - stock.adobe.com

Encoding molecular characteristics for optimization and prediction can be challenging. Molecular characteristics that must be generated experimentally will ultimately be limited to the chemical set for which they have been determined. Abraham descriptors, though powerful for modeling molecular interactions, fall into this category. Other molecular characteristics might be easy to access but have limited predictive power. Measures like log P and pKa can be readily calculated, but their potential for describing a variety of molecular interactions on their own is not high.

An alternate strategy for generating molecular characteristics is to generate features directly from a chemical structure. This requires no laboratory experimentation and can be accomplished quickly for any new molecule of interest.

The three most common categories of molecular features generated from chemical structures are fingerprints, custom descriptor sets, and counts over chemical bonds. These features can be extracted from Simplified Molecular-Input Line-Entry System (SMILES) representations of molecules, which are easily generated from structure data files (.sdf).

Fingerprints encode properties of molecules, such as size, shape, polarity, and electronic structure, as a set of numerical values or binary bits. There are a variety of different fingerprinting strategies; popular ones are E-state and Morgan fingerprint. Custom descriptor sets map a molecule to a scalar value; descriptors are usually chosen based on physical intuition and computational efficiency. Descriptors that require additional computations or measurements are not usually included. Log P or oxygen balance could be considered and used as components of a set of custom descriptors. Counts over chemical bonds creates bond count vectors to represent each molecule. These are based on the intuition that chemical bonding and the presence of different functional groups will control physicochemical properties of a molecule. Each of these molecular feature categories can be expanded or combined to create more comprehensive molecular feature representations for various tasks.

We recently explored and reported the use of molecular feature representations in conjunction with ML for the prediction of gas phase vacuum ultraviolet–ultraviolet (VUV/UV; 125–240 nm) absorption spectra (1). We had approximately 1400 VUV/UV spectra, which form a portion of the library for a commercial VUV/UV absorbance detector made by VUV Analytics, Inc. for gas chromatography (2). We obtained the .sdf files for each of the compounds and used RDKit to create SMILES strings and generate molecular features.

We tested different molecular feature generation techniques in connection with different ML learning techniques for predicting VUV/UV spectra. We found that the introduction of a new feature set, used in combination with other established feature generation sets, provided the best performance. Our new feature set, termed ABOCH, better captured aromatic and unsaturated units in the molecules, among others; these features are known to contribute significantly to the shape of VUV/UV absorption spectra. The combined features used in conjunction with a random forest ML model provided the best performance relative to other ML techniques and several DL techniques. The results from the random forest ML model not only outperformed DL models, but they were also much faster and provided results that were more easily interpretable. The ML model also outperformed predictions performed previously using time-dependent density functional theory-based theoretical computations (3,4).

There is still room for improvement. The addition of new molecular features based on the intuition that certain functional units contribute more significantly to the task at hand, prediction of molecular absorption properties, was straightforward. Additional feature generation techniques may better capture functional units that contribute to spectral shape, such as aromatic units substituted with electron-withdrawing and electron-donating groups. This remains to be investigated.

A similar strategy to evolve new molecular features is being pursued for studies aimed at supercritical fluid extraction–supercritical fluid chromatography (SFE–SFC) method development. In this case, features are needed that well capture functional units that drive intermolecular noncovalent interactions (for example, adsorption) and chromatographic partitioning processes. On-line SFE–SFC has the potential for an extremely wide application base, but optimization requires the consideration of numerous variables as parameter settings (5). The use of molecular feature generation techniques for efficient model development, as well as the use of advanced optimization strategies, should make method development easier for users.

In a time when AI has begun to permeate a great deal of the science we seek to develop, it is refreshing to understand that there is still some quite low-hanging fruit to pick. Molecular feature generation is an area that can be substantially developed further for specific tasks using chemical intuition. Additionally, it may not always be necessary to use the most complicated models to achieve efficient performance. Simpler ML models can be faster and more easily interpretable relative to more complicated DL models, where many layers of information can be hidden from view.

This research was supported by a grant from the National Science Foundation (CHE-2108767).

Disclaimer: KAS is a scientific advisory board member for VUV Analytics. A management plan has been created to preserve objectivity in research in accordance with UTA policy.

References

(1) Ho Manh, L.; Chen, V.; Rosenberger, J.; Wang, S.; Yang, Y.; Schug, K. A. Prediction of Vacuum Ultraviolet/Ultraviolet Gas Phase Absorption Spectra using Molecular Feature Representations and Machine Learning. J. Chem. Inform. Model. 2024, 64, 5547–5556. DOI: 10.1021/acs.jcim.4c00676

(2) Schug, K. A.; Sawicki, I.; Carlton Jr., D. D.; Fan, H.; McNair, H. M.; Nimmo, J. P.; Kroll, P.; Smuts, J.; Walsh, P.; Harrison, D. Vacuum Ultraviolet Detector for Gas Chromatography. Anal. Chem. 2014, 86, 8329–8335. DOI: 10.1021/ac5018343

(3) Skultety, L.; Frycak, P.; Qiu, C.; Smuts, J.; Shear-Laude, L.; Lemr, K.; Mao, J. X.; Kroll, P.; Schug, K. A.; Szewczak, A.; Vaught, C.; Lurie, I.; Havlicek, V. Resolution of Isomeric New Designer Stimulants Using Gas Chromatography–Vacuum Ultraviolet Spectroscopy and Theoretical Computations. Anal. Chim. Acta 2017, 971, 55–67. DOI: 10.1016/j.aca.2017.03.023

(4) Schenk, J.; Mao, X.; Smuts, J.; Walsh, P.; Kroll, P.; Schug, K. A. Analysis and Deconvolution of Dimethylnaphthalene Isomers Using Gas Chromatography Vacuum Ultraviolet Spectroscopy and Theoretical Computations. Anal. Chim. Acta 2016, 945, 1–8. DOI: 10.1016/j.aca.2016.09.021

(5) Wicker, A. P.; Tanaka, K.; Nishimura, M.; Chen, V.; Ogura, T.; Hedgepeth, W.; Schug, K. A. Multivariate Approach to On-Line Supercritical Fluid Extraction–Supercritical Fluid Chromatography-Mass Spectrometry Method Development. Anal. Chim. Acta 2020, 1127, 282–294. DOI: 10.1016/j.aca.2020.04.068