A research group has developed an improved workflow for constructing machine learning (ML) models in oligonucleotide separation.
A team of scientists from the Department of Engineering and Chemical Sciences and the Department of Mathematics and Computer Science at Karlstad University, Sweden, has developed an improved workflow for constructing machine learning (ML) models to predict retention times and peak widths in oligonucleotide separation. Their work was published in the Journal of Chromatography A (1).
Illustration of IT roadmap modern technology and innovative processes, networking and big data: © Johannes - stock.adobe.com
Oligonucleotides are short nucleic acid molecules used in therapeutic applications; they present a unique challenge in chromatography because of their complex structures. Any analytical method must be capable of separating, quantifying, and characterizing oligonucleotides and their potential impurities, which can arise from the multistep manufacturing process (2). The goal of this research was to create an ML-driven system that could accurately predict retention times and peak widths from large datasets, removing the need for time-consuming manual analysis.
Using a combination of ML techniques, the researchers built a systematic workflow capable of handling extensive datasets. They analyzed oligonucleotide forms, ranging from native to fully phosphorothioated structures, using three different gradient slopes. These oligonucleotides were separated on a C18 chromatographic system using tributylaminium ion-pair reagents. The study generated retention time data for approximately 900 sequences per gradient.
To process the large amount of data, the team implemented a semi-automated rule-based approach for retention time determination, peak decomposition and width assessment, signal-to-noise ratio, and skewness analysis. The workflow also incorporated probability density functions (PDFs) to fit elution profiles, with an F-test used for PDF selection. Coeluting peaks were addressed using a multiple Gaussian PDF approach.
The encoded sequence data was modeled using multiple ML algorithms, including support vector regression (SVR); gradient boosting (GB); random forest (RF); and decision tree (DT). The results indicated that GB and SVR were the most effective models for retention predictions, demonstrating accuracy in predicting retention times. While RF and DT models performed well in terms of speed, they showed limited generalization capabilities.
The ML models encountered larger prediction errors for shallower gradient slopes and lower predictability for P=O sequences. The authors suggested that signal intensity and sequence heterogeneity contributed to these errors. Future improvements in signal-to-noise ratios, such as incorporating mass spectrometry in selected ion monitoring mode, could enhance predictability.
By using these ML models, scientists can now predict chromatograms for various gradient slopes, allowing for the simulation of impurity peak resolution across different experimental conditions. This could lead to more efficient drug development processes, especially in the production of therapeutic oligonucleotides. The ability to anticipate peak behaviors before running actual experiments could significantly reduce costs and improve analytical accuracy in pharmaceutical research. This approach also enables the prediction of resolution between critical solutes. As ever, the researchers acknowledge that the models are not without their limitations. Caution should be advised when interpreting separation performance, particularly resolution, as the main challenge lies in accurately predicting peak width (1).
(1) Samuelsson, J.; Enmark, M.; Szabados, G.; et al. Improved Workflow for Constructing Machine Learning Models: Predicting Retention Times and Peak Widths in Oligonucleotide Separation. J. Chrom A 2025, 1747, 465746. DOI: 10.1016/j.chroma.2025.465746
(2) Fornstedt, T.; Enmark, M. Separation of Therapeutic Oligonucleotides Using Ion-pair Reversed-phase Chromatography Based on Fundamental Separation Science. J. Chrom. Open 2023, 3, 100079. DOI: 10.1016/j.jcoa.2023.100079
How Many Repetitions Do I Need? Caught Between Sound Statistics and Chromatographic Practice
April 7th 2025In chromatographic analysis, the number of repeated measurements is often limited due to time, cost, and sample availability constraints. It is therefore not uncommon for chromatographers to do a single measurement.
Fundamentals of Benchtop GC–MS Data Analysis and Terminology
April 5th 2025In this installment, we will review the fundamental terminology and data analysis principles in benchtop GC–MS. We will compare the three modes of analysis—full scan, extracted ion chromatograms, and selected ion monitoring—and see how each is used for quantitative and quantitative analysis.
Rethinking Chromatography Workflows with AI and Machine Learning
April 1st 2025Interest in applying artificial intelligence (AI) and machine learning (ML) to chromatography is greater than ever. In this article, we discuss data-related barriers to accomplishing this goal and how rethinking chromatography data systems can overcome them.