Hardware and Software Challenges for the Near Future: Structure Elucidation Concepts via Hyphenated Chromatographic Techniques

Balogh,Michael;Kind,Tobias;Fiehn,Oliver;

Hardware and Software Challenges for the Near Future: Structure Elucidation Concepts via Hyphenated Chromatographic Techniques

February 1, 2008

By Michael P. Balogh
Tobias Kind

Article

LCGC North America

LCGC North AmericaLCGC North America-02-01-2008

Volume 26

Issue 2

Pages: 176–187

This month, guest columnists Kind and Fiehn discuss small-molecule structure elucidation (excluding peptides) using hyphenated chromatographic techniques, mass spectrometers, and other spectroscopic detectors.

Choosing and effectively employing analytical technology is the broader discussion that underlies the diversity of topics that appear in this installment of "MS—The Practical Art." Over the last five years, we have presented the reasoning of the practitioners we consult, to disseminate their wisdom and further our own knowledge in a number of areas such as accurate mass measurement, sample preparation, modern small-particle ultrahigh-pressure liquid chromatography (UHPLC) technology, and evaluating the viability of emerging non-LC ionization prospects.

In almost every instance, I relied on the expertise of well-recognized practitioners, not only for their "nuts-and-bolts" explanations but for the unique insight their experience affords. Insights developed from their well-earned knowledge we cannot glean from reading peer-reviewed work.

This month, Drs. Kind and Fiehn discuss small-molecule structure elucidation (excluding peptides) using hyphenated chromatographic techniques, mass spectrometers, and other spectroscopic detectors. Their opinions help us glimpse the hardware and software challenges that we face in the near future.

The World Outside Analytical Chemistry Fails to Recognize the Myriad and Difficult Intellectual Challenges That Attend the Structure Elucidation of Small Molecules

In the last 20 years, no major discussion of large-scale efforts on the structure elucidation of small molecules has found its way into high-impact journals like Science and Nature. Innovative, breakthrough technologies like comprehensive gas chromatography (GC) (GC×GC), Orbitrap mass spectrometry (MS) from Thermo Fisher Scientific (San Jose, California), DART ionization from JEOL (Peabody, Massachusetts), and UHPLC separations developed from Waters Corporation (UPLC, Milford, Massachusetts) appeared largely in method-oriented journals. Yet high-impact journals targeting broad readerships frequently showcase techniques like gene sequencing and expression, which are relatively mature. Why? Results obtained by genomic analysis can be linked directly to biological activity, gene names, and protein structures and their functions. Because they can deliver results that answer biological questions, such techniques attract major research interest.

There is also the problem of difficulty. Structure elucidation of small molecules using only hyphenated techniques like GC–MS and LC–MS without nuclear magnetic resonance (NMR) spectroscopy is extremely difficult to perform. The failure 40 years ago of the Dendral project (1) is an apt example. In that project, scientists pioneered computer-assisted structure elucidation (CASE) techniques. In doing so, they laid the foundation for most of the subsequent research in the field. Yet since then, no similar project has attracted funding, this despite the increasing sophistication of software technology and computer design.

Hyphenated Chromatographic Techniques Deliver Multidimensional data

Let us first address the pure chromatographic technologies: orthogonal techniques like LC×LC, GC×GC, ultraperformance techniques like UHPLC, new separation principles, and new phases like monolithic capillary phases, hydrophilic interaction, and aqueous normal-phase chromatography. Unfortunately, those separation techniques cannot solve all of our current problems regarding structure elucidation of small molecules. Nevertheless, we must use them in conjunction with spectroscopic and MS detectors to separate complex mixtures and, finally, to obtain the true isomer structures of molecules.

We can say the same for pure MS technologies: that ultrahigh resolving power, high mass accuracy, and high isotopic abundance accuracy alone cannot solve all our structure elucidation problems. Only together with chromatographic separations do they unfold their power. The process of structure elucidation requires use of all dimensions of acquired multidimensional data. Running a GC–MS analysis on the basis of a mass spectral library search only without retention-index information is not the best approach to confirming identity when an added informational dimension is easily available. Nor is acquiring LC–MS chromatograms and neglecting MSⁿ and MS^e information because it lacks software concepts. MS^e is the practice of using high and low collision energies to enhance the spectral quality of an analyte in MS-MS while MSⁿ is a technique popularized in ion traps of successively fragmenting ions which are themselves products of a fragmentation experiment to enhance spectral quality.

The Robustness of Experimental Platforms and Speed of Data Generation Versus Software Automation

Fortunately, for a wide range of products, LC and MS technologies became robust and mature during recent years. Gone are the times when an ion source required weekly cleaning, defective resistors required manual soldering, and chromatography columns required manual coating. Yet such platforms deliver data, not answers, and they do so at such a speed and quality that we cannot keep pace evaluating the torrent of data. So, where lies the glue, the dark matter — the answer to our problem? In software databases? Automation techniques? New web technologies? No single technology can solve the riddle that voluminous and complex structural annotations pose, and tease biological answers from them. Only the trinity of technologies — separation, detection, and software — can do that. From our own research platform, which can handle several thousand samples a year, we observed that automation on the software side plays a crucial role. An experiment with 100 GC–MS metabolic profiling runs finishes in two days. But manually inspecting 100 chromatograms, each of which consists of 500 deconvoluted peaks, would require weeks of work. Automated peak annotation systems utilizing high-quality criteria render performing the once formidable task trivial, completing it in mere minutes and with far more consistency and reliability.

Blending Analytical Chemistry with Cheminformatics, Machine Learning, and Database Concepts

One of the basic ideas to improve structure elucidation is to use multiple chemometric capabilities comprehensively. These capabilities depend upon tools for handling small molecules (viewers, structure editors, local database tools), found in the free, Instant-JChem database (Instant JChem Personal includes all search, viewing and structure import/export, forms and relational data capabilities as standard but to access external database engines, share local database tables, define custom structure canonicalization, and batch process structure-based calculations requires a "pay for" license [www.chemaxon.com/product/ijc.html].), and tools for calculating compound properties (logP, boiling points, and water solubility), found in the free, EPA EPI-Suite (www.epa.gov/opptintr/exposure/pubs/episuitedl.htm). They also depend upon using quantitative structure-property relationships (QSPR).

A key issue is access to the molecular information in large databases such as PubChem, Chemspider, or eMolecules. Database access is important because it limits the search space of a given problem. Molecular isomer generators (like MOLGEN–MS or generators in the Chemistry Development Kit) can enumerate all possible structural isomers from a given molecular formula. However, even when they include molecular substructure information (for example, a specific compound comprises a methyl group but not a nitro group), these data alone are insufficient because of combinatorial explosion in the number of isomers. So a reasonable approach is to first compare the physicochemical properties of compound peaks to those of known molecules instead of trying de novo to identify all peaks. Working on a small subset of only, say, a million known molecules, instead of trillions of molecules, does not always lead to the correct solution, but it does limit the computational effort tremendously.

Interpretation and in silico prediction constitute two general ways to improve the structure elucidation process. Interpretation is a particular way to interpret molecular and mass spectra using expert algorithms or machine-learning techniques. Many such tools are used already in analytical labs. For example, the MS interpreter included in the NIST–MS Search program. In silico prediction anticipates molecular properties, like retention information, from LC and GC data and calculates molecular spectra, like infrared and NMR, and (possibly) mass spectra. A large set of in silico spectra can be matched against spectra experimentally obtained. Examples reside in the NIST retention index prediction algorithm or in MassFrontier's mass spectral prediction algorithm, the latter of which can calculate "barcode" mass spectra (lacking intensity information).

Comparing Experimental Data with in silico Calculated Properties and Spectra

The availability of large structure databases allows an approach to structure elucidation that was unimaginable before the existence of free databases like PubChem. Today, anyone can perform the task with commodity hardware and freely available software tools. Molecular spectra, like infrared, ultraviolet, or NMR, can be calculated using either some of the higher-order expert systems or quantum mechanical ab initio or semiempirical methods (see Table I for possible approaches). Researchers who perform GC–IR or LC–NMR measurements could match a precalculated database of molecular spectra against spectra they obtained experimentally, excluding the majority of structures. The success rate depends not only upon the speed of such a calculation, but also on the accuracy of the prediction. In the case of higher-order NMR expert systems, the prediction accuracy (for NMR shifts) is between 1 and 3 ppm (2). You can avoid the sensitivity issues associated with RAMAN, IR, dircular dichroism (CD), fluorescence, and ultraviolet molecular spectra by using robotized solid-phase extraction (SPE) and an automated, fractionation processes for offline investigations.

Table I: Structure elucidation concepts for hyphenated chromatographic techniques

Though the technology has advanced rapidly and far, the notion of computer prediction of mass spectra is still far removed from reality simply because we cannot yet successfully model gas-phase ion processes. Fast semiempirical methods like AM1, PM3, or PM6 and slower (but more accurate) density functional theory (DFT) models already are used successfully during interpretation of mass spectral rearrangements and fragmentation pathways. The use of systems such as MassFrontier, which includes a reaction library of 19,000 mechanisms collected from thousands of MS publications, which can be used as a state-of-the-art tool for in silico fragmentation predictions of compounds. Such an approach works well when applied to compound classes with similar structures such as phospholipids or glycerolipids. Theoretical fragmentations can be used for comparing the phospholipids or glycerolipids to experimentally obtained, tandem mass spectral patterns (MS-MS). Similar functions are used for carbohydrate sequencing, or they are included in databases such as LipidMaps (Lipid Metabolites and Pathways Strategy, www.lipidmaps.org/) for use in identifying lipids.

In silico Prediction of Retention Index or Retention Time From Molecular Structures

For more than 20 years, investigators have examined retention-time predictions or retention-index predictions for chromatographic separations. In GC, most prediction models are based upon relatively few compounds (usually < 500), presenting a severe problem. A second problem, though one of lesser consequence, is that only specific compound classes were used for model building, so the model does not apply to other compound classes due to the small sample size. Other problems include differing column selectivities and GC temperature programs.

It was only in 2005 that the NIST (National Institute of Standards, USA) Mass Spectrometry division, led by Steve Stein, released a large collection of GC retention indices collected from the literature. The database includes 121,112 Kovats retention indices for 25,983 compounds. Additionally, a retention-index estimation program (3) was released together with the free NIST MS Search program. Their prediction accuracy (with an absolute error of about 50–70 Kovats units) is not accurate enough to obtain unique identifications. Nevertheless, you can apply it as a powerful filter in a more complex setup.

Figure 1 shows all Kovats indices for all structural isomers of the molecular formula C₈H₁₆O₂. The formula could be obtained by accurate mass measurement and isotopic pattern filtering from a time-of-flight instrument using soft ionization techniques. Such an orthogonal filter can be powerful: only 20 compounds out of all 13,190 structural isomers have a Kovats index above 1300. However, many more candidate structures would remain if a C₈H₁₆O₂ peak was detected at a Kovats index of about 1050.

Figure 1

There are a number of reasons why no general LC retention index system has so far included all the influences of different solvents, temperature and buffer systems as well as large differences in column selectivity. A recent review by Roman Kaliszan (4) in Chemical Reviews listed over 370 references that use QSRR. Octanol water partition coefficients (logP) values show good correlations to LC retention times. Such logP algorithms, widely in use in the pharmaceutical industry, demonstrate good prediction capabilities. Some of the reasons why no general LC retention index system as yet developed includes the influence of different solvent, temperature, and buffer systems as well as large differences in column selectivity. One can tackle this problem by taking the more complex distribution coefficients (logD) into account. You calculate the logD values by including the extent a compound ionizes at different pH values using pK_a values. Even if none of these prediction algorithms is accurate enough to yield a unique result, they are nonetheless useful in removing most false candidates. Such filters also can be used in powerful multifilter cascades, limiting the search space even further.

Future Technology Requirements for GC–MS Based Structure elucidation

For GC–MS the ultimate technology platform for structure elucidation would be a comprehensive, two-dimensional GC×GC coupled to a fast-scanning (50–100 spectra per second), high-mass-accuracy (1–5 ppm), and high-resolving-power (10,000 RP) mass spectrometer. Such an instrument would allow positive and negative switching and MSⁿ technologies together with soft ionization techniques. GC×GC would deliver the best chromatographic resolution. The required scanning speed is due to the high resolution of the chromatographic separation itself, as 2D peaks can have peak widths of about 100–200 ms. For deconvolution purposes, a large number of mass spectra must be acquired along a peak. For GC–MS, the detection of the molecular ion is crucial to calculate sum formulas of the intact molecule. Hence, improved GC×GC–MS instruments should utilize soft ionization techniques such as chemical ionization, field emission, or field desorption. The detection of the molecular ion, together with accurate isotopic abundances, would allow the calculation of correct molecular formulas. Mass fragmentations would be required for further structure annotations.

Current trends in miniaturization could lead to detector hybrid systems that would include MS and element-specific detectors in a single modular instrument. These instruments would include as many as four simultaneous detectors using column-split technologies and would satisfy the requirements for obtaining a complete analytic overview from a complex sample. Moreover, they were shown 20 years ago (5), so there is no question of whether they can be designed and built. Rather, the challenge lies in developing data systems that can accommodate the complex and voluminous data stream that such multifarious instruments would produce.

Future Technology Requirements for LC–MS Based Structure Elucidation

In LC–MS, good chromatographic peak resolution and total peak capacity outweigh in importance speed for peak detection and de novo structure elucidation. Comprehensive, two-dimensional LC×LC (6) or monolithic columns in reversed-phase, normal-phase, and hydrophilic interaction mode (7) can provide such an enhanced resolution. You can couple LC to multidetector arrays that include fluorescence, electrochemical, evaporative light scattering, and infrared detectors. Indeed, connecting a UV detector with a mass spectrometer is a common standard today. It is essential in MS to switch ionization polarities employing multimode ion sources (electrospray, atmospheric pressure chemical, and atmospheric pressure photoionization) and, thus, cover different polarity ranges and compound classes.

As we previously suggested, the capability to combine multiple data streams from different detector systems and synchronize them to adjust for retention-time shifts is crucial. To do so, the optimal approach would obtain as many signals as possible from different detectors in a signal run. Otherwise, you must use internal marker substances to combine detector data obtained from the different sources. However, we cannot discuss meaningfully in this column the many technical concerns that would attend a multiple-data-stream approach, like sensitivity issues, which for multimode detectors would require additional preconcentration steps or incompatible solvent systems. Nor can we address the plethora of different mass spectrometers and their hybrids. Bear in mind, nevertheless, that for structure elucidation, our goal is high resolving power (RP > 30,000), high mass accuracy (less than 1 ppm), and high isotopic abundance accuracy (< 5% absolute error).

Additionally, the mass spectrometer must be of the fast-scanning type, able to acquire MSⁿ (n < 4) and MS^e and collision-induced dissociation (CID) spectra. Yet the rule is the higher the resolving power used, the slower the resultant scan speed. You can, however, lower the mass spectrometer's resolving power without compromising a scan's overall efficacy when you use chromatographic peak resolution. When doing so, note that MS scans over a full chromatographic peak should be averaged. You also must perform multiple scans over the full width of a peak to calculate the peak's purity during the deconvolution process. If peak purity is not high enough, you must assume that the MS-MS spectra were obtained from potentially overlapping compounds, and that they, too, are not pure enough.

Software Tools for GC–MS and LC–MS Based Structure Elucidation

Evaluating GC–MS data is, in general, a dual-stage process. First, retention time information is converted into Kovats retention indices. Such a process allows the lookup in precalculated retention-index tables or databases. Second, a deconvolution and peak detection processes is performed. Electron ionization spectra interpretation can be performed by mass spectral tools found in AMDIS (VARMUZA classifiers) and the substructure determination from mass spectra expert algorithms such as the NIST-MS Search program (STEIN algorithm) or the free NIST-MS Interpreter program This is an expert algorithm that will detect special functional groups or substructures. You will find the substructures in the AMDIS handbook.

Only some of these classifiers have a high discriminating power. As an example you will find many targets pointing to elemental sulfur. But there is absolutely no sulfur. www.amdis.net/What_is_AMDIS/AMDIS_Advanced/amdis_advanced.html]

Additionally you can perform in silico fragmentation of possible structures using MassFrontier or ACDLabs Fragmenter and then match the obtained barcode spectra against experimental spectra. You can use low molecular weight structures programs such as MOLGEN–MS (Bayreuth) or the isomer generator from the open source Chemistry Development Toolkit (CDK) to generate possible isomers and match the obtained features against experimental retention indices and mass spectra.

With LC–MS technology, challenges nonetheless abound, including those associated with peak deconvolution, peak extraction, and adduct detection. Furthermore, retention-index concepts are not yet fully developed because of the problems we mentioned earlier. The solution, when it comes, will be to produce retention-index models. Such models, would comprise a diverse set of compounds (n > 1000), and would be developed under standardized solvent, buffer and temperature conditions for a specific set of reversed, normal, and hydrophilic interaction phases.

A library search is usually the initial step in a compound identification procedure. Yet no large public or commercially available spectral library (n > 100,000) for LC–MS and MSⁿ data exists (8). The fact that we lack such a library obviously limits the new approaches we can devise.

To create libraries, we must acquire large collections (n > 10,000) of mass spectral "trees." Such trees would include MSⁿ and MS^e and CID spectra at different voltages, because mass spectra and tandem-mass spectra sometimes contain only a few features or a single prominent peak. Using robotized mass-spectral infusion you can obtain a rich set of mass spectral fragmentation patterns and create such databases. Similar core structures in the gas phase result in similar mass spectral patterns. A diverse and large set of such compounds can collectively yield a database of core structures and fragmentation patterns that you can apply to further structure elucidation. The additional use of accurate tandem and MSⁿ spectra allows the correct calculation of molecular formulas and better interpretation of such obtained fragments.

Many issues remain unresolved because data sets acquired from modern analytical techniques produce far more information than we currently use. For example, in Figure 2, an ion map experiment generates a three-dimensional map of precursor ions and corresponding product ions over the full mass range of a molecule. The third-dimension intensities are color coded. Note that a product ion was obtained for every precursor ion. Currently no software exists that can interpret such a multiple-feature ion map. The richness of this experiment and others like it requires new software approaches to better understand MS fragmentations and rearrangements.

Figure 2

Metabolomics Requires a Metabolite-Scoring Algorithm Similar to SEQUEST

A general solution for better and faster structure elucidation is preparative, robotic fractionation with high-resolution, reversed- or normal-phase LC used in conjunction with 96-well-plate fraction collectors. Alternatively, even preparative GC fractionation and subsequent identification using ¹H and ¹³C and 2D-NMR works well. However, you must track such compounds in databases, so that no rediscovering them occurs. You must also annotate them by MS and chromatographic techniques so that you can find them in independent analysis runs.

It is currently almost impossible to perform de novo structure elucidation with chromatographic and MS techniques alone. Therefore, embracing new data-sharing policies for compound spectra, use of cheminformatics and chemometric approaches together with that of free and open structure databases like PubChem, would hasten the structure elucidation approaches for small molecules.

Natural-product researchers have shown the correct way to elucidate complex structures. Doctoral theses cover the wealth of structure elucidation of natural products and reveal the broad range of analytical techniques. But they do not address the time consuming and cumbrous procedures we must first undertake to perform the technique and the lengthy road we still must travel to it. Eventually, we still must perform organic syntheses of the authentic organic compounds to validate structural claims. Consider in academia, the speed of structural elucidation may be as low as one compound per year per scientist. Metabolomics aims to identify all metabolites (in contrast to drug metabolism or biomarker studies). So the metabolomics scientists must advance compound annotations much faster than what is currently possible. For techniques that screen large numbers of unknowns, we need a compound scoring system similar to the BLAST, MASCOT, or SEQUEST algorithms used for nucleotide or protein analysis. A small-molecule scoring system must consist of a minimum set of different orthogonal parameters with a weighing formula that is yet to be developed. Instead of annotating a metabolite as unknown compound, the metabolite score would contain a list of possible structures or structure sets together with the level of uncertainty and methods used for identification.

Oliver Fiehn studied analytical chemistry at the Freie Universitaet Berlin and worked as a research scientist for water quality control at TU Berlin, Germany. He received his PhD from Technical University Berlin in 1997 and in 2000 became group leader for Metabolomics at the Max-Planck-Institute for Molecular Plant Physiology in Potsdam, Germany. Since 2004 he is an associate professor at the UC Davis Genome Center. He is on the board of directors of the Metabolomics Society and the current chair of the Metabolomics Standard Initiative.

Tobias Kind studied analytical chemistry in Merseburg, Bayreuth and Leipzig. Later he worked for four years at the UFZ (Centre for Environmental Research) Leipzig-Halle and finished his PhD studies focused on the combination of GC-MS with chemometrics at Leipzig University. He was a member of the Metabolomic Analysis Group at the Max Planck Institute of Molecular Plant Physiology in Golm. In 2004 he joined Oliver Fiehn's lab at the UC Davis Genome Center - California (USA) as a Post-Doc. His primary research area is structure elucidation of small molecules with a focus on metabolomics by using a comprehensive approach including analytical techniques, chemometrics, cheminformatics and databases.

Michael P. Balogh

Michael P. Balogh

"MS — The Practical Art" Editor Michael P. Balogh is principal scientist, MS technology development, at Waters Corp. (Milford, Massachusetts); President and co-founder of the Society for Small Molecule Science which organizes the annual CoSMoS conference; and a member of LCGC's editorial advisory board.

References

(1) Applications of Artificial Intelligence for Organic Chemistry: The DENDRAL Project profiles.nlm.nih.gov/BB/A/L/A/F/

(2) M.E. Elyashberg, A.J. Williams, and G.E. Martin, Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation. Progress in Nuclear Magnetic Resonance Spectroscopy, in press, corrected proof (Elsevier, Amsterdam, The Netherlands).

(3) S.E. Stein, V.I. Babushok, R.L. Brown, and P.J. Linstrom, J. Chem Inf. Model 47(3), 975–980 (2007).

(4) R. Kaliszan, Chem. Rev. 107(7), 3212–3246 (2007).

(5) D. Janssen, Fresenius Zeitschrift Fur Analytische Chemie 331(1), 20–26 (1988).

(6) P. Jandera, LCGC Eur. 20(10), 510 (2007).

(7) K. Horie, T. Ikegami, K. Hosoya, N. Saad, O. Fiehn, and N. Tanaka, J. Chromatogr., A 1164(1–2), 198–205 (2007).

(8) J.M. Halket, D. Waterman, A.M. Przyborowska, R.K.P. Patel, P.D. Fraser, and P.M., J. Experimental Botany 56(410).

Articles in this issue