LCGC North America
Decision trees offer great visuals to observe complex data sets and to classify data according to simple decision rules.
Decision trees, or classification trees, offer a simple visual representation to understand and interpret your data, or to build automated predictive models. The figure obtained has a tree-like shape, where each end branch contains one group (of samples, methods, or any objects to classify). Nodes usually represent data or simple "if-then-else" decision rules to classify one object into one group. The predicted variable (belonging to one class) may be defined by qualitative (categorical data) or quantitative (numerical data) attributes. Ideally, the tree should represent all data with the smallest number of nodes, but a data set containing high variability would necessitate many nodes. Decision trees are, again, an example of supervised methods, but with one significant advantage over other methods from previous chapters: little data preparation is required (usually no normalization is necessary).
Here are a few simple examples to understand the features of this straightforward method.
Imagine you are a Parisian planning your next vacation, and would like to go on a trip abroad. Destinations may be classified according to a number of criteria: for example, climate, cultural interest (museums and places of interest), possibility for outdoor activities, and traveling distance. To illustrate this example, we have selected some places where one may want to go on holiday, and have defined their characteristic features (Table I). Each criterion has scores from 0 to 10, based on climatic conditions (amount of sunshine, average ambient temperature, rainfall), places to visit, nature interest, child-friendliness, shopping opportunities, and travel distance from Paris, France. "Interest" criteria were defined according to rankings proposed by travel agency websites.
Then I asked two colleagues (called "candidates") if they would quite spontaneously want to go to one place or another. Using their "yes" or "no" answers as a qualitative dependent variable and the criteria selected as quantitative classifiers, decision trees were calculated. To some, the decision trees were too complicated and difficult to interpret (a typical case of overfitting where the data cannot be grouped into a small number of classes). One possible case of failure was that the decision criteria selected did not reflect the reasons why the candidate preferred to go to a certain destination. In other words, the variables chosen were not discriminating, they could not explain the candidate's choices. Another reason why the decision tree sometimes failed to procure adequate results is there were times when the candidate wanted to go everywhere. For example, those who really love to travel and want to discover the whole world gave too few negative answers. Thus, no negative criteria appeared to contribute to their decision process. A balanced dataset is necessary to obtain a meaningful decision tree.
To some, however, the decision criteria were quite good, and procured easy-to-read decision trees. Two of them are presented in Figure 1. Eric's decision tree (Figure 1a) is quite straightforward; clearly, cultural interest is a strong decision criterion, as it appears as the first node, with most "yes" answers in the group of destinations with high cultural interest (scores in the range 9.5–10.0, on the right of the figure). For destinations with less cultural interest (on the left of the figure), he may still be interested, if there is a nature interest (second node at the bottom left and bottom middle).
Figure 1: Decision trees for travel destinations as (a) Eric and (b) Elise.
Elise had much different decision criteria (Figure 1b). First, she likes to travel to long-distance destinations, as the first node is "flying time", with most "yes" answers in the long distance groups (middle and right of the figure) and most "no" answers in the short distance group (left of the figure). When the distance is not long (left of the figure), she may still be interested in sunny places (bottom left), but cares little for nature or child-friendliness (middle nodes).
Let us now observe a similar strategy applied to chromatographic experiments, to unravel the enantioselective mechanisms contributing to successful chiral separations. Understanding the structural features that contribute to the enantiorecognition process is a difficult task. In a recent study (1), the enantioresolution of some chiral sulfoxide species on polysaccharide stationary phases were explored. Particularly good resolution on chlorinated polysaccharides was observed. It seemed that the general shape of the molecule was an indication of the resolution that could be achieved; indeed, compact or spherical shape (when the molecule was folded) yielded higher resolution values than linear shapes (when the molecule was preferentially in an extended conformation). To illustrate this example, the values of enantioselectivity measured for 24 chiral sulfoxides on one chiral stationary phase in one set of operating conditions were used as quantitative dependent variables. Molecular descriptors quantifying the structure attributes were computed and used as quantitative independent variables as classifiers for a decision tree. The result can be observed in Figure 2. The whole group of racemates is first divided according to their sphericity, with low-sphericity compounds on the left yielding low enantioselectivity (red color), and high-sphericity species on the right yielding generally higher enantioselectivity values. In the left group, higher enantioselectivity values were obtained for the analytes that possessed more pi and non-binding electrons (second node). In the right group, the capability for hydrogen bonding was a second classifier, although perhaps not as clearly discriminating. Clearly, additional data would be required to obtain a more reliable model that would be suitable for prediction, but still these observations are in accordance with the general observations on chromatographic results within this dataset.
Figure 2: Decision tree for enantiorecognition process on a given chiral stationary phase for a set of chiral sulfoxide compounds.
Another example of predictive classification trees to predict enantioseparation capability can be found in the works of Del Rio and Gasteiger (2).
Other examples of decision trees applied to the selection of orthogonal chromatographic systems (3), or to predict the sensory attributes of olive oil samples (4), may be found elsewhere. In the latter example, the decision tree was generated based on two previous multivariate analysis methods using soft independent modeling of class analogy (SIMCA), and a partial least square (PLS) regression technique, that served for both classification and quantitation purposes. This decision tree could then classify unknown olive oil samples according to their sensory attributes based on the volatile species identified with headspace–mass spectrometry.
Naturally, there is more to decision trees than is explained here. For predictive purposes, validation of the model with adequate statistics would be required. Also, constructing a multitude of decision trees is the basis of random forests methods of machine learning.
(1) C. West, M.-L. Konjaria, N. Shashviashvili, E. Lemasson, P. bonnet, R. Kakava, A. Volonterio, and B. Chankvetadze, J. Chromatogr. A1499, 174–182 (2017).
(2) A. Del Rio, and J. Gasteiger, J. Chromatogr. A 1185, 49–58 (2008).
(3) R. Put, E. Van Gyseghem, D. Coomans, and Y. Vander Heyden, J. Chromatogr. A 1096, 187–198 (2006).
(4) S. López-Feria, S. Cárdenas, J.A. García-Mesa, and M. Valcárcel, J. Chromatogr. A 1188, 308–313 (2008).
Caroline West is an Associate Professor of analytical chemistry at the University of Orleans. Her scientific interests lie in the fundamentals of chromatographic selectivity, both in the achiral and chiral modes, mainly in SFC but also in HPLC. In 2015, she received the LCGC award for "Emerging Leader in Chromatography". Direct correspondence to caroline.west@univ-orleans.fr.
New Study Uses MSPE with GC–MS to Analyze PFCAs in Water
January 20th 2025Scientists from the China University of Sciences combined magnetic solid-phase extraction (MSPE) with gas chromatography–mass spectrometry (GC–MS) to analyze perfluoro carboxylic acids (PFCAs) in different water environments.
The Next Frontier for Mass Spectrometry: Maximizing Ion Utilization
January 20th 2025In this podcast, Daniel DeBord, CTO of MOBILion Systems, describes a new high resolution mass spectrometry approach that promises to increase speed and sensitivity in omics applications. MOBILion recently introduced the PAMAF mode of operation, which stands for parallel accumulation with mobility aligned fragmentation. It substantially increases the fraction of ion used for mass spectrometry analysis by replacing the functionality of the quadrupole with high resolution ion mobility. Listen to learn more about this exciting new development.
A Guide To Finding the Ideal Syringe and Needle
January 20th 2025Hamilton has produced a series of reference guides to assist science professionals in finding the best-suited products and configurations for their applications. The Syringe and Needle Reference Guide provides detailed information on Hamilton Company’s full portfolio of syringes and needles. Everything from cleaning and preventative maintenance to individual part numbers are available for review. It also includes selection charts to help you choose between syringe terminations like cemented needles and luer tips.
The Complexity of Oligonucleotide Separations
January 9th 2025Peter Pellegrinelli, Applications Specialist at Advanced Materials Technology (AMT) explains the complexity of oligonucleotide separations due to the unique chemical properties of these molecules. Issues such as varying length, sequence complexity, and hydrophilic-hydrophobic characteristics make efficient separations difficult. Separation scientists are addressing these challenges by modifying mobile phase compositions, using varying ion-pairing reagents, and exploring alternative separation modes like HILIC and ion-exchange chromatography. Due to these complexities, AMT has introduced the HALO® OLIGO column, which offers high-resolution, fast separations through its innovative Fused-Core® technology and high pH stability. Alongside explaining the new column, Peter looks to the future of these separations and what is next to come.
Oasis or Sand Dune? Isolation of Psychedelic Compounds
January 20th 2025Magic mushrooms, once taboo, have recently experienced a renaissance. This new awakening is partially due to new findings that indicate the effects of psilocybin, and its dephosphorylated cousin psilocin may produce long lasting results for patients who might be struggling with anxiety, depression, alcohol and drug abuse, and post-traumatic stress disorder. Hamilton Company has developed a methodology for the isolation and identification of 5 common psychedelic compounds used in the potential treatment of disease. The PRP-1 HPLC column resin remains stable in the harsh alkaline conditions ideal for better separations.