I’ve been working with chemical instrumentation for a while. I always joke with my family that I enjoy “tinkering” with my instruments the way someone else might enjoy working on their car in a garage (if you like that kind of thing). The primary research tool in my laboratory is comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometry (GC×GC-TOFMS). We predominantly generate highly complex sample datasets of over 50 samples. The individual files are large and constitute just a small piece of the information network contained in an even larger batch dataset. It will come as no surprise that computers not only play a key role in the operation of these sophisticated tools, but also in the challenging task of extracting meaningful interpretations from the sea of information.
I recently moved institutions, and as I was in the process of setting up my laboratory, I decided to investigate options to make some of our computing workflows easier, with fewer headaches, and ideally try to make the work as “futureproof” as possible. Multidimensional chromatography software has come a long way in the past 10 years, but pairing a new instrument with an ill-suited data workstation can cause tragic bottlenecks. I started to reframe the acquisition of a powerful offline data workstation as the acquisition of an instrument. Now, I am no longer surprised at the budget I must invest in this part of our research workflow–it is just as critical to our research success as chemical instrumentation.
In acquiring new software with our instrument, I started reading about minimum specifications for its operation and knew I would need to get something far beyond these recommendations for the types of datasets I deal with. I asked for further support on suggested specifications, but this type of information doesn’t really exist since each user is so different. I wanted to avoid a “guess-and-check” approach since that landed me previously in iterative purchases and a lot of productivity breaks. So, I decided to do what any good scientist in search of information might do: phone a friend.
Luckily, I have built an extensive network of colleagues doing similar work to me across the world. This is, of course, only possible through participation in scientific conferences, collaborative research projects, participation in professional committees–all things I highly recommend if you want to increase your “phone a friend” opportunities. In my case, I reached out to a few different people, all of whom pointed me to someone they had spoken to about similar questions–Prof. James Harynuk from the University of Alberta. Here, we’ve documented some of our conversations to highlight things you may want to consider when setting up a dedicated data workstation specifically for GC×GC-TOFMS research. I like to think that much of this applies to many nontargeted chemical analyses and to laboratories dealing with large amounts of GC-MS data as well.
Q: What are the main components to consider when building a data workflow?
A: The main things to consider include (1) how you get your data from your instrument computer to your processing computer; (2) how much storage space you’ll need to accommodate your data both for current projects and for archiving older data; and (3) how you’re going to be processing your data to generate results. This last point includes both the software and the hardware for the processing computer.
Q: Where are the worst bottlenecks in this workflow?
A: The two worst bottlenecks in data workflows are migrating the data off the instrument computer and setting up the data processing computer. If you get either of those wrong, working with your data becomes a horrendous chore at best and a complete nightmare at worst.
Q: Which hardware do you suggest putting on the list when setting up the data workflow?
A: The first thing that you need is some kind of network attached storage (NAS) device, and I personally always get a decent network switch with enough ports for your network storage and all your instrument and data processing computers to connect to. Having my computers and data not on my campus network isolates them to protect them from the outside, meaning that I’m not relying on the campus network to move gigabytes of data at high speeds and on time. The computers all have secondary (or tertiary) network cards to let them connect to the outside world when needed.
Q: How do you protect your data from being lost?
A: Network storage is key. Our instrument computers immediately dump acquired data to our NAS, which has a pair of drive partitions. The one where data from the instruments lands is write-only, with one folder per instrument. It stays organized, and nobody other than the manager can delete data, so it is safe. People processing data can copy the data files to local hard drives or folders on the other NAS partition to work it up and process it, but the original data is safe. Then, you just need enough storage for the samples you’ll be collecting with active projects. In our experience, you need about 600–700 MB of space per hour of 200 Hz data acquired from a LECO system (Peg IV, BT, or HRT), and you need about 4 GB of space per hour of 100 Hz data from a Markes GC×GC BenchTOF system. We have an 8 × 18 TB RAID 5 array of disks on our NAS, giving us 126 TB of space, and the data is safe, even if one drive fails. We are presently looking at setting up a larger off-site archive/backup of our NAS, just in case the NAS fails. A key feature on your NAS is that it should have enough onboard RAM for a few data files at once. That way, if there are multiple files being read from and/or written to it at once, it can minimize bottlenecks and maximize performance. Ours has 4 GB and seems fine so far.
Q: What workstation specs do I need to be concerned about?
A: It depends a bit on the software that you’re using for processing your data. You for sure want the fastest processor with the most cores that you can afford. We’ve had good luck with both high-end AMD Threadripper PRO and Intel i9 chips, so which one is best likely depends on who wins for cost vs performance on the day you buy your processor. The bottleneck with ChromaTOF 5 seems to be processing the data and doing all the deconvolution. It appears LECO has parallelized some of these tasks, as all the cores on our CPU get maxed out when processing.
From a RAM perspective, in our experience with ChromaTOF 5 (BT and HRT) and GCImage, neither one asks for much more than about 25–30 GB of RAM. So, you’re likely fine with 64 GB, or more if you’re planning on having other software open at the same time and future features that may demand more memory. One recommendation that our computer technician gave us was that if we wanted maximum performance, we should have a stick of memory in each slot on the motherboard.
Looking at the GPU, it seems as though it is used for rendering graphics and driving the multiple monitors you’re bound to have, but in our testing, it basically sits idle while data is being processed. You want one that meets or exceeds the minimum specs for the software so that you can render the graphics quickly, but a bigger GPU likely won’t help much with speeding up data processing with current commercial software offerings.
Finally, there is disk space. One of the big bottlenecks we were surprised by (especially with GCImage) was the impact of drive read/write speeds. Classic 7200 RPM hard drives are very slow. So slow, in fact, that they cripple your data processing speed. So, you likely don’t want a giant drive in your computer. A 1 or 2 TB SSD to hold your operating system and your programs is just fine (an M2.NVMe SSD would be preferred, as they’re a lot faster than a SATA SSD). For your “big storage” and folder to hold the data you’re actively processing, you’re better off using your NAS, as it and the Ethernet will give faster data read/write speed than a local SATA HDD.
Q: If I have a little extra room in my budget – what should I spend it on to get the biggest return on my investment for data processing?
A: If your motherboard can accommodate it, a card that lets you set up a big, fast RAID array of M2.NVMe drives is a nice touch. Our data processing computers came with this option and we set up a 4 × 4 TB RAID 0 array (optimized for maximum space and speed). This made a noticeable difference in startup/database connection time with ChromaTOF 5 and a huge difference with GCImage. With the latter, processing a series of 3800 chromatograms would have taken about 15 days reading and writing from our local SATA HDD. With the local NVMe RAID array, reading and writing only took about 2.3 days. One caveat with this setup is that if one drive dies, all data is lost. As such, you want to have the data backed up somewhere else when you’re done working with it.
If your motherboard cannot accommodate an internal NVMe RAID, external drive housings that will take 4 × M2.NVMe drives are pretty inexpensive. As long as the housing has a Thunderbolt 3 or 4 port and your computer also has a Thunderbolt 3 or 4 port, you should get pretty good read/write performance, but we haven’t tested this yet.
Q: What about longevity of the hardware I purchase – warranties, maintenance, budgeting for replacement devices, etc.?
A: Our local computer shop offers up to 4-year warranties on devices, and if something fails, they’ll just give a straight replacement with an identical or better compatible part. No hassle, super quick. We do that with all our storage drives and data processing computers; it adds extra 15% or so to the cost, but when something dies, it’s worth it. Many institutions will also require some kind of warranty/service plan on electronic purchases, so check the policy of your workplace as well.
Q: Any other advice?
A: If you’re using software other than what I’ve mentioned, watching your computer during processing can really help you figure out where your bottlenecks are. If the software has a way for you to monitor the data processing progress, watch how long it takes to do different tasks and watch what’s happening in Task Manager and Resource Monitor. These few things can let you pinpoint where any bottlenecks are.
As we move towards more complex data being produced by our chemical instrumentation, considering a suitable workstation and data workflow for your lab is becoming more and more important. If you’re in a fortuitous position to redesign your computational workflow in the laboratory, or perhaps have an upcoming grant where you can budget for some of these important items, our hope is that we’ve provided you with some new tools to think about during that process. Don’t underestimate the power of your networking when tackling some of these complex issues – phone your nearest friend (or reach out to an expert), and hopefully you’ll be able to find some good tips to apply in your work.
Analysis of Pesticides in Foods Using GC–MS/MS: An Interview with José Fernando Huertas-Pérez
December 16th 2024In this LCGC International interview with José Fernando Huertas-Pérez who is a specialist in chemical contaminants analytics and mitigation at the Nestlé Institute for Food Safety and Analytical Sciences at Nestlé Research in Switzerland, In this interview we discuss his recent research work published in Food Chemistry on the subject of a method for quantifying multi-residue pesticides in food matrices using gas chromatography–tandem mass spectrometry (GC–MS/MS) (1).
The Use of SPME and GC×GC in Food Analysis: An Interview with Giorgia Purcaro
December 16th 2024LCGC International sat down with Giorgia Purcaro of the University of Liege to discuss the impact that solid-phase microextraction (SPME) and comprehensive multidimensional gas chromatography (GC×GC) is having on food analysis.
Next Generation Peak Fitting for Separations
December 11th 2024Separation scientists frequently encounter critical pairs that are difficult to separate in a complex mixture. To save time and expensive solvents, an effective alternative to conventional screening protocols or mathematical peak width reduction is called iterative curve fitting.