The great weakness of artificial intelligence (AI) approaches to technical automation and creation is the quality of the learning data sets.
The internet is generally the worst place for AI input, because there is no vetting of the data there. For artistic and creative endeavors, this weakness is of little consequence, and can lead to very interesting, unexpected outcomes.
In contrast, the amazing success of AlphaFold (1)—in which 200 million protein folding structures were solved—can be traced to the high-quality input data in the form of available protein structures, along with the linearity of the amino acid sequences. As we move into more complex, real-world environments, the criticality of high-quality information—versus junk or noise—must be understood if we are going to have successful AI models.
How do we quantify and discuss these learning sets?
Well, we start with some terribly simple statements like, “There exists a true ‘signal’.” That signal might be the identity of a protein, the concentration of a biomarker, or a distribution of something, such as in magnetic resonance imaging (MRI). In addition to these examples from biology and medicine, this concept applies equally to fields like monitoring the environment or looking for extraterrestrial life in the solar system.
That signal must contain “important information” (more on that in a moment), be transduced, then “transmitted” and be “read” by a receiver. The signal can be transduced by a highly specific sensor for one physical element (such as refractive index detectors) or information-rich detectors like mass spectrometry.
The “important information” is the identity and distribution of atoms, molecules, and physical structures throughout time and space. And that information must have some relevance—a clinical decision (you have an infection), an emergency (carbon monoxide sensor triggering), or a life-changing event (you are pregnant).
What will AI want as input?
Everything.
The trick is to get good high-quality information, and a lot of it.
To uncover this information, nearly all transducers need nice, clean samples within a limited concentration range. That is where separation science shines. It is already an integral part of many systems, but its impact will only grow.
How much “important information” is out there? How much of it are we already measuring, and how much can we measure? Using information theory (2), something like a sensor for a single thing, like blood glucose, produces a few bits of information. Linear separation science in analytical modes for one dimension can produce, say 103–104 bits. Mass spectrometry, NMR, cryo-electron microscopy (EM) and other high-information content detectors can generate ~104–106 bits. Compare these numbers to a rough estimate of the information carried in a drop of blood: It approaches 1022 bits (coincidently also the number of stars in the known universe). We’ve got a lot of wiggle room to grow.
The true power of separation comes from its preparative mode: It can amplify the information that can be gathered from sensors and detectors. It can multiply the information gathered by several orders of magnitude—especially for multidimensional separations. You can already see 108–109 bits popping out from comprehensive two-dimensional gas chromatography (GC×GC), size-exclusion chromatography X high performance liquid chromatography X capillary zone electrophoresis-mass spectrometry (SEC×HPLC×CZE-MS), and so on. For gradient techniques, the numbers can grow even more.
There is a reasonable argument to be made that 1022 bits is attainable, although I am not smart enough to know what that would look like. But if AI is used to query these data sets and the data are of high quality and relevant to the questions being posed, I suspect some amazing things can happen across many fields.
So, what is the “AI Granddaddy”? The Phoenix metropolitan area where I live has a weekly gathering of entrepreneurs largely focused on biosciences and a young man there calls himself “AI Daddy” (@aidaddy.io). We’ve had extensive discussions about the future of healthcare and the potential impact of AI. What would an app look like that could diagnose and treat disease, writ large? In these conversations, I always circle back to the quality of the data sets, objectively determining the impact of the output for these approaches.
Thinking about all of this has all leaked back into my day job. I truly think that separations science will be the “force multiplier” to uncover increasingly large amounts of information from complex samples like blood and environmental samples and the search for life in the solar system.
Separation science will be the mechanism that can scale with the inputs needed for data-thirsty AI. I am a lot older than he is, so I decided to call myself “AI Granddaddy”—although you won’t find me on TikTok anytime soon.
1) AlphaFold reveals the structure of the protein universe. https://www.deepmind.com/blog/alphafold-reveals-the-structure-of-the-protein-universe (accessed 2023-03-15)
2) Science and Information Theory, 2nd edition, Leon Brillouin, 1962 Academic Press New York, London.
3) Huber JFK, Smit HC. Information flow and automatic data processing in chromatography. Z Anal Chem. 1969; 245(1-2):84-8.
This blog is a collaboration between LCGC and the American Chemical Society Analytical Division Subdivision on Chromatography and Separations Chemistry (ACS AD SCSC). The goals of the subdivision include:
For more information about the subdivision, or to get involved, please visit https://acsanalytical.org/subdivisions/separations/.
The Future of Digital Method Development: An Interview with Anne Marie Smith
December 12th 2024Following the HPLC 2024 Conference in Denver, Colorado, LCGC International spoke with Anne Marie Smith of ACD/Labs about the new ICH Q14 guidelines and how they impact analytical scientists and their work.
Inside the Laboratory: Using GC–MS to Analyze Bio-Oil Compositions in the Goldfarb Group
December 5th 2024In this edition of “Inside the Laboratory,” Jillian Goldfarb of Cornell University discusses her laboratory’s work with using gas chromatography–mass spectrometry (GC–MS) to characterize compounds present in biofuels.
RAFA 2024: Michel Suman Discusses Food Safety And Authenticity Research
November 28th 2024During RAFA 2024, Michel Suman of Barilla Spa and Catholic University Sacred Heart talked with us about his food safety and authenticity research, focusing on contaminants, adulterants, and authenticity markers in food processing.
Exploring The Chemical Subspace of RPLC: A Data-driven Approach
November 11th 2024Saer Samanipour from the Van ‘t Hoff Institute for Molecular Sciences (HIMS) at the University of Amsterdam spoke to LCGC International about the benefits of a data-driven reversed-phase liquid chromatography (RPLC) approach his team developed.