ML vs Wet Lab: The Great Impedance Mismatch

In my last couple of posts, I discussed how different mental models between digital and wet lab teams cause friction in the flow of information across biotech organizations, particularly for so-called tech biotechs that rely on extensive collaboration between the two. In this post, I will explore how it impacts the second leg of the information cycle between the two teams: Deciding which questions/problems to try to answer and how they can fit together into a research pipeline.

Rather than dive straight into the differences between the mental models, I want to start by understanding the contexts in which these mental models were formed, and how this caused the models to evolve so differently. The main difference I’ll focus on is the cost and availability of data. This is closely related to an idea I wrote about a while ago that I call the Experiment Cost Inflection Point.

If these ideas resonate with your own experience, please let me know in the comments below. Also consider signing up for my weekly newsletter where I send out short ideas to get you thinking about this in your day-to-day, and announce upcoming blog posts that will go deeper into this topic.

Data Scarcity and Biological Context

From the beginning, biology evolved in a context in which broadly applicable data was not readily available. In physics, and to a large extent chemistry, a small number of experiments can lead to universal insights (literally). Biology, on the other hand, is so heterogeneous that for the most part, data can only be collected in a very narrowly scoped context. It’s often difficult to tell which aspects of the information you gathered are applicable more widely, and what is only true in this narrow scope.

Even when universal theories like evolution/natural selection and Central Dogma of genomics uncover universal truths, they still leave plenty of heterogeneity. Each species must be observed independently. Each gene requires a body of experiments and papers. That’s why biology has historically had a reputation for focusing on observational studies rather than fundamental principles.

Fancy new instruments allow us to collect more data, but for the most part they don’t expand the scope in which the data is relevant. It’s still about a single sample or cell line, in a single set of experimental conditions. We get more data points (and increased cost), but it’s often still unclear which aspects are relevant in a broader context.

Investing in a large dataset for general, to-be-determined future analysis can lead to a huge payoff. But the more limited the scope in which you can collect data, the more expensive it is to get to the payoff. And for biological data, expanding the scope gets expensive much faster than increasing the volume within the existing scope. So the investment needed to get to that payoff can end up being much higher than you initially thought.

Data Abundance and the ML Revolution

The type of machine learning that most people think of when they use the term today only became feasible in the last decade or so, as the advent of “Big Data” made larger and larger datasets available. Thanks to the internet, ubiquitous devices and infrastructure for managing data at scale, ML in the last decade has benefited from seemingly unlimited sources of new and interesting data. Moreover, this data is typically available in a form that addresses a very broad scope.

Besides technical issues of scale, questions that come up in the current context of ML tend to involve matching the available data to useful questions, and identifying potential bias based on how the data was collected. So while the scope of the data is typically wider, analysts often expect to have less control over what data is collected. The quintessential example is ImageNet, a database of millions of annotated images that was released in 2009. Any data scientist can download and use this dataset for free, but they don’t get to decide what the images and labels are.

Mathematical Models

To understand how these two contexts – biology and machine learning – translate into different approaches to problem definition and analysis, we first need to understand where they overlap. And for this, we need a mental model for mathematical models.

A mathematical model is a way of encoding a mental model into a precise form that can be communicated to other people or to a computer for them to evaluate. For example, a common mental model is “Your height is determined by the average of your parents’ heights plus some unknown smaller factors.” If you translate this into a mathematical model, you get linear regression. Many mental models are too complex/subtle/subconscious to translate into mathematical models. We’re interested in the others.

All mathematical models have certain assumptions built into them. Many also leave flexibility to adjust the model in response to data. This is the “learning” in machine learning, but it’s also a component of plenty of non-ML models. Once you’ve decided what the inputs and outputs of a model are, the rest is defined by what assumptions you want to bake into it, and what you want the model to learn from data.

Baking or Overfitting: Choose your bias

So, how do you decide what assumptions to bake into your model, and what to learn? To some extent this is defined by what information is “in” the data – a subtle, if not philosophical, question. In practice, it depends greatly on how much data you have.

For a toddler, every four-legged, furry animal is a dog, until they see enough cats to realize it also needs to bark and wag its tail. That’s because their mental model is given a great deal of flexibility to learn from data. For a mathematical model, you can narrow down that flexibility, to ensure that the diversity in the available examples is enough to really learn the answer. If your dataset only contains dogs, or only contains large dogs and small cats, you’ll need to bake some distinctions into the model rather than learn them.

Overfitting is the term in machine learning for when you make a model too flexible for the available data. The idea is that the model does such a good job learning the patterns in the data that it ends up learning artifacts and noise such as “cats are small, dogs are big”. One way to avoid this is by adding assumptions into the model that aren’t learnable from the data, and reduce the overall flexibility. In statistics, there isn’t a term for overfitting because statistics is deliberately designed to bake in enough assumptions to avoid the issue.

Baking more assumptions into your model reduces the potential for overfitting, but it comes with its own risks. Every assumption is a simplification. Every assumption closes off an opportunity to find a pattern that you haven’t noticed yet. Every assumption could be wrong if you aren’t careful. So baking in assumptions and leaving flexibility both come with risks of different types of bias.

Because the trade-off depends on the type of data that’s available, biology and machine learning developed very different approaches to building models, and evolved very different shared mental models in the process.

The Scientific Method

In the sparse-data context in which biology evolved, you have no choice but to bake lots of assumptions into your models. Where do you get these assumptions from? That’s the point of the scientific method.

The scientific method is an iterative process in which you add assumptions to each successive model. You start with a narrowly scoped question and gather just enough data for it to “learn” one or two assumptions. Then you make a new model that adds those assumptions into the mix and do it again.

What you end up with is a sequence of models, each feeding into the next and adding up to the larger problem. You also get a mental model, both for each step and for how they fit together. So you can inspect the intermediate results from the individual models to sanity check the final answer. In other words, you can easily interpret the overall mathematical model because it relies heavily on the assumptions that also power your mental model.

Machine Learning

From the earliest days of computing in the 1950s, the field of Artificial Intelligence explored a range of approaches from models that were completely defined by baked-in rules to models that were almost entirely trained from data. Machine Learning came to refer to the models that involved any level of learning, including models with a mix of baked-in rules and flexibility to learn.

For most of its history, the field was dominated by rule-heavy models because there just wasn’t enough data to avoid overfitting with rule-lite models. These models weren’t fundamentally very different from what you’d see in a field like biology. They were just being used for different types of problems. A few people decided to ignore the lack of data and look for ways to remove baked-in, particularly with a type of model called a neural network. But the rule-heavy models continued to do better on benchmarks, so that’s where the field focused.

Then, along came ImageNet in 2009. Within a few years, image recognition research shifted from an obsession with baked-in rules about things like image segmentation and anatomy to an obsession with convolutional neural networks where the only baked-in rules are about translational symmetry. (A cat in the upper left of the picture is the same as a cat in the middle.) By the end of the decade, as massive datasets in other domains became available, the rest of the ML field followed suit.

In the presence of these massive datasets, machine learning researchers were able to remove the simplifications and potential bias of baked-in rules, while minimizing the potential bias from overfitting. But those baked-in rules were the only thing keeping the mathematical models aligned with their mental models, so they also lost the automatic interpretability.

You’d think that spending decades obsessed with rule-heavy models and just recently starting to care about rule-lite models would lead to a shared mental model that emphasizes both. But that would be missing a key element: shortly after this shift in emphasis, the advent of data science created a surge in the number of people studying machine learning. Most of the ML experts you meet today only became experts after this shift.

Because the field shifted so completely into the rule-lite world of neural networks, then rapidly grew in this context, the field of machine learning has come to think of itself as primarily focused on rule-lite models, mostly neural networks. And its shared mental models reflect this.

Mental Models for Mathematical Models

A core component of a task mental model is an understanding of the tools available to solve a problem and their cost/benefit analysis – the criteria for when you would choose one over another. We’re going to look at how this plays out for two types of tools: data and models.

A wet lab biologist will be very familiar with the available data sources in the lab, including all the intricacies of how reliable each one is and the cost to acquire it, in terms of both time and money. For the cost/benefit analysis they’ll tend to focus on immediate, concrete use cases and because these tend to be narrow in scope, they’ll underestimate the benefit and over-emphasize costs.

A typical data scientist, on the other hand, will have a much more limited understanding of the available data sources and their model of available data will focus more on what’s already been/being collected, rather than what could be collected if they asked the right people. On the other hand, their cost/benefit analysis will include potential future use cases, and thus tend to over-emphasize the potential benefit.

When it comes to available models, a wet lab biologist’s list will tend towards the kinds of sequentially stacked, rule-heavy models that come out of the scientific method, while a data scientist will default to rule-lite models like neural networks. There’s a gap in between these, in which you have more baked-in rules than the rule-lite models, but more flexibility to learn than the rule-heavy models.

Because modern biological data is more readily available than it was a decade or two ago, you might think the rule-lite models are the way to go. But there’s a catch: The ways in which data is more readily available are… complicated. Even though you can generate massive datasets through sequencing and imaging and a number of other modalities, they often have a limited scope in subtle ways, such as a particular cell line, or an assay that only replicates certain aspects of a desired environment.

So the ideal model will probably have some baked-in assumptions to account for the limited scope, and some flexibility to learn from the large volume of data. These are exactly the models in the gap between what the digital and wet lab teams are typically familiar with.

Creating Shared Mental Models

Between wet lab biologists’ and data scientists’ default mental models, we have a very lopsided understanding of data sources and a giant gap in their understanding of potential models. But this doesn’t mean all is lost. It just means you need to do the work to create better shared mental models.

Wet lab biologists and data scientists are already doing this in labs around the world by working to understand how their colleagues think differently about problems. There are also plenty of labs where this isn’t happening. It’s a time consuming process, and it can be frustrating for team members who don’t realize just how deep these differences run. The idea and vocabulary of shared mental models provides a lens for understanding the gaps and identifying ways to speed up the process. In my upcoming posts, I’ll continue to explore how these ideas can help solve the problems described above.