Scaling a biotech research platform requires an organization to define and optimize the flow of data, which can be broken down into three questions:
- What tasks do we need data for (and what data do we need for these tasks)?
2. How will we collect/acquire/generate this data?
3. How will we organize the data to connect collection to usage?
This post is about the third question. But even though it’s the last question you should answer, understanding the possible answers is necessary to properly address the first two questions.
In fact, this post is secretly about a concept with an intimidating name: Data Normalization.
But don’t stop reading yet – We’ll focus on the context and motivation for this concept, leaving the technical details to others.
My goal is for you to understand data normalization just well enough to make informed strategic decisions about the three questions above.
And the main takeaway is this: Data normalization is a concept that helps manage organizational complexity by enabling different conceptual views of the same data for different users across an organization, e.g. one that’s writing the data and another that’s reading it.
Consistent with the immediacy/generality trade-off, different individuals and teams will have different understandings of what a particular collection of data should look like.
For example, a biologist collecting data from a lab instrument will think in terms of process and readout: What steps did they do to prepare each sample, and what numbers came out of the instrument?
But a computational biologist or data scientist analyzing the data afterwards will likely think about it in terms of experimental conditions and variation: How did the readouts from different instruments change under a range of shared experimental conditions?
Abstractly, the information content is the same in both cases.
But their forms are very different, so optimizing for each use case requires a different approach to organizing the data.
And while this is one example, you’ll find that in many cases the people collecting data will want to do so in a different form than the people who will eventually use it.
The Elephant and the Cave
This is a specific instance of a broader theme.
There’s a well known parable originating in the Indian subcontinent about three blind individuals trying to understand the concept of an elephant by feeling different parts of it.
The person feeling the tail ends up with a very different mental model of the elephant than the person feeling its trunk.
If the information content is the elephant, the view of the data collected by the lab scientist might be its tail while the data scientist wants to view it as a trunk.
But a slightly better analogy is Plato’s cave: If someone spent their life in a cave, only seeing the shadows of the things passing by outside, they would have a very limited (literally two-dimensional) view of the world.
Now, imagine that the lab scientist and the data scientist are in different but nearby caves, close enough that the same objects cast shadows from different angles.
When an elephant stands outside the cave facing the data scientist, they see the outline of its ears, its left legs and its right legs.
But the lab scientist in the adjacent cave sees the shadow from the side, an outline of its tail and trunk, its front legs and its back legs.
Both see the outline of the same elephant, but the way they understand it is very different.
The data scientist understands the concepts of left and right legs, but doesn’t distinguish between front and back legs. For the lab scientist, it’s the opposite.
Reconstructing the Elephant
So how do you design the structure of the data that comes from the lab and goes to the data scientist?
If you were going to put it into an Excel file, how do you choose tabs for the different concepts and column headers for the different attributes of each concept?
If you’re putting the data in a relational database, you’d have to make the same decisions, but it’s tables instead of tabs, fields instead of columns, and it’s a bit harder to change your mind once you’ve done it.
In either case, the set of tabs/tables and their columns/fields is called a schema.
And you have a few major options for how you design a schema:
- You could design tables based on how the lab scientist thinks about the data – the elephant’s shadow from the side.
- You could design tables based on the data scientist’s view – the elephant’s shadow from the front.
- Or, you could try to design a more general schema that isn’t biased by either perspective – the three-dimensional elephant.
If you design a schema based on the three-dimensional elephant rather than one of its shadows, we call that a normal form of the data. In fact, there are different levels of normal forms, based on how three-dimensional we want to get.
An evolving problem
In the early days of databases, when computers were slow and memory was expensive, schema design was highly constrained by the technical limitations of the systems.
If you built a schema representing the three-dimensional elephant, your queries would take too long and you’d run out of memory.
A schema based on one of its shadows would be simple enough to work efficiently. But then you have to pick which one.
So database engineers went to great lengths to build data structures that were close enough to the whole elephant to be accurate, but close enough to a shadow to be practical.
The star schema was the most common approach to this.
And for the users who ended up with a view of the data that didn’t quite match their mental models, well, it wasn’t always so great.
Today, as computers get faster and memory gets cheaper, there are a few cases where the volume of data is growing to match.
But there are also an increasing number of cases where the volume hasn’t grown so fast, allowing the technical constraints to melt away.
In biology, where data collection tends to be quite expensive, there are a lot of these.
For these cases, where from a technical perspective you can choose whichever view of the data you want, the factor of organizational complexity remains:
Do you choose a schema based on the shadow that makes sense to the lab scientist, the shadow that makes sense to the data scientist, or the three-dimensional elephant?
Skirting the issue
The key to answering this question is to recognize that the way that we store the data under the hood doesn’t need to be the same as how either of the users accesses it.
In other words, we can have the lab scientist enter the data in a form that matches their view of the elephant, then transform it into a different schema to store it.
Then when the data scientist accesses it, we can again transform it to a schema matching the shadow they see.
So if the storage schema doesn’t need to match the shadow that the lab scientist sees, or the shadow that the data scientist sees, how do we pick?
This is where the three-dimensional elephant schema – the normal form – shines: It’s the most accurate view of the real world and allows us to capture the most information content.
Of course, there may be some issues in turning the lab scientist’s shadow into a three-dimensional elephant: Suddenly we need to know about front and back legs, not just left and right.
The analogy begins to break down a bit, but the point is that there are subtleties here.
This is the best approach in principle, but in practice an approximation of normal form is usually better.
If you store the data in a schema that the lab scientist understands, you maximize its immediacy for solving the problem that the lab scientist cares about.
If you use the schema that the data scientist understands, you lose some immediacy in favor of the generality required by a down-stream user.
When we introduce the normalized schema, connected to the shadow schemas by transformations, we effectively decouple them into separate processes – one optimized for immediacy and the other for generality.
But this approach pushes things along the cost vs reliability tradeoff:
It makes the overall process more reliable by ensuring that all parties are able to speak their own language.
But now experiments that involve changing the schema to iterate on the larger process are more expensive.
While database normalization is a fairly deep technical topic, it has a simple purpose: It allows organizations to manage organizational complexity by enabling different users to interact with data based on their own particular view of the world.
This post didn’t go into any of the technical detail, but there’s a clear takeaway when it comes to answering the questions of what you need data for, and where this data will come from:
Database normalization gives you flexibility to improve both immediacy and generality by decoupling the schema for how data is collected from the schema for how it will be used.
For many datasets, the requirements for this are defined more by organizational complexity than technical constraints.
And while the additional technical complexity of the transformations may push the process towards the cost end of the cost/dependability trade-off, it can make a dramatic improvement to both immediacy and generality.