In my last two posts I’ve explored the high-level design decisions related to two of the three layers that define each pipeline stage of each category of data use cases: Control and Compute. The Control layer defines how the user interacts with the system, while the Compute layer defines how the system does the work. In this post, I’ll explore the design options for the final layer, Storage, which defines how data is stored between, during and after the all the stages in a pipeline or system.
In the same way that an application can provide multiple options for the Control layer, and can split Compute between different options, an application can leverage more than one type of storage. But while most of the decisions around the Control and Compute layers are determined primarily by performance requirements and user preference, decisions in the Storage layer are largely driven by the nature of the data that’s being stored.
Below, I’ll describe the major categories of storage options, with a brief overview of each.
[New!] Want to stay up to date with new posts? Sign up for my newsletter to get notified in your email inbox each time I post, with a short summary, links to additional reading and a preview of upcoming posts. I’ll also include questions and feedback from subscribers and occasional shorter pieces on topics that I haven’t yet boiled down into a blog post. Visit my newsletter to see past posts and subscribe.
A local file is a file on the hard drive of the user’s laptop or desktop. This is the simplest option and typically the fastest, or at least the lowest latency, since data doesn’t need to be transferred over the network.
Local files can store structured or unstructured data. In fact, most storage options use local files under the hood, even if those local files are on a remote server. So on the one hand this gives you the most flexibility, but on the other it means your application has to define the kind of structure and functionality that other Storage options provide automatically.
Local files can be can only be used with a local Compute layer, unless you have the user manually upload and download files to/from a remote session or service. It also relies on each user to manage the files and file system, including backups, security and transferring files when they get a new computer. So this option can both limit your other options and create a maintenance headache.
Remote files live on a remote system that allows the user’s local compute or other remote system to read and write files. It’s usually defined by a remote service that allows multiple users to connect and read/write files at the same time. This gives you the flexibility of local files while centralizing management in one place that usually takes care of things like backup and security.
Remote file systems are typically designed to handle much larger scale than a local file system, and often store multiple copies of each file on different physical drives to guard against hardware failure. These were the core of the Big Data movement, beginning with the Hadoop File System (HDFS). Today, most Cloud providers have a built-in option like S3 on AWS, Azure Storage and Google Storage on GCP.
The tradeoff for the centralized maintenance is that the network traffic makes it slower (higher latency) and the implementation is more complex. Because these systems are designed for scale, they tend to have additional tradeoffs such as limited functionality compared to a local file system. For example, they typically don’t allow users to read and write arbitrary parts of a file – You have to either download the whole thing or scan through it in chunks. So the benefits of these systems mostly apply to use cases that involve large amounts of data.
Remote file systems are sometimes referred to as Data Lakes. However the term often implies additional functionality beyond storage. So “Data Lake” more often refers to a larger system built around a remote storage service.
Note that many Cloud providers provide options such as EBS and EFS on AWS that allow you to mount additional storage to a virtual machine across the network. Even though this storage is technically remote, the VM treats it as if it were a physically-attached hard drive. So I would still classify this under local files.
A relational database organizes data into a collection of tables, each of which behaves like a spreadsheet with two important differences: 1) You have to decide what the columns are before-hand, after which you can add and edit the data. 2) The system allow rows in one table to refer to rows in other tables, defining relationships between the tables. So compared to a spreadsheet, a relational database trades away flexibility in favor of structure, consistency and functionality.
It is possible, of course, to change the tables and columns (aka the schema) after you start using a database, but the process is intentionally slower and more involved. Because of this, you have to put careful thought into how you use a relational database to model the structures and processes that it will be supporting and schema changes.
Relational databases are optimized for small, individual transactions – adding and updating one or a handful of rows at a time. Because of this, they’re often behind the scenes of interactive applications for Operational use cases that need to keep track of each individual change a user makes. They can also run large-scale queries via a language such as SQL, which allows them to power some Monitoring, Exploration or Prediction use cases.
However, because relational databases are optimized for individual transactions they typically don’t scale well. With modern hardware, millions of entries may be no problem but the billions of data points you commonly see with “Big Data” is probably out of the question. Even at the smaller scale, running large queries can slow down or halt operational interactions, leading many organizations to export the data to a separate system to support non-Operational use cases.
A relational database can be run on the user’s local system for applications with a Local compute layer, but it’s more common for them to be run as a remote service.
Analytics databases structure data very similarly to a relational database, but they’re optimized for querying data at large scales. They’re sometimes called Data Warehouses, though as with “Data Lake”, this term often includes additional functionality beyond just storage. In order to optimize for scale, many analytics databases require data to be loaded in batches and don’t allow individual add/edit transactions. So a common pattern is to load regular snapshots from a relational database supporting Operational use cases into an analytics database supporting the other use cases.
While analytics databases typically require a pre-defined set of columns for each table, the fact that they read data in batches means that the schema has already been established anyway. In fact, it’s common for an analytics database to have many more tables compared to a relational database, each of which may be a variant or snapshot of the same table. These tables can then be merged when you query them, managing any differences in schema based on the immediate use case.
This distinction is often referred to as ELT (Extract Load Transform) or Schema-on-read as opposed to ETL (Extract Transform Load) or Schema-on-write: In a relational database, all data must be Transformed into the pre-defined schema before it can be Loaded/Written into the tables. In an analytics database, data can be Loaded into tables, then Transformed into a common schema when you query/Read it.
Because they’re optimized for scale, analytics databases can run queries on massive volumes of data in a fraction of the time it would take a relational database. They do this by distributing the calculations over a cluster of machines that each work on a subset of the data. This ends up being a latency/speed trade-off: For the same query, relational databases will typically be faster than analytics databases on small amounts of data (low latency) while analytics databases will be faster on large amounts of data (high speed). So relational databases may actually be a better choice for use cases that involve small amounts of data.
Structured Data Files
There’s a hybrid system between a remote file system and an analytics database, in which you store structured files, then use a separate application to keep track of how these files map to “tables” and run queries on them. A number of analytics databases secretly do this behind the scenes. But there are also systems like AWS Athena where you explicitly manage the data files and interact with the query system separately.
The performance trade-offs for structured data files are similar to those for an analytics database. Using a query engine on structured data files allows you to defer defining the schema even closer to the query (ELT), but the larger difference is that it gives you more flexibility (and responsibility) for how data is stored and managed, and how the Compute layer is deployed.
A Key-Value Store, Object Store or Document Store is a system that allows you to store chunks of information (Values) and look up/retrieve them later based on a unique name (a Key). They’re typically optimized for both individual read/write operations like a relational database and the volume of an analytics database. The values may be completely unstructured – images, text, documents – or they may be structured as columns, lists, etc. For unstructured values, a Key-Value store isn’t very different from a remote file system. In fact, most service-based remote storage systems are just key-value stores in which the keys have slashes, so they look like a file system.
For structured values, a key-value store feels more like a relational or analytics database. In particular, if you have multiple values with the same schema (columns), you can treat them as a single table. However, most key-value stores won’t enforce any consistency between values. These are often called NoSQL databases because you can’t query them via SQL, though they often provide a query language that comes close. But because Key-Value Stores are optimized flexibility of schema rather than querying, these queries will typically be much slower than in a relational or analytics database.
So key-value stores are ideal for use cases where you want to capture data while being flexible about what form it takes. These may include Operational use cases where you want to allow users maximal freedom in what data they record and how. Or they could include the Collect stage of any use case where you can’t guarantee the consistency of incoming data, particularly streaming data. These will tend to be Exploratory use cases, where a data scientist can manually handle inconsistency in the schema.
A graph database stores structured data similar to a relational database or a key-value store, but with a focus on the relationships/connections between different elements. This type of database presents the data as a graph or network in which the data elements are the vertices/nodes and the relationships between the elements define edges/links. Many graph databases allow you to bucket the data elements into classes such that all the nodes in a class have the same fields. So you can then think of the nodes as a table like you’d find in a relational database.
However, a graph database stores data in a way that is optimized for queries that take advantage of this graph structure. The trade-off is that the types of queries that you’d run on a relational database will be significantly slower. Conversely, if you were to put the data into a relational database and try to run a graph-style query, it would typically require a long series of relational queries followed by custom post-processing. Some queries might not be possible at all.
For example in a relational database you might have a table of US states and a table of people, with a column that records the state each person lives in. This column is a reference from the “person” table to the “state” table. A graph database would presents this as an edge from the vertex for each person to the vertex for the state they live in. The relational database would be very fast to answer a question such as “How many people live in California?” The graph database would be much slower at this, but if you asked it “How is person A related to person B?” it would very quickly discover that they live in the same state (assuming this is the case.)
So while you can store essentially the same data in a relational database or a graph database, the choice comes down to a question of what types of queries you expect to run. It’s possible to store copies of the same data in both a relational database and a graph database and query whichever version matches your use case. However, keeping the two copies in sync as you update adds a level of complexity that may not be worth it.
Some key-value stores/NoSQL databases also allow you to run reasonably efficient graph-style queries. As noted above, these databases often also have a SQL-like query language, so this can be a good compromise for running both types of queries without having to sync two different databases. Any solution will involve a trade-off, so the better you understand the types of queries you’ll want to run, the better you can optimize your storage.
In this post, we’ve discussed the major categories of options for the storage layer of your data management infrastructure. Each option is optimized for a certain type of data and certain types of use cases, whether these are very general such as file storage or very narrow, such as graph or relational databases.
In practice, you’ll often want to split data between different storage options, even within the same use case. Every storage system comes with its own complexity and overhead, so it’s usually best to keep this to a minimum. And if you can share some of these options between multiple use cases, or even categories of use cases, you can cut down the overall complexity even more.
So as I’ve described throughout this series, it’s useful to take a step back and think about all the different data use cases your organization is likely to encounter. If you can organize them via the use case categories I described previously and settle on a minimal set of shared options for the Control, Compute and Storage layers, you can not only save yourself a good deal of infrastructure overhead, you’ll also create a more consistent and coherent experience for everyone involved.