As the head of software engineering at a small startup with ambitions to grow much larger, I think a lot about how to design data infrastructure that will both address our immediate needs and adapt to future needs. I’ve seen what happens at large companies when each team has their own set of data infrastructure: You end up spending all your time either managing integrations between these systems or complaining about not being able to share data between teams. So the problem is to come up with the smallest set of tools that can be used across all the different teams and functions, both the ones that we have now and the ones that we can expect to add in the coming years.
Ideally, all the data would be in a single system so we can have seamless data sharing without having to manage cross-system integrations at all. However, different teams and functions have different requirements in terms of both technical specs and how they fundamentally interact with data. So to better understand which of these teams and functions should be able to share infrastructure, I’ve found it useful to group these different sets of requirements into four categories of use cases described below.
Every organization will need to support a different mix of use cases that fall into these different categories. And while it may be impossible to predict exactly what activities those will be in five to ten years, you can usually get a rough idea within each category. In this post, I’ll described each category and the circumstances that define their unique requirements. In upcoming posts, I plan to present additional ways of thinking about these requirements, that will allow us to go into more detail.
While some tools and platforms can address more than one of these categories, they are fundamentally too different for a single system to address all of them. So even if we can’t reach the ideal of a single system for everything, understanding these four categories and which ones you will need to support can help to select the smallest and most effective set of tools and systems for the medium- to long-term.
The Categories of Use Cases
Before we go into the specific details of each category, I’ll quickly summarize what they are and how they’re related:
Operations – Activities that involve entering and retrieving individual data elements. The data gathered here will feed into the other use cases. This is where most if not all raw data comes from, whether it’s within your organization or in an organization that you acquired the data from. Examples include inventory, HR and accounting systems, Electronic Health Records (EHRs) and system logs. In fact, a data catalog is an operational use case, though the data is metadata for other datasets.
Monitoring – Calculating and serving pre-determined metrics and visualizations of snapshots of data at ongoing time points. The key here is that you’re calculating the same statistics and drawing the same charts of visualizations in a deterministic and consistently repeatable manner as the data evolves. The results are often displayed in dashboards.
Exploration – One-off, custom analysis. The datasets or data sources may be new to the organization or the user. The goal may be to answer questions that will influence a strategic decision. Or it may be part of the exploration step in the development process of the other use cases.
Prediction – Training and deploying machine learning models, then querying them for either batch or individual predictions. These models will typically be used over a period of time and need to be consistently and repeatably updated. However, unlike the Monitoring category, the results come from trained models rather than deterministic algorithms. And while batch predictions may be fed into a dashboard, Prediction use cases can also support transactional requests for individual predictions.
These four use cases interact extensively with each other: Data from operations feeds into the other categories, whether it’s your internal operations or those of an external data source. Exploration is the basis for the development work that defines the Monitoring and Prediction use cases. Conversely, Exploration work may make use of prediction APIs, or may be motivated by trends identified from Monitoring use cases.
You may have noticed that there isn’t a category called “Data Science”. That’s because the term is used to mean a lot of different things that fall into any of the last three categories. Two of these, Monitoring and Exploration, existed before data science was a thing: Monitoring is a large component of Business Intelligence and a number of other areas. Exploration has long been a specialty of management consultants, among others. However, data science provides a new take on both of these categories while helping to introduce the Prediction category into general usage.
Once you have an understanding of which of these categories you’ll need to support, and which use cases within each category, you can begin considering what requirements your data platform will need to satisfy for each one, and what type of software will allow you to meet these requirements.
The Operations category is unique from the others in that users are almost exclusively concerned with individual data elements. In the other categories, minor inaccuracies are smoothed out by statistics, or thrown away as outliers. But in Operations use cases, a single wrong entry could be fatal (figuratively or literally.)
This makes operations use cases the best place to enforce standards that will improve data quality. However, the types of accuracy that the operational user is concerned with may be different from those that will matter in the other categories. For example, there may be systematic biases or drift that don’t impact operations, but cause statistical artifacts. Fields that are vitally important for analysis may be unnecessary for operations. It can be difficult to motivate operations users to change their processes to minimize these data quality issues that don’t immediately impact them. But that’s a topic for a different post.
The Operations category is also unique in that the transactional nature of user interaction makes the underlying data highly dynamic. Monitoring, Exploration and Prediction, on the other hand, are primarily concerned with static snapshots of a dataset. This makes data consistency particularly difficult for things like references tables, which has led to specialized tools and approaches such as Master Data Management (MDM).
Software that supports Operational interaction with data tends to include other functionality specific to the data type, and is thus often specialized by domain: Customer Resource Managers (CRM), Lab Information Management Systems (LIMS), Human Resource software, etc. So organizations tend to end up with multiple different systems in the operations category and spend a fair amount of resources syncing data between them. (Again, see MDM.) However, a few platforms such as Airtable have been making a convincing argument that many of these domain-specific operations use cases could be handled by a single system.
Because of the transactional nature of Operations use cases, the technical requirements tend to emphasize latency over throughput. Since the other use case categories often require longer-running queries that could block Operational queries, it’s common to mirror Operations databases into a separate system for the other categories of use cases. This also provides the opportunity to transform the data into a schema that has been simplified, denormalized or otherwise modified to make analysis easier.
Today, with faster hardware and distributed systems, this separation is less of a necessity. However, there are now a lot of Operational use cases that involve transactions from a large customer base (e-commerce) or the general public (network security). These use cases require dedicated systems for capturing high-volume data streams and transforming them into an analysis-ready form.
Requirements for monitoring use cases come from both the end users who view the analysis and developers who maintain the pipelines.
End users viewing the results may need to be able to drill down, or otherwise adjust how they view the results, but this is only within pre-defined limits. Any use case where the user has complete flexibility in how they view the data would fall under the Exploration category. The goal of monitoring use cases is more to identify trends and issues than to understand them. Since the end user requirements are fairly limited, the discussion here will focus on the developer requirements.
For Monitoring use cases, the pre-defined analysis is often performed at set intervals based on when data snapshots are updated, ideally with minimal or no human intervention. This may involve a mix of internal and external data sources that requires data to be joined across these sources. Most of the difficulty and complexity of this use case stems from the requirement that the analysis be consistent over time, given the potential inconsistency of datasets outside the developers’ control.
Unlike with the Operations category, an error in a single entry won’t be a problem since it will be smoothed out in the statistics. However, small systematic errors or statistical drift across snapshots can be a huge problem.
While the analysis in this use case is deterministic (as opposed to Modeling use cases), the logic can be very complex. It often models business processes that are formally or informally defined elsewhere, such as Key Performance Indicators (KPIs). These may require pulling from multiple data sources with different conventions and reference tables. So a great deal of effort typically goes into monitoring data quality and dealing with corner cases.
The software for these use cases tends to be domain agnostic, and spans a spectrum from code-based solutions for technical experts to graphical interfaces such as Tableau for domain experts. Some of these tools can also be used for Exploration use cases. But within the Monitoring category, the goal of using this software is to develop a robust and repeatable process that the developer can deploy, then update as little as possible.
In the Exploration category, the end user and the developer are the same. The analysis is by definition one-off, so unlike with Monitoring use cases, there is no notion of consistency across snapshots of the data. In fact, this use case often involves new data sources that the analyst hasn’t seen before. The complexity thus has more to do with discovering and understanding these new data sources and going deeper into the analysis with more complex statistics.
Again, individual data errors tend to be smoothed out by the statistics, while systematic errors can be a problem. However, because a person is involved in every calculation, these issues are more likely to be caught and accounted for. A good analyst or data scientist will start by running summary statics that are explicitly aimed at finding these kinds of errors.
On the other hand, the outcome of this exploration is a single ephemeral analysis such as a set of slides or a notebook, rather than a product that can be used repeatedly. The result doesn’t need to be robust and production-grade so the goal is to minimize the time and effort required to answer the question at hand.
As with monitoring, the software tools and platforms tend to be domain agnostic, allowing multiple teams to share tools within the category. The graphical tools geared towards domain experts are often the same a those for development in the Monitoring category. The canonical one Tableau, for example, allows you to quickly define graphs with lots of different views of the data (Exploration) then publish the best ones to a dashboard (Monitoring).
For code-based tools, the canonical choice has become the notebook – typically Jupyter, though there are a number of alternatives. This is in contrast to Monitoring use cases, which tend to involve more traditional software engineering tools (IDEs, debuggers, etc.) A number of development flows have been introduced to transition smoothly from notebooks to software development, but there isn’t yet a well established winner.
There are hundreds of choices for languages and libraries that one can use within the notebook, and depending on the circumstances it may or may not be important to standardize these within your organization.
In the final category we have use cases where a trained model is deployed to make predictions on demand. This is similar to the Monitoring category, in that there’s both an end user and a developer with different requirements, and because it involves an evolving series of potentially inconsistent snapshots of the data. However, this category adds the complexity of managing and tracking the evolving series of models. The models also tend to involve more statistical complexity rather than complexity from business logic.
As with Monitoring and Exploration, individual errors tend to be smoothed out, while systematic data issues are potentially problematic. As with Monitoring, statistical drift is also an issue. But in the Prediction setting these make it into the results indirectly via a trained model with bias. While some models can be interpreted in a way that makes the bias detectable, for other models it may only be discovered by carefully and systematically analyzing the predictions.
To identify and debug this kind of bias, you need to be able to track which model versions served which results and what snapshots of the data each model was trained on. Capturing and analyzing this metadata requires special tools and infrastructure.
There are a number of platforms that allow domain experts to define, train and deploy models, but these are relatively new. Many organizations that address these use cases still do so with code written by technical experts.
The initial development often involves Exploration use cases, while the later phases typically require more of a software development-style workflow to build, deploy and maintain a production-grade model. So this is where the handoff from notebook to IDE is particularly important. Many teams still work by building a prototype in a notebook environment (an Exploration use case) then having a separate team re-implement it from scratch for production (a Prediction use case). There are a number of tools that attempt to make this smoother, but this is still a very new area without a canonical solution.
As with Exploration, there is a wide variety of libraries and frameworks available, many of which can also be used for Exploration use cases. The extent to which an organization or team needs to standardize on a small number of choices will again depend on the circumstances.
The four categories of use cases defined above are not mutually exclusive. In fact, as you’ve seen above, there is some overlap and plenty of interaction between them. However, when creating a data platform for an organization, you must decide how to support each category and it’s particular requirements.
In this post, I’ve given an overview of their unique needs and touched on the software that’s typically used. However, to get a complete understanding you need to go deeper into each category. In my next few posts I plan to discuss additional frameworks for analyzing each category that I’ve found helpful in my own work.