Data Warehouse Architecture
Data warehouses used by businesses today need to contain a variety of data from tools like HubSpot, Google Analytics, Stripe, and more and make it available for analysis. Because of the variety of data, data warehouses need to be architected in a way that maximizes the warehouse’s flexibility while maintaining its speed of operation.
In this article, you'll learn more about data warehouse architectures, as well as what you should consider when setting up your own data warehouse.
The Key Elements of a Data Warehouse Architecture
In this section, you’ll look at how data warehouses differ from other forms of data storage, including both the technical differences and the ways in which these differences affect user experience.
Columnar storage: Traditionally, when a database records a transaction in a row, all fields in that row contain a value related to that particular transaction. In this scenario, storing all rows in contiguous blocks) on a hard drive makes perfect sense.
The following example outlines how a row-based database stores transaction lines for a store that sells televisions (yes, in reality, it's a little more complex).
Although a relational database management system (DBMS) like MySQL or PostgreSQL is perfectly capable of calculating the total number of sold devices by Samsung, it probably has to access a lot of disk storage blocks to do so. That's very inefficient. Not only that, but these blocks contain a lot of data that’s unnecessary for this particular calculation. Ad hoc analyses rarely require all columns, so it's safe to say that row-based systems aren't suited for (or are at least suboptimal at) analytical queries.
That's why most data warehouses store their columns in contiguous blocks. Here's what the same example would look like in a column-oriented database:
Columnar storage is a fundamental technical principle of data warehouses. It's efficient in multiple ways:
- Fewer blocks need to be accessed
- These blocks do not contain unneeded data
- Storing homogeneous data within a single block enables better compression
You should know that this is a technical property almost entirely invisible to the end user. Data warehouses support the same—if not more—SQL commands as a traditional database.
Separation of storage and processing: Operational databases are designed for processing lots of transactions at millisecond latency. Keeping storage close to computing power is a big help in ensuring ACID properties.
But when these ACID properties are relaxed, keeping storage close to compute is often costly and unnecessary. Typically, when data is stored on local drives attached to the processing nodes, database systems partition the data so that each node owns a portion of the data. Scaling a database in this way requires that the whole cluster be rebalanced when adding additional nodes. Otherwise, some nodes will remain idle when there's no data allocated to them.
That's where the decoupling of storage and compute comes in. When storage is added via a network, you can add a node and point it to one of the existing network drives. This way, the new node can get ownership over a portion of the data without actually moving that data from one partition to another. Furthermore, if a node fails, no data is lost—not even temporarily. That's because the other nodes in the cluster can also access the data that initially belonged to the failed node.
Massively parallel processing: By moving their analytical workloads to the cloud, organizations open the doors to virtually unlimited storage and computing power. The benefit? Queries that used to take days—or that broke production systems altogether—can now get answers in seconds. The secret ingredient is massively parallel processing, or MPP.
It shouldn't surprise you that adding more servers (horizontal scaling) to a database system can speed up queries. But for decades, this was an extraordinarily complex and expensive architecture to manage.
That changed when major cloud providers abstracted these difficulties away by offering data warehouses as a service. Amazon Web Services has Redshift. Google Cloud Platform has BigQuery. Here's how a Google whitepaper boasts BigQuery's capabilities: "You would need to run 10,000 disk drives and 5,000 processors simultaneously to execute the full scan of 1TB of data within one second. Because Google already owns a huge number of disk drives in its own data centers, why not use them to realize this kind of massive parallelism?"
Of course, these technical bells and whistles would be pointless if they didn't come with any functional benefits. Below, you can find six of them: speed, cost, instantaneousness, zero-maintenance, scalability, and availability.
Speed: The be-all and end-all for data warehouses is their speed. There are many benchmarks out there, and frequently these result in all-out war. Although independent benchmarks don't always line up and vendors dispute each other's results, it's indisputable that cloud data warehouses reduce time to insight drastically.
Zero startup costs: Using a cloud data warehouse means no investments in physical hardware and no fees for expensive consultants to manage it. The costs for storing data in a cloud data warehouse are ridiculously low, and the price for processing data is on a per-query basis.
Instant deployment: In the case of Redshift, deploying a data warehouse is a matter of minor configurations. For Snowflake, you only need to select a cloud provider. And BigQuery is entirely serverless. In other words, getting started with your data warehouse has never been easier.
Zero-maintenance: Because cloud data warehouses are software as a service (SaaS), maintenance isn’t a concern. No one needs to monitor the performance of underlying hardware to ensure that the CEO gets his weekly report on time.
Scalability: Because compute and storage are strictly separated, it's possible to modify the system's configuration dynamically. A serverless data warehouse like BigQuery even scales resources on a per-query basis.
Availability: The separation of processing power and storage is also responsible for near-perfect availability. There are no nodes to manage, and there’s no downtime with a cloud provider that has network centers across the globe.
Designing a Data Warehouse Architecture
By now, it should be clear what a data warehouse is and why you should consider using one. In this section, you'll learn about the considerations regarding setting up a data warehouse architecture.
Typically, data warehouses support all kinds of structured data, and some even support semi-structured data. The following types of data are relevant in a variety of contexts.
Since data warehouses generate a lot of interest from within business intelligence and marketing analysis communities, organizations tend to prioritize data sets related to their customers and their buying behavior.
Transaction and order lines are an obvious first pick. That's because sales trackers are simply the most sought-after reports. But that's not the only reason. Transactions are usually stored in a data repository (such as an OLTP database) owned and maintained within the company. Rarely does setting up a data replication pipeline from a transactional database to a data warehouse require third-party integration software. Complete database dumps or SQL statement logs often suffice.
Another point of interest is the data stored within an organization's customer relationship management (CRM) tool. It shouldn't surprise you that salespeople looking for new opportunities want to know who their customers are—or aren't. The same can be said of marketing departments. They're in the business of promoting the right products and services to the right customers. Combining transaction data with CRM data to get to those insights is an easy win.
Both transaction data and customer properties are explicit data intentionally provided by the customer. But organizations can also acquire implicit data, which can be collected by monitoring how users or customers interact with their (often digital) channels.
- Email: Tools like Adobe Campaign, Selligent, and Mailchimp store information on every email that is sent out: who it’s delivered to, who opens it, who clicks on a link. Getting the data out can be done programmatically or by using data integration tools that maintain connectors for interfacing with these tools' APIs.
- Web: A popular way for tracking web behavior is Google Analytics, which integrates seamlessly with BigQuery (Google's cloud data warehouse). But there are other tools like Snowplow and RudderStack, which can stream directly into your data warehouse.
- App: Installing tracking on websites has become less painful throughout the years. Nowadays, tools that are mostly installed for tracking web behavior can also be used for tracking app interactions.
It doesn't end here. With support for all kinds of structured and semi-structured data, there are nearly infinite use cases for which a data warehouse is the most suitable data repository.
Fetching a lot of data is one thing; storing it in the most suitable way for your use case is another. In operational databases, data typically gets normalized, resulting in a snowflake or a star schema. That has two significant advantages. First, storing all dimensions in separate tables makes transactions more compact. Second, it deduplicates data and reduces storage volumes.
However, data warehouses are mainly used for reading operations. Furthermore, given the separation of storage and computing power, reducing storage volumes isn't a concern. This raises the question: how should one organize data inside a data warehouse?
The answer: flattening tables, or "denormalization."
For example, with storage being extremely cheap, Google recommends optimizing queries instead of schemas in BigQuery: "Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. […] The storage savings from using normalized data has less of an effect in modern systems." Nevertheless, many vendors offer features so that their customers can have it both ways: optimization of both storage and computing power. Firebolt has semi-structured tables, Redshift has the SUPER data type, and BigQuery has nested and repeated fields. This way, data redundancy can be avoided while optimizing query speed and keeping costs down.
And what about slowly changing dimensions? For example, when a user in your database changes addresses, do you overwrite his previous location? Or do you want to keep a historical overview? There are at least half a dozen ways of handling this issue, but without storage volume concerns, this isn't much of a problem for a data warehouse.
- If your data arrives via an event stream, you'll always have an event timestamp, which can be used to keep track of historical changes to a particular field.
- If your data arrives in batch (e.g., per day), you'll have the snapshot timestamp. This date field can be used to keep a daily historical log.
Finally, warehouse-first tools like RudderStack come with a predefined schema. The data these tools generate are optimized for analytical queries and don't require post-processing steps.
From ETL to ELT
With their processing power and storage capacity tightly linked together, tools like OLAP cubes or relational databases were terrible data sinks). You couldn't just dump troves of data in them without sacrificing performance or paying for expensive scaling operations. For this reason, most data warehouse architectures were designed around an ETL workflow.
ETL to ELT
ETL is an acronym for extract, transform, and load and describes how data flows from source systems into analytical data stores. In this workflow, data is extracted from source systems into a data sink. The data sink could be as simple as a server with a vast storage capacity, but data lakes were truly fit for purpose. Next, data would flow between the sink and the analytical store through data pipelines that transformed it into its final form.
In the past couple of years, the following three data engineering evolutions were notable:
- Data integration is now completely commoditized by tools such as RudderStack, Fivetran, and Airbyte.
- Data warehouses now increasingly support the loading of streaming data (e.g., Snowpipe or BigQuery's streaming inserts).
- We've seen the third one before. Due to cheap storage, unlinked from processing capacity, the data warehouse can now be leveraged both as a sink and for managing the data pipeline.
These technological evolutions sparked a whole new paradigm known as ELT, an acronym for extract, load, and transform.
In this workflow, data gets extracted directly into the data warehouse from its source system. Remember that most data warehousing technologies also support semi-structured data. Consequently, extracted data in a data warehouse is often completely schemaless.
Although stored procedures and (materialized) views) can theoretically manage the transformation from raw data to perfect tables in a data mart, the ELT paradigm has spawned a whole new software category that allows organizations to transform, model, and track the lineage of their data at scale. Most popular are dbt, Matillion, and Dataform (now Google-owned).
Many organizations have discovered that they can do more with their data than analyze it for improved decision-making. This ambition often means that the data needs to be extracted from the data warehouse into other (third-party SaaS) tools. Think about the following data-driven use cases:
- Segments for personalizing Google search and display ads
- Personalized emails in your marketing campaigns
- Propensity scores to feed your product recommendation engine
- Audience building in Facebook Ads
- Survey recipient selection
- Next-best-action scores for contacts in a CRM
But getting data into a data warehouse is one thing; getting cleaned, combined, and enriched data out is another.
Until recently, it was very cumbersome to maintain integrations with each of these products. However, with data warehouses taking up an increasingly important role, another new warehouse-first software category rose to prominence: the Reverse ETL tool. While traditional data integration tools focus on getting data out of SaaS applications, reverse ETL tools such as Hightouch and Census are focused on moving data back into them.
The Modern Data Stack
When we put all the puzzle pieces together, we come to a software architecture with the following components:
- a data integration tool to extract data from operational databases and SaaS platforms;
- a cloud-based columnar database that can be leveraged as a data sink, transformation engine, and data mart;
- a data modeling layer for managing transformations at scale; and
- a reverse ETL tool to activate data across a wide range of SaaS tools.
This set of tools is often called the "modern data stack" (MDS).
The modern data stack scales incredibly well, while the technical barriers to entry have almost completely evaporated. It is completely modular, hosted in the cloud, and requires a minimal amount of configuration and maintenance. Anyone with a visualization tool or notebook can interface with the organization's data repository. These properties have made the modern data stack a favorite in analyst circles.
In this article, we defined the data warehouse as a cloud-hosted columnar data store that scales amazingly due to the separation of storage and computing power.