🗓️ Live Webinar August 17: How Allbirds solves identity resolution in the warehouse with dbt Labs, Snowflake, and RudderStack

Register Now

Data Warehouses versus Data Lakes

The ever-increasing amount of data needed to evaluate past performance and existing processes and to model future scenarios presents business intelligence challenges that may seem daunting. Several data storage methodologies tame this wild landscape, including data warehouses and data lakes. Sometimes used interchangeably, significant differences and use cases nevertheless separate the two data systems.

What is a data warehouse?

Data warehouses contain structured data that’s been processed and sanitized.. Data warehouses’ data sources are usually coordinated by historical time series, preparing sets for modeling and inference. Data warehouses are designed to be highly responsive to the demands of analysts, enabling flexible real-time iteration of data insights. For more details, read our articles Key Concepts of a Data Warehouse and What are the Benefits of a Data Warehouse?

What is a data lake?

In contrast to the well-ordered data warehouse, a data lake contains a vast pool of unstructured, raw data. Whereas the filtering process used to populate a data warehouse is informed by the needs of the intended audience and the purpose(s) to which they’ll put the data, lakes contain data aggregated without a specific intention; the query scenarios for lake data are undefined. A data lake is an archive of data, safely kept to satisfy future needs.

An emerging architecture — the “data lakehouse” — combines the archival flexibility of a data lake with the optimized, query-ready structure of a warehouse.

Key differences between data warehouses and data lakes

The following table summarizes the key differences between the data warehouse and data lake architecture methodologies. Details about each difference follow.

Intended purpose

The data stored in a data warehouse has been deliberately gathered from external data sources with a particular purpose in mind. Upon retrieval, the source data is processed, sanitized, and otherwise readied for storing in a manner that’s optimal for the planned query needs. This processing is computationally expensive in both time and infrastructure (electricity, computing resources, interim storage needs).

Data is aggregated into a data lake without specific purposes, use cases, or even audiences in mind. It is purely for archival needs, to preserve every bit of information collected over a span of time, ready for the future demands of business analysis.

Audience

Business intelligence efforts require the warehouse’s aggregated, cleaned, well-structured data. This aggregated, consistent data lends itself to mining for truths in a straightforward way that the scattered data obscures. The data mining process may take the form of human data analysis, algorithmic machine pattern matching, or more intelligent AI models.

Data scientists, in contrast, are the primary consumers of the unstructured raw data stored in the lake. Having all the primary source data at hand enables the investigative process that determines what relationships and supporting data are available to the business intelligence community. Additionally, the purely unstructured nature of a data lake is unsuitable for most machine-automated pattern seeking.

Data structure

Data warehouses contain data that’s already been tidied into a planned format, ready for extraction for specific purposes. This provides maximum speed of consumption (at the cost of some flexibility). Data warehouses also typically prepare data with a shared timestamp schema that lends itself to machine learning training and time-bracketed data lookup.

Data lakes aggregate information from multiple sources in their native formats, ready for whatever purpose lies in store. This provides maximum flexibility (at some possible future computational cost). This aggregated data can be loaded into the lake faster (and accessed more quickly) because the pre-loading processing is side-stepped.

Access and update cost

Information from data warehouses is accessed primarily via specific, tailored queries that have been constructed to satisfy specific use cases. Updates to the warehouse occur only after the computationally-extensive transformation process is completed for the new data. The choice of using a data warehouse balances the immediacy of getting updated data against the need to format new data before storing.

In contrast, updated data inserted into a data lake may be accessed more quickly, because the sanitizing process doesn’t happen (saving the costs of computation). Having access to the raw data may provide more ad hoc opportunities for exploring the data, executing proof-of-concepts, and self-service, free-form prediction modeling.

Access model

Data warehouses require a computationally-intensive process of cleansing and transforming information prior to storing. Technically, this scheme of organizing data is called schema on write.

In the planning process for deploying the data warehouse, a series of exercises must be undertaken to determine the ultimate schema (organization and relationships) for storing the data, one that provides the most usable data to the most audiences. Changes in requirements demand updates to the schema and process to insert data, typically a time- and labor-intensive process.

Data lakes, in contrast, store data in their native formats. Information is extracted and transformed on the fly for each need. This is known as schema on read. As with the overhead benefits of updates and storage, schema on read offloads the work of designing a schema to later users of the data, while sidestepping the potential debt of a data warehouse schema migration.

Storage and computing

Like the relationship between content and formatting, which were deliberately separated for ease of making stylistic changes in the early days of the web, there’s a relationship between storing data and the associated computing needs.

Data warehouses, with their pre-storage computation needs, have a “tightly-coupled storage and computation” relationship. Increasing storage capacity demands a proportional increase in processing needs.

Incidentally, data warehouses are generally designed to store larger quantities of data than data lakes. Currently, the largest data warehouse is SAP, holding over 12 petabytes of information (check out a PDF report about it here).

Separating the data storage from the transformation computation — technically called “decoupled storage and computation” — optimizes the business’ costs by matching storage requirements against the frequency of data access. Archiving business data in less expensive tiers — the data lake — saves money and provides more immediate availability of the information. As mentioned before, this immediacy enables staff to more quickly begin experiments in data exploration, analysis, and modeling.

Data warehouse or data lake: which is better?

Deciding which data storage solution to implement depends upon a delicate balance between needs, value extracted from data analysis, and the costs of infrastructure, storage, and computation.

Organizations that require great agility and work on smaller quantities of information may choose the way of the data lake. Those in industries where far greater quantities of data exist, and where the data must be massaged in order to be most useful to the greatest audience, may choose the data warehouse. Some may choose both to provide maximal flexibility.

Customer Data Platform for Developers | RudderStack
HIPPA Compliant
SOC 2 TYPE 2