Learning Center
Learning TopicsData Collection
Data Analytics
Data Analytics vs. Data Analysis
Quantitative vs. Qualitative Data
What is Behavioral Analytics?
Data Analytics vs. Business Analytics
Data Analytics vs. Data Science
The Difference Between Data Analytics and Statistics
The Difference Between Data Analytics and Data Visualization
Data Analytics Lifecycle
Data Analytics vs Business Intelligence
What is Descriptive Analytics?
What is Data Analytics?
What is Diagnostic Analytics?
Data Analytics Processes
Data Warehouse
A top-level guide to data lakes
Redshift vs Snowflake vs BigQuery: Choosing a Warehouse
Data Warehouse Architecture
What Is a Data Warehouse?
How to Create and Use Business Intelligence with a Data Warehouse
Best Practices for Accessing Your Data Warehouse
Data Warehouse Best Practices — preparing your data for peak performance
How do Data Warehouses Enhance Data Mining?
Data Warehouses versus Databases: What’s the Difference?
What are the Benefits of a Data Warehouse?
Key Concepts of a Data Warehouse
Data Warehouses versus Data Lakes
Data Warehouses versus Data Marts
Difference Between Big Data and Data Warehouses
How to Move Data in Data Warehouses
Data Trends
Customer Data
What Is Customer Data?
Customer Data Analytics
Customer Data Management
Collecting Customer Data
The Importance of First-Party Customer Data After iOS Updates
Types of Customer Data
What Is a Customer Data Platform?
What is an Identity Graph?
Customer Data Protection
A complete guide to first-party customer data
CDPs vs. DMPs
What is Identity Resolution?
Data Security
What is Consent Management?
Data Access Control
Data Sharing and Third Parties
Cybersecurity Frameworks
What is PII Masking and How Can You Use It?
Data Security Strategies
Data Security Technologies
Data Protection Security Controls
How to Manage Data Retention
How To Handle Your Company’s Sensitive Data
Data Security Best Practices For Companies
What is Persistent Data?
GA4
Google Analytics 4 and eCommerce Tracking
What Is Google Analytics 4 and Why Should You Migrate?
GA4 Migration Guide
GA4 vs. Universal Analytics
What are the New Features of Google Analytics 4 (GA4)?
Benefits and Limitations of Google Analytics 4 (GA4)
Understanding Google Analytics 4 Organization Hierarchy
Understanding Data Streams in Google Analytics 4
Subscribe
We'll send you updates from the blog and monthly release notes.
Learning Center
Data Warehouses versus Data Lakes
The ever-increasing amount of data needed to evaluate past performance and existing processes and to model future scenarios presents business intelligence challenges that may seem daunting. Several data storage methodologies tame this wild landscape, including data warehouses and data lakes. While these terms are sometimes used interchangeably, significant differences and use cases separate the two data systems.
What is a data warehouse?
Data warehouses contain structured data that’s been processed and sanitized.. Data warehouses’ data sources are usually coordinated by historical time series, preparing sets for modeling and inference. Data warehouses are designed to be highly responsive to the demands of analysts, enabling flexible real-time iteration of data insights. For more details, read our articles Key Concepts of a Data Warehouse and What are the Benefits of a Data Warehouse?
What is a data lake?
In contrast to the well-ordered data warehouse, a data lake usually contains a vast pool of unstructured, raw data, although it can also be used to store the same structured data one would find in a warehouse. Whereas the filtering, cleaning, and transformation processes used to populate a data warehouse is informed by the needs of the intended audience and the purpose(s) to which they’ll put the data, lakes contain data aggregated without a specific intention; the query scenarios for lake data are undefined. A data lake is an archive of data, safely kept to satisfy future needs.
An emerging architecture — the data lakehouse — combines the archival flexibility of a data lake with the optimized, query-ready structure of a warehouse.
Key differences between data warehouses and data lakes
The following table summarizes the key differences between the data warehouse and data lake architecture methodologies. Details about each difference follow.
Data Warehouse | Data Lake | |
Intended purpose | Known before insertion | Not yet known |
Audience | Business intelligence | Data science |
Data structure | Transformed for use cases | Raw, native format |
Access and update cost | Accessed primarily via tailored queries per use case; updated after extensive transformation process | Easy to access; frequently updated |
Access model | Schema-on-writing | Schema-on-reading |
Storage and computing | Tightly-coupled | Loosely-coupled |
Intended purpose
The data stored in a data warehouse has been deliberately gathered from external data sources with a particular purpose in mind. Upon retrieval, the source data is processed, sanitized, and otherwise readied for storing in a manner that’s optimal for the planned query needs. This processing is computationally expensive in both time and infrastructure (electricity, computing resources, interim storage needs).
In a data lake, data is aggregated without specific purposes, use cases, or even audiences in mind. It is purely for archival needs, to preserve every bit of information collected over a span of time, ready for the future demands of business analysis.
Audience
Business intelligence efforts require the warehouse’s aggregated, cleaned, well-structured data. This aggregated, consistent data lends itself to mining for truths in a straightforward way that the scattered data obscures. The data mining process may take the form of human data analysis, algorithmic machine pattern matching, or more intelligent AI models.
Data scientists, in contrast, are the primary consumers of the unstructured raw data stored in the lake. Having all the primary source data at hand enables the investigative process that determines what relationships and supporting data are available to the business intelligence community. Additionally, the purely unstructured nature of a data lake is unsuitable for most machine-automated pattern seeking.
Data structure
Data warehouses contain data that’s already been tidied into a planned format, ready for extraction for specific purposes. This provides maximum speed of consumption (at the cost of some flexibility). Data warehouses also typically prepare data with a shared timestamp schema that lends itself to machine learning training and time-bracketed data lookup.
Data lakes aggregate information from multiple sources in their native formats, ready for whatever purpose lies in store. This provides maximum flexibility (at some possible future computational cost). This aggregated data can be loaded into the lake faster (and accessed more quickly) because the pre-loading processing is side-stepped.
Access and update cost
Information from data warehouses is accessed primarily via specific, tailored queries that have been constructed to satisfy specific use cases such as customer 360 modeling, data visualization, or scorecard reporting. Updates to the warehouse occur only after the computationally-extensive transformation process is completed for the new data. The choice of using a data warehouse balances the immediacy of getting updated data against the need to format new data before storing.
In contrast, updated data inserted into a data lake may be accessed more quickly, because the sanitizing process doesn’t happen (saving the costs of computation). Having access to the raw data may provide more ad hoc opportunities for exploring the data, executing proof-of-concepts, and self-service, free-form prediction modeling.
Access model
Data warehouses require a computationally-intensive process of cleansing and transforming information prior to storing. Technically, this scheme of organizing data is called schema on write, since you can think of data warehouses as highly scalable databases with fixed database and table schemas
In the planning process for deploying the data warehouse, a series of exercises must be undertaken to determine the ultimate schema (organization and relationships) for storing the data, one that provides the most usable data to the most audiences. Changes in requirements demand updates to the schema and process to insert data, typically a time- and labor-intensive process.
Data lakes, in contrast, store data in their native formats. Information is extracted and transformed on the fly for each need. This is known as schema on read, since data is stored in raw files, each of which could contain different data schemas. It is therefore impossible to know the full schema for all data before the query. As with the overhead benefits of updates and storage, schema on read offloads the work of designing a schema to later users of the data, while sidestepping the potential debt of a data warehouse schema migration.
Storage and computing
Like in the early days of the web, the relationship between content & structure (HTML) and formatting (CSS), which were deliberately separated for ease of making stylistic changes, there’s a relationship between storing data and the associated computing needs when it comes to data lakes and warehouses.
Data warehouses, with their pre-storage computation needs, have a “tightly-coupled storage and computation” relationship. Increasing storage capacity demands a proportional increase in processing needs.
Incidentally, data warehouses are generally designed to store larger quantities of data than data lakes. Currently, the largest data warehouse is SAP, holding over 12 petabytes of information (check out a PDF report about it here).
Separating the data storage from the transformation computation — technically called “decoupled storage and computation” — optimizes the business’ costs by matching storage requirements against the frequency of data access. Archiving business data in less expensive tiers — the data lake — saves money and provides more immediate availability of the information. As mentioned before, this immediacy enables staff to more quickly begin experiments in data exploration, analysis, and modeling.
Data warehouse or data lake: which is better?
Deciding which data storage solution to implement depends upon a delicate balance between needs, value extracted from data analysis, and the costs of infrastructure, storage, and computation.
Organizations that require great agility and work on smaller quantities of information may choose the way of the data lake. For example, if you can load somewhat structured event or ETL data into a data lake, the query capabilities of a data lake will often suffice for many small to mid-sized organization’s reporting & analysis needs.
However, those in industries where far greater quantities of data exist, and where the data must be massaged in order to be most useful to the greatest audience, may choose the data warehouse.
Some may choose both to provide maximal flexibility. For example, you can utilize a cost effective data lake for the raw storage of event stream and ETL’ed data but combine it with a data warehouse that is used to selectively load data from the datalake to support more advanced business needs like data modeling or connection to Business Intelligence tools.
In conclusion, a data lake is a cost-effective way to store vast amounts of raw data at the cost of more complicated query or data modeling down the road. A data warehouse is a larger investment but offers higher performance and extra features. Many organizations tend to use a combination of both to balance cost with functionality.