🗓️ Live Webinar August 17: How Allbirds solves identity resolution in the warehouse with dbt Labs, Snowflake, and RudderStack

Register Now

Key Concepts of a Data Warehouse

This article covers key concepts of the data warehouse, comparing its capabilities with that of relational databases, data marts, and data lakes — all common solutions to the challenges of consuming large, varied types of data.

A data warehouse (DW) pulls together data from different sources into a single target for business intelligence (BI) analysis and support for strategic decisions. It is sometimes referred to as an enterprise data warehouse (EDW).

Introduction to data warehouses

One of the primary challenges to running a business, or any other data-centric operation, is making good decisions based on data that may be scattered far and wide. A follow-on challenge is capturing institutional knowledge into discoverable data. The general field that covers these challenges is called business intelligence.

Data warehouses were created to address this critical business need. DWs contain the aggregation of multiple sources (possibly scattered across the internet) into a single central data store. This aggregated consistent data lends itself to mining for truths in a straightforward way that the scattered data obscures. This mining may take the form of human data analysis, machine data mining (for significant trends), artificial intelligence (AI), or machine learning (ML).

In addition to the benefits provided by a single source of truth, data warehouses have been engineered to store quantities of data that challenge traditional database implementations. The current record holder for largest data warehouse is SAP, holding 12.1 petabytes.

Databases, data marts, data lakes

Colloquially, the terms databases, data warehouses, data lakes, and data marts are used interchangeably and imprecisely. While all deal with aggregated data, important differences exist between these concepts.

Databases

Database software has traditionally been a monolithic solution for acquiring and storing data from a well-defined source application, like online order systems. Databases are designed for speedy transaction storage and processing, coupled with a query language optimized for reporting (as opposed to analytics). Databases generally deal with updating real-time data.

Data warehouses, in contrast, are designed to store data in varying formats from many different sources. They capture current and historical data to use for predictive analytics, as inputs for machine learning, and for other model generation and analysis.

Data marts

Whereas data warehouses contain the aggregation of multiple data sources and are ready for varied uses across a business enterprise, data marts contain data specific to a particular business department or function. This smaller subset of data enables faster discovery of focused insights.

Data lakes

Data warehouses gather data from multiple sources into a central access point. That data is then transformed into structures specified in predefined schemas designed for data analytics. (More about these transformations is covered in “The architecture of a modern data warehouse,” below.) Schemas used in data warehouses will be covered in a separate article.

Data lakes store data without the use of predefined schemas, designed for consumption by big data platforms such as Apache Hadoop or Google BigQuery and Google Cloud Dataproc.

Benefits of cloud-based data warehouses

In recent times the option to store data warehouses in the cloud has made these endeavors more affordable, more reliable, and easier to deploy than purchasing and maintaining on-premises hardware. Cloud-based data warehouses are:

  • Speedy: With the chores of purchasing, installing, and maintaining hardware taken on by cloud computing companies, the process of instantiating a cloud-based data warehouse is reduced to a short online session. Specifying initial configurations for the warehouse and sandboxes and watching them become available is short work.
  • Inexpensive: Compared with the cost of acquiring, operating, and expanding on-premises hardware, data warehouse-as-a-service (DWaaS) schemes charge only for resources as needed.
  • Elastic: Upgrading capacity to handle heavier workloads is trivial compared to the arduous and expensive process of improving local hardware. Budgeting separately for storage and computation makes it easier to demonstrate the value of ongoing operations. The use of virtualized computing appliances in a highly distributed environment makes it much easier to spin up instances of known configurations, boosting reliability.
  • Up to date: Cloud providers handle the acquisition, deployment, and maintenance of software. This makes the timely integration of new technologies and software packages uncomplicated.
  • Secure: Regulatory and public-relations needs for data security are more easily satisfied with competent management of challenging technologies like cryptographic key management services (KMSs).
  • Robust: Disaster mitigation and recovery are provided for by off-premises implementation of distributed duplication of data and periodic backups.

The architecture of a modern data warehouse

Generally, the process of populating a data warehouse consists of:

  • extracting data from other source
  • cleansing and preparing the data into a well-defined format
  • loading the data into the warehouse
  • maintaining and validating the data integrity on an ongoing basis

Before we delve into the details of these processes we should explore the architecture of a modern data warehouse.

Architecture, in this context, refers to the functional structure of the primary computing roles in running a data warehouse. These roles can reside on one piece of hardware or divided among several.

Data warehouses generally have a three-tiered structure, sometimes referred to as data, semantics, and analysis layers.

Collecting and cleaning data—the bottom tier

The bottom tier, the entry point for data into the warehouse, retrieves data from external sources, cleanses and sanity-checks the data, and transforms the data into a well-defined format.

This process is commonly either one of Extract, Transform, and Load (ETL) or Extract, Load, and Transform (ELT).

Searching and modifying data—the middle tier

Once the data has been converted into the desired format, it’s ready to be searched and modified. These actions, grouped into analysis and transaction-processing, depend upon the nature of the data and the ultimate needs of the data owner.

OLAP (online analytical processing) performs multidimensional analysis of the aggregated data for typical business intelligence needs, data mining, and predictive scenarios. The results support financial analysis, budgeting, and forecasting.

OLTP (online transaction processing) performs large-scale real-time execution of transaction-oriented processing such as is needed by e-commerce platforms, including online booking and reservation systems, credit card payments and ATM transactions, and other record-keeping.

Digging through the aggregated data—the top tier

The top tier implements interactions with the aggregated data: either canned business intelligence reporting or live ad-hoc queries. These access tools allow users to interact with the data. Users can create dashboards and reports, monitor key performance indicators (KPIs), and use a workbench or sandbox area for interactive data exploration and the development of new analysis models.

Types of data warehouses

Knowing the use cases, benefits, and architecture of data warehouses will help you decide which type(s) of warehouse to use.

On-premises data warehouse software

Purchasing data warehouse software for deployment on on-premises infrastructure is a good choice if you need to comply with strict security protocols or mandated data privacy regulatory standards. This enhanced control over data is typically seen in financial institutions, government entities, and for-profit research organizations.

Cloud-based data warehouse

The advantage of this managed data warehouse as a service (DWaaS) is that off-premises entities deal with the complexities of acquiring, deploying, and managing computing resources. You don’t need to make an upfront investment in fixed infrastructure, you have the flexibility of changing to handle different capacity needs, and the service includes canned backup and disaster recovery systems. This makes DWaaS a very attractive option, especially if you want to reduce your on-premises data center footprint.

Data warehouse appliance

Bundling together hardware and software into a plug-and-play business solution, data warehouse appliances are an intermediate option between on-premises and cloud-based data warehouses. The decisions about computing hardware, operating system software, and data warehouse software and configuration are made by a third-party provider. They dictate the upfront costs, types of administrative control and data access tools, and scalability options.

Data warehouse best practices

Whether you’re starting from scratch or enhancing an existing data warehouse, we’ve highlighted some industry best practices below.

  • Define what success looks like. Understand what types of reporting and analyses would satisfy your needs. A clear expectation of the end result will help to inform the following steps; not having a good vision of success will result in wasted time, effort, and cost.
  • Determine the information required for success. Identify the data sources needed to support the required analyses, whether they live within your organization or are provided externally by authoritative entities.
  • Document the structure, quality, location, and availability of data. Knowing the freshness and quality of what’s available sets realistic expectations for timely insights. Knowing the location and availability of data informs the operations needed to retrieve it. Identifying providers of data and establishing contracts for continued data availability are key to making promises to downstream customers.
  • Establish data gaps and transformation rules. Bridge the gap between what data you have and the data you’ll need to achieve your goals.

Conclusions and next steps

Data warehouses are increasingly becoming essential to business intelligence, allowing key personnel access to fresh, germane data derived from a wide variety of sources. Combining home-grown data from internal business systems and authoritative information from external sources, data warehouses are the foundation for informed decision-making.

Data warehouses offer fast, complex data mining and analytics that complement existing business systems. They support the needs of staff, management, and executives for flexible models, reports, KPIs, dashboards, and situational alerts.

Customer Data Platform for Developers | RudderStack
HIPPA Compliant
SOC 2 TYPE 2