How to Move Data in Data Warehouses
Business Intelligence, or BI, alone may be unsuitable or ineffective in indicating the next steps for your business. Even though it enables you to analyze data to surface trends that help you arrive at actionable insights, the sourcing, moving, and storage of the data it uses is equally important. In order to develop a data-led culture and make BI operational, companies need to ensure simplified access to accurate, unified, and real-time data. It’s here that data warehousing is invaluable.
Data warehousing systems have been a traditional part of business intelligence suites for no less than three decades now. However, they have taken on a life of their own, morphing in tandem with the emergence of new data types and data hosting techniques. A data warehouse is like any physical warehouse; it’s a container for data from multiple sources. It’s beneficial to use a data warehouse if you have to make data-informed decisions, consolidate or merge data from multiple sources, or analyze historical data.
In this article, we’ll explore the various ways that data enters, moves through, and exits warehouses, as well as how it’s formatted and stored. You’ll also learn how a successful data warehouse strategy helps businesses understand their current position and set measurable benchmarks to drive sustainable long-term growth.
A High-Level Overview of How Data Passes through a Data Warehouse
A data warehouse, or formally, an enterprise data warehouse (EDW), is a go-to repository of information in an organization. This central repository exists to ease access to information for analysis to bring about informed decision-making.
How Does Using a Data Warehouse Make Collected Data More Useful?
When organizations collect data, the process usually involves several sources. A data warehouse is similar to a consolidating tank for all those data streams, including transactional systems and relational databases.
However, the data isn’t quite ready for use at the time of collection. In a nutshell, a data warehouse works to merge information from various sources into one comprehensive database with usable data. There are several types of collecting tanks that a company could decide to use. Therefore, it’s important to differentiate common data collection entities, such as data warehouses, databases, and data lakes.
First, for a business to thrive in the face of modern competition, it must pay attention to data and analytics. As a result, dashboards, reports, and analytics tools have become prevalent for mining insights from their data, monitoring business performance, and supporting decision-making. Besides data analysis, the data store also enables data mining, artificial intelligence (AI), and machine learning.
Data warehouses are the engines behind these analytics tools, dashboards, and reports. They store data efficiently in a way that minimizes data I/O (input and output) and concurrently deliver query results to a large number of users, ranging from hundreds to thousands.
A database captures and stores data in tabular format using a schema. Data could be the details of an e-commerce transaction, for instance. This includes items in carts, email addresses, credit card information, and billing and shipping addresses. So, for each customer, you can have a database of all their purchases and the attendant payment and shipping information.
On the other hand, a data lake is a data container designed to hold data in its native or raw format. A data lake supports scalability to massive amounts of data. There are three broad classes of data in a data lake:
- Structured data
- Semi-structured data
- Unstructured data
In order to store and secure large amounts of these three classes of data lake elements, many opt to use data lakes mainly for their analytic performance and native integration.
Having considered tabular databases and free-form data lakes, we can now revisit the concept of data warehouses. One may call them a hybrid of sorts, since data warehouses need all data organized in tables to allow the use of SQL to query the data. But, not every application needs tabular data. Big data analytics, full-text search, and machine learning can access unstructured or semi-structured data.
In the data warehouse, processing, transformation, and ingestion occur before users access the processed data using BI tools, SQL clients, and even simple spreadsheets.
Thus, a data warehouse allows companies to analyze their customers more thoroughly. As a convenient repository for data analytics, it involves reading large amounts of data to detect relationships and trends. Using a data warehouse automatically means you have to consider every possible source of information in terms of volume and type. Furthermore, data mining patterns become easier to spot when using a data warehouse, leading to better sales and profits.
How Do Data Warehouses Receive and Manage Data?
The primary role of a data warehouse is to aggregate multi-source data into a single, central, and consistent store of data. Relational databases, transactional systems, and other sources funnel data into a data warehouse. Business analysts, data engineers, data scientists, and critical decision-makers then access the data using business intelligence tools.
It used to be common to have a data warehouse located on-premise, typically on a mainframe computer. Its primary functionalities were data extraction from other sources, cleansing and preparing the data, and loading and maintaining the data in an RDBMS (Relational Database Management System).
However, more modern data warehouses might feature on a dedicated appliance or on the cloud. This enables remote administration of the data warehouse and makes it easier for authorized personnel to carry out their work functions outside the confines of the office. Many even ship with analytical tools and the ability to perform data visualization and presentation. Such tools are indispensable in the modern enterprise for building products and services that capture the most essential needs of customers.
Variations on How a Data Warehouse Can Receive and Return Data
Better decisions are possible with more data sources; this is the crux of analytics and business intelligence. It is essential to explore how businesses can receive and return data from data warehouses.
Data ingestion describes the transportation of data from different sources to a storage hmedium, such as a data warehouse, where it can be accessed, used, and analyzed by the organization.
Data ingestion precedes data digestion. The ingestion layer is the core of any analytics architecture. Consistent and accessible data is crucial to downstream reporting and analytics.
There's an assortment of options to ingest data, and the design of a specific ingestion layer may depend on several models and architectures.
Basic data warehouse architecture diagram
Data Ingestion Methods
Specific business requirements and constraints determine the appropriate data ingestion layer to use. This supports an optimal data strategy, and businesses will choose the appropriate model for each data source depending on how quickly and often they’ll need analytical access to the data.
Batch processing is the most common data ingestion type. Here, the ingestion layer collects and groups source data periodically and then sends it to the destination system. The processing of data groups depends on any logical ordering, activation of specific preconditions, or a simple schedule.
Where near-real-time data is not paramount, batch processing is a logical option. It's more accessible and more affordable than streaming data ingestion.
Also known as streaming or real-time processing, this kind of processing requires no grouping. Data sourcing, manipulation, and loading happen as soon as the data is created or recognized by the ingestion layer.
Stream processing is more resource-intensive, and therefore expensive. It requires systems to monitor sources continuously and accept new information. It works great for analytics requiring continually refreshed data.
Some streaming platforms, such as Apache Spark Streaming, use a variant of batch processing that involves smaller ingestion groups or smaller preparation intervals (typically a few minutes), but with no individual processing.
Imagine a large e-commerce website that wants to know how the recent overhaul of its user interface impacted user behavior, as reduced conversion rates could mean lower revenues. Waiting for a full day as you might with batch processing may not be feasible. Therefore, micro batching may be a suitable alternative.
With micro batch processing, your data is available in near real time. Don’t be surprised to find some data architecture descriptions that use the term interchangeably with event stream processing (which we’ll talk about more later). However, many analysts still consider micro batch processing a distinct category of data ingestion. If you need fresh data—not precisely real-time, though—such as web analytics or user behavior, you’ll probably need to look no further than micro batch processing.
Loading Raw Files into a Data Warehouse Using Python and PostgreSQL
Another way to accomplish data ingestion in a data warehouse is by loading data via automation with software code. One way to achieve this is using the open-source Python programming language and PostgreSQL database together. Once you install PostgreSQL, which should be a breeze, you’ll need to provision a database by creating one. The command is simply
CREATE DATABASE lusiadas;.
Then, you’ll install the
sqlalchemy package to connect your database through Python. Type the following code into your shell program or Anaconda prompt:
>>> pip install sqlalchemy
Then download, install, and import all the libraries you'll need in your Python script:
python from sqlalchemy import create_engine import psycopg2 import pandas as pd import streamlit as st Now establish a connection to your
records_db database. Also, create a table in which to store arrays and records.
You’ll also need to create a separate connection to the
datasets_db database that your data sets will go into:
python engine = create_engine("postgresql://<username>:<password>@localhost:5432/records_db") engine_dataset = create_engine("postgresql://<username>:<password>@localhost:5432/datasets_db") engine.execute("CREATE TABLE IF NOT EXISTS records(name text PRIMARY KEY, details text")
Note that Postgres’s naming conventions require table names to begin with underscores or letters only. Additionally, they may not contain dashes and must be less than 64 characters long.
Your records table needs a
name field that’s a
text data type. Declare it as a
PRIMARY KEY. Another
details field should be added as
text to tell Postgres that it's a single-dimension array.
You'll use the primary key
name field to search records when interacting with the data.
As a side note, it's more secure to store database credentials in a configuration file and then invoke them as parameters in your code.
Now, you can create five functions to write, update, read, and load raw data to and from your database:
```python def write_record(name, details, engine): engine.execute("INSERT INTO records (name, details) VALUES ('%s', '%s'" % (name, details))
def read_record(field, name, engine): result = engine.execute("SELECT %s FROM records WHERE name = '%s'" % (field, name)) return result.first() def read_record(field, name, new_value, engine): engine.execute("UPDATE records SET %s = '%s' WHERE name = '%s'" % (field, new_value, name)) def write_dataset(name, dataset, engine): dataset.to_sql('%s' % (name), engine, index=False, if_exists='replace', chunksize=1000) def read_dataset(name, engine): try: dataset = pd.read_sql_table(name, engine) except: dataset = pd.DataFrame() return dataset def list_datasets(engine): datasets = engine.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' ORDER by table_name;") return datasets.fetchall()
Comparing Data Sources and How They Inform Data Warehouse Architecture
Every data warehouse architecture consists of tiers. The top tier is the front-end client that enables the presentation of results through analysis, reporting, and data mining tools. Tools for accessing data warehouses include analytics applications and SQL clients.
The middle tier comprises the analytics engine helpful in accessing and analyzing data.
Lastly, the bottom tier of a data warehouse architecture is the database server, where data is loaded and stored. Data may be stored in fast storage (such as SSD drives) for frequent access, or in a cheap object store like Amazon S3 for infrequent access.
The data warehouse will automatically move frequently accessed data into fast storage to optimize query speed. Most data warehouse architectures will deal with structured data: most will come from streams, though others will result from transformation processes such as ETL and reverse ETL. Let’s take a closer look at each of these data sources.
Like event-driven architecture, event streaming centers around events. An event is a record of a past occurrence. It could be a mouse click, a keystroke, or a successful loading of a program.
In event streaming services such as Apache Kafka, streams of events are published to a broker. The consumers of event streams can access each stream and consume the events they want. The broker retains those events.
Some view event stream processing as a complement to batch processing. But, the latter focuses on taking action on a large set of static data or data at rest, whereas event stream processing takes action on a constant flow of data or data in motion.
Event streams work when you need instant action on data; therefore, you'll hear experts call it "real-time processing." Streaming data warehouse architecture would then comprise software components built and wired together to process streaming data from several sources as soon as the data is available or collected.
It's essential to consider the characteristics of the data streams when designing event stream architecture. They usually generate massive volumes of data and require further pre-processing, extraction, and transformation to be of any significance.
Event streams are prevalent in data warehouse implementations such as big data integration, natural language processing (NLP), tactical reporting, and auditing and compliance.
Extract-Transform-Load (ETL) is a three-step data integration process for synthesizing raw data from a data source to a data warehouse or other destination such as a relational database or data lake.
ETL blends data from multiple sources to build the data warehouse. That is, you take data from a source system, transform (convert) it into an analysis-friendly format, and load (store) it into your data warehouse.
The idea behind ETL is to push processing down to the database and improve performance. It provides a deep historical context for the business when combined with an enterprise data warehouse. It also makes things easier for business users to analyze and report data essential to their work.
ETL codifies and reuses processes that move data, but doesn't require coding skills.
ETL has an exact opposite process called a reverse ETL, in which data is copied from the data warehouse into systems of records across a company. There are times where it may be necessary to move data out of the centralized data warehouse location.
One such scenario might occur if your data warehouse has also become a data silo, despite having acquired the warehouse to eliminate data silos, which prevent information sharing and collaboration across departments. Reverse ETL also allows your business's core definitions to live outside the data warehouse.
With reverse ETL, there are a few prominent use cases:
- Operational analytics, to feed insights from analytics to business teams for improved decision-making
- Data automation, such as when your finance unit needs to issue invoices using a CSV
- Data infrastructure, as a general-purpose software engineering pattern to handle the growing number of data sources in modern business
- Identifying at-risk customers and spotting signs of potential customer churn
- Driving new sales by correlating data from the CRM and other interfaces
Reverse ETL is the final piece in your data stack jigsaw puzzle. You have a warehouse to hold data, a reverse ETL to query and retrieve data, and a customer data platform as a real-time personalization smart hub capable of modeling and activating data in diverse business platforms.
Processing Data in a Data Warehouse
Imagine that a company needs to coordinate and manage its data transformation workflows. Some data analysts might schedule queries on tools such as BigQuery for transformation workflows before testing the transformed data.
However, many lack a comprehensive infrastructure to automate the building, compiling, testing, and documenting of SQL models. This limits the scalability of the process, but it's precisely where dbt shines. Let’s take a closer look at this tool that helps data analysts and engineers to transform data in a reliable, fun, and fast way.
dbt stands for data building tool. It is a command-line utility that enables data analysts and engineers to transform data in their warehouses using simple select statements. The analysts can use it to clean data and minimize time spent on manual checks. Thus, a company can write transformations as queries and efficiently orchestrate them.
One advantage of using dbt is that non-engineers can also use it, promoting shared data knowledge between engineering and non-engineering teams. It also supports a highly flexible data model, so it's easy to recreate data and backfills.
dbt also ensures that data warehouse-level transformations are a piece of cake, allowing you to count on in-built data quality testing. As easy as dbt is to use, it still includes reusable macros and an online, searchable data catalog and lineage.
Other Third-Party Tools
Besides dbt, other third-party tools for processing data in a data warehouse include:
| Tool | Use(s) | | ------------- |---------------| | BigQuery | Cloud data warehouse tool | | CloverDX | Data integration platform for full control | | Dundas | Dashboard, analytics, and reporting tool | | Hitachi Vantara | Open-source tool for analytics and BI | | PostgreSQL | Open-source object-related database system | | QuerySurge | RTTS-developed solution for ETL testing | | SAS | Accessing data from various sources | | Sisense | BI tool for real-time analysis and visualization | | Solver | Reporting, data storage, and interactive dashboards | | Tableau | Data visualization in BI | | Teradata | Displaying and handling large quantities of information |
What Does Access Look Like with These Methods?
Data access is a crucial aspect of storing data in a data warehouse. Now let’s turn our attention to two essential methods for accessing data in a data warehouse.
Data access using OLTP, or online transactional processing, enables real-time execution of large numbers of database transactions by large numbers of people simultaneously.
OLTP typically takes place over the internet, and its design mainly supports transaction-oriented applications that need to process recent transactions quickly and accurately.
You’ll find OLTP in active use for ATMs, credit card payment processing, e-commerce software, online booking, record-keeping tools, and reservation systems.
OLAP, or online analytical processing, is software for performing high-speed data analysis in multiple dimensions.
Online analytical processing has routine applications in enterprises that generate large volumes of data from one unified and centralized data store, such as a data warehouse. Its typical applications are in data mining (besides other BI applications), complex analytical calculations, and predictive scenarios.
Other uses for OLAP include business reporting functions, including budgeting, financial analysis, and forecast planning.
The fundamental difference between OLAP and OLTP is that whereas OLAP is analytical in nature, OLTP is transactional (that is, based on interactions between users and a data store).
A data warehouse provides an efficient way to manage multiple-source data. Businesses invest in a data warehouse architecture because of the edge it offers them in terms of analytics, data mining, insights, and decision-making. To win and retain customers and drive sales, paying attention to data generated during the buyer journey is essential.
It’s important to mention in closing that it's not enough to move data into data warehouses. It's even more important to deploy an active maintenance routine. This ensures that column, table, schema, and database names are straightforward and consistent, and that only authorized users will run simple select statements using the root user account.
Also, this means that the public schema and public role of a properly maintained data warehouse are used sparingly. Ownership, privileges, and lineage are never ambiguous in good data warehouses. All of these factors combine to efficiently provide answers to business questions and data administration questions.