Best Practices for Accessing Your Data Warehouse
A lot of the data businesses generate every day, such as emails, product analytics events, or customer data, needs to be analyzed for business intelligence purposes. In order for this data to be useful, it needs to be in a format that’s suitable for analysis. A data warehouse is frequently used to address this need—it stores data in a specified model that makes it readily available for analysis. A data warehouse is usually accessed by the company’s employees through business intelligence tools, internal applications, dashboards, and sometimes SQL queries.
In this article, you will learn more about how data is stored in a data warehouse, how it’s accessed, some best practices for writing SQL queries for the data warehouses, who the major cloud data warehouse vendors are, and the various types of data storage models available.
Background on Data Warehouses
In the modern data analytics world, a data warehouse is defined as a repository where data can be organized into a specific data model as desired, then can be cleaned, transformed, and stored. Consider the following two use cases for a data warehouse.
- Suppose you are working for a company that owns an SaaS product. Their application is an aggregator where customers can buy tickets for multiple airlines, trains, and buses. Since the tickets will ultimately be purchased from each of the individual transit vendors, you can connect to your vendors and extract the data from them, clean and transform it into a data model that is accepted by your company, and then store it in a data warehouse. Once the data is in the data warehouse, in addition to being used to inform your product, it can also be used by downstream applications for machine learning or data visualization.
- Another use case would be a banking company. They store data about their customers, including personal information, transaction history, credit card activity, and loans. In order to consolidate all their data in one place, the banking company might want a data warehouse that would clean, transform, and store their highly sensitive data. Machine-learning algorithms can be run on top of the data warehouse to extract insights such as credit risk, or to raise alerts about potentially fraudulent transactions.
Data Access Methods
Data access methods are the various ways users can query or access the data being stored in the databases. There are multiple ways in which a user can get access to the data, some of which are discussed below.
If a user has direct access to the database or a table within the database, then they may write an SQL query that can read and display data from the table. However, it’s important that they optimize the query correctly. If it’s not optimized, the execution might take longer, which can affect the performance of the database and increase cost due to unnecessary resource use.
A secure way of accessing data in large operational databases is by making intermediate tables, which run on separate instances from the primary database and contain a subset of the primary database’s data. This way of accessing data is much more secure, as it eliminates the risk of running unoptimized queries on the primary database and accidentally updating or deleting any records.
Views are wrappers on SQL queries that are ephemeral in nature. They can be used as tables while writing queries on top of them, but are not stored in the warehouse, and are materialized only when a query runs on them. Some warehouses do optimizations on how views are treated by caching previous results and using them if data in the underlying tables don't change. Views are a great way to break down complex queries into manageable chunks.
Materialized views are similar to views, in that they're the result of SQL queries on top of other tables. Unlike views, though, they're materialized, ie the data is stored, which means that they come with additional storage costs. The advantage is that if these are used in multiple downstream queries, they don't need to be recreated every time.
Access management involves assigning users to specific groups, then assigning group-specific roles to those groups. Individual users aren’t granted any specific role, but inherit the roles of the group to which they belong. This is especially helpful if you have a large user base. As the number of users grows, granting and maintaining data access for each user becomes a complicated, time-consuming task, and can lead to problems like lapses in needed access, or users who have more access than they need. Group-based access policies solve these problems, making sure that every group on your team has access to what they need—and nothing else.
Best Practices for SQL Queries
In this section, you will learn about some of the important aspects and best practices to consider while writing your SQL queries.
Common Table Expressions (CTEs)
When authoring SQL queries, developers often come across a situation where they need to create a temporary dataset that's then used to calculate the final results. While this approach could be achieved by introducing a view and using the view as an intermediate layer, most SQL databases provide a feature known as a common table expression, or CTE, that allows you to achieve the same result more easily.
CTEs can be used multiple times within the same query, but the results of the CTE are only available until the query has finished executing. A CTE can be used in a view, a stored procedure, or while writing complex ETL transformations based on SQL.
Data Sanitization is the practice of deleting any special characters from within your user input. For example, your SQL query might contain characters like *, *#, $, @, %, and &. These characters can be misinterpreted by the query engine and result in undesirable code execution. Look at [this article] for more detail about why we should sanitize our inputs before they're executed by the SQL query engine.
Indexing a database means to properly organize data to maximize access to the columns that are most frequently used. Indexing is used to boost the performance of SQL queries, but it comes at the cost of storage space. The more indexing there is, the more storage space that’s required—and the more it will cost. There should be a nice balance between the need for indexing and the need to minimize how much storage it takes up. To learn more about indexing, you can read this answer on StackOverflow.
Batching in SQL queries involves grouping similar queries together in a batch, then executing them together. When you execute multiple SQL queries one by one, each of those queries begins a transaction in the database, manipulates the data as required, and then commits the transaction. This creates multiple commits to the database, since each query commits. In a batch, however, all the queries in the batch will be executed, and only when they’re all done will the transaction be committed. In some cases, this is used to improve database performance.
When you write and execute an SQL query, a plan is generated by the database engine which tells us how the query is going to be executed in the database. You can take a look at this by using EXPLAIN PLAN commands, then rewrite your queries to optimize how they’ll be executed. Once you execute the query, the plan is cached, and will be used when you run your queries in the future.
How to Adopt Modern Data Warehousing Tools and Access Data
Modern data warehousing tools provide cloud-based services that allow customers to start using them easily, without having to worry about infrastructure setup and maintenance. This allows businesses to focus on more important things, such as the development of the warehouse, the data pipelines, and implementing the business logic, without having to worry if it will be supported by the backend. The infrastructure of data warehouses is usually maintained by the vendors, and is kept abstract from the customers. The following are four of the most popular cloud data warehouse vendors.
- Amazon Redshift
- Azure Synapse Analytics
Let's look at each of them in detail to better understand how to adapt to them.
Snowflake is a scalable, fully managed cloud-based data warehouse solution from Snowflake, Inc. You can implement your Snowflake instance on any public cloud vendor, including AWS, Azure, and GCP. This allows customers to easily integrate Snowflake with the cloud provider they are already using, keeping the overall experience the same. One of the things that makes Snowflake stand out is that it separates the compute nodes from the storage nodes, decoupling the two services. This means that if there aren’t any queries being executed, customers pay only for the storage during that time.
Snowflake also supports the storage of both structured and semi-structured data in its database schemas. This approach made Snowflake one of the first data warehouse systems capable of dealing with JSON-like data. Snowflake also offers several security measures to protect your data. Since it’s cloud based, you can control access to data by whitelisting IP addresses that are allowed access. It also offers both single sign-on and user-based multi-factor authentication. These factors make Snowflake one of the most popular data warehouse solutions on the cloud.
Amazon Redshift is an easy choice if customers are already building their infrastructure on the AWS cloud. Because it’s part of the AWS ecosystem, Redshift offers excellent integration with other AWS resources, including S3, Lambda, Kinesis, and Athena. This means that when customers want to build a customized data warehouse to fit their needs, integrating Amazon Redshift is very easy. One of the most commonly used features of Redshift is the ability to directly move data from S3 buckets into Redshift using the
Amazon Redshift is a column-oriented database, which contrasts with the traditional row-oriented databases. This means Redshift will perform aggregate calculations faster and in a more optimized way. It also leverages Massively Parallel Processing, or MPP, which allows it to execute large data jobs in parallel within the distributed cluster. AWS Redshift is also compliant with major privacy laws, such as the GDPR, HIPAA, and the CCPA.
BigQuery, a Google offering, has been popular since its inception. It allows users to use a petabyte-scale data warehouse on the cloud. Some of the main draws of BigQuery are that it is multi-cloud, able to run machine-learning models within your data warehouse, and conducts analyses for spatial or geographic data.
Azure Synapse Analytics
Azure Synapse Analytics is the cloud-based data warehousing from Microsoft Azure, and is especially appealing to customers who are already using the Microsoft ecosystem. Azure Synapse Analytics provides easy integration with machine-learning tools within the Azure environment, which helps build analytics solutions faster. Data access in Azure Synapse Analytics can be managed by integrating it with the Azure Active Directory service, which allows SSO-based user access to the platform’s data.
Data Models for Optimizing Data Access
When storing data, the very first thing that customers seek is a database where they can store data in the form of tables. A table comprises rows and columns, and is directly related to other tables within the database. This way of storing data is called the relational database management system. Relational databases have been popular since the earliest databases, and many essential modern business applications still run on them. However, with the rise of Big Data, the velocity and variety of data that are generated have skyrocketed. Relational databases are capable of storing structured data, but when it comes to unstructured or semi-structured data, they’re often not as good a choice. In the following section, you’ll look at some other popular data models that can be used to store unstructured or semi-structured data.
A document store, or a document database, is a system of storing information in documents, rather than in rows, as in relational databases. Documents are organized in collections, and a collection can store multiple documents. A document in a document database is usually stored in JSON format. It’s easy to add new documents, and documents can be updated or deleted using a query language. Some popular examples of document databases are MongoDB, DocumentDB by Amazon, and Cosmos DB by Azure. Most of these vendors provide an API that can be used to query and access the data using simple ANSI SQL statements.
In order to optimize data access in a document database, the database should be partitioned and indexed properly. Another important practice is to define the schema for the database, as the schema exists within your database and should be as compact as possible.
- Schemaless: Document databases are schemaless, meaning you are free to choose the schema on write for each of the records that you store.
- No foreign-key constraints: Records are not related to other records with a foreign-key constraint, which enables document databases to be independent of other tables and records.
A key-value database stores data in key and value pairs. This is a non-relational database that’s used to store data that can be identified using a row identifier. The usual format of storing data is JSON, and as such, nested JSON documents can easily be stored in a key-value database. While accessing the data, you can use the key to directly query the object, and it will return the data that it holds. One of the user cases for key-value stores is that it stores the session information of users who log into your website. They’re also commonly used for storing product details, or to manage e-commerce shopping carts.
- High throughput: With key-value databases, users can obtain a very high throughput while reading data. While in some analytical data stores, it may be acceptable for query execution to take several minutes, production-ready systems require quicker response times.
Graph databases are a special type of database that stores data in a graph-based semantic model using nodes, edges, and properties. Nodes are connected to each other by edges, which store information on how nodes are related to each other. Graph databases are similar to how many people might express connections between entities on a whiteboard, or in a family tree. You might also consider a social network like LinkedIn, where there are many users, each of them connected to some—but not all—of the other users. This relationship can be illustrated by a graph database: the nodes represent the different users, and the relationships between the users are clarified by the edges. Popular graph databases in today's market are Neo4J, Apache Cassandra, Amazon Neptune, and Azure Cosmos DB.
- Schemaless: Graph databases are schemaless, which provides flexibility while storing data.
- Performance: These databases offer high performance, as there are no relationships to handle.
In simple terms, metadata means information about your existing data. When you know what kind of data you have, it becomes much more intuitive and fun to interact with your data. Metadata management deals with the management of your existing data sources and targets, setting proper definitions of what your data is about, setting correct access policies, and guidelines for data governance. For example, your policies might relate to the following information.
- How is this data created?
- How is it transformed, and how is it stored?
- What is the business definition of this column?
- How is a KPI calculated? What are the underlying fields used to define the KPI metric?
- Who owns the data, and how should it be accessed?
These questions can be used as a starting point for managing your metadata. This information can lead to building operational data catalogs that can be used as a single source of reference for all of your metadata.
- Operational excellence: Maintaining your metadata will allow users with access to proper information.
- Less confusion: Users will have less confusion while using metrics, as definitions are readily available in the data catalog.
Programming Languages for Accessing Data Warehouses
Data warehouses, in simple terms, are database solutions designed with a different approach in terms of schema and tables. Most of the data warehouse solutions can be built using the native ANSI-SQL commands. While dealing with data warehouses, you can use SQL commands to write Data Definition Language (DDL) statements for things like creating the databases, tables, and views. You can also use Data Query Language and Data Manipulation Language statements that allow you to select, insert, delete, or update data from various tables, and Data Control Language statements that allow you to control how data can be accessed by different users. For example, you can use the
revoke SQL commands to adjust permissions for users, groups, and data warehouse assets.
Many programmers and developers use Python and R routinely, because both Python and R have fantastic support for libraries in many fields, including data science, analysis, statistics, and machine learning. That fact makes it easier to start development; as a result, you can also use Python and R to start interacting with your data warehouse solutions. Snowflake offers a Python connector that allows you to directly access your data warehouse tables from your Python code. Amazon Redshift allows JDBC connections that allow users to connect to the Redshift instances using their R console. This makes accessing the modern data warehouse solutions easier.
In this article, you’ve learned all about cloud data warehouses and the best ways to access data in them. Data warehouses are a great way to support your complex analytical workloads.