- A top-level guide to data lakes
- Redshift vs Snowflake vs BigQuery: Choosing a Warehouse
- What Is a Data Warehouse?
- How to Create and Use Business Intelligence with a Data Warehouse
- Best Practices for Accessing Your Data Warehouse
- Data Warehouse Architecture
- Data Warehouse Best Practices — preparing your data for peak performance
- How do Data Warehouses Enhance Data Mining?
- Data Warehouses versus Databases: What’s the Difference?
- What are the Benefits of a Data Warehouse?
- Key Concepts of a Data Warehouse
- Data Warehouses versus Data Lakes
- Data Warehouses versus Data Marts
- How to Move Data in Data Warehouses
- Difference Between Big Data and Data Warehouses
- What is Identity Resolution?
- What Is Customer Data?
- Customer Data Analytics
- Customer Data Management
- Collecting Customer Data
- The Importance of First-Party Customer Data After iOS Updates
- CDPs vs. DMPs
- Types of Customer Data
- What Is a Customer Data Platform?
- What is an Identity Graph?
- Customer Data Protection
- A complete guide to first-party customer data
- What is Consent Management?
- What is Persistent Data?
- Data Access Control
- Data Sharing and Third Parties
- Cybersecurity Frameworks
- What is PII Masking and How Can You Use It?
- Data Security Strategies
- Data Security Technologies
- Data Protection Security Controls
- How to Manage Data Retention
- How To Handle Your Company’s Sensitive Data
- Data Security Best Practices For Companies
We'll send you updates from the blog and monthly release notes.
Redshift vs Snowflake vs BigQuery: Choosing a Warehouse
In our previous blog post, we described our experience setting up an Open-Source Analytics Stack using RudderStack, Apache SuperSet, and Redshift. When it comes to choosing a data warehouse for the open-source analytics stack, it is very common to pit data warehouses against each other. Redshift vs BigQuery, Redshift vs Snowflake, etc. are some common examples.
In this post, we will go into the reasoning behind choosing AWS Redshift over other popular data warehousing solutions such as Google BigQuery and Snowflake.
AWS Amazon Redshift and Google BigQuery (part of Google Cloud Platform) are market leaders for cloud-based data warehouse solutions available today. As they both utilize columnar storage, they offer improved performance over traditional RDBMS and are what most companies use when migrating off of on-premise infrastructure. These platforms provide mature, intuitive management consoles and a rich set of programmatic interfaces (primarily for standard SQL). On the other hand, Snowflake has emerged as a modern data warehouse solution with features such as ease of access, instant scalability for workloads, as well as powerful analytics and security and robust API functionality. (Microsoft Azure is an additional provider, but in this post we will cover the top three.)
There are lots of use cases for warehouses and big data, like business intelligence, data analysis, data science and other workflows. For this comparison, we used a data set that consists of approximately 20 million event records, triggered directly or indirectly by the users of a mobile casino game. The mobile application contains multiple games - both free and paid - that mimic the casino slot machines.
Data for comparison
The event data itself is captured as JSON, then RudderStack runs jobs to load data into the cloud provider (this works roughly the same way across Amazon Web Services, GCP and Snowflake). It’s worth noting that this job is different from a traditional ETL job and can be tuned to load data extremely fast, including real-time data where the warehouse supports that benchmark for data ingestion.
We used the data from the following three tables:
revenue- details for purchase events
spin_result- details for events triggered by automatic or manual player spin action
tracks- parent table containing common details of all events
Note that the data sets here came from one data pipeline (RudderStack’s Event Stream) pulling from a single app to limit variables in the test, but data can be loaded to each of these warehouses from multiple data sources (a common use case, especially for machine learning use cases). Also, we used a single data type for this test, but all of these warehouses can ingest multiple data types.
We compared Redshift vs BigQuery vs Snowflake on two major aspects - query performance and cost. For query engine performance evaluation, we ran a set of 11 queries on each warehouse that were used extensively by the product managers of the gaming application for data analytics. As the pricing model varies in the data warehouse ecosystem (i.e., on-demand pricing, upfront, flat-rate, credits, etc.), we also made an attempt to create a common cost evaluation standard based on usage.
Some other considerations that guided this performance comparison exercise were:
- Data loading time was not considered as part of this exercise.
- Performance comparisons were done using two Redshift cluster configurations (one with single dc2.large instance and another with two dc2.large instances). Higher configurations were deemed to be too costly for the given setup as discussed later.
- All the queries were executed from a Python program. The module used for accessing Redshift was psycopg2 while we used the google.cloud module for BigQuery. For Snowflake, SnowSQL i.e. the Snowflake Command Line Interface was used
- Multiple readings were taken to balance one-off factors such as network disruptions.
- No platform optimizations, vendor-provided or otherwise, were implemented during the exercise.
Cost Comparison: Redshift vs BigQuery vs Snowflake
When comparing the cost of data storage for running the queries, the following are the key takeaways for each data warehouse:
- For dc2.large with 160GB/node storage, the cost is $0.25/node/hour, that is $180/month + no cost for processing queries.
- As we use two dc2.large nodes for each cluster, the total cost comes up to $360/month.
- BigQuery charges $20/TB/month for storage and $5/TB for queries.
- The total cost of running the 11 queries (only once) came up to roughly around $0.56.
- The size of the dataset being queried is 39.30 GB, that is, 0.0393 TB. Therefore, the storage cost per month came up to around $0.79.
Snowflake offers different kinds of platforms that can be opted for, as per our usage. For this exercise, we used the Business Critical platform for our data storage. It also offers two types of services:
- On-demand: Snowflake charges $40/TB per month for storage. Although our total cost of storage comes to $40 * 0.0393 TB = $1.57/month, there is a $25 minimum monthly charge for on-demand customers.
- Pre-purchased: Snowflake offers a pre-purchased plan of $23/TB/month for data storage.
For querying, SnowFlake Credits are used to pay for the consumption of the cloud services that exceed 10% of the daily usage of the compute resources. The total number of credits used for executing the 11 queries is 1.866761388. Given the cost per credit for a business-critical platform is $4, the total cost of querying comes to $4*1.866761388 = $7.5.
- If the queries are run just once in a month, Redshift is approximately 267 times more expensive than BigQuery.
- When it comes to storing and querying data in BigQuery as compared to Snowflake, the cost of storing and running the queries once on the former ($1.35) was much less than doing so in the latter ($7.5).
Performance Comparison: Redshift vs BigQuery vs Snowflake
The following table shows the performance comparison when it comes to Redshift vs BigQuery vs Snowflake, as per our exercise:
Based on the performance and cost evaluation of the three warehouses, we see that using Redshift as our preferred data warehouse solution makes sense. Considering there are large data sets and real-time or near real-time dashboards that are being accessed by a high number of users, Redshift gives us the best ROI on the performance and cost aspects.
Depending on your use case, you should review the docs for each major warehouse because they vary in terms of support for data integration, connectors, indexes for mpp (massively parallel processing), concurrency, caching and other data warehouse services that might be more important to you that query performance and cost.
Also, if you anticipate processing data at petabyte scale, you will likely need to consider adding a data lake (like Amazon S3) to your architecture and pre-process data that will enter the warehouse to control cost at scale.
Get the Data Maturity Guide
Our comprehensive, 80-page Data Maturity Guide will help you build on your existing tools and take the next step on your journey.Get the Guide