Pricing
Log in

Learning Center

Learning Topics

Subscription

Subscribe

We'll send you updates from the blog and monthly release notes.

Learning Center

Redshift vs Snowflake vs BigQuery: Choosing a Warehouse

Conclusion

In our previous blog post, we described our experience setting up an Open-Source Analytics Stack using RudderStack, Apache SuperSet, and Redshift. When it comes to choosing a data warehouse for the open-source analytics stack, it is very common to pit data warehouses against each other. Redshift vs BigQuery, Redshift vs Snowflake, etc. are some common examples.

In this post, we will go into the reasoning behind choosing AWS Redshift over other popular data warehousing solutions such as Google BigQuery and Snowflake.

Background

AWS Amazon Redshift and Google BigQuery (part of Google Cloud Platform) are market leaders for cloud-based data warehouse solutions available today. As they both utilize columnar storage, they offer improved performance over traditional RDBMS and are what most companies use when migrating off of on-premise infrastructure. These platforms provide mature, intuitive management consoles and a rich set of programmatic interfaces (primarily for standard SQL). On the other hand, Snowflake has emerged as a modern data warehouse solution with features such as ease of access, instant scalability for workloads, as well as powerful analytics and security and robust API functionality. (Microsoft Azure is an additional provider, but in this post we will cover the top three.)

There are lots of use cases for warehouses and big data, like business intelligence, data analysis, data science and other workflows. For this comparison, we used a data set that consists of approximately 20 million event records, triggered directly or indirectly by the users of a mobile casino game. The mobile application contains multiple games - both free and paid - that mimic the casino slot machines.

Data for comparison

The event data itself is captured as JSON, then RudderStack runs jobs to load data into the cloud provider (this works roughly the same way across Amazon Web Services, GCP and Snowflake). It’s worth noting that this job is different from a traditional ETL job and can be tuned to load data extremely fast, including real-time data where the warehouse supports that benchmark for data ingestion.

We used the data from the following three tables:

  • revenue - details for purchase events
  • spin_result - details for events triggered by automatic or manual player spin action
  • tracks - parent table containing common details of all events

Note that the data sets here came from one data pipeline (RudderStack’s Event Stream) pulling from a single app to limit variables in the test, but data can be loaded to each of these warehouses from multiple data sources (a common use case, especially for machine learning use cases). Also, we used a single data type for this test, but all of these warehouses can ingest multiple data types.

Evaluation Methodology

We compared Redshift vs BigQuery vs Snowflake on two major aspects - query performance and cost. For query engine performance evaluation, we ran a set of 11 queries on each warehouse that were used extensively by the product managers of the gaming application for data analytics. As the pricing model varies in the data warehouse ecosystem (i.e., on-demand pricing, upfront, flat-rate, credits, etc.), we also made an attempt to create a common cost evaluation standard based on usage.

Some other considerations that guided this performance comparison exercise were:

  • Data loading time was not considered as part of this exercise.
  • Performance comparisons were done using two Redshift cluster configurations (one with single dc2.large instance and another with two dc2.large instances). Higher configurations were deemed to be too costly for the given setup as discussed later.
  • All the queries were executed from a Python program. The module used for accessing Redshift was psycopg2 while we used the google.cloud module for BigQuery. For Snowflake, SnowSQL i.e. the Snowflake Command Line Interface was used
  • Multiple readings were taken to balance one-off factors such as network disruptions.
  • No platform optimizations, vendor-provided or otherwise, were implemented during the exercise.

Cost Comparison: Redshift vs BigQuery vs Snowflake

When comparing the cost of data storage for running the queries, the following are the key takeaways for each data warehouse:

AWS Redshift

  • For dc2.large with 160GB/node storage, the cost is $0.25/node/hour, that is $180/month + no cost for processing queries.
  • As we use two dc2.large nodes for each cluster, the total cost comes up to $360/month.

Google BigQuery

  • BigQuery charges $20/TB/month for storage and $5/TB for queries.
  • The total cost of running the 11 queries (only once) came up to roughly around $0.56.
  • The size of the dataset being queried is 39.30 GB, that is, 0.0393 TB. Therefore, the storage cost per month came up to around $0.79.

Snowflake

Snowflake offers different kinds of platforms that can be opted for, as per our usage. For this exercise, we used the Business Critical platform for our data storage. It also offers two types of services:

  • On-demand: Snowflake charges $40/TB per month for storage. Although our total cost of storage comes to $40 * 0.0393 TB = $1.57/month, there is a $25 minimum monthly charge for on-demand customers.
  • Pre-purchased: Snowflake offers a pre-purchased plan of $23/TB/month for data storage.

For querying, SnowFlake Credits are used to pay for the consumption of the cloud services that exceed 10% of the daily usage of the compute resources. The total number of credits used for executing the 11 queries is 1.866761388. Given the cost per credit for a business-critical platform is $4, the total cost of querying comes to $4*1.866761388 = $7.5.

Key Takeaways

  • If the queries are run just once in a month, Redshift is approximately 267 times more expensive than BigQuery.
  • When it comes to storing and querying data in BigQuery as compared to Snowflake, the cost of storing and running the queries once on the former ($1.35) was much less than doing so in the latter ($7.5).

Performance Comparison: Redshift vs BigQuery vs Snowflake

The following table shows the performance comparison when it comes to Redshift vs BigQuery vs Snowflake, as per our exercise:

Based on the performance and cost evaluation of the three warehouses, we see that using Redshift as our preferred data warehouse solution makes sense. Considering there are large data sets and real-time or near real-time dashboards that are being accessed by a high number of users, Redshift gives us the best ROI on the performance and cost aspects.

Depending on your use case, you should review the docs for each major warehouse because they vary in terms of support for data integration, connectors, indexes for mpp (massively parallel processing), concurrency, caching and other data warehouse services that might be more important to you that query performance and cost.

Also, if you anticipate processing data at petabyte scale, you will likely need to consider adding a data lake (like Amazon S3) to your architecture and pre-process data that will enter the warehouse to control cost at scale.

Get the Data Maturity Guide

Our comprehensive, 80-page Data Maturity Guide will help you build on your existing tools and take the next step on your journey.

Get the Guide
Build a data pipeline in less than 5 minutes
Create an account
See RudderStack in action
Get a personalized demo
Collaborate with our community of data engineers
Join Slack Community

COMPANY

  • About
  • Contact us
  • Partner with us
  • 🚀 We’re hiring!
  • Privacy policy
  • Terms of service

JOIN THE CONVERSATION

Learn more about the product and how other engineers are building their customer data pipelines.

Join our Slack Community

READ OUR DOCUMENTATION

Technical documentation on using RudderStack to collect, route and manage your event data securely.

Go to Docs

© RudderStack Inc.

This site uses cookies to improve your experience. If you want to learn more about cookies and why we use them, visit our cookie policy. We’ll assume you’re ok with this, but you can opt-out if you wish cookie settings.