By Rudderstack Team

How to Access and Query Your Amazon Redshift Data Using Python and R


In this post, we will see how to access and query your Amazon Redshift data using Python. We follow two steps in this process:

  • Connecting to the Redshift warehouse instance and loading the data using Python
  • Querying the data and storing the results for analysis

Since Redshift is compatible with other databases such as PostgreSQL, we use the Python psycopg library to access and query the data from Redshift. We will then store the query results as a dataframe in pandas using the SQLAlchemy library.

The purpose of this exercise is to leverage the statistical techniques available in Python to get useful insights from your Redshift data. Some of the insights that you can get include a better understanding of your customers’ product behavior, predicting churn rate, etc.

We also have a dedicated blog for manipulating and querying your Google BigQuery data using Python and R, in case you are interested.
For this post, we assume that you have the data already loaded in your Redshift instance. In case you haven’t, the best way to load your data to Redshift is to leverage Customer Data Infrastructure tools such as RudderStack. They allow you to collect your data across all the customer touch-points, and load them securely into Redshift – or any other warehouse of your choice, with minimal effort.

Connecting to Your Redshift Data Using Python

To access your Redshift data using Python, we will first need to connect to our instance. As mentioned above, Redshift is compatible with other database solutions such as PostgreSQL. Hence, you can safely use the tools you’d use to access and query your PostgreSQL data for Redshift.

We will use the psycopg Python driver to connect to our Redshift instance. That said, please feel free to experiment with any other library of your choice to do so.

import psycopg2
con=psycopg2.connect(dbname= 'dbname', host='host',
port= 'port', user= 'user', password= 'pwd')

We highly recommend that you use the above code as part of your pipeline, and wrap it in a function that handles any errors. The parameters that you need are typical for connecting to any database:

  • Name of the database
  • Host Name
  • Port
  • User Name
  • Password

Executing Queries on Your Redshift Data Using Psycopg

Once the database connection is set up, we can start querying the Redshift data. We use SQL queries to narrow down the amount of data we want for our analysis.

To do so with psycopg, we perform these steps:

  • We get a cursor from our database connection, like so:
cur = con.cursor(