Databricks

Send data from Databricks to your entire stack.

Databricks is a data analytics platform that lets you easily integrate with open source libraries. It offers a simple collaborative environment to run interactive and scheduled data analysis workloads.

RudderStack supports Databricks as a source from which you can ingest data and route it to your desired downstream destinations.

success
You can now ingest data into RudderStack by running queries on your Databricks cluster or SQL warehouse.

Granting permissions

RudderStack requires you to grant certain user permissions on Databricks to successfully access data from it.

Follow the steps listed in the following sections in the exact order to grant these permissions:

Step 1: Add a user

Step 2: Creating the RudderStack schema and granting permissions

  1. Create a dedicated schema _rudderstack.
CREATE SCHEMA `_rudderstack`;
warning
The _rudderstack schema is used by RudderStack for storing the state of each data sync. This name should not be changed.
  1. Grant full access to the schema _rudderstack for the user created in step 1.
GRANT ALL PRIVILEGES ON SCHEMA `_rudderstack` TO `user@example.com`

Replace user@example.com with the user created in step 1.

Setting up the Databricks source in RudderStack

  1. Log in to your RudderStack dashboard.
  2. From the left panel, go to Collect > Sources > New source > Reverse ETL. Then, select Databricks.
  3. Assign a name to your source and click Continue.

Configuring the connection credentials

  1. Choose from the Table or Model option to sync data from either a warehouse table or a model.
  2. Enter the connection details of your Databricks cluster or SQL warehouse in the Connection Credentials section:
success
For most use cases, RudderStack recommends using a SQL warehouse over a cluster as they generally cost less and are faster to spin up. In contrast, clusters are used for much larger operations that require more resources.
  • Host - Enter the server hostname.
  • Port - Enter the port number.
  • Path - Enter the HTTP path.
  • Token - Enter the personal access token.
  • Catalog - Enter the name of your Unity catalog. See Databricks documentation for more information on getting the catalog details.
info

Note the following:

  • See this FAQ for more information on getting the host, port, path, and token for your Databricks cluster.
  • See this FAQ for more information on getting the host, port, path, and token for your SQL warehouse.
  • If you’ve already configured Databricks as a source before, your existing credentials will automatically appear under Use Existing Credentials.
  1. Click Continue to proceed.

Schedule settings

  1. Specify the Schedule Settings to schedule the data syncs from your Databricks source.
info
RudderStack lets you schedule data syncs for your Reverse ETL sources and specify how and when the syncs will run. For more information on the Basic, CRON, and Manual schedule types, refer to the Sync Schedule Settings guide.
  1. After specifying the schedule type and run settings, click Continue to finish the setup.

Databricks is now successfully configured as a source in your RudderStack dashboard. You can further connect this source to your preferred destination by clicking on Add Destination button:

Specifying the data to import

While connecting a destination to your Reverse ETL source, you can use the default JSON mapping or the Visual Data Mapping feature.

info

Based on the option(Table/Model) you chose while setting up the Reverse ETL source, follow the relevant guide for detailed steps:

FAQ

Where can I obtain the connection credentials for the Databricks cluster?

To obtain the Host, Path, and Port number, go to your Databricks account and follow these steps:

  1. Go to the Compute tab and select your Databricks cluster.
  2. Click Advanced options > JDBC/ODBC tab to find the required settings:
Select Databricks source in RudderStack

To obtain the Token, go to the Settings > User Settings in your Databricks account and generate a new personal access token:

Select Databricks source in RudderStack
info
See Databricks documentation for more details on generating a personal access token.

Where can I obtain the connection credentials for the SQL warehouse?

To obtain the Host, Path, and Port number for your SQL warehouse, go to your Databricks account and follow these steps:

  1. Go to the SQL warehouses tab and select your warehouse.
  2. Click the Connection details tab to find the Host, Path, and Port number.
SQL warehouse connection details

To obtain the Token, go to the Settings > User Settings in your Databricks account and generate a new personal access token:

Databricks access token
info
See Databricks documentation for more details on generating a personal access token.

What do the three validations under Verifying Credentials imply?

When setting up a Reverse ETL source, once you proceed after entering the connection credentials, you will see the following three validations under the Verifying Credentials option:

Validations

These options are explained below:

warning
Make sure your Databricks SQL warehouse/cluster is active when running the validations. Otherwise, the validations might fail.
  • Verifying Connection: This option indicates that RudderStack is trying to connect to the warehouse with the information specified in the connection credentials.
info
If this option gives an error, it means that one or more fields specified in the connection credentials are incorrect. Verify your credentials in this case.
  • Able to List Schema: This option checks if RudderStack is able to fetch all schema details using the provided credentials.
  • Able to Access RudderStack Schema: This option implies that RudderStack is able to access the _rudderstack schema you have created by successfully running all commands in the Creating the RudderStack schema and granting permissions section.
info
If this option gives an error, verify if you have successfully created the _rudderstack schema and given RudderStack the required permissions to access it. For more information, refer to the Creating the RudderStack schema and granting permissions section.

Does my SQL warehouse/cluster need to be active when running the validations?

Yes - otherwise, the validations might fail.


Questions? Contact us by email or on Slack