Databricks Delta Lake

Sync data from RudderStack to Databricks Delta Lake.

Delta Lake is a popular data lake used for both streaming and batch operations. It lets you store structured, unstructured, and semi-structured data securely and reliably.

success
You can now use Databricks Partner Connect to set up your Databricks Delta Lake destination in RudderStack without following the setup instructions.

Find the open source code for this destination in the GitHub repository.

Configuring Delta Lake destination in RudderStack

warning

Before configuring Delta Lake as a destination in RudderStack, it is highly recommended to go through the following sections to obtain the necessary configuration settings. These sections also contain the steps to grant RudderStack and Databricks the necessary permissions to your preferred storage bucket.

To send event data to Delta Lake, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to Delta Lake via RudderStack.

To configure Delta Lake as a destination in RudderStack, follow these steps:

  1. In your RudderStack dashboard, set up the data source. Then, select Databricks Delta Lake from the list of destinations.
  2. Assign a name to your destination and then click Next.

Connection settings

  • Host: Enter your server hostname from the Databricks dashboard.
  • Port: Enter the port number.
  • HTTP Path: Enter the cluster’s HTTP path.
info
For more information on obtaining the server hostname, port, and the cluster’s HTTP path, refer to the Obtaining the JDBC/ODBC configuration section below.
  • Personal Access Token: Enter your Databricks access token. For more information on generating the access token, refer to the Generating the Databricks access token section below.
  • Enable delta tables creation in an external location: Enable this setting to specify the external location to create the delta tables. You can specify the external location in the External delta table location setting. When disabled, RudderStack creates the delta tables at a default storage location for the non-external Apache Hive tables.
info

If you have not configured a Unity catalog, you can access the delta tables at {path_to_table}/{schema}/{table}.


If you have configured a Unity catalog, follow these steps:

  1. Create an external location by following this Databricks documentation. Your location will look something like s3://{bucket_path}/{external_location}.
  2. Specify the absolute location in the External delta table location setting. It will look something like s3://{bucket_path}/{external_location}/{path_to_table}/{schema}/{table}.
  • Catalog: If you have configured a Unity catalog, enter the catalog name where your data assets are organized. For more information on creating a Unity Catalog, refer to the Databricks documentation.
info
If you do not specify the catalog name, RudderStack uses the default catalog configured for your workspace.
  • Namespace: Enter the the name of the schema where RudderStack will create the tables. If you don’t specify a namespace in the dashboard settings, RudderStack will set it to the source name, by default.
  • Sync Frequency: Specify how often RudderStack should sync the data to your Delta Lake instance.
  • Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to the Delta Lake instance.
  • Exclude Window: This optional setting lets you specify the time window (in UTC) when RudderStack will skip the data sync.
  • Object Storage Configuration: Use this setting to specify your object storage.
    • Use RudderStack managed object storage: Enable this setting to use RudderStack-managed buckets for object storage. Note that this option is applicable only for RudderStack-hosted data planes. For self-hosted data planes, you need to specify your own object storage configuration settings.
    • Choose your storage provider: If Use RudderStack managed object storage is disabled, you can select any one of the following platforms for storing your staging files:

See How RudderStack stores data in an object storage platform for more information.

info
If you select S3 as your storage provider, RudderStack provides the option to specify your IAM role ARN or the AWS access key ID/secret access key by enabling the Use STS Tokens to copy staging files setting. For more information, refer to the Amazon S3 storage bucket settings section below.

Advanced settings

  • Warehouse Append: This setting is turned on by default - RudderStack appends your incoming Event Stream data to the existing data in your warehouse. Turning it off causes RudderStack to merge your incoming data into your warehouse to ensure 100% non-duplicate data.
info

The append operation helps to achieve faster data syncs while reducing warehouse costs. However, note that it may increase the number of duplicates in the warehouse, especially if the existing data is older than two weeks. A common scenario where duplication might occur is when the SDKs retry sending events in case of failures.

A merge strategy ensures deduplication but can lead to longer sync times and increased warehouse costs.

  • Skip User Table: Toggle on this setting to send events exclusively to the identifies table and skip the users table, eliminating the need for a merge operation on the users table.
  • Skip Tracks Table: Toggle on this setting to skip sending events to the tracks table.

Granting RudderStack access to your storage bucket

This section contains the steps to edit your bucket policy to grant RudderStack the necessary permissions, depending on your preferred cloud platform.

Amazon S3

Follow these steps to grant RudderStack access to your S3 bucket based on the following two cases:

Case 1: Use STS Token to copy staging files is disabled in the dashboard

warning
Follow the steps listed in this section if the Use STS Token to copy staging files option is disabled, that is, you don’t want to specify the AWS credentials while configuring your Delta Lake destination.

For RudderStack Cloud

If you are using RudderStack Cloud, edit your bucket policy using the following JSON:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::422074288268:user/s3-copy"
    },
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/*",
      "arn:aws:s3:::YOUR_BUCKET_NAME"
    ]
  }]
}
info
Make sure you replace YOUR_BUCKET_NAME with the name of your S3 bucket.

For self-hosted RudderStack

If you are self-hosting RudderStack, follow these steps:

  1. Create an IAM policy with the following JSON:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "*",
    "Resource": "arn:aws:s3:::*"
  }]
}
  1. Then, create an IAM user with programmatic access. Attach the above IAM policy to this user.
info
Copy the ARN of this newly-created user. This is required in the next step.
  1. Next, edit your bucket policy with the following JSON to allow RudderStack to write to your S3 bucket.
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::ACCOUNT_ID:user/USER_ARN"
    },
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/*",
      "arn:aws:s3:::YOUR_BUCKET_NAME"
    ]
  }]
}
info
Make sure you replace USER_ARN with the ARN copied in the previous step. Also, replace ACCOUNT_ID with your AWS account ID and YOUR_BUCKET_NAME with the name of your S3 bucket.
  1. Finally, add the programmatic access credentials to the env file present in your RudderStack installation:
RUDDER_AWS_S3_COPY_USER_ACCESS_KEY_ID=<user_access_key>
RUDDER_AWS_S3_COPY_USER_ACCESS_KEY=<user_access_key_secret>

Case 2: Use STS Token to copy staging files is enabled in the dashboard

In this case, provide the configuration directly while setting up the Delta Lake destination in RudderStack:

S3 settings in RudderStack dashboard
  • Role-based Authentication: Enable this setting to use the RudderStack IAM role for authentication. For more information on creating an AWS IAM role for RudderStack, refer to this guide.
    • IAM Role ARN: Enter the ARN of the IAM role.
warning
It is highly recommended to enable this setting as the access keys-based authentication method is now deprecated.

If Role-based Authentication is disabled, you need to enter the AWS Access Key ID and AWS Secret Access Key to authorize RudderStack to write to your S3 bucket.

info
In both the role-based and access key-based authentication methods, you need to set a policy specifying the required permissions for RudderStack to write to your intermediary S3 bucket. Refer to the S3 permissions for warehouse destinations section for more information.

Google Cloud Storage

You can provide the necessary GCS bucket configuration while setting up the Delta Lake destination in RudderStack. For more information, refer to the Google Cloud Storage bucket settings.

Azure Blob Storage

You can provide the necessary Blob Storage container configuration while setting up the Delta Lake destination in RudderStack. For more information, refer to the Azure Blob Storage settings.

Granting Databricks access to your staging bucket

This section contains the steps to grant Databricks the necessary permissions to access your staging bucket, depending on your preferred cloud platform.

Amazon S3

Follow these steps to grant Databricks access to your S3 bucket depending on your case:

Case 1: Use STS Token to copy staging files is disabled in the dashboard

warning
Follow the steps listed in this section if the Use STS Token to copy staging files option is disabled, i.e. you don’t want to specify the AWS access key and secret access key while configuring your Delta Lake destination.

In this case, you will be required to configure your AWS account to create an instance profile which will then be attached with your Databricks cluster.

Follow these steps in the exact order:

  1. Create an instance profile to access the S3 bucket.
  2. Create a bucket policy for the target S3 bucket.
  3. Note the IAM role used to create the Databricks deployment.
  4. Add the S3 IAM role to the EC2 policy.
  5. Add the instance profile to Databricks.

Case 2: Use STS Token to copy staging files is enabled in the dashboard

info
Follow the steps listed in this section if the Use STS Token to copy staging files option is enabled, i.e. you are specifying the AWS access key and secret access key in the dashboard while configuring your Delta Lake destination.

Add the following Spark configuration to your Databricks cluster:

spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl.disable.cache true
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3n.impl.disable.cache true
info
For more information on adding custom Spark configuration properties in a Databricks cluster, refer to Spark configuration guide.

Google Cloud Storage

To grant Databricks access to your GCS bucket, follow these steps:

  1. Follow the steps listed in this user permissions section to set up the required role and user permissions.
  2. Then, add the following Spark configuration to your Databricks cluster:
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
info
For more information on adding custom Spark configuration properties in a Databricks cluster, refer to Spark configuration guide.
  1. Finally, replace the following fields with the values obtained from the downloaded JSON in the previous step: <project_id>,<private_key>, <private_key_id>,<client_email>.

Azure Blob Storage

To grant Databricks access to your Azure Blob Storage container, follow these steps:

  1. Add the following Spark configuration to your Databricks cluster.
spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
info
For more information on adding custom Spark configuration properties in a Databricks cluster, refer to Spark configuration guide.
  1. Replace the following fields with the relevant values from your Blob Storage account settings: <storage-account-name>,<storage-account-access-key>.

Creating a new Databricks cluster

To create a new Databricks cluster, follow these steps:

  1. Sign into your Databricks account. Then, click the Compute option on the dashboard:
Delta Lake Compute option
  1. Click the Create Cluster option.
  2. Next, enter the cluster details. Fill in the Cluster Name:
Delta Lake Cluster name
  1. Select the Cluster Mode depending on your use-case. The following image highlights the three cluster modes:
Delta Lake cluster modes
  1. Then, select the Databricks Runtime Version as 7.1 or higher:
Delta Lake runtime version
  1. Configure the rest of the settings as per your requirement.
  2. In the Advanced Options section, configure the Instances field as shown in the following image:
Delta Lake instances
  1. In the Instance Profile dropdown menu, select the Databricks instance profile that you added to your account in the previous step.
Delta Lake instances field
  1. Finally, click the Create Cluster button to complete the configuration and create the Databricks cluster.
Delta Lake create cluster option

Obtaining the JDBC/ODBC configuration

Follow these steps to get the JDBC/ODBC configuration:

  1. In your Databricks dashboard, click the Compute option:
Delta Lake Compute option
  1. Then, select the cluster you created in the previous section.
Delta Lake cluster
  1. In the Advanced Options section, select the JDBC/ODBC field and copy the Server Hostname, Port, and HTTP Path values:
Delta Lake JDBC/ODBC settings
info
The Server Hostname, Port, and HTTP Path values are required to configure Delta Lake as a destination in RudderStack.

Generating the Databricks access token

To generate the Databricks access token, follow these steps:

  1. In your Databricks dashboard, go to Settings and click User Settings:
Databricks user settings
  1. Then, go to the Access Tokens section and click Generate New Token:
Access tokens
  1. Enter your comment in the Comment field and click Generate:
Databricks generating new token
warning
Keep the Lifetime (days) field blank. If you enter a number, your access token will expire after that number of days.
  1. Finally, copy the access token as it will be used during the Delta Lake destination setup in RudderStack.

IPs to be allowlisted

To enable network access to RudderStack, allowlist the following RudderStack IPs depending on your region and RudderStack Cloud plan:

Plan
Region
US
EU
Free, Starter, and Growth
  • 23.20.96.9
  • 18.214.35.254
  • 52.38.160.231
  • 34.211.241.254
  • 18.198.90.215
  • 18.196.167.201
Enterprise
  • 34.198.90.241
  • 54.147.40.62
  • 3.216.35.97
  • 100.20.239.77
  • 44.236.60.231
  • 3.66.99.198
  • 3.64.201.167
info
All the outbound traffic is routed through these RudderStack IPs.

FAQ

How does RudderStack handle deduplication in Databricks?

RudderStack takes the staging table and runs MERGE queries against the target Databricks tables to ensure no duplicate data is sent.

What are the reserved keys for Delta Lake?

Refer to this documentation for a complete list of the reserved keywords.

Does the Databricks cluster attached to the destination need to be up all time?

No, your Databricks cluster attached to the destination need not be up all time.

What happens if the cluster or the destination service is down? Is there a possibility of data loss?

If a warehouse destination is down or unavailable, RudderStack will continue to retry sending events (on an exponential backoff basis, for up to 3 hours).

RudderStack stores the syncs as staging files and retries sending them at a later time when the cluster is up again. This allows for a successful delivery without any missing data.

After retrying for up to 3 hours, RudderStack marks the syncs as aborted. Once the service is up and running again, you can go to the Syncs tab in the RudderStack dashboard and retry sending the data.

Does RudderStack automatically spin the Databricks cluster/SQL endpoint every time it needs to write?

No, RudderStack does not spin the Databricks cluster or the SQL endpoint on its own every time it needs to write to the cluster.

Databricks itself starts up the cluster/endpoint when the connection is established. You just need to configure the automatic termination settings in the Autopilot Options on the cluster creation page:

Databricks autopilot options

How does RudderStack handle the reserved words in a column, table, or schema?

There are some limitations when it comes to using reserved words as a schema, table, or column name. If such words are used in event names, traits or properties, they will be prefixed with a _ when RudderStack creates tables or columns for them in your schema.

Also, integers are not allowed at the start of a schema or table name. Hence, such schema, column, or table names will be prefixed with a _. For example, '25dollarpurchase' will be changed to '_25dollarpurchase'.

How can I modify an existing table to a partitioned table?

To modify an existing table to a partitioned table, follow these steps:

  1. Set an exclusion window (using the Exclude window connection setting) so that RudderStack does not process any data while performing the below changes.
  2. Make the required changes in connection settings of the configured Delta Lake destination.
  3. Run the following queries in the Databricks Cluster/SQL endpoints to:
    • Rename the existing table with _temp suffix.
    • Add event_date column to the _temp table.
    • Backfill the data into original table.
ALTER TABLE x RENAME TO x_temp;
ALTER TABLE x_temp ADD COLUMN TO event_date DATE;
INSERT INTO x SELECT * FROM x_temp;
info
RudderStack will create the new tables with partition support. Refer to the Databricks documentation on partitions for more information.

How can I convert an existing managed or unmanaged table at a location to an unmanaged table at a new location?

  1. Set an exclusion window (using the Exclude window connection setting) so that RudderStack does not process any data while performing the below changes.
  2. Run the following queries in the Databricks Cluster/SQL endpoints to:
    • Create a temporary table using the new location.
    • Drop the temporary table.
    • Drop the original table.
CREATE OR REPLACE TABLE namespace.x_temp DEEP CLONE namespace.x LOCATION '/path/to/new/location/namespace/x';
// where namespace represents the namespace attached to the destination in RudderStack.
// where x represents the original table created by RudderStack.

DROP TABLE namespace.x_temp;
DROP TABLE namespace.x;
  1. Enable the Enable delta tables creation in an external location setting in RudderStack dashboard and update the location.
  2. Remove the exclusion window and make the required changes in connection settings of the configured Delta Lake destination.

RudderStack will create the table again during the subsequent data syncs.

How do I convert an existing unmanaged table at a specific location to a managed table (at default location)?

  1. Set an exclusion window (using the Exclude window connection setting) so that RudderStack does not process anything while performing the below changes.
  2. Run the following queries in the Databricks Cluster/SQL Endpoints to:
    • Create a temporary table.
    • Drop original table.
    • Rename temporary table to original table.
CREATE TABLE IF NOT EXISTS namespace.x_temp DEEP CLONE namespace.x;
// where namespace represents the namespace attached to the destination in RudderStack.
// where x represents the original table created by RudderStack.

DROP TABLE namespace.x;
ALTER TABLE namespace.x_temp RENAME TO namespace.x;
  1. Remove the exclusion window and make sure the Enable delta tables creation in an external location setting is disabled in the RudderStack dashboard.

RudderStack will create the table again during the subsequent data syncs.

info
For a more comprehensive FAQ list, refer to the Warehouse FAQ guide.
Databricks Partner Connect

Connect your Databricks cluster to RudderStack and set up your Databricks Delta Lake destination.


Questions? Contact us by email or on Slack