BigQuery Stream (All Events) Private Beta

Sync data from RudderStack to BigQuery Stream (All Events).

announcement

The BigQuery Stream (All Events) destination is currently in Private Beta.

Reach out to Customer Success if you are interested in enabling this destination for your workspace.

This guide helps you set up the BigQuery Stream (All Events) destination in RudderStack.

Overview

The BigQuery Stream (All Events) destination lets you stream all RudderStack events to BigQuery in near real-time using BigQuery Storage Write API.

success

Advantages of using BigQuery Stream (All Events)

This destination provides the near real-time speed of streaming and the full schema management you’d normally only get from batch loading without the trade-offs of either approach:

  • vs. BigQuery Stream: Adds automatic schema management and support for all event types (not just track events), and uses BigQuery’s modern, lower-cost Storage Write API.
  • vs. BigQuery (Batch destination): Delivers data in near real-time instead of waiting for each sync, while keeping the same warehouse schema you’re used to.

With this integration, you get fresh data in BigQuery within seconds, full schema parity with your other warehouse tables, and lower streaming costs.

Prerequisites

Follow the steps in this section before setting up the BigQuery Stream (All Events) destination in RudderStack.

1. Important considerations

Before you set up the BigQuery Stream (All Events) destination in RudderStack, note the following limitation:

No merge mode support

This destination does not support merge mode. The BigQuery Storage Write API only supports inserts and writes rows directly into tables to reduce latency.

See BigQuery Storage Write API for more information on how it operates.

2. BigQuery project setup

  1. Create a Google Cloud Platform project if you don’t have one already. See BigQuery documentation for more information.
warning
Make sure to enable billing for the project to allow RudderStack to load data into your BigQuery cluster.
  1. Enable the BigQuery API for your existing project if not done already.
  2. Log in to your BigQuery console and copy the project ID – this ID is required for configuring the destination in RudderStack.
GCP Project ID

3. Set up the RudderStack service account

Create the service account for the BigQuery project you set up above by following these steps:

  1. Go to IAM & Admin > Service Accounts.
  2. Create a new role with the following permissions:
bigquery.datasets.create // Skip if the dataset already exists.
bigquery.datasets.get
bigquery.jobs.create
bigquery.routines.get
bigquery.routines.list
bigquery.tables.create
bigquery.tables.delete
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.list
bigquery.tables.update
bigquery.tables.updateData
tip
Tip: Skip the bigquery.datasets.create permission if the dataset already exists.
  1. Assign the role to the service account.
info

If you do not want to add the above fine-grained permissions to the service account individually, add the below roles to your service account:

  • BigQuery Job User
  • BigQuery Data Owner

If the dataset name already exists (configurable by the Namespace setting during the destination setup), you can assign the BigQuery Data Editor role instead of BigQuery Data Owner.

Service account role assignment
  1. Optional: Create and download the private JSON key required for accessing the service account. You need this key to configure the destination in RudderStack.
Service account JSON key download

IPs to allowlist

By default, you can access BigQuery via the Google APIs, which are publicly accessible. As such, allowlisting any IPs is not required.

However, if your VPC service restricts the BigQuery APIs, allowlist the below IPs by setting up network access control for BigQuery.

To enable network access to RudderStack, allowlist the following RudderStack IPs depending on your region and RudderStack plan:

Plan
Region
US
EU
IN
Free and Growth
  • 3.216.35.97
  • 18.214.35.254
  • 23.20.96.9
  • 34.198.90.241
  • 34.211.241.254
  • 52.38.160.231
  • 54.147.40.62
  • 3.123.104.182
  • 3.125.132.33
  • 18.198.90.215
  • 18.196.167.201
Not applicable
Enterprise
  • 3.216.35.97
  • 34.198.90.241
  • 44.236.60.231
  • 54.147.40.62
  • 100.20.239.77
  • 3.66.99.198
  • 3.64.201.167
  • 3.123.104.182
  • 3.125.132.33
  • 3.7.235.227
  • 13.200.113.188
  • 35.154.198.69
  • 3.6.122.214
info
All the outbound traffic is routed through these RudderStack IPs.

Setup

  1. In your RudderStack dashboard, add a source. Then select BigQuery Stream (All Events) from the list of destinations.
  2. Assign a name to your destination and click Continue.

Connection settings

Setting
Description
Project IDSpecify the project ID where the BigQuery database is located.
LocationSpecify the GCP region for your dataset.
NamespaceSpecify the schema name where RudderStack creates all tables. You cannot change the namespace later.

info
If you don’t specify any namespace, RudderStack sets the namespace to the source name, by default.
CredentialsEnter the GCP service account credentials JSON as created in the Set up the RudderStack service account section above.
Partition ColumnSpecify how you want to partition your tables by choosing from the following options:

OptionDescription
Ingestion TimeTime at which BigQuery ingests the data. See Ingestion time partitioning for more information.
Loaded AtTime at which RudderStack loads the data in the warehouse (loaded_at column).
Received AtTime at which RudderStack receives the data (received_at column).
TimestampTimestamp calculated by RudderStack to account for the client-side clock skew (timestamp column).
Sent AtTime at which the data was sent from the client to RudderStack (sent_at column).
Original TimestampTime at which the data was generated at the source (original_timestamp column).
See Time-unit column partitioning for more information on how BigQuery puts the data into the partition based on the values in the Loaded At, Received At, Timestamp, Sent At, and Original Timestamp columns.
Partition TypeSpecify the partition’s granularity level from the dropdown. RudderStack provides two options - Hour and Day.
warning

Partition settings for existing destinations are not editable

You cannot edit the partition settings for existing BigQuery destinations through the dashboard. RudderStack automatically sets their partitioning type to Ingestion time partitioning with a day level granularity.

  • To change your partition configuration, RudderStack recommends creating a new BigQuery destination and deprecating the old one.
  • To apply partition changes to an existing destination, contact RudderStack Support.

Advanced settings

Setting
Description
Skip Users TableThis destination does not support the users table.
Skip Tracks TableToggle on this setting to skip sending events to the tracks table.
Skip Views CreationToggle on this setting to disable views creation in BigQuery. See Partitioned tables and views for more information.

warning
This setting is configurable only while creating a new BigQuery destination and cannot be changed later.
JSON ColumnsThis setting lets you ingest semi-structured event data not defined by a fixed schema. Specify the required JSON column paths in the dot notation, separated by commas.

See the JSON Column Support guide for more information.

How RudderStack creates the dataset

RudderStack uses the source name (written in snake case, for example, source_name) to create a dataset in BigQuery.

See the Warehouse Schema guide for more details on the tables and columns created by RudderStack.

info
By default, RudderStack uses the partitioned tables method to ingest data into BigQuery.

Partitioned tables

RudderStack creates ingestion-time partition tables based on the load date, so you can take advantage of it to query a subset of data.

For information on how RudderStack creates these tables on load, see the Creating partitioned tables section of the BigQuery documentation.

warning
RudderStack does not discard duplicate data while loading it into BigQuery.

Views

In addition to tables, RudderStack creates a view (<table_name>_view) for every table for de-duplication purposes, ensuring that queried events are unique and contain the latest records.

Note that:

  • RudderStack recommends using the corresponding view (containing the events from the last 60 days) to avoid duplicate events in your query results.
  • Since BigQuery views are merely logical views and are not cached, you can create a native table from it to save costs - by avoiding running the query that defines the view every time.
  • You can modify the view query to change the time window of the view - the default value is set to 60 days.

Skip views creation

Toggle on the Skip Views Creation connection setting to disable creating views in BigQuery.

RudderStack recommends skipping views creation only if:

  • You do not need views and want to take care of deduplication by yourself.
  • You have alternative deduplication methods (for example, partition columns and types).
  • You want a materialized view instead of a logical view.
  • You have any security reasons and want to limit access to the tables.

FAQ

I am getting an “Failed to add columns for table <table_name> in the name ” error even though the table column limit is not reached. How do I resolve this issue?

According to BigQuery documentation, the maximum columns in a table, query result, or view definition cannot exceed 10000 columns. This includes recently deleted columns that persist in the total columns quota until it resets.

To circumvent the “Failed to add columns for table <table_name> in the name ” error when the table column limit is not reached, you can create a new table using CLONE and take a backup of the existing table to resolve the issue:

  1. Create new table using CLONE by using the below query:
CREATE OR REPLACE TABLE <YOUR_PROJECT>.dataset.<NEW_TABLE> CLONE <YOUR_PROJECT>.dataset.<OLD_TABLE>;
  1. Rename and keep the current table as backup:
ALTER TABLE <YOUR_PROJECT>.dataset.<OLD_TABLE> RENAME TO <OLD_TABLE_BACKUP>;
  1. Rename the new table to match the old table name:
ALTER TABLE <YOUR_PROJECT>.dataset.<NEW_TABLE> RENAME TO <OLD_TABLE>;

Where do I add the allowlisted IPs in BigQuery?

By default, BigQuery is accessible via publicly accessible Google APIs. As such, allowlisting any IPs is not required. However, if your VPC service restricts the BigQuery APIs, you will need to allowlist the IPs by setting up network access control for BigQuery.

What happens if my custom property name matches a standard RudderStack property?

RudderStack normalizes property keys to snake case when building warehouse columns.

If a property name (in properties or traits) matches a standard RudderStack property, RudderStack drops your property value in favor of the standard mapping.

For example, the user_id column is populated from root-level userId of an event, not from a separate properties.user_id field, so if you send a property named user_id, its value will not appear in the table.

Note that warehouse loads will still complete successfully, which means you might not see a sync failure when this happens. To capture the data under another name, use a different property (for example user_id_v2) or rename the field using a transformation.

warning

This behavior is not the same as SQL reserved keyword handling.

See How RudderStack handles SQL reserved keywords for more information.

How does RudderStack handle SQL reserved keywords in dataset, table, or column names?

BigQuery treats certain words as SQL reserved keywords. If you use them in event names, traits, or properties, RudderStack prefixes an underscore (_) when creating the corresponding tables or columns in your dataset.

Integers are not allowed at the start of a schema or table name. In those cases, RudderStack also prefixes the name with an underscore. For example, 25dollarpurchase becomes _25dollarpurchase.

For BigQuery-specific keyword lists, see the BigQuery documentation.

info
Conflicts with standard RudderStack properties follow discard rules, not underscore prefixing.

When sending data into a data warehouse, how can I change the table where this data is sent?

Set the Namespace field in the destination settings to change the table. If you don’t specify any namespace, RudderStack sets the namespace to the source name, by default.

How can I understand what data is populated in each column?

See the Warehouse Schema guide for details on how RudderStack generates the schema in the warehouse and populates the data in each column.

Do I need to stop the running pipeline to change my sync frequency? Or will the new change be effective even without stopping the pipeline?

To change the sync frequency, you need not stop the pipeline.

When configuring the BigQuery destination, where does Google use the credentials JSON from?

BigQuery uses the credentials JSON from the dashboard configuration when setting up the destination.

For more information, refer to the Set up the RudderStack service account section.

When configuring the BigQuery destination, should the user permissions be set for the specific dataset or the whole project?

You need to set the user permissions for the whole project. Otherwise, you may encounter issues.

How long are the failed syncs retried before being aborted?

RudderStack retries the failed syncs for up to 3 hours before aborting them. See this FAQ for more information.


Questions? We're here to help.

Join the RudderStack Slack community or email us for support