Google BigQuery is an industry-leading, fully-managed cloud data warehouse that lets you efficiently store and analyze petabytes of data.
RudderStack lets you configure BigQuery as a destination to which you can send your event data seamlessly.
Before you set up BigQuery as a destination in RudderStack, follow these steps to set up your BigQuery project:
- Create a Google Cloud Platform (GCP) project if you don't have one already. For more details, refer to this BigQuery documentation.
- Enable the BigQuery API for your existing project if it's not done already. For more information, refer to this BigQuery documentation.
- Log into your BigQuery console. Copy the project ID as shown in the following image:
- Create a new Google Cloud Storage (GCS) bucket or provide an existing one to store files before loading the data into your BigQuery instance.
For RudderStack to successfully send events to your BigQuery instance, you also need to set up a service account with the necessary permissions.
Follow the steps below to set up a service account:
- Create a new service account by going to IAM & Admin > Service Accounts.
- Add the service account permissions as specified below:
- Add the
Storage Object Creatorand
Storage Object Viewerroles to the account.
- Add the
BigQuery Job Userand
BigQuery Data Ownerroles to the account.
- Create a key for the service account with JSON as the type and store it.
- Create and download the private JSON key required for configuring BigQuery as a destination in RudderStack, as shown:
To enable network access to RudderStack, you will need to allowlist the following RudderStack IPs:
To send event data to SQL Server, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to SQL Server via RudderStack.
To configure SQL Server as a destination in RudderStack, follow these steps:
- In your RudderStack dashboard, set up the data source. Then, select BigQuery from the list of destinations.
- Assign a name to your destination and then click on Next.
Enter the following credentials in the Connection Credentials page:
- Project: The GCP project ID where the BigQuery database is located.
- Location: The GCP region for your dataset.
- Staging GCS Storage Bucket Name: The name of the storage bucket as specified in the Setting up the BigQuery project section.
- Prefix: If specified, RudderStack creates a folder in the bucket with this prefix and pushes all the data within that folder.
- Namespace: Enter the schema name where RudderStack will create all the tables. If you don't specify any namespace, RudderStack will set this to the source name, by default.
- Credentials: Your GCP service account credentials JSON as created in the Setting up the service account for RudderStack section.
- Sync Frequency: Specify how often RudderStack should sync the data to your BigQuery dataset.
- Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to BigQuery.
- Exclude Window: This optional setting lets you set a time window when RudderStack will not sync the data to your database.
- JSON Columns: Use this optional setting to specify the required JSON column paths in dot notation, separated by commas. This option applies to all the incoming
trackevents for this destination.
RudderStack uses the source name (written in snake case, for example,
source_name) to create a dataset in BigQuery.
RudderStack supports two modes by which data is ingested into BigQuery:
- Partitioned tables (default method)
If you are using RudderStack Open Source or hosting the RudderStack data plane, you can manage these modes via the environment variable
RSERVER_WAREHOUSE_BIGQUERY_IS_DEDUP_ENABLED, as shown in the following table:
RudderStack creates ingestion-time partition tables based on the load date, so you can take advantage of it to query a subset of data.
In addition to tables, RudderStack creates a view (
<table_name>_view) for every table for de-duplication purposes.
It is highly recommended that you use the corresponding view (containing the events from the last 60 days) to avoid duplicate events in your query results. Since BigQuery views are merely logical views and are not cached, you can create a native table from it to save money - by avoiding running the query that defines the view every time.
In this method, RudderStack automatically discards any duplicate events while loading them into the BigQuery tables. To enable this method, set the environment variable
RSERVER_WAREHOUSE_BIGQUERY_IS_DEDUP_ENABLED in your RudderStack data plane setup to
trackstable every 30 minutes to upload any new events.
You can easily switch from using the deduplication method to ingest your data in BigQuery to the partitioned tables by setting the environment variable
RSERVER_WAREHOUSE_BIGQUERY_IS_DEDUP_ENABLED in your RudderStack setup to
RudderStack then does the following:
users_viewwhile loading the
userstable, if not already present.
- Creates the views for the new
- All the other skipped views are not regenerated. You can either manually create those views or contact us to get them created.
There are some limitations when it comes to using reserved words in a schema, table, or column names. If such words are used in event names, traits or properties, they will be prefixed with a
_when RudderStack creates tables or columns for them in your schema.
Besides, integers are not allowed at the start of the schema or table name. Hence, such schema, column or table names will be prefixed with a
'25dollarpurchase' will be changed to
By default, RudderStack sends the data to the table/dataset based on the source it is connected to. For example, if the source is Google Tag Manager, RudderStack sets the schema name as
gtm_*. However, you can override this behavior by setting the Namespace field in the BigQuery destination settings, as shown:
I'm looking to send data to BigQuery through RudderStack and I'm trying to understand what data is populated in each column. How do I go about this?
Refer to the Warehouse Schema documentation for details on how RudderStack generates the schema in the warehouse and populates the data in each column.
I am trying to load data into my BigQuery destination and I get the error "Cannot read and write in different locations". What should I do?
Make sure that both your BigQuery dataset and the bucket have the same region.
When piping data to a BigQuery destination, I can set the bucket but not a folder within the bucket. Is there a way to put Rudderstack data in a specific bucket folder?
Yes, you can set the desired folder name in the Prefix field while setting up your BigQuery destination in RudderStack.
The near-realtime BigQuery syncing feature is currently under development and is planned to be released in the coming months. Unfortunately, Event Replay is not a part of open-source RudderStack currently.
If you're using open source RudderStack, the minimum sync frequency is 30 minutes. If you're self-hosting the data plane or using RudderStack Cloud Pro / Enterprise, you can tweak the config to set
0 and determine the best possible value for near real-time sync. For more information, refer to this FAQ.
Do I need to stop the running pipeline to change my sync frequency? Or will the new change be effective even without stopping the pipeline?
To change the sync frequency, you need not stop the pipeline.
BigQuery uses the credentials JSON from the dashboard configuration when setting up the destination. For more information, refer to the Setting up the service account for RudderStack section.
When configuring the BigQuery destination, should the user permissions be set for the specific dataset or the whole project?
You need to set the user permissions for the whole project. Otherwise, you may encounter issues.
RudderStack retries the failed syncs for up to 3 hours before aborting them. For more information, refer to this FAQ.