Files
supabase/apps/docs/content/guides/database/replication/bigquery.mdx
2026-06-15 12:22:46 +02:00

181 lines
10 KiB
Plaintext

---
id: 'bigquery-destination'
title: 'BigQuery destination'
description: 'Configure BigQuery as an external replication (ETL) destination.'
subtitle: 'Replicate Supabase Postgres tables to BigQuery.'
sidebar_label: 'BigQuery'
---
<Admonition type="caution" title="Private Alpha">
External replication (ETL) is currently in private alpha. Managed pipelines run through Supabase ETL. Access is limited and features may change.
</Admonition>
[BigQuery](https://cloud.google.com/bigquery) is Google's fully managed data warehouse. You can replicate your database tables to BigQuery for analytics and reporting.
## Prepare GCP resources
Before configuring BigQuery as a destination, set up the following in Google Cloud Platform:
1. **Google Cloud Platform (GCP) account**: [Sign up for GCP](https://cloud.google.com/gcp) if you don't have one
2. **BigQuery dataset**: Create a [BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets-intro) in your GCP project
- Open the BigQuery console in GCP
- Select your project
- Click **Create Dataset**
- Provide a dataset ID, for example `supabase_replication`
3. **GCP service account key**: Create a [service account](https://cloud.google.com/iam/docs/keys-create-delete) with appropriate permissions
- Go to **IAM & Admin > Service Accounts**
- Click **Create Service Account**
- Grant the "BigQuery Data Editor" and "BigQuery Job User" roles
- Create and download the JSON key file
Required permissions:
- `bigquery.datasets.get`
- `bigquery.jobs.create`
- `bigquery.tables.create`
- `bigquery.tables.delete`
- `bigquery.tables.get`
- `bigquery.tables.getData`
- `bigquery.tables.list`
- `bigquery.tables.update`
- `bigquery.tables.updateData`
- `bigquery.routines.get`
- `bigquery.routines.list`
## Configure BigQuery as a destination
1. Navigate to the [**Database > Replication**](/dashboard/project/_/database/replication) section of the Dashboard
2. Click **Add destination**
3. Configure the general settings:
- **Destination name**: A name to identify this destination, for example "BigQuery Warehouse"
- **Publication**: The publication to replicate data from
- **Destination type**: Select **BigQuery**
4. Configure BigQuery-specific settings:
- **Project ID**: Your BigQuery project identifier, found in the GCP Console
- **Dataset ID**: The name of your BigQuery dataset, without the project ID
<Admonition type="note">
In the GCP Console, the dataset is shown as `project-id.dataset-id`. Enter only the part after the dot. For example, if you see `my-project.my_dataset`, enter `my_dataset`.
</Admonition>
- **Service Account Key**: Your GCP service account key in JSON format
5. Optionally expand **Advanced settings** for BigQuery-specific performance tuning:
| Setting | Default | Description |
| ------------------------ | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Connection pool size** | `4` connections | Size of the BigQuery Storage Write API connection pool. More connections allow more parallel writes, but consume more resources. |
| **Maximum staleness** | No staleness limit | Maximum allowed age, in minutes, for BigQuery cached metadata before reading base tables. Lower values improve freshness. Higher values can reduce query cost and latency. |
6. Click **Create and start** to begin replication
Your replication pipeline now starts copying data from your database to BigQuery.
## How it works
Once configured, replication to BigQuery:
1. Captures changes from your Postgres database (INSERT, UPDATE, DELETE, TRUNCATE operations)
2. Optimizes delivery automatically
3. Creates BigQuery tables automatically to match your Postgres schema
4. Streams data to BigQuery
## Source table requirements
BigQuery replication requires each source table to have a primary key, and the publication must include the primary-key columns. Supabase ETL declares those columns as the BigQuery destination primary key so BigQuery change data capture (CDC) can apply `UPSERT` and `DELETE` rows.
BigQuery primary keys are `NOT ENFORCED`, and BigQuery change data capture (CDC) supports composite primary keys with up to 16 columns. Your source primary key must stay unique and non-null because BigQuery uses it to match CDC rows.
Source tables must also use a BigQuery-compatible Postgres `REPLICA IDENTITY` setting. Most tables can keep the Postgres default, as long as they have a primary key and all primary-key columns are included in the publication.
| Source table setting | BigQuery support | Guidance |
| ------------------------------------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `REPLICA IDENTITY DEFAULT` with a primary key | Supported | Recommended for most tables. BigQuery uses the replicated source primary key to apply upserts and deletes. |
| `REPLICA IDENTITY FULL` | Supported | Recommended for tables with large `text`, `jsonb`, `bytea`, or other values that Postgres may store out-of-line using TOAST, especially when those rows update. |
| `REPLICA IDENTITY USING INDEX` | Not supported | BigQuery change data capture (CDC) rows are keyed by the source primary key, not by an alternative unique index. |
| `REPLICA IDENTITY NOTHING` | Not supported | Updates and deletes do not include enough row identity for BigQuery to apply them safely. |
| `REPLICA IDENTITY DEFAULT` without a primary key | Not supported | BigQuery requires a source primary key. |
For a general explanation of how replica identity affects update and delete events, see [How does replica identity affect updates and deletes?](/docs/guides/database/replication/external-replication-faq#how-does-replica-identity-affect-updates-and-deletes).
For updates, Postgres does not always send a complete old row through logical replication. It can also mark unchanged toasted values as `unchanged toast` instead of resending the value. BigQuery change data capture (CDC) upserts require a complete new row because omitted columns are not preserved in the destination. The replication pipeline can reconstruct a complete update when the old row image contains the missing value, which is reliable with `REPLICA IDENTITY FULL`.
If a BigQuery pipeline fails with an error about a partial update row, set `REPLICA IDENTITY FULL` on the affected source table and restart the pipeline. Changing replica identity only affects new WAL records, so a retained update that was written before the change may still need to be skipped by recreating the pipeline or re-copying the affected table.
Check a table's current replica identity:
```sql
select
n.nspname as schema_name,
c.relname as table_name,
c.relreplident as replica_identity
from
pg_class as c
join pg_namespace as n on n.oid = c.relnamespace
where n.nspname = 'public' and c.relname = 'your_table';
```
The `replica_identity` value is `d` for default, `f` for full, `i` for index, and `n` for nothing.
Set full replica identity when a table has toasted columns and update replication must be reliable:
```sql
alter table public.your_table replica identity full;
```
`REPLICA IDENTITY FULL` increases WAL volume because Postgres logs the full old row for updates and deletes. Use it on tables where update correctness is more important than the extra replication overhead.
## How tables are structured
Due to BigQuery limitations, replicated tables use a versioned structure:
- The table you query is a **view**, for example `users`
- The actual data is stored in versioned tables with a `_version` suffix, for example `users_version`
- When a table is truncated in your database, a new version is created and the view automatically points to the latest version
This structure handles table truncations while maintaining query compatibility.
## Schema change support
Schema change support for BigQuery is currently in beta. Supabase ETL supports a limited set of schema changes for BigQuery while the feature is developed further.
Supported schema changes:
- Adding a nullable column
- Removing a column
- Renaming a column
- Dropping a `NOT NULL` constraint
- Setting or dropping supported column default metadata
Unsupported or limited schema changes:
- Changing a column's data type
- Adding `NOT NULL` with `SET NOT NULL`
- Filling existing rows for `ADD COLUMN ... DEFAULT`
- Unsupported default expressions
BigQuery requires added columns to be nullable. When a replicated `ADD COLUMN` includes a default, external replication (ETL) can apply supported default metadata for future rows, but BigQuery does not backfill existing rows through that DDL. Existing destination rows remain `NULL` unless you run a separate backfill.
Supported defaults are best-effort translations to BigQuery SQL. Unsupported defaults are skipped with a warning instead of failing replication.
## Limitations
- **Row size**: Limited to 10 MB per row due to BigQuery Storage Write API constraints
- **Primary keys**: Source tables must have a primary key, the replicated primary key can contain at most 16 columns, and BigQuery does not enforce key uniqueness
- **Replica identity**: Source tables must use primary-key replica identity or `REPLICA IDENTITY FULL`
- **Table names**: Source table names cannot start or end with `_` when replicating to BigQuery
- **Schema changes**: Limited to the supported schema changes listed above
## Additional resources
- [BigQuery documentation](https://cloud.google.com/bigquery/docs) - Official Google BigQuery documentation
- [BigQuery change data capture](https://cloud.google.com/bigquery/docs/change-data-capture) - BigQuery change data capture (CDC) requirements and limitations