mirror of
https://github.com/supabase/supabase.git
synced 2026-07-02 23:24:21 +08:00
181 lines
10 KiB
Plaintext
181 lines
10 KiB
Plaintext
---
|
|
id: 'bigquery-destination'
|
|
title: 'BigQuery destination'
|
|
description: 'Configure BigQuery as an external replication (ETL) destination.'
|
|
subtitle: 'Replicate Supabase Postgres tables to BigQuery.'
|
|
sidebar_label: 'BigQuery'
|
|
---
|
|
|
|
<Admonition type="caution" title="Private Alpha">
|
|
|
|
External replication (ETL) is currently in private alpha. Managed pipelines run through Supabase ETL. Access is limited and features may change.
|
|
|
|
</Admonition>
|
|
|
|
[BigQuery](https://cloud.google.com/bigquery) is Google's fully managed data warehouse. You can replicate your database tables to BigQuery for analytics and reporting.
|
|
|
|
## Prepare GCP resources
|
|
|
|
Before configuring BigQuery as a destination, set up the following in Google Cloud Platform:
|
|
|
|
1. **Google Cloud Platform (GCP) account**: [Sign up for GCP](https://cloud.google.com/gcp) if you don't have one
|
|
|
|
2. **BigQuery dataset**: Create a [BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets-intro) in your GCP project
|
|
- Open the BigQuery console in GCP
|
|
- Select your project
|
|
- Click **Create Dataset**
|
|
- Provide a dataset ID, for example `supabase_replication`
|
|
|
|
3. **GCP service account key**: Create a [service account](https://cloud.google.com/iam/docs/keys-create-delete) with appropriate permissions
|
|
- Go to **IAM & Admin > Service Accounts**
|
|
- Click **Create Service Account**
|
|
- Grant the "BigQuery Data Editor" and "BigQuery Job User" roles
|
|
- Create and download the JSON key file
|
|
|
|
Required permissions:
|
|
|
|
- `bigquery.datasets.get`
|
|
- `bigquery.jobs.create`
|
|
- `bigquery.tables.create`
|
|
- `bigquery.tables.delete`
|
|
- `bigquery.tables.get`
|
|
- `bigquery.tables.getData`
|
|
- `bigquery.tables.list`
|
|
- `bigquery.tables.update`
|
|
- `bigquery.tables.updateData`
|
|
- `bigquery.routines.get`
|
|
- `bigquery.routines.list`
|
|
|
|
## Configure BigQuery as a destination
|
|
|
|
1. Navigate to the [**Database > Replication**](/dashboard/project/_/database/replication) section of the Dashboard
|
|
2. Click **Add destination**
|
|
|
|
3. Configure the general settings:
|
|
- **Destination name**: A name to identify this destination, for example "BigQuery Warehouse"
|
|
- **Publication**: The publication to replicate data from
|
|
- **Destination type**: Select **BigQuery**
|
|
|
|
4. Configure BigQuery-specific settings:
|
|
- **Project ID**: Your BigQuery project identifier, found in the GCP Console
|
|
- **Dataset ID**: The name of your BigQuery dataset, without the project ID
|
|
|
|
<Admonition type="note">
|
|
|
|
In the GCP Console, the dataset is shown as `project-id.dataset-id`. Enter only the part after the dot. For example, if you see `my-project.my_dataset`, enter `my_dataset`.
|
|
|
|
</Admonition>
|
|
|
|
- **Service Account Key**: Your GCP service account key in JSON format
|
|
|
|
5. Optionally expand **Advanced settings** for BigQuery-specific performance tuning:
|
|
|
|
| Setting | Default | Description |
|
|
| ------------------------ | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| **Connection pool size** | `4` connections | Size of the BigQuery Storage Write API connection pool. More connections allow more parallel writes, but consume more resources. |
|
|
| **Maximum staleness** | No staleness limit | Maximum allowed age, in minutes, for BigQuery cached metadata before reading base tables. Lower values improve freshness. Higher values can reduce query cost and latency. |
|
|
|
|
6. Click **Create and start** to begin replication
|
|
|
|
Your replication pipeline now starts copying data from your database to BigQuery.
|
|
|
|
## How it works
|
|
|
|
Once configured, replication to BigQuery:
|
|
|
|
1. Captures changes from your Postgres database (INSERT, UPDATE, DELETE, TRUNCATE operations)
|
|
2. Optimizes delivery automatically
|
|
3. Creates BigQuery tables automatically to match your Postgres schema
|
|
4. Streams data to BigQuery
|
|
|
|
## Source table requirements
|
|
|
|
BigQuery replication requires each source table to have a primary key, and the publication must include the primary-key columns. Supabase ETL declares those columns as the BigQuery destination primary key so BigQuery change data capture (CDC) can apply `UPSERT` and `DELETE` rows.
|
|
|
|
BigQuery primary keys are `NOT ENFORCED`, and BigQuery change data capture (CDC) supports composite primary keys with up to 16 columns. Your source primary key must stay unique and non-null because BigQuery uses it to match CDC rows.
|
|
|
|
Source tables must also use a BigQuery-compatible Postgres `REPLICA IDENTITY` setting. Most tables can keep the Postgres default, as long as they have a primary key and all primary-key columns are included in the publication.
|
|
|
|
| Source table setting | BigQuery support | Guidance |
|
|
| ------------------------------------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `REPLICA IDENTITY DEFAULT` with a primary key | Supported | Recommended for most tables. BigQuery uses the replicated source primary key to apply upserts and deletes. |
|
|
| `REPLICA IDENTITY FULL` | Supported | Recommended for tables with large `text`, `jsonb`, `bytea`, or other values that Postgres may store out-of-line using TOAST, especially when those rows update. |
|
|
| `REPLICA IDENTITY USING INDEX` | Not supported | BigQuery change data capture (CDC) rows are keyed by the source primary key, not by an alternative unique index. |
|
|
| `REPLICA IDENTITY NOTHING` | Not supported | Updates and deletes do not include enough row identity for BigQuery to apply them safely. |
|
|
| `REPLICA IDENTITY DEFAULT` without a primary key | Not supported | BigQuery requires a source primary key. |
|
|
|
|
For a general explanation of how replica identity affects update and delete events, see [How does replica identity affect updates and deletes?](/docs/guides/database/replication/external-replication-faq#how-does-replica-identity-affect-updates-and-deletes).
|
|
|
|
For updates, Postgres does not always send a complete old row through logical replication. It can also mark unchanged toasted values as `unchanged toast` instead of resending the value. BigQuery change data capture (CDC) upserts require a complete new row because omitted columns are not preserved in the destination. The replication pipeline can reconstruct a complete update when the old row image contains the missing value, which is reliable with `REPLICA IDENTITY FULL`.
|
|
|
|
If a BigQuery pipeline fails with an error about a partial update row, set `REPLICA IDENTITY FULL` on the affected source table and restart the pipeline. Changing replica identity only affects new WAL records, so a retained update that was written before the change may still need to be skipped by recreating the pipeline or re-copying the affected table.
|
|
|
|
Check a table's current replica identity:
|
|
|
|
```sql
|
|
select
|
|
n.nspname as schema_name,
|
|
c.relname as table_name,
|
|
c.relreplident as replica_identity
|
|
from
|
|
pg_class as c
|
|
join pg_namespace as n on n.oid = c.relnamespace
|
|
where n.nspname = 'public' and c.relname = 'your_table';
|
|
```
|
|
|
|
The `replica_identity` value is `d` for default, `f` for full, `i` for index, and `n` for nothing.
|
|
|
|
Set full replica identity when a table has toasted columns and update replication must be reliable:
|
|
|
|
```sql
|
|
alter table public.your_table replica identity full;
|
|
```
|
|
|
|
`REPLICA IDENTITY FULL` increases WAL volume because Postgres logs the full old row for updates and deletes. Use it on tables where update correctness is more important than the extra replication overhead.
|
|
|
|
## How tables are structured
|
|
|
|
Due to BigQuery limitations, replicated tables use a versioned structure:
|
|
|
|
- The table you query is a **view**, for example `users`
|
|
- The actual data is stored in versioned tables with a `_version` suffix, for example `users_version`
|
|
- When a table is truncated in your database, a new version is created and the view automatically points to the latest version
|
|
|
|
This structure handles table truncations while maintaining query compatibility.
|
|
|
|
## Schema change support
|
|
|
|
Schema change support for BigQuery is currently in beta. Supabase ETL supports a limited set of schema changes for BigQuery while the feature is developed further.
|
|
|
|
Supported schema changes:
|
|
|
|
- Adding a nullable column
|
|
- Removing a column
|
|
- Renaming a column
|
|
- Dropping a `NOT NULL` constraint
|
|
- Setting or dropping supported column default metadata
|
|
|
|
Unsupported or limited schema changes:
|
|
|
|
- Changing a column's data type
|
|
- Adding `NOT NULL` with `SET NOT NULL`
|
|
- Filling existing rows for `ADD COLUMN ... DEFAULT`
|
|
- Unsupported default expressions
|
|
|
|
BigQuery requires added columns to be nullable. When a replicated `ADD COLUMN` includes a default, external replication (ETL) can apply supported default metadata for future rows, but BigQuery does not backfill existing rows through that DDL. Existing destination rows remain `NULL` unless you run a separate backfill.
|
|
|
|
Supported defaults are best-effort translations to BigQuery SQL. Unsupported defaults are skipped with a warning instead of failing replication.
|
|
|
|
## Limitations
|
|
|
|
- **Row size**: Limited to 10 MB per row due to BigQuery Storage Write API constraints
|
|
- **Primary keys**: Source tables must have a primary key, the replicated primary key can contain at most 16 columns, and BigQuery does not enforce key uniqueness
|
|
- **Replica identity**: Source tables must use primary-key replica identity or `REPLICA IDENTITY FULL`
|
|
- **Table names**: Source table names cannot start or end with `_` when replicating to BigQuery
|
|
- **Schema changes**: Limited to the supported schema changes listed above
|
|
|
|
## Additional resources
|
|
|
|
- [BigQuery documentation](https://cloud.google.com/bigquery/docs) - Official Google BigQuery documentation
|
|
- [BigQuery change data capture](https://cloud.google.com/bigquery/docs/change-data-capture) - BigQuery change data capture (CDC) requirements and limitations
|