Snowflake Iceberg Quickstart guide
This Quickstart guide is designed to introduce all the concepts you need to get started with Iceberg on Snowflake.
In brief, weâll cover:
- Creating an S3 bucket and creating the AWS IAM Roles and Policies that will enable Snowflake to use it
- Provisioning the Snowflake resources needed to access the bucket, including an External Volume
- Creating a simple data pipeline that uses Snowflake as a transformation engine, but keeps all data located in your own AWS account
The Quickstart assumes you have ârootâ access to the Snowflake account youâll be using (ie you have the accountadmin
role), and also that you have all the relevant permissions on the AWS account. If you run into any permission related errors, you will need to contact your account administrator to resolve them.
So letâs get started!
Prerequisites
For this Quickstart weâll be using the Snowflake CLI and the AWS CLI. If you havenât used the Snowflake CLI before, you can use our Snowflake CLI Quickstart guide to get set up. For help installing and authenticating with the AWS CLI, see the documentation on AWSâs website.
To confirm that your Snowflake CLI is configured, use the following command to test your connection
snow connection test
Create an S3 bucket for Snowflake to use
The first step is to create an S3 bucket that lives in your own AWS account and which Snowflake will use to store the Iceberg tables.
But before creating this S3 bucket we need to decide which region to put the bucket in.
You should create the S3 bucket in the same region that your Snowflake account is using. Otherwise, youâll incur unnecessary data transfer fees.
To identify the AWS region that your Snowflake account is using, use the following query.
snow sql -q 'select current_region()'
You should see a value like the following, which shows that our Snowflake account is running in the AWS region us-east-2
.
AWS_US_EAST_2
Use the following command to create the bucket in your target region.
aws s3 mb s3://corbett-iceberg-quickstart --region us-east-2
Grant Snowflake access to the S3 bucket
To grant Snowflake access to the bucket, we need to provision the following resources:
- Am AWS policy that allows an AWS role to read and write to the S3 bucket we just created
- An AWS role in our account that Snowflake will assume, and which has been granted the policy
- A Snowflake External Volume, which will provision an IAM user inside Snowflakeâs AWS account. Eventually we will grant the IAM ser managed by Snowflakeâs account access to the AWS role created in our own AWS account
Additionally, we will need to use our AWS account id. To retrieve your account id, use the following command.
aws sts get-caller-identity --query Account --output text
Create an AWS IAM Policy granting permissions on the bucket
First we create an IAM Policy in our own account using the AWS CLI. Notice that we are using the bucket path in multiple places in this policy, and that we are also granting full access to all paths inside the bucket.
#!/bin/bash
POLICY_DOCUMENT='{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::corbett-iceberg-quickstart/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::corbett-iceberg-quickstart",
"Condition": {
"StringLike": {
"s3:prefix": [
"*"
]
}
}
}
]
}'
aws iam create-policy \
--policy-name iceberg-quickstart-tutorial \
--policy-document "$POLICY_DOCUMENT"
The command above will produce an output that includes the Policy ARN (Amazon Resource Name). Make a note of this value as we will need it shortly.
Create an AWS IAM Role for Snowflake to use
Next, we create an AWS IAM Role inside our AWS account. Later we will enable Snowflake to assume this role.
To create the IAM role, we need to attach a trust policy to the role. This policy specifies which AWS entities can assume the role.
For now, we are just going to use a dummy value for the AWS account ID that can assume the role. Specifically, we will use our own AWS account id. Eventually, we will modify this dummy value to enable the Snowflake AWS account to assume the role.
#!/bin/bash
TRUST_POLICY='{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {
"AWS": "<YOUR_AWS_ACCOUNT_ID>"
},
"Condition": {
"StringEquals": {
"sts:ExternalId": "<YOUR_AWS_ACCOUNT_ID>"
}
}
}
]
}'
aws iam create-role \
--role-name iceberg-quickstart-tutorial \
--assume-role-policy-document "$TRUST_POLICY"
Attach the policy to the role
Using the Policy ARN from the previous step, we can attach the policy to the role we just created.
#!/bin/bash
aws iam attach-role-policy \
--role-name iceberg-quickstart-tutorial \
--policy-arn arn:aws:iam::730335345979:policy/iceberg-quickstart-tutorial
So far we have:
- Created an IAM policy that grants permissions to read and write data in an S3 bucket
- Created an IAM role with an associated trust policy. The trust policy is just a placeholder - we will modify it shortly.
- Attached the IAM policy to the IAM role
Create a Snowflake External Volume
A Snowflake external volume will automatically provision an IAM User in Snowflakeâs AWS account. We will soon grant that IAM User permissions to assume the IAM Role we created previously.
Use the following query to create the external volume.
create or replace external volume iceberg_quickstart
storage_locations =
(
(
NAME = 'iceberg-quickstart'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://corbett-iceberg-quickstart'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::730335345979:role/iceberg-quickstart-tutorial'
STORAGE_AWS_EXTERNAL_ID = 'iceberg-quickstart-external-id'
)
);
Note the following values that we chose:
STORAGE_BASE_URL
: we are using the path of the bucket we created previouslySTORAGE_AWS_ROLE_ARN
: we are using the ARN of the AWS Role we created in the previous stepSTORAGE_AWS_EXTERNAL_ID
: we are using an arbitrary value here; you can choose whatever you like.
Next, we need to retrieve the STORAGE_AWS_IAM_USER_ARN
associated with the storage integration. We will use that value in the next step in order to enable Snowflake to assume the role that has access to our S3 bucket.
Use the following command to retrieve that value.
snow sql -q 'describe external volume iceberg_quickstart'
Modify the roleâs trust policy to enable the Snowflake External Volume to assume the role
The last step to enabling Snowflake to authenticate with our S3 bucket is to modify the Trust Policy associated with the AWS Role we created.
Modify the trust policy like so, replacing the values as described in the script.
#!/bin/bash
TRUST_POLICY='{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {
"AWS": "<REPLACE_WITH_STOARGE_AWS_IAM_USER_ARN>"
},
"Condition": {
"StringEquals": {
"sts:ExternalId": "<REPLACE_WITH_STORAGE_AWS_EXTERNAL_ID>"
}
}
}
]
}'
aws iam update-assume-role-policy \
--role-name iceberg-quickstart-tutorial \
--policy-document "$TRUST_POLICY"
Choosing a catalog
In the next step weâll create a database that references the external volume we created previously.
By referencing the external volume, any Iceberg tables that created inside the database will store their contents in the S3 bucket in our own AWS account.
However, we also still need to choose which catalog to use.
For a detailed description of catalogs, you can see the Iceberg documentation.
But in brief, the catalog is a component responsible for tracking all of the tables in the S3 bucket. It enables commands like âshow tables in databaseâŠâ. Whenever a new table is created, the catalog needs to know about it.
There are two types of catalog that we can choose. Either we can use Snowflakeâs native catalog, or we can specify an external catalog. There are many external catalogs, including the AWS Glue Catalog, Nessie, and Databricks Unity Catalog, among others.
The choice of catalog is important, as it significantly affects the features available when using Snowflake Iceberg tables.
If we choose to use the Snowflake native catalog, then Snowflake Iceberg tables work similarly to how native Snowflake tables function. Most importantly, we can both read and write data to those tables.
But if we choose to use an external catalog, then Snowflake can only read data from the tables contained in the catalog. Snowflake does not support writing or modifying any of the tables tracked via an external catalog.
For a more detailed comparison of the native catalog compared to external catalogâs, you can see the documentation from Snowflake.
So the choice of which catalog to use depends on the use case. If you anticipate Snowflake being exclusively a âreaderâ of the iceberg tables, while using a different engine to produce those tables, then using an external catalog is fine. However if you anticipate using Snowflake to transform the iceberg tables in your S3 bucket, then youâll need to use the Snowflake native catalog.
For this tutorial, weâll use the Snowflake native catalog, as setting up an external catalog is a separate scope of work.
Creating an Iceberg table
First we create a database thatâs configured to use the external volume. Specify the external volume name we chose earlier. Also specify SNOWFLAKE
as the catalog.
create database corbett;
create schema corbett.iceberg_tests;
alter database corbett set
external_volume = iceberg_quickstart
catalog = 'SNOWFLAKE';
Next we can create a table specifying base_location
, and populate it with some mock data.
create iceberg table corbett.iceberg_tests.first_iceberg_table (a int)
base_location = 'first_iceberg_table';
insert into corbett.iceberg_tests.first_iceberg_table (a) values (1), (2), (3);
To validate that Snowflake has stored the contents of the table in our own S3 bucket, we can view the data inside the iceberg table directory on S3
aws s3 ls s3://corbett-iceberg-quickstart/first_iceberg_table/metadata/
Querying from a read-only engine
In the last section of this Quickstart we are going to demonstrate reading from the Iceberg table using a separate query engine.
The engine weâll be using is duckdb. For help installing the engine, see their website here.
Also ensure youâre using duckdb version â„ 1.0.
To start a session, run the command:
duckdb
Next, install the iceberg
and httpfs
extensions.
install 'iceberg';
install 'httpfs';
load 'iceberg';
load 'httpfs';
Create a secret
that allows accessing the S3 bucket. If you have been using the aws
commands throughout this Quickstart, then you should be able to retrieve your AWS access key id from the location ~/.aws/credentials
.
create secret corbett_iceberg (type S3, key_id '...', secret '...', region 'us-east-2');
Finally, you should now be able to query the table directly from duckdb.
select count(1)
from iceberg_scan(
's3://corbett-iceberg-quickstart/first_iceberg_table',
allow_moved_paths = true
);
Summary
In this Quickstart guide we introduced all the concepts you needed to get started with Iceberg on Snowflake.
This included provisioning an S3 bucket, and granting Snowflake access to use that bucket via IAM Roles and Policies.
We also created a database that used Iceberg tables by default, and used the Snowflake native Iceberg catalog.
Finally, we created a table in this database, and used an external engine for reading the data from that table.