Snowflake Iceberg Quickstart guide

This Quickstart guide is designed to introduce all the concepts you need to get started with Iceberg on Snowflake.

In brief, we’ll cover:

Creating an S3 bucket and creating the AWS IAM Roles and Policies that will enable Snowflake to use it
Provisioning the Snowflake resources needed to access the bucket, including an External Volume
Creating a simple data pipeline that uses Snowflake as a transformation engine, but keeps all data located in your own AWS account

The Quickstart assumes you have “root” access to the Snowflake account you’ll be using (ie you have the accountadmin role), and also that you have all the relevant permissions on the AWS account. If you run into any permission related errors, you will need to contact your account administrator to resolve them.

So let’s get started!

Prerequisites

For this Quickstart we’ll be using the Snowflake CLI and the AWS CLI. If you haven’t used the Snowflake CLI before, you can use our Snowflake CLI Quickstart guide to get set up. For help installing and authenticating with the AWS CLI, see the documentation on AWS’s website.

To confirm that your Snowflake CLI is configured, use the following command to test your connection

snow connection test

Create an S3 bucket for Snowflake to use

The first step is to create an S3 bucket that lives in your own AWS account and which Snowflake will use to store the Iceberg tables.

But before creating this S3 bucket we need to decide which region to put the bucket in.

You should create the S3 bucket in the same region that your Snowflake account is using. Otherwise, you’ll incur unnecessary data transfer fees.

To identify the AWS region that your Snowflake account is using, use the following query.

snow sql -q 'select current_region()'

You should see a value like the following, which shows that our Snowflake account is running in the AWS region us-east-2.

AWS_US_EAST_2

Use the following command to create the bucket in your target region.

aws s3 mb s3://corbett-iceberg-quickstart --region us-east-2

Grant Snowflake access to the S3 bucket

To grant Snowflake access to the bucket, we need to provision the following resources:

Am AWS policy that allows an AWS role to read and write to the S3 bucket we just created
An AWS role in our account that Snowflake will assume, and which has been granted the policy
A Snowflake External Volume, which will provision an IAM user inside Snowflake’s AWS account. Eventually we will grant the IAM ser managed by Snowflake’s account access to the AWS role created in our own AWS account

Additionally, we will need to use our AWS account id. To retrieve your account id, use the following command.

aws sts get-caller-identity --query Account --output text

Create an AWS IAM Policy granting permissions on the bucket

First we create an IAM Policy in our own account using the AWS CLI. Notice that we are using the bucket path in multiple places in this policy, and that we are also granting full access to all paths inside the bucket.

#!/bin/bash

POLICY_DOCUMENT='{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
              "s3:GetObject",
              "s3:GetObjectVersion",
              "s3:DeleteObject",
              "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::corbett-iceberg-quickstart/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::corbett-iceberg-quickstart",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "*"
                    ]
                }
            }
        }
    ]
}'

aws iam create-policy \
    --policy-name iceberg-quickstart-tutorial \
    --policy-document "$POLICY_DOCUMENT"

The command above will produce an output that includes the Policy ARN (Amazon Resource Name). Make a note of this value as we will need it shortly.

Create an AWS IAM Role for Snowflake to use

Next, we create an AWS IAM Role inside our AWS account. Later we will enable Snowflake to assume this role.

To create the IAM role, we need to attach a trust policy to the role. This policy specifies which AWS entities can assume the role.

For now, we are just going to use a dummy value for the AWS account ID that can assume the role. Specifically, we will use our own AWS account id. Eventually, we will modify this dummy value to enable the Snowflake AWS account to assume the role.

#!/bin/bash

TRUST_POLICY='{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Principal": {
                "AWS": "<YOUR_AWS_ACCOUNT_ID>"
            },
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<YOUR_AWS_ACCOUNT_ID>"
                }
            }
        }
    ]
}'

aws iam create-role \
    --role-name iceberg-quickstart-tutorial \
    --assume-role-policy-document "$TRUST_POLICY"

Attach the policy to the role

Using the Policy ARN from the previous step, we can attach the policy to the role we just created.

#!/bin/bash

aws iam attach-role-policy \
  --role-name iceberg-quickstart-tutorial \
  --policy-arn arn:aws:iam::730335345979:policy/iceberg-quickstart-tutorial

So far we have:

Created an IAM policy that grants permissions to read and write data in an S3 bucket
Created an IAM role with an associated trust policy. The trust policy is just a placeholder - we will modify it shortly.
Attached the IAM policy to the IAM role

Create a Snowflake External Volume

A Snowflake external volume will automatically provision an IAM User in Snowflake’s AWS account. We will soon grant that IAM User permissions to assume the IAM Role we created previously.

Use the following query to create the external volume.

create or replace external volume iceberg_quickstart
storage_locations =
   (
      (
         NAME = 'iceberg-quickstart'
         STORAGE_PROVIDER = 'S3'
         STORAGE_BASE_URL = 's3://corbett-iceberg-quickstart'
         STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::730335345979:role/iceberg-quickstart-tutorial'
         STORAGE_AWS_EXTERNAL_ID = 'iceberg-quickstart-external-id'
      )
   );

Note the following values that we chose:

STORAGE_BASE_URL: we are using the path of the bucket we created previously
STORAGE_AWS_ROLE_ARN: we are using the ARN of the AWS Role we created in the previous step
STORAGE_AWS_EXTERNAL_ID: we are using an arbitrary value here; you can choose whatever you like.

Next, we need to retrieve the STORAGE_AWS_IAM_USER_ARN associated with the storage integration. We will use that value in the next step in order to enable Snowflake to assume the role that has access to our S3 bucket.

Use the following command to retrieve that value.

snow sql -q 'describe external volume iceberg_quickstart'

Modify the role’s trust policy to enable the Snowflake External Volume to assume the role

The last step to enabling Snowflake to authenticate with our S3 bucket is to modify the Trust Policy associated with the AWS Role we created.

Modify the trust policy like so, replacing the values as described in the script.

#!/bin/bash

TRUST_POLICY='{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Principal": {
                "AWS": "<REPLACE_WITH_STOARGE_AWS_IAM_USER_ARN>"
            },
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<REPLACE_WITH_STORAGE_AWS_EXTERNAL_ID>"
                }
            }
        }
    ]
}'

aws iam update-assume-role-policy \
    --role-name iceberg-quickstart-tutorial \
    --policy-document "$TRUST_POLICY"

Choosing a catalog

In the next step we’ll create a database that references the external volume we created previously.

By referencing the external volume, any Iceberg tables that created inside the database will store their contents in the S3 bucket in our own AWS account.

However, we also still need to choose which catalog to use.

For a detailed description of catalogs, you can see the Iceberg documentation.

But in brief, the catalog is a component responsible for tracking all of the tables in the S3 bucket. It enables commands like “show tables in database…”. Whenever a new table is created, the catalog needs to know about it.

There are two types of catalog that we can choose. Either we can use Snowflake’s native catalog, or we can specify an external catalog. There are many external catalogs, including the AWS Glue Catalog, Nessie, and Databricks Unity Catalog, among others.

The choice of catalog is important, as it significantly affects the features available when using Snowflake Iceberg tables.

If we choose to use the Snowflake native catalog, then Snowflake Iceberg tables work similarly to how native Snowflake tables function. Most importantly, we can both read and write data to those tables.

But if we choose to use an external catalog, then Snowflake can only read data from the tables contained in the catalog. Snowflake does not support writing or modifying any of the tables tracked via an external catalog.

For a more detailed comparison of the native catalog compared to external catalog’s, you can see the documentation from Snowflake.

So the choice of which catalog to use depends on the use case. If you anticipate Snowflake being exclusively a “reader” of the iceberg tables, while using a different engine to produce those tables, then using an external catalog is fine. However if you anticipate using Snowflake to transform the iceberg tables in your S3 bucket, then you’ll need to use the Snowflake native catalog.

For this tutorial, we’ll use the Snowflake native catalog, as setting up an external catalog is a separate scope of work.

Creating an Iceberg table

First we create a database that’s configured to use the external volume. Specify the external volume name we chose earlier. Also specify SNOWFLAKE as the catalog.

create database corbett;
create schema corbett.iceberg_tests;
alter database corbett set 
  external_volume = iceberg_quickstart
  catalog = 'SNOWFLAKE';

Next we can create a table specifying base_location , and populate it with some mock data.

create iceberg table corbett.iceberg_tests.first_iceberg_table (a int)
base_location = 'first_iceberg_table';
insert into corbett.iceberg_tests.first_iceberg_table (a) values (1), (2), (3);

To validate that Snowflake has stored the contents of the table in our own S3 bucket, we can view the data inside the iceberg table directory on S3

aws s3 ls s3://corbett-iceberg-quickstart/first_iceberg_table/metadata/

Querying from a read-only engine

In the last section of this Quickstart we are going to demonstrate reading from the Iceberg table using a separate query engine.

The engine we’ll be using is duckdb. For help installing the engine, see their website here.

Also ensure you’re using duckdb version ≥ 1.0.

To start a session, run the command:

duckdb

Next, install the iceberg and httpfs extensions.

install 'iceberg';
install 'httpfs';
load 'iceberg';
load 'httpfs';

Create a secret that allows accessing the S3 bucket. If you have been using the aws commands throughout this Quickstart, then you should be able to retrieve your AWS access key id from the location ~/.aws/credentials.

create secret corbett_iceberg (type S3, key_id '...', secret '...', region 'us-east-2');

Finally, you should now be able to query the table directly from duckdb.

select count(1) 
from iceberg_scan(
    's3://corbett-iceberg-quickstart/first_iceberg_table', 
    allow_moved_paths = true
);

Summary

In this Quickstart guide we introduced all the concepts you needed to get started with Iceberg on Snowflake.

This included provisioning an S3 bucket, and granting Snowflake access to use that bucket via IAM Roles and Policies.

We also created a database that used Iceberg tables by default, and used the Snowflake native Iceberg catalog.

Finally, we created a table in this database, and used an external engine for reading the data from that table.