Getting started with Snowflake's native data quality framework

On March 29, 2024 Snowflake announced the public release of a "native" data quality framework. The native framework is a kind of competitor to other data quality monitor platforms, and has the advantage that it runs directly in Snowflake.

So naturally we wanted to take this new capability for a test drive to see if it's really ready for production.

The short answer is No, it's probably not something that you should use to replace any existing data quality capabilities at your company.

But it's still very early. At the time I'm writing this, the public preview of the native data quality framework is just a few months old. There are some clear areas of improvement that will likely be added in the future, such as the ability to define data quality checks using Python, not just SQL.

We're optimistic that investment into this area will continue and that Snowflake's "native" capabilities will eventually become viable alternatives to the current standard practice, even if the capabilities are a bit green today.

Key concepts of Snowflake's native data quality framework

Snowflake's native data quality framework is built around the concept of a data metric function (DMF).

These functions are very similar to user defined functions (UDF)s. They live inside a database's schema, and can be called by any role that has permissions to use that function. Snowflake provides some pre-built data metric functions as well.

Data metric functions follow the same pattern as other data quality testing tools: their job is to hunt down "bad" records in a table.

Data metric functions are associated with a table or view by running an alter table ... statement on that table or view. The same data metric function can be used across multiple tables or views. Currently data metric functions can only be added to vanilla tables and views; they can not be added to dynamic tables, materialized views, and so on.

The data metric functions associated with a table can be ran on a schedule by setting the data_metric_schedule parameter of that specific table or view. All of the data metric functions associated with a table are ran on the same schedule. The compute consumed by running the function is not associated with a warehouse, instead it uses Snowflake's "serverless" compute capabilities.

The results of the scheduled execution of these data metric functions is stored in a centralized logging table. This table can then be used to build reporting and alerting based on the results of the data metric functions.

We'll show some examples of actually using the above concepts in practice in shortly. But first we want to discuss some key considerations before implementing these data metric functions in production.

Key considerations of Snowflake's native data quality framework

The most salient consideration is that data quality tests which run natively on Snowflake via data metric functions can not "stop the line" when they encounter data quality issues. In other words, when a data metric function "fails" and identifies invalid records in a table, the pipeline that built or updated that table has no way to know this, so the downstream parts of the pipeline will continue to run, and invalid data could then get propogated throughout the data whole warehouse.

This is a critical consideration, because other pipeline tools like Airflow or dbt do not have this limitation, and recovering from invalid data that spreads throughout an entire warehouse is a complete nightmare. This doesn't mean that Snowflake's data metric functions are useless, but it does mean that they can't replace data quality checks that are triggered by an orchestrator, and which stop the line when they encounter data quality issues.

Another consideration is that custom data metric functions currently only support defining functions with SQL. This means that building ML/AI driven quality monitors with machine learning libraries is not possible today, so the data quality framework can't be used for anomoly detection or other more complex use cases. However, it's reasonable to predict that defining data metric functions via Snowpark will likely be added in the future.

The lifecycle of data metric functions is also a bit brittle. For example, if a table is recreated, then all of the data metric functions defined on that table need to be "reconnected". Similarly if a data metric function is recreated, it also needs to be re-applied to every table that it was previously connected to. So some care needs to be taken if trying to use data metric functions in addition to a tool like dbt, which recreates tables as its standard practice.

Additionally, it does not seem possible to store the failing records associated with a data metric function, which is a feature supported by dbt, and makes debugging data quality issues easier.

So, caveats aside, let's take a look at actually using the data metric functions in practice.

Enabling Snowflake's native data quality framework

Currently Snowflake's native data quality framework is only available for Enterprise Edition accounts, and is also not available on trial accounts.

To enable a specific Snowflake role to create and execute Snowflake's built in data metric functions, we need to grant various privileges to the role, which we demonstrate below. For this demo we're using a role called dq_tester, and we need access to the accountadmin role to grant the necessary privileges.

use role accountadmin;
grant imported privileges on database snowflake 
    to role dq_tester;
grant application role snowflake.data_quality_monitoring_viewer 
    to role dq_tester;
grant database role snowflake.data_metric_user 
    to role dq_tester;
grant execute data metric function on account
    to role dq_tester;    

Using the built-in data metric functions

To test out one of Snowflake's built-in data metric functions, we'll first create a simple dummy test table called orders, and populate it with some sample data.

create table orders (order_id int, order_value int);
insert into orders values (1, 10), (1, null);

The built-in metric function we'll use is called null_count, and it simply identifies the number of records that are null in a given column. There are currently about ten other built in data metric functions that cover data freshness, uniqueness, and volume, and it's likely Snowflake will add more built-in tests in the future.

We can execute the data metric function as follows.

select snowflake.core.null_count(
  select order_value from orders
);
NULL_COUNT
1

Notice how the function takes a SQL statement as its parameter. There are limitations to the SQL statement you provide, such as not being able to use join conditions nor functions like current_date().

We can also run multiple data metric functions at the same time as follows.

select 
    snowflake.core.null_count(select order_value from orders) as order_value_null_count,
    snowflake.core.null_count(select order_id from orders) as order_id_null_count
;
ORDER_VALUE_NULL_COUNT  ORDER_ID_NULL_COUNT
1                       0

At this point, it's probably not clear why you would want to use the metric functions instead of the normal functions for counting null values. After all, these metric functions look basically the same as running select count(...) on a table.

There are two reasons for using metric functions.

First, it's possible to schedule data metric functions to run automatically against a table on a given frequency.

Second, the result of scheduled data metric functions are automatically stored in a centralized event log, which we'll examine shortly. (The results of manually triggered metric functions do not land in the event log).

Together, those capabilities mean you could configure an observability and monitoring solution directly within Snowflake, without having to configure any external tools scheduling or orchestration tools.

We'll explore the scheduling and logging capabilities in the next sections.

Scheduling a data metric function

To schedule a data metric function to run against a table, we first need to define a data_metric_schedule for that table. All data metric functions associated with the table will run on that single schedule.

To add a data_metric_schedule to the table we use an alter table command.

alter table orders set data_metric_schedule = '5 MINUTE';

The scheduled period must be at least 5 minutes, and using cron syntax is supported. If you use the shorthand human-readable syntax as shown above, then only specific intervals are currently supported (5, 15, 30, 60, 730, 1440).

To inspect the data metric schedule of a table, we can use the following command.

show parameters like 'data_metric_schedule' in table orders;

After the data_metric_schedule has been defined for the table, we can now add a date metric function to the table.

alter table orders 
add data metric function snowflake.core.null_count
on (order_id);

Notice that we can only add this specific metric function to a single column at a time. To add the same function to multiple columns, we would need to use multiple alter table commands.

Additionally, when a table is dropped and recreated (using a statement like create or replace table), the data metric functions associated with that table do not automatically get reconnected.

Viewing the logs of metric function executions

To inspect the results of a scheduled data metric function execution, we can query the following snowflake view.

select 
    measurement_time,
    table_name,
    metric_name,
    argument_names,
    value
from snowflake.local.data_quality_monitoring_results

The query shows us the results of all test executions. In our case, the field value is the number of null records found by the null_count test, and the argument_names field shows which column the data metric function was ran against.

Viewing the data metric functions on a table

To view the data metric functions associated with a table, we currently need to use SQL to query views from the database's information schema. There is not currently any graphical display that visualizes the associated functions.

select *
  from table(
    information_schema.data_metric_function_references(
      ref_entity_name => 'orders',
      ref_entity_domain => 'table'
    )
  );

For more details on querying the fields in the data_metric_function_references view you can see reference documentation.

Creating custom data metric functions

We can create a custom data metric function as follows.

Note that the data metric function should be written so that it hunts down (ie flags) dirty or invalid records from inside the table. Currently, the data metric function can only use SQL statements for their logic.

create or replace data metric function orders_positive_values (
  arg_t table(
    arg_c1 number
  )
)
returns number
as
$$
  select
    count(*)
  from arg_t
  where
    arg_c1<0
$$;

The custom metric function then needs to be associated with the various tables as follows.

alter table orders 
add data metric function orders_positive_values
on (order_value);

Conclusion

Overall, Snowflake's native data quality features are currently fairly limited. Personally, I feel they're not yet ready to be a viable alternative to using an orchestrator like Airflow to run standard data quality tests.

That said, the capabilities are still very green, and I'd expect that they improve materially over the future. If Snowflake is able to build a platform that genuinely doesn't need a team to run an orchestrator or scheduler on the side, that will certainly help more companies adopt Snowflake (and of course make it harder to leave Snowflake as well).

But today I'd stick with your current orchestrator and data quality testing framework, as there aren't really compelling reasons to switch.