Snowflake Unistore and Hybrid Tables: performance, cost, and limitations

Snowflakeā€™s Unistore offering aims to bring some of the capabilities of row-store transactional systems (OLTP) into Snowflakeā€™s primarily column-store analytic system (OLAP).

The attempts at consolidating the specailized capabilities of these two different types of systems was given the acronym HTAP (hybrid transaction and analytic processing) in around 2014 by Gartner. Since that time, many HTAP databases have emerged, but in general most companies have not been able to completely replace their different specialized systems with a single unified system.

More often, the ability to add a sprinkling of OLTP capabilities into an existing OLAP system has reduced some of the need to move data between systems. By enabling their users to choose whether an individual table in the database uses a column-store or row-store **layout, the OLAP systems are able to offer some of the capabilities of OLTP systems.

The most important capability that a row-store provides is fast point lookups. These lookups require finding a single record of data in a large table at very fast speeds - typically milliseconds. Other capabilities include transactions, and in general the ability to ensure strong data integrity across the rows of a table. For example, a row-store oriented table can ensure that all the values of a column are unique, whereas column-store oriented table cannot efficiently provide this gaurantee.

In the case of Snowflake, users can leverage the Hybrid Table type to create an individual table that uses a row-store format. These hybrid tables then provide much faster point lookup speeds than Snowflakeā€™s normal table type, even on tables that contain billions of rows and terabytes of data, and even when running hundreds of lookups every second.

Of course, these Hybrid tables then lose some of the important capabilities that a column-store layout provides, namely the ability to quickly aggregate a large number of rows. But for use cases where point lookup speeds are more important than the speed of aggregate queries, the ability to use row-store formats directly inside of Snowflake may simplify the need to move data out of Snowflake and into a different system.

Use cases for Snowflake Hybrid tables

The most common use case for Snowflake Hybrid tables is creating so-called ā€œofflineā€ feature stores. These feature stores often contain the results of modeling the attributes of a user and then storing those results in a table for later retrieval.

For example, consider a project where a company wants to classify the users of a software application into different segments. These segments can then be used by the application to provide different notifications for different types of users, with the goal of improving user engagement. Every time the application serves a notification to a user, it needs to retrieve that userā€™s segment(s).

The assignment of users to different segments might be handled by a set of data transformations that run in Snowflake each night. For this example, suppose the segments are stored in a table called user_segments. Every row in the table contains an individual user_id, along with the segment(s) that user is part of.

The application serving the notifications has a requirement that retrieving the segment of each user must be fast, and the application also needs to be able to run many look ups per second. Because the speed and rate of these transactions can not be supported by a normal Snowflake table, we need to consider alternatives ways to store and expose the user segments after theyā€™ve been built in Snowflake.

There are a few options:

  1. Procurement of Specialized tools. Purchase or use a set of tools for deploying to specialized feature stores. Eg, we could use a service like HighTouch to push the data from Snowflake into a platform like LaunchDarkly. We could also use self-hosted open-source alternatives to those tools.
  2. Using an OLTP system. Copy the user_segments table out of Snowflake and into a table inside a transactional system, eg a Postgres database. Our application could retrieve the segments by connecting to the Postgres database and querying that table.
  3. Using the HTAP capabilities in Snowflake. Copy the user_segments table into a Snowflake Hybrid table. Our application could retrieve the segments by connecting directly to Snowflake and querying the Hybrid table.

The first option likely provides the richest set of capabilities over the long run, and can potentially be achieved by teams even with limited engineering resources. However this option is also the largest commitment, and potentially requires a procurement process. Self-hosting open source alternatives to those tools also means managing the deployment of multiple applications, which can be risky.

The second option also provides strong set of capabilities, as it is using a true OLTP system. But it has the disadvantage of needing to maintain a data pipeline and a new database instance.

The third option is the simplest and potentially easiest choice, but only provides the smallest set of capabilities.

To better understand the third option, we should also asses the performance characteristics of Snowflakeā€™s Hybrid tables, as well as their potential costs and other limitations.

Snowflake Hybrid Table performance benchmarks

To assess the performance of Hybrid Tables, we ran a simple load test that is representative of the user_segment application described above.

You can find the code for the load tests here: https://github.com/corbettanalytics/snowflake-hybrid-tables-load-tests.

In short, we created a simple application that retrieved the segments of a single user from a Hybrid Table.

The Hybrid Table contained 1 million user records. Every request to the application would use the tableā€™s primary key to load the userā€™s segments.

The results of the load test are summarised here:

hybrid-table-load-test-results.png

The results from the load test showed that we were able to achieve about 70 lookups per second.

However, the response times for these lookups varied widely: the response times for the 95th percentile could fluctuate from ~200milliseconds during a good time period, to almost 1000 milliseconds (1 second) during slow periods.

For comparison, this workload can easily achieve <10 milliseconds when using virtually any row-store system, including SQLite, Postgres, Mongodb, etc.

Based on the results of this load test, itā€™s fair to conclude that Hybrid Tables are not able to achieve consistent single-digit millisecond latency, which is what you would expect when running this workload on a true OLTP system.

Snowflake Hybrid Table costs

Hybrid tables incur three types of costs: Hybrid Table Storage costs, Hybrid Table Request costs, and Virtual Warehouse costs. As weā€™ll see, the virtual warehouse costs will be the largest driver of expense, as the warehouses will need to run close to 24/7 in order to support typical workloads.

The Hybrid Table Request costs are based on the amount of data read or written to the Hybrid Table, in 4Kb units. For example, a request that reads a single record containing 1Kb of data would be charged for reading 4Kb.

The Hybrid table storage costs are straightforward. You just pay based on the volume of data stored.

The costs of using a virtual warehouse when querying hybrid tables works the same as any other virtual warehouse. You pay for at least 60 seconds of compute, and then pay per second, although the warehouse can auto-scale down to zero if it is not being used.

These prices vary by region and cloud provider, and you can see the details in their Credit Consumption Tables. Weā€™ll use pricing values from the AWS US East 2 region, running on the Enterprise plan ($3 per credit).

To estimate the cost of using hybrid tables for the user_segments workload weā€™ve been describing, weā€™ll make the following assumptions:

These assumptions imply the following consumption levels and cost:

hybrid-table-cost-estimates.png

Note that the virtual warehouse pricing is by far the largest driver of expense, as weā€™re assuming a dedicated warehouse that is running 24/7. So itā€™s possible to significantly reduce this expense by using a shared warehouse, ie using a warehouse that shares its consumption with other workloads. This can create some risk by reducing the isolation of resources, but can also produce significant cost savings.

Conclusion

Snowflake Hybrid Tables and its HTAP capabilities can simplify data pipelines, as they allow applications to perform faster row-level operations directly with Snowflake, and without needing to move data into a dedicated OLTP system.

However the performance of Snowflakeā€™s Hybrid Tables are relatively slow compared to most dedicated OLTP systems, achieving 95th percentile response times of about 500ms when performing primary-key lookups at about 70 requests per second.

Hybrid tables are also comparatively expensive, costing about $25K per year when using a dedicated warehouse and when running a workload that requires more than 1 request per minute.