Snowflake Unistore and Hybrid Tables: performance, cost, and limitations
Snowflakeās Unistore offering aims to bring some of the capabilities of row-store transactional systems (OLTP) into Snowflakeās primarily column-store analytic system (OLAP).
The attempts at consolidating the specailized capabilities of these two different types of systems was given the acronym HTAP (hybrid transaction and analytic processing) in around 2014 by Gartner. Since that time, many HTAP databases have emerged, but in general most companies have not been able to completely replace their different specialized systems with a single unified system.
More often, the ability to add a sprinkling of OLTP capabilities into an existing OLAP system has reduced some of the need to move data between systems. By enabling their users to choose whether an individual table in the database uses a column-store or row-store **layout, the OLAP systems are able to offer some of the capabilities of OLTP systems.
The most important capability that a row-store provides is fast point lookups. These lookups require finding a single record of data in a large table at very fast speeds - typically milliseconds. Other capabilities include transactions, and in general the ability to ensure strong data integrity across the rows of a table. For example, a row-store oriented table can ensure that all the values of a column are unique, whereas column-store oriented table cannot efficiently provide this gaurantee.
In the case of Snowflake, users can leverage the Hybrid Table type to create an individual table that uses a row-store format. These hybrid tables then provide much faster point lookup speeds than Snowflakeās normal table type, even on tables that contain billions of rows and terabytes of data, and even when running hundreds of lookups every second.
Of course, these Hybrid tables then lose some of the important capabilities that a column-store layout provides, namely the ability to quickly aggregate a large number of rows. But for use cases where point lookup speeds are more important than the speed of aggregate queries, the ability to use row-store formats directly inside of Snowflake may simplify the need to move data out of Snowflake and into a different system.
Use cases for Snowflake Hybrid tables
The most common use case for Snowflake Hybrid tables is creating so-called āofflineā feature stores. These feature stores often contain the results of modeling the attributes of a user and then storing those results in a table for later retrieval.
For example, consider a project where a company wants to classify the users of a software application into different segments. These segments can then be used by the application to provide different notifications for different types of users, with the goal of improving user engagement. Every time the application serves a notification to a user, it needs to retrieve that userās segment(s).
The assignment of users to different segments might be handled by a set of data transformations that run in Snowflake each night. For this example, suppose the segments are stored in a table called user_segments
. Every row in the table contains an individual user_id, along with the segment(s) that user is part of.
The application serving the notifications has a requirement that retrieving the segment of each user must be fast, and the application also needs to be able to run many look ups per second. Because the speed and rate of these transactions can not be supported by a normal Snowflake table, we need to consider alternatives ways to store and expose the user segments after theyāve been built in Snowflake.
There are a few options:
- Procurement of Specialized tools. Purchase or use a set of tools for deploying to specialized feature stores. Eg, we could use a service like HighTouch to push the data from Snowflake into a platform like LaunchDarkly. We could also use self-hosted open-source alternatives to those tools.
- Using an OLTP system. Copy the
user_segments
table out of Snowflake and into a table inside a transactional system, eg a Postgres database. Our application could retrieve the segments by connecting to the Postgres database and querying that table. - Using the HTAP capabilities in Snowflake. Copy the
user_segments
table into a Snowflake Hybrid table. Our application could retrieve the segments by connecting directly to Snowflake and querying the Hybrid table.
The first option likely provides the richest set of capabilities over the long run, and can potentially be achieved by teams even with limited engineering resources. However this option is also the largest commitment, and potentially requires a procurement process. Self-hosting open source alternatives to those tools also means managing the deployment of multiple applications, which can be risky.
The second option also provides strong set of capabilities, as it is using a true OLTP system. But it has the disadvantage of needing to maintain a data pipeline and a new database instance.
The third option is the simplest and potentially easiest choice, but only provides the smallest set of capabilities.
To better understand the third option, we should also asses the performance characteristics of Snowflakeās Hybrid tables, as well as their potential costs and other limitations.
Snowflake Hybrid Table performance benchmarks
To assess the performance of Hybrid Tables, we ran a simple load test that is representative of the user_segment
application described above.
You can find the code for the load tests here: https://github.com/corbettanalytics/snowflake-hybrid-tables-load-tests.
In short, we created a simple application that retrieved the segments of a single user from a Hybrid Table.
The Hybrid Table contained 1 million user records. Every request to the application would use the tableās primary key to load the userās segments.
The results of the load test are summarised here:
The results from the load test showed that we were able to achieve about 70 lookups per second.
However, the response times for these lookups varied widely: the response times for the 95th percentile could fluctuate from ~200milliseconds during a good time period, to almost 1000 milliseconds (1 second) during slow periods.
For comparison, this workload can easily achieve <10 milliseconds when using virtually any row-store system, including SQLite, Postgres, Mongodb, etc.
Based on the results of this load test, itās fair to conclude that Hybrid Tables are not able to achieve consistent single-digit millisecond latency, which is what you would expect when running this workload on a true OLTP system.
Snowflake Hybrid Table costs
Hybrid tables incur three types of costs: Hybrid Table Storage costs, Hybrid Table Request costs, and Virtual Warehouse costs. As weāll see, the virtual warehouse costs will be the largest driver of expense, as the warehouses will need to run close to 24/7 in order to support typical workloads.
The Hybrid Table Request costs are based on the amount of data read or written to the Hybrid Table, in 4Kb units. For example, a request that reads a single record containing 1Kb of data would be charged for reading 4Kb.
The Hybrid table storage costs are straightforward. You just pay based on the volume of data stored.
The costs of using a virtual warehouse when querying hybrid tables works the same as any other virtual warehouse. You pay for at least 60 seconds of compute, and then pay per second, although the warehouse can auto-scale down to zero if it is not being used.
These prices vary by region and cloud provider, and you can see the details in their Credit Consumption Tables. Weāll use pricing values from the AWS US East 2 region, running on the Enterprise plan ($3 per credit).
To estimate the cost of using hybrid tables for the user_segments
workload weāve been describing, weāll make the following assumptions:
- The application will service 1 request every second. Critically, this implies the virtual warehouse we use will need to run 24 hours a day, seven days per week.
- The
user_segments
table will contain 1 million users and 6 segments, at a total size of 0.1GB. Reading a single record of data will consume less than 4KB of data. With 1 request per second, the application will serve about 100K requests per day. - The table will be fully rewritten once per day with updated user segment values.
- The virtual warehouse we use will be an X-Small, consuming 1 credit per hour, and running 24/7 in order to service the 1 request per second.
These assumptions imply the following consumption levels and cost:
Note that the virtual warehouse pricing is by far the largest driver of expense, as weāre assuming a dedicated warehouse that is running 24/7. So itās possible to significantly reduce this expense by using a shared warehouse, ie using a warehouse that shares its consumption with other workloads. This can create some risk by reducing the isolation of resources, but can also produce significant cost savings.
Conclusion
Snowflake Hybrid Tables and its HTAP capabilities can simplify data pipelines, as they allow applications to perform faster row-level operations directly with Snowflake, and without needing to move data into a dedicated OLTP system.
However the performance of Snowflakeās Hybrid Tables are relatively slow compared to most dedicated OLTP systems, achieving 95th percentile response times of about 500ms when performing primary-key lookups at about 70 requests per second.
Hybrid tables are also comparatively expensive, costing about $25K per year when using a dedicated warehouse and when running a workload that requires more than 1 request per minute.