AWS just released AppFlow

Note: AppFlow was released publicly on 04/22/2020, and is available in most AWS regions.

AWS's AppFlow is Amazon's first product that automatically pulls 3rd party data directly into an AWS service without requiring custom-built ETL code. At the time it was released, AppFlow supports 13 data sources, and includes S3, Redshift, SnowFlake, and Salesforce as destinations.

Google's Cloud Platform and Microsoft Azure don't offer an equivalent service at the time this article was published. But data integration services that connect SaaS products to Redshift, S3, and other destinations have been around for a while: Talend's Stitch, for example, offers 90+ integrations and supports destinations in AWS as well as Google Cloud Platform and Azure.

AWS's investment in AppFlow follows the growing trend for companies to use "dumb" data pipelines, moving data between their services without processing the data while it's in transit. Instead, data is processed once it reaches its destination, which enables companies to outsource much of their ETL process to SaaS providers.

The trend towards "dumb" data pipelines has become known more commonly as "ELT," emphasizing that the data Transformation happens after the data has been loaded into a warehouse like Redshift. Other tools that support the "ELT" pattern have also received growing levels of investment recently, most notably Andreessen Horowitz's recent $12.9M Series A round of the company behind DBT.

It's likely that AppFlow will continue to add more integrations, although I expect they'll likely only support more "enterprise" focused services.

Appflow's pricing is the same "pay as you go" model that other AWS services use. The price of a single run is $0.001, and it  costs $0.02 per GB of data transferred. If you transfer data to a service outside of AWS and without using AWS Private Link, the cost is $0.10 per GB of data transferred.

More resources

Test driving AppFlow

To see how easy it was to get started with AppFlow, I decided to try setting up a pipeline between Google Analytics and S3. Unfortunately, AppFlow doesn't support moving Google Analytics data into Redshift, although all of the other supported data sources do allow pushing data directly into Redshift, Snowflake, or Salesforce.

Personally, I found AppFlow's documentation a little terse, especially as I had only limited experience using Google Cloud Platform's API. In general, AppFlow's instructions assume you're quite familiar with the external service's developer tools, as well as authentication schemes like OAuth, so AppFlow is impractical for non-developers to use.

Once I'd configured my Google Cloud Platform account to allow AppFlow to connect to Google Analytics, I was able to select whether I wanted the flow to run on a schedule or on demand, and I chose on-demand for the test run. I was not able to run the flow in response to an event when Google Analytics was the source.

After I'd selected the trigger for my flow, I was able to select which fields I wanted to extract from Google Analytics, and then map those "source fields" into "destination fields," ie rename them. At this point, it was clear that exporting the data also required a good understanding of the meaning of the fields available in Google Analytics: what did it meant to export User: Metric: ga:users and User: Dimension: ga:userBucket in the  same report?

Data Export options in AppFlow

Once I'd selected the fields  I wanted to include, I was able to apply some limited data validations, which are essentially filters applied to each row. I could then select a filter based on the date range I wanted to extract.

I was disappointed the date range filters weren't dynamic: what if I wanted to run the flow on a schedule every day, and only extract that single day's data? I did investigate using a "scheduled" flow, which gave me the option to "transfer new  data only," but I wasn't sure if that would limit the export to daily dumps or not.

After this point, I was now able to run the flow. But my configuration failed each time, reporting only the cryptic error null!  

A cryptic error...

I made an educated guess that Google Analytics didn't support exporting both User Dimension's and User Metrics at the same time, so I updated the list of exported fields to be more sensible, but that didn't fix the error.

Eventually I realized that I had created an invalid filter while I was exploring the filtering options: all of the criteria specified on the filter were blank, so I didn't realize that it would still be applied. I have suffered through software with worse UX issues, but I felt the error message could have been more helpful here.

A valid but broken filter

The run completed successfully once I deleted the filter.

The data dumped into S3 was JSON formatted with the schema specified by the Google Analytics reporting API's batchGet method, so I was disappointed that I couldn't just easily query the result in S3 by using AWS Athena (ie AWS's Presto service). Since AppFlow doesn't support transferring Google Analytics data to Redshift, it makes sense they also don't support transforming the data into a tabular format when using S3 as the destination.

Additionally, the file that was dumped into the S3 bucket had a random name, which I thought was pretty unhelpful. If I was trying to build a pipeline with daily exports of Google Analytics data, and suffered any delays in processing, it would be impossible to determine which day's data was in the file! This is such a common use case for these kinds of ETL jobs that I was surprised AppFlow didn't better support the relevant features.

Overall, I expect AppFlow is more useful if you're exporting data that comes in a tabular format, and can be automatically loaded into Redshift. While I didn't find AppFlow particularly helpful for analyzing the Google Analytics data, I'd be willing to give it another try if I needed to pull data from a different source.