The past four years I’ve spent at Snowflake, helping it grow from a relatively small startup to a major player in the data space, have been the most exciting of my career so far. Snowflake was and continues to be a game-changing technology with a limitless ceiling for innovation.
With its unique architecture, Snowflake can do many things not possible in any other platform. It allows you to work with data of any type–structured, semi-structured or unstructured–at any scale you need. It is a powerful platform for running data pipelines at massive scale to transform those vast quantities of raw data into information using SQL, as well as Java and Python. It is also an open platform in the sense that any user who needs access to this information and insights can seamlessly and securely connect with their tool of choice and start using Snowflake to make better decisions to drive their business forward.
While there is a lot to like about Snowflake, one area where I have seen customers struggle is the actual authoring and orchestration of those data pipelines. The problem comes down to the options data engineers have had for building data pipelines in general. We can group those options into three main categories:
- Creating and managing SQL (or Python, or insert language of choice) scripts manually
- Implementing a full stack ETL tool
- Cobbling together a set of open-source transformation tools
None of these options are ideal in the context of Snowflake for some reasons detailed below.
Manual Script Management
As an Oracle and SQL Server consultant for almost 10 years, I have written more than my fair share of PL/SQL and T-SQL. I have also worked on projects where many other people were also writing a lot of PL/SQL and T-SQL scripts. Some people may be natural-born coders, but most are not. Good coders are hard to find and code quality, standardization, automation, supportability, and sustainability are all major issues with this approach. This is especially true for non-IT types who are more concerned with getting the job done–as they rightly should be–than perfecting and productionalizing the code.
Full Stack ETL Tool
Another option, especially for enterprise customers with money to burn, was implementing complex, expensive full stack ETL tools. I have significant experience with ETL tools like Business Objects Data Services, Informatica, Oracle Warehouse Builder, Oracle Data Integrator, and SSIS, going back to 2005. Compared to raw scripting, these tools bring a lot of benefits: development frameworks, automation, repositories with versioning, visual UI, lineage, repeatability, reduced delivery time, and more. But there are major downsides to using these legacy tools in the modern cloud era.
Full stack ETL tools are not optimal with Snowflake. The main reason for this is these tools were created more than 25 years ago and tailored to the technology limitations of the day. They were not built with Snowflake in mind.
Snowflake allows you to change the paradigm from ETL to ELT. Ideally, you want your data pipeline tools to push down as many operations as possible to Snowflake, so that Snowflake can handle the heavy lifting. The fact is, however, that many of these full stack ETL tools struggle–to varying degrees–to generate optimal Snowflake code and fully push down transformations to Snowflake. This results in large amounts of data movement, poor performance, and ultimately higher cost.
Full stack ETL tools are also very expensive to procure and maintain, require additional hardware, and typically have many features you don’t need with Snowflake but end up paying for anyway. They also require highly skilled, specially trained people to run them.
Varying Mixes of Open-Source Tools
Today, there are relatively newer open-source alternatives in the transformation space. Instead of trying to tackle the “full” ETL stack, they work in an ELT paradigm. Best of breed vendors handle the “EL” – extracting data from source systems and loading into Snowflake. Once raw data lands in Snowflake, open-source tools handle the “T” for transformation.
This accomplishes the “pushdown” so processing takes place in Snowflake, which is great. Five years ago, simply pushing down SQL select statements to Snowflake was enough for many customers, especially newer startups looking for something free to manage simple data pipelines.
There are problems with these open-source alternatives, though:
- A very high learning curve shuts many non-coders out.
- The lack of a GUI feels like stepping back in time and managing PL/SQL scripts.
- Tools aren’t optimized for Snowflake and take a lowest common denominator approach to support many technologies, resulting in inefficient SQL in Snowflake.
- As projects grow, performance suffers.
- As projects grow, managing project metadata / infrastructure becomes its own challenge.
Organizations who have chosen Snowflake deserve a better experience building data pipelines, and Coalesce does just that.
In a lot of ways, I think Coalesce and Snowflake are very similar. Like Snowflake, Coalesce is built in the cloud, for the cloud, and delivered as a service. There is no infrastructure to manage, no software to install. You simply connect to Snowflake and start using it to build data pipelines, and it scales with you.
I could list many reasons why I wanted to join Coalesce, but if I had to pick my top, they would be the product, the people, and the incredible potential I see for our customers, and our business.
Coalesce is a cloud-native solution with a GUI tailored specifically to Snowflake that fits Snowflake like a glove. I think of it as a wrapper that surrounds Snowflake and makes it easy for anybody to access all the powerful functionality that exists in Snowflake to transform data with a simple drag-and-drop interface. When you create a pipeline using Coalesce, it actually generates the Snowflake SQL and that SQL is only ever executed in Snowflake. It is 100% pushdown. For people who prefer to code, it also includes a powerful CLI.
Automation is another big differentiator: Coalesce comes with many predefined node types for creating pipelines in Snowflake. Even better, you can create your own custom nodes based on your needs.
- Want to build a Type 2 SCD following data warehouse best practices? There is a node for that.
- Want to create a streaming pipeline using Streams and Tasks but don’t know how to code it? There is a node for that.
- Want to implement a Deferred Merge (Lambda Architecture) to reduce ingestion cost and data latency? There is a node for that.
- Want to quickly implement the Deferred Merge logic across 5,000 tables? Coalesce automates that.
The Coalesce founders come from the industry and have been working with Snowflake for a long time. They understand the problems customers face and have come up with an elegant solution for helping them. What strikes me is they really do want to help Snowflake customers use Snowflake more effectively.
Over the past few months, I have interacted with many Coalesce employees. They are a tight-knit team excited to grow the company and ready to take on any challenges they may face. There is obviously a focus not just on building a great product, but also on creating a great company culture. If I was ever going to leave Snowflake, it would have to be similarly focused on both customer success and employee satisfaction.
Best of all, Coalesce employees are ardent Snowflake supporters, like me. This isn’t surprising because Snowflake is the only platform Coalesce supports, which made my decision to join that much easier.
Coalesce helps Snowflake customers utilize Snowflake better. There is already a large and growing market for what Coalesce can do. Combine that with the phenomenal growth that Snowflake continues to experience, particularly in the number of customers, and the sky really is the limit. The future is bright, I’m excited for a new adventure and cannot wait to get started!