It can be hard, even for techies, to understand the numerous companies operating in the data storage and processing space. Some have evolutionary solutions, some have revolutionary solutions. To make matters harder for us investors, there are not only many companies with extremely smart, talented people solving problems, those companies rarely stand still.
Two such companies are Snowflake and Databricks. This post is a first attempt at understanding where these 2 companies came from, where they’re going, and whether they compete.
TL;DR: Snowflake and Databricks are close friends today, but becoming frienemies. But, in my view, neither will ever completely replace the other and I don’t see any real competition between them near or mid term.
What’s most interesting to me is that both Snowflake and Databricks were created in response to Hadoop’s deficiencies. While it’s beyond the scope of this post, suffice it to say that Hadoop was created to handle Big Data storage (via HDFS) and processing (via MapReduce) and was based on the Google File System, and MapReduce also came from Google back in the day.
Snowflake was founded in 2012 by a couple of Oracle engineers who set out from the beginning to create a Data Warehouse in the cloud. That meant not only moving away from the On-Premise model, but also moving away from the heavily structured Relational Data model. But, instead of completely unstructured data (as might be found in a Data Lake), they chose a loosely-structured approach. HDFS is complex and costly to implement and maintain. Snowflake fixes that and enables elastic scaling of storage and compute in the cloud.
Snowflake isn’t just a better HDFS, it’s a better Enterprise Data Warehouse. Whereas companies like Oracle are centered around heavily structured Relational Data models running on hardware the customer owned, Snowflake was designed to handle loosely structured data in the cloud. And while companies like MongoDB handled loosely structured data (known as “NoSQL”), they didn’t do so in the cloud until relatively recently with their Atlas product, and even today still don’t support the very commonly used SQL language. One of Snowflake’s secret sauces is that they can store and process what’s called a “sparse table” efficiently.
For laypeople, here’s some explanation. A traditional Relational Database (“RDB”) can be thought of as a table in Excel. Each “thing” in the database would be a single row in that table, with entries for each column heading. A typical table might have dozens of columns, but thousands or even millions of rows. Now if you were to try to merge different tables together (say IoT data from cars and from motorcycles), you have the problem that many of the columns won’t match. Maybe you can map between them if they’re essentially the same, but often they cover different things (cars don’t have chain wear indicator and motorcycles don’t have cabin temperature). So, you end up with a table in which many of the rows have no entries for many of the columns. You can handle this in your processing of the data, but the real problem is that the combined table requires far more storage space than the two tables from which it was built, plus performance now slows down. The NoSQL models get around this by not actually building the table, but then you can’t use standard SQL. One of Snowflake’s genius capabilities is that they can store and process that sparse table very efficiently, and thus customers can even use SQL.
Databricks was founded a year after Snowflake, by the UC Berkeley creators of Apache Spark, which is an open-sourced project that set out to fix the complexity and performance of MapReduce (remember HDFS and MapReduce are essential components of Hadoop). In layman’s terms Spark is an engine that runs analytics jobs Data Scientists write in Java, Scala, or Python to operate on data stored in HDFS (expanded to more languages and DB’s since). Databricks started out providing a fully usable eco-system around Spark (you can see what Databricks says they add here: https://databricks.com/spark/comparing-databricks-to-apache-… ).
Again, for the laypeople out there, Spark is better than MapReduce for a variety of reasons, the top two being that 1) Spark uses a streaming architecture in which data is read in and sent out incrementally and so can be processed in memory, which is much more efficient, and 2) Spark jobs are much easier to write and manage. An important value-add for Databricks is that they completely rewrote Spark in C++ so it runs much faster than the original Java version, and this C++ code is proprietary (not open-source) to them. They call this produce the “Delta Engine” (https://databricks.com/blog/2020/06/24/introducing-delta-eng… ).
To usefully over-simplify, Snowflake was a better HDFS for storage and Databricks was a better MapReduce for compute. BTW, small wonder that Hortonworks and Cloudera, which were both pure-play Hadoop companies, struggled and merged and are still struggling.
II. BFFs (Best Friends Forever)
With Snowflake being a way-better HDFS replacement (among other things) and Databricks being a complete Spark eco-system (ie, a way-better MapReduce replacement), these companies would seem - and are(!) - natural friends. Indeed, they announced a strategic partnership way back in 2018: https://databricks.com/company/newsroom/press-releases/datab…
You can see the CEO of Databricks on stage getting along quite well with one of the founders of Snowflake in this Tech Crunch video from 2019: https://youtu.be/Xi70_iTY3FY
For a third-party perspective, here’s a talk from a SmartSheets engineer on why and how they migrated from an internal MySQL database that didn’t scale for their needs to a combination of Snowflake and Databricks: https://youtu.be/0GELzB9UZmM
Smartsheets has databases with 50 million rows and needed to process 100 million rows a day. It took them a total of 8 months to completely migrate 50K lines of SQL over, including 5 months of parallel storage/processing between the two systems before they were convinced there were no big issues in the new system. One of their slides compares their on-premise MySQL implementation to Snowflake. There was an example job in which the user gave up after waiting 1.5 hours that took just 20 minutes on Snowflake. Table lockups are rare on Snowflake, and most importantly, elasticity is easy on Snowflake. Snowflake also didn’t limit them to SQL as it also supports Java & Python. Smartsheets also uses Databricks for ML (machine learning) and advanced analytics. (quote from the talk: “Databricks for ML; Snowflake for everything else”) BTW, they use Tableau for reports and have about 1000 reports internally.
A company call Datalytyx also uses both Snowflake and Databricks in a similar manner (https://www.datalytyx.com/snowflake-and-databricks-a-hyper-m… ).
Note that both Snowflake and Databricks are centered around data analytics. Neither is going after transactional database problems.
III. Maybe Not Forever
If you watched the Tech Crunch video I linked above, you might have picked up on a different perspective from Databricks’ CEO and Snowflake’s founder regarding where data is kept. Snowflake naturally wants you to store all your data in a central Data Warehouse (them) so that you can easily run all sorts of analytics and other processing jobs. Databricks wants you to take advantage of the programmability of steaming jobs in Spark/Delta Engine to access data from multiple databases, which could be in multiple places. Snowflake’s thinking is that they’ll make it easy to pipe data from anywhere into the central data warehouse, and that’s worth doing since you’ll be running all kinds of analytical jobs on it in the future, while Databricks is saying you don’t need to bother to fix all your company’s existing data silos.
Still, as we saw in Chap II, it’s easy to store you data in Snowflake and then run Spark jobs in Databricks on that data. If you have other analytics jobs that aren’t Spark compatible (or necessary), you can run those in Snowflake, and if you have data not in Snowflake you can run your Spark jobs on them in Databricks. So, friends, but not exclusive. Nice.
Where the cracks are appearing in this friendship is from Databricks. The Spark/Delta Engine is a general purpose processing environment that can be also be used for everything from data cleansing to ETL (Extract Transform Load) as well as Big Data Analytics. It doesn’t need structured data since it’s essentially running a program you wrote, and you can write that program to handle whatever the data is that you’ve got. So, Databricks has decided that you don’t need a Data Warehouse (eg, Snowflake), and that a Data Lake is all you need.
The term Data Lake comes from the Hadoop, big data days. At that time, companies like Google were gathering tons of data, but they didn’t yet know what they would do with that data. So, the HDFS is a big unstructured pool of data that they called a lake since lakes are bigger than pools. One advantage to Data Lakes is that they’re cheaper than Data Warehouses. Amazon’s cheapest storage, S3 (Simply Storage System) is a Data Lake, whereas Amazon’s Data Warehouse, Redshift, is much more expensive per byte.
So, what Databricks has done is to open-source what they call a “Delta Lake”, which runs on top of simple Data Lakes and, in conjunction with their Delta Engine, also supports SQL in addition to Python and Java. Remember, SQL is the traditional language of choice for relational databases (RDBs), so this opens up Databricks to lots more users, especially users that aren’t programmers. Databricks is branding the combination of these offerings as a “Data Lakehouse” (see https://databricks.com/blog/2020/01/30/what-is-a-data-lakeho… and https://www.datanami.com/2020/11/12/data-lake-or-warehouse-d…).
This would seem to remove Snowflake from the pipeline, but I think it’s important to understand that not every company wants to write Spark jobs, even in SQL, to analyze data. Databricks is still very much tied to requiring Spark/Delta Engine jobs be written by Data Scientists - if you want to use anything else then you’re somewhat stuck since your data is in a Data Lake and needs additional processing and transfer to a Data Warehouse to use most other tools on it. If you put your data into Snowflake, then you can support multiple processing pipelines on it.
Note that Snowflake has been clear that it doesn’t want to take on analytics processing itself. It’s not looking to compete with Spark or Databricks or Alteryx, etc. They want to be like Swiss bankers for your data. Central, secure and easy access for whatever purpose you want.
My conclusion is that Databricks might be successful in getting some companies that use Spark/Delta Engine jobs to put data that is only processed by Databricks’ Delta Engine data into cheaper Delta Lakes instead of Snowflake, but that this is probably a small subset of data at a subset of companies that would be considering Snowflake.