What is Snowflake?

Smorgasbord1 · February 26, 2022, 10:58pm

In post https://discussion.fool.com/smorg-any-thoughts-on-how-to-think-a… , Dreamer asked a number of questions about Snowflake, including:

You have a link to a laymans view of what use cases look like for snowflake? Seems like data warehouses have been around for a while, so i am not sure what snowflake is doing that is so unique.

So I thought now might be a good time to explain what Snowflake is, what it does, and what differentiates it from other database solutions. I actually found a decent summary of Snowflake on DataDog’s monitoring site. Here’s the link and a couple snippets: https://www.datadoghq.com/blog/snowflake-monitoring-datadog/…

Snowflake is a data platform that enables users to easily store, manage, analyze, and share high volumes of structured and semi-structured data. Whereas traditional data architectures often consist of multiple databases, data warehouses, and data lakes, Snowflake’s Data Cloud breaks down the silos between your different data sources and serves as a single source of truth for a wide range of simultaneous workloads.

and

Snowflake offers a single, persistent storage system for all of your data, regardless of where it’s coming from, how it’s structured, and what it’s needed for. Data can be staged and loaded into Snowflake in several ways, such as with the Snowflake Web UI, the Snowflake CLI, and a variety of third-party tools. This flexibility allows data to be migrated from an existing database or data lake, or continuously ingested from a stream-processing service such as Kafka.

and

With Snowflake, every virtual data warehouse is able to access the same storage layer without competing with other warehouses for resources, which means data can be loaded and queried at the same time.

That’s all well and fine, but let me break Snowflake’s characteristics down into bite-sized chunks. Put simply, Snowflake is a database that:
A) Runs in the cloud.
B) Is provided as a service (DBaaS) (administration and management of the database is automatically handled).
C) Supports both fully structured (SQL) data and semi-structured (NoSQL) data efficiently.
D) Provides both storage and compute efficiently in the platform.
E) Hides the internal specifics of the database without sacrificing performance or cost.
F) Uses a single control plane for accessing all data you store.
G) Enables sharing of read-only data (as tables or views) with external organizations.

There are many database products available today. Only Snowflake provides all of the above. And Snowflake is great at each one. Taking each characteristic separately:

A) Runs in the cloud. Hard to believe, but up until 2016, the big database providers like Oracle and MondoDB (I’m just picking on two companies here) only had database products that customers had to buy and provision on a server located in the company itself.

B) Is DBaaS (DB as a Service). Running in the cloud removes the hardware considerations from customers, but those cloud instances still have to be administered, provisioned, and managed. In 2016 both Oracle and Mongodb released DBaaS products: Oracle Cloud and Mongodb Atlas. Both were version 1 products with numerous limitations, but have gotten way better over the years. But note that it’s only a very recent event that Atlas made up more than half of Mongodb’s total business. Microsoft was actually ahead of the game with their Azure product in 2010

C) Support both structured (aka relational) and semi-structured (aka non-relational) data efficiently. Let’s back up and talk about data organization. Originally, databases were setup as tables like you might find in MS Excel, with defined column names and cell content datatypes, and then as many rows as you have data entries. This format enables all sorts of data processing. But things aren’t always so clean in the real world. Sometimes you don’t have values for some of the cells. Sometimes you have additional data which needs new columns. Heck, sometimes the values aren’t of the same datatype. For example, in an IoT world you might have dozens of different devices sending data to your database. Those different devices might have different software versions, they might even have different hardware. But, in a structured world you can’t take the data unless you can massage it into the structure of your database or modify the database to handle the superset of all possibilities.

This is where semi-structured data comes in. Rather than storing data in tables, data is simply stored as Key/Value pairs. Instead of having a column called “Name,” for instance, you use something like JSON (JavaScript Object Notation) into the database, like this:

"Name" : "Saul"

. You can literally make up Keys and datatypes for each data record you add. The downside to this flexibility is that there is no longer any guarantee that any data record has any particular set of keys, but you can program around that as long as you know the complete set of all keys in existence and what they mean. But, it does mean that you can store any data from any source in your database as it comes in, and worry about processing/analyzing it later.

BTW, structured data is often called “SQL,” which stands for “Structured Query Language,” and is really just set of APIs that structured databases support. For the most part, if you know SQL you can do most of what any structured database can do. And, the semi-structured data is called “NoSQL,” because, well, they don’t support SQL constructs.

One obvious question that comes up is why not use structured data (tables), and simply add a new column every time you have a new Key in a data record? The answer is that blows up the size of the database since it has to reserve space for that column in every row of the database, as well as provide default values for every pre-existing row (there can be thousands or millions of rows). And then, it’s not just the column names, but the datatypes. In SQL, the datatype of every entry in a column must be the same (think of formulas in Excel). So, you can, in theory, simply create new columns for every new type of data to be stored, but you lose efficiency as the database size grows and cost from the processing adding every new column entails on the existing data. On the other side, you can easily insert structured data into a NoSQL database, but then you won’t have the power of SQL for processing.

Another way to think about this is that both SQL and NoSQL require care and processing. With SQL, you have to first prep the data to match the table (called a “schema”) into which the data is going. With NoSQL, you can just add it to the database, but then when you want to do something with that data you then have to process it, which can be a multi-hour thing for big NoSQL databases.

Snowflake tastes great and is less filling. It supports most SQL constructs, yet handles semi-structured data quickly (without hours of pre-processing). It even lets you use SQL write constructs on NoSQL data! I don’t know all the magic sauce involved, but it’s a combination of being able to efficiently handle sparse tables (tables that are missing many cells’ contents) combined with captured metadata that is used for NoSQL type processing. While many structured databases were architected before NoSQL was popular and so are very inefficient with NoSQL data, and most NoSQL databases were created only for semi-structured workflows, Snowflake is a newer architecture that handles both well.

Snowflake describes at a high-level what its architecture is: Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture. (https://docs.snowflake.com/en/user-guide/intro-key-concepts… )

D) Handles Storage and Compute efficiently. Now this doesn’t sound like a big deal to many people at first. After all, I can have my DBaaS in the cloud and simply create AWS EC2 or Azure compute instances to run your processing. And sure, that works. The issue there, however, is that this separate compute process must access the data, which is stored in another AWS or Azure instance and so must be extracted and transferred to your compute process. This takes processing memory and time, which cost money at the cloud providers and slows things down. With Snowflake, the compute job you create has direct access to the database and so there’s no data extraction/loading/copying needed.

But, with Snowflake you only pay for what you use. If you have data that you don’t access for a week or a month or whatever, Snowflake charges you about what Amazon’s AWS S3 charges for data storage - and Snowflake has compressed that data tightly for you (remember the SQL tables) to reduce those charges. Only when you fire up a compute process do you pay for that. You can scale storage and compute up and down immediately and independently as you desire.

E) Hides the internal specifics of the database without sacrificing performance or cost. I’ve covered this above in how Snowflake internally keeps structured data compressed but provides both SQL and NoSQL access methods. Hiding the internals from users of the database has many benefits. One such benefit was Snowflake’s announcement in June 2021 (https://www.snowflake.com/news/snowflake-announces-new-featu… ) that it found a way to compress data even more than before. As Snowflake said: This enhanced storage technology is available transparently, with no user action, no configuration changes, and no application or query changes. This kind of improvement can only be done if the internal details are not exposed to end users, for if the format of the internal data storage is exposed, it can’t be changed without forcing users to change. Better than a Tesla car getting an OTA update, Snowflake got better overnight without any user having to do anything differently (not even clicking on “update”) What other database provider has done anything like this?

F) Uses a single control plane for accessing all data you store. What this means is that when you put data into Snowflake, there’s no partitioning or splitting up of your data among different nodes or clusters. It’s not like many DBaaS providers that simply spin up an instance of their product and manage it for you to use, and so if you ask them they’ll spin up 5 or 6 or 100 separate instances (maybe in different regions of the world), with no easy/fast connection between them. With Snowflake, you don’t worry about instances, you just setup databases and you access them all the same way and your compute can access them all if you want.

G) Enables sharing of read-only data (as tables or views) with external organizations. A lot has been discussed around Data Sharing, in Snowflake’s conference calls, in the press, and here on this board. My view is that this is a major differentiator and will remain a big moat for Snowflake. With Snowflake, live data set views can be shared directly under control of the owner. Others may try to copy the feature by supporting read-only copy/clone access methods, but that’s not only both slow and costly, it won’t be “live” up-to-date data. Snowflake has built a system around controlling and monitoring access that I suspect won’t be easily reproduced by other DBaaS providers.

Now, to be fair, Snowflake isn’t going to replace all other databases. In particular, Snowflake isn’t good at Transactional workflows, where you might be updating individual rows in a table independently. Think of banking where an ATM needs to update the row for your account’s balance as you take money out, or inventory control in a warehouse as items come and go. Those are called OLTP workflows (OnLine Transactional Processing). The reason for this has to do with how Snowflake compresses and stores the data, which is optimized for data access, but is costly for writing/updating.

Instead, Snowflake is great at OLAP workflows (OnLine Analytical Processing). Think analytic, data mining, and business intelligence workflows where you are mostly reading data and processing it to provide insights. This is why Snowflake is often thought of as a EDW (Enterprise Data Warehouse) or Data Lake, where massive amounts of data are stored for later analytical processing. As IoT devices proliferate and gather more and more data that companies wish to analyze to improve their business, these kinds of workflows are becoming more prevalent.

SaulR80683 · February 27, 2022, 9:05am

Smorg, that was an awesome write-up about Snowflake. You have the gift of explaining the technical issues so that they can be understood by a tech novice like me, in plain words, and with all the acronyms defined and explained. Wow! I won’t remember all of it but what I’ll remember is the essence, that Snowflake does all these things that no one else can do as well, and that I once read it and understood it. (That gives a lot more confidence.)

Thanks,

Saul

Smorgasbord1 · February 27, 2022, 4:17pm

Thanks, but I need to make a correction. While editing my post down (yes, it was even longer at one point), I mixed up a date and milestone. What I should have said was that it wasn’t until 2016 that DBaaS products like MongoDB Atlas and Oracle Cloud were first made available, but that both were able to be run on the cloud, self-managed, as early as 2008 (maybe even earlier for MongoDB’s case as that was open source). Sorry about that.

Here’s one way to look tiers for ease of use/deployment:
0) Runs on-premise. This means you, the customer, buys hardware and installs and configures everything.

Runs on the cloud. This means you don’t buy hardware, but you’re still installing and configuring things yourself (or hiring another company to do that for you).
Runs as a Service (DBaaS). This is the easiest - you sign-up, arrange payment and then start using almost immediately. Note that within this tier there are still ease of use differences. For instance, some DBaaS, like Snowflake, will auto-scale up and down for you, while other services require manual scaling. While MongoDB has had its Atlas cloud native version for over 5 years, it’s only just now introducing Atlas Serverless (in Preview/Beta) to provide this kind of auto-scaling feature, as well as a pricing model more similar to Snowflake’s separation of storage and compute, as well as reduced configuration and management.

I also wanted to talk a bit more about Snowflake’s Data Sharing. Most of the world’s businesses are only just now figuring out their “Digital Transformation” strategy. Sound crazy? Well, according to this (https://www.precisely.com/blog/mainframe/9-mainframe-statist… ), 71% of the Fortune 500 still use mainframes, mainframes handle 90% of all credit card transactions and 68% of the world’s production IT loads. IBM is even still releasing new mainframe products, like the z15. One reason for the slow adoption is that it can be hard to justify the ROI on the large investment in changing how the business operates to first gather all the data, and then figure out how to process it. Until you’ve done the actual work, you don’t know how much more efficient you can be or what marketing/customer insights you can gather to improve sales, etc. So, going to your BoD with a multi-year, multi-million dollar plan when the best ROI you can cite is what other companies have experienced is a really hard sell. I’ve been in such BoD meetings, and they attack your plan with things like “We are already better than that other company, so why do you think we’ll gain as much?”

What Data Sharing does is open up a new revenue stream to help justify the cost. Companies may already be sharing some data with other companies on an ad-hoc basis (Insurance is a big buyer of data), but Snowflake’s model opens that up greatly in a number of dimensions in terms of how much data, how “live” that data is, and the controls on what can be done with the data. If part of your BoD pitch on your company’s Digital Transformation includes new revenue streams, that’s going to help you get that approval.

Another area worth of discussion is Snowflake’s competition. I’d guess the two most mentioned contenders are Amazon’s Redshift and Databricks.

Snowflake vs Amazon Redshift:
These are similar products. Both run as a Service, both support SQL queries, both provide a unified storage and compute model with MPP (Massively Parallel Processing) for high performance, and with a recent change in Redshift, both have pricing models that separate storage from compute.
In terms of differences, Snowflake provides support for semi-structured data efficiently, while Redshift does not. Snowflake requires less maintenance than Redshift and provides auto-scaling that must be done manually in Redshift (at least last time I looked). Redshift runs only on Amazon’s AWS while Snowflake also supports Azure and GCP (Google Cloud Platform). And, of course, only Snowflake has Data Sharing.
In terms of security, Snowflake has an always-on encryption model with access control, while Redshift requires/enables the customer to configure how much security they want. Some customers may want the “don’t worry about it” model, while other customers may want the “I want security control” model.
Unfortunately, I’m not involved with either enough to know how pricing compares.

Snowflake vs Databricks:
These are not similar products, so I’m somewhat surprised that there is much to discuss here, but Databricks seems intent on the comparison, so here goes:

Similarities:
• Both are provided as a Service
• Both can handle large amounts of semi-structured data
• Both support OLAP (Analytical) workflows
• Both support auto-scaling based on demand
• Both support a variety of languages for programming/access
• Both support a “Time Travel” feature that preserves the state of data before each transformation in case you need to roll back some processing mistake.

Differences:
• Snowflake requires less configuration and management than Databricks
• Snowflake stores data within its system; Databricks does not have its own data layer. Databricks says that’s an advantage because it can work with data you already have on Amazon’s S3, Azure’s Blob Storage, or Google’s Cloud Storage (and some others, too).
• Snowflake has Data Sharing
• By far the biggest difference is in the data organization and processing models. This deserves its own discussion:

Snowflake mainly provides SQL (Structured Query Language) access to data (whether that data is fully or semi-structured). SQL actually dates back to the 1970s, although it has undergone many revisions (the last official ANSI version was 2016).

Databricks mainly provides an optimized version of Apache Spark, which is an open source engine for advanced ML (Machine Learning) data processing. The folks at Databricks had originally worked on the open source project for Spark for many years - now they’ve written their own proprietary version of Spark in C++, and they call it the Delta Engine (actually, the rewritten Spark piece is in something they call Photon, but Photon is fully encapsulated within the Delta Engine). The Delta Engine executes more efficiently and faster than the open source Spark, but the APIs are fully compatible. It’s very clear that if you already use Spark, Databricks is fantastic.

SQL is an easy to learn interface and some people who use it daily don’t have computer science degrees. OTOH, Spark/Delta Engine requires complex programming concepts and languages. What you get for this extra effort is advanced processing, especially the MLlib (Machine Learning Library). Note that Spark with the MLlib has pretty much killed prior Hadoop database and processing since it’s about an order of magnitude faster. Spark also includes a the GraphX component to enable graph processing, another advanced concept.

OK, this is starting to get too complex, but that’s kind of the point of the comparison: Databricks is like a Formula 1 race car while Snowflake is like a Tesla Model S Plaid. In the right hands, Databricks is an amazing tool for advanced analysis, especially when you need Machine Learning capabilities. But, for more common analysis, like Business Intelligence, charting, end-user queries, etc. Snowflake is not only easier, but quicker. You just mash the accelerator on the Tesla at you’re at 60 MPH in under 2 seconds (vs ~2.5 for the F1), but if you regularly need to travel at high speeds (say 120 MPH or more), the F1 is the car you want to be in, as long as you’re skilled enough.

The two products do overlap somewhat: Databricks provides an SQL interface (see https://www.crn.com/news/applications-os/databricks-offers-s… ), while Snowflake does have integrations with Spark (see https://www.snowflake.com/trending/machine-learning-tools or https://www.snowflake.com/guides/machine-learning-platforms ). That said, neither is as good as the other in their domain strength.

You’ll also hear a lot about Data Warehouse versus Data Lakes versus Data Lakehouse, at least from Databricks.

Data Warehouse, or Enterprise Data Warehouse (EDW) is simply a big place to store lots of structured data. Snowflake extends this to be able to support semi-structured data as well.

Data Lakes are big places to store lots of raw, typically semi-structured, data. You have to perform ETL (Extract, Transform, Load) processes on that data to organize it - then you can process it for analytics.

Data Lakehouse is something Databricks invented. They are trying to make it an open standard. I haven’t seen an implementation personally, but it appears to be a Data Warehouse for semi-structured data that combines the cost and ease in gathering & storing data of a Data Lake with the processing speed benefits of an EDW. It’s still unproven in terms of practicality. In terms of impact on Snowflake, I believe that Snowflake’s storage pricing is close to Amazon’s raw S3 pricing, so there is hardly any cost benefit to a Lakehouse over a Warehouse. Other EDW solutions, however, may not fare as well. This is an area receiving lots of attention lately. Heck, there’s even something new (2019) called a Data Mesh that I haven’t figured out yet.

One thing I’ve learned is that as fast as the technology and offerings evolve, the business world is mostly very slow to adopt. MongoDB came out with its DBaaS product, Atlas, half a decade ago and it’s only just recently that half its business is from that. And its only this year that their fully-managed Serverless product is coming out. It’s easy for techies to say that Databricks, with its ML capabilities and speed is the future, but business wise I continue to believe that many companies are just now undergoing the “Digital Transformation.” They are just now capturing business data and will be happy to gather any BI (Business Intelligence) on that, so easy and cost-effective charts and graphs and interactive queries is the name of the game here, and will be for some years. Yes, some companies are at the fore-front of the technology and using advanced analytical techniques to drive their business forward, but that’s still a small percentage of overall database usage.

TTCL;DR Still Too Techy/Complex/Long, so you Didn’t Read? How about this:
• Data Sharing could become a nice moat for Snowflake.
• Snowflake is probably the best traditional large-scale Analytical database out there, beating Oracle, Amazon, etc.
• Databricks is probably the best high-end Machine Learning compatible database.
• I believe that the TAM for Snowflake is much larger than for Databricks, and will be for the next 5-10 years.
• I don’t believe that either Snowflake or Databricks will steal much business from the other, at least for the next 5-10 years.
• If you have Transactional processing needs (banks, inventory, etc.) then you need another kind of database. There is no such thing as one database to rule them all.

kevinh68 · February 28, 2022, 11:28am

Really excellent write-up Smorg in describing the difference between Snowflake and Databricks.

We are in the midst of implementing our modern data platform design from the Lakehouse approach.

The reason is because of ML.

One thing to keep in mind is the types of compute and distributed scalability required to support operational ML. A design that follows the Lakehouse path gets our organization the best of both worlds.

We can efficiently ingest and transform data along our highly performant operational pipeline, syphon data from that pipeline to support operational ML, and then ultimately rely on other downstream transformations to build an eventual EDW for analytic workloads. That EDW could even be snowflake.

The trick is there is to design a platform that can support both needs (if you are future proofing), and that is why Databricks “invented” and is pushing the Lakehouse path. Nothing precludes you from moving your Lakehouse data into Snowflake itself if you like. It is actually a very nice design that plays well with others. In fact, we intend to use Snowflake at the analytic end to support our business users.

Human consumers of data vs Machine consumers of data. Another thing to watch out for is when we think Machines will have higher demands on data than us humans do. That day is coming. I like the 5 year timeframe on that, but at the rate and pace of change that day may be sooner. In that world Databricks and their design path will begin to take over. I think Snowflake sees this, and that is why they have come up with Snowpark as an alternative to consider to stay in their ecosystem. We evaluated Snowpark as part of a POC. Unfortunately the product is not ready for prime time yet, but they are actively working on it.

In the short term I think Snowflake is fine. But after the hypergrowth phase is over I’d be looking to watch this space.

Best,
–Kevin

Smorgasbord1 · February 28, 2022, 3:13pm

Kevin - great to hear from someone with boots on the ground, and interesting to hear that Snowpark isn’t fully baked yet. That’s certainly something to watch. And I completely agree that heavy users can and will use multiple analytical workflows, even multiple copies of their data, which unfortunately is rather commonplace.

In the short term I think Snowflake is fine. But after the hypergrowth phase is over I’d be looking to watch this space.

Well, I can point to my own past investing experience, which is that I tend to be way ahead of the curve on technology adoption. For instance, I got out of MDB way too soon. They are finally now going Serverless (in Preview), which if you asked me a couple years ago I would have said was already too late, but the continued high growth of Atlas revenue even without that has proven me wrong, as that so many MongoDB customers are still running on premise.

Another thing to watch out for is when we think Machines will have higher demands on data than us humans do.

Oh wow, interesting to think that the output of business and ML analytics won’t be consumed by humans, but consumed by programs that steer the business directly. I read that Amazon tried this several years ago and got burnt by their algos not recognizing trends as being seasonal or fads, so maybe that day is further off than we think. That said, Wall St is already overrun with fully algorithmic trading, so there’s no denying that day will come.