In post https://discussion.fool.com/smorg-any-thoughts-on-how-to-think-a… , Dreamer asked a number of questions about Snowflake, including:
You have a link to a laymans view of what use cases look like for snowflake? Seems like data warehouses have been around for a while, so i am not sure what snowflake is doing that is so unique.
So I thought now might be a good time to explain what Snowflake is, what it does, and what differentiates it from other database solutions. I actually found a decent summary of Snowflake on DataDog’s monitoring site. Here’s the link and a couple snippets: https://www.datadoghq.com/blog/snowflake-monitoring-datadog/…
Snowflake is a data platform that enables users to easily store, manage, analyze, and share high volumes of structured and semi-structured data. Whereas traditional data architectures often consist of multiple databases, data warehouses, and data lakes, Snowflake’s Data Cloud breaks down the silos between your different data sources and serves as a single source of truth for a wide range of simultaneous workloads.
and
Snowflake offers a single, persistent storage system for all of your data, regardless of where it’s coming from, how it’s structured, and what it’s needed for. Data can be staged and loaded into Snowflake in several ways, such as with the Snowflake Web UI, the Snowflake CLI, and a variety of third-party tools. This flexibility allows data to be migrated from an existing database or data lake, or continuously ingested from a stream-processing service such as Kafka.
and
With Snowflake, every virtual data warehouse is able to access the same storage layer without competing with other warehouses for resources, which means data can be loaded and queried at the same time.
That’s all well and fine, but let me break Snowflake’s characteristics down into bite-sized chunks. Put simply, Snowflake is a database that:
A) Runs in the cloud.
B) Is provided as a service (DBaaS) (administration and management of the database is automatically handled).
C) Supports both fully structured (SQL) data and semi-structured (NoSQL) data efficiently.
D) Provides both storage and compute efficiently in the platform.
E) Hides the internal specifics of the database without sacrificing performance or cost.
F) Uses a single control plane for accessing all data you store.
G) Enables sharing of read-only data (as tables or views) with external organizations.
There are many database products available today. Only Snowflake provides all of the above. And Snowflake is great at each one. Taking each characteristic separately:
A) Runs in the cloud. Hard to believe, but up until 2016, the big database providers like Oracle and MondoDB (I’m just picking on two companies here) only had database products that customers had to buy and provision on a server located in the company itself.
B) Is DBaaS (DB as a Service). Running in the cloud removes the hardware considerations from customers, but those cloud instances still have to be administered, provisioned, and managed. In 2016 both Oracle and Mongodb released DBaaS products: Oracle Cloud and Mongodb Atlas. Both were version 1 products with numerous limitations, but have gotten way better over the years. But note that it’s only a very recent event that Atlas made up more than half of Mongodb’s total business. Microsoft was actually ahead of the game with their Azure product in 2010
C) Support both structured (aka relational) and semi-structured (aka non-relational) data efficiently. Let’s back up and talk about data organization. Originally, databases were setup as tables like you might find in MS Excel, with defined column names and cell content datatypes, and then as many rows as you have data entries. This format enables all sorts of data processing. But things aren’t always so clean in the real world. Sometimes you don’t have values for some of the cells. Sometimes you have additional data which needs new columns. Heck, sometimes the values aren’t of the same datatype. For example, in an IoT world you might have dozens of different devices sending data to your database. Those different devices might have different software versions, they might even have different hardware. But, in a structured world you can’t take the data unless you can massage it into the structure of your database or modify the database to handle the superset of all possibilities.
This is where semi-structured data comes in. Rather than storing data in tables, data is simply stored as Key/Value pairs. Instead of having a column called “Name,” for instance, you use something like JSON (JavaScript Object Notation) into the database, like this:
"Name" : "Saul"
. You can literally make up Keys and datatypes for each data record you add. The downside to this flexibility is that there is no longer any guarantee that any data record has any particular set of keys, but you can program around that as long as you know the complete set of all keys in existence and what they mean. But, it does mean that you can store any data from any source in your database as it comes in, and worry about processing/analyzing it later.
BTW, structured data is often called “SQL,” which stands for “Structured Query Language,” and is really just set of APIs that structured databases support. For the most part, if you know SQL you can do most of what any structured database can do. And, the semi-structured data is called “NoSQL,” because, well, they don’t support SQL constructs.
One obvious question that comes up is why not use structured data (tables), and simply add a new column every time you have a new Key in a data record? The answer is that blows up the size of the database since it has to reserve space for that column in every row of the database, as well as provide default values for every pre-existing row (there can be thousands or millions of rows). And then, it’s not just the column names, but the datatypes. In SQL, the datatype of every entry in a column must be the same (think of formulas in Excel). So, you can, in theory, simply create new columns for every new type of data to be stored, but you lose efficiency as the database size grows and cost from the processing adding every new column entails on the existing data. On the other side, you can easily insert structured data into a NoSQL database, but then you won’t have the power of SQL for processing.
Another way to think about this is that both SQL and NoSQL require care and processing. With SQL, you have to first prep the data to match the table (called a “schema”) into which the data is going. With NoSQL, you can just add it to the database, but then when you want to do something with that data you then have to process it, which can be a multi-hour thing for big NoSQL databases.
Snowflake tastes great and is less filling. It supports most SQL constructs, yet handles semi-structured data quickly (without hours of pre-processing). It even lets you use SQL write constructs on NoSQL data! I don’t know all the magic sauce involved, but it’s a combination of being able to efficiently handle sparse tables (tables that are missing many cells’ contents) combined with captured metadata that is used for NoSQL type processing. While many structured databases were architected before NoSQL was popular and so are very inefficient with NoSQL data, and most NoSQL databases were created only for semi-structured workflows, Snowflake is a newer architecture that handles both well.
Snowflake describes at a high-level what its architecture is: Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture. (https://docs.snowflake.com/en/user-guide/intro-key-concepts… )
D) Handles Storage and Compute efficiently. Now this doesn’t sound like a big deal to many people at first. After all, I can have my DBaaS in the cloud and simply create AWS EC2 or Azure compute instances to run your processing. And sure, that works. The issue there, however, is that this separate compute process must access the data, which is stored in another AWS or Azure instance and so must be extracted and transferred to your compute process. This takes processing memory and time, which cost money at the cloud providers and slows things down. With Snowflake, the compute job you create has direct access to the database and so there’s no data extraction/loading/copying needed.
But, with Snowflake you only pay for what you use. If you have data that you don’t access for a week or a month or whatever, Snowflake charges you about what Amazon’s AWS S3 charges for data storage - and Snowflake has compressed that data tightly for you (remember the SQL tables) to reduce those charges. Only when you fire up a compute process do you pay for that. You can scale storage and compute up and down immediately and independently as you desire.
E) Hides the internal specifics of the database without sacrificing performance or cost. I’ve covered this above in how Snowflake internally keeps structured data compressed but provides both SQL and NoSQL access methods. Hiding the internals from users of the database has many benefits. One such benefit was Snowflake’s announcement in June 2021 (https://www.snowflake.com/news/snowflake-announces-new-featu… ) that it found a way to compress data even more than before. As Snowflake said: This enhanced storage technology is available transparently, with no user action, no configuration changes, and no application or query changes. This kind of improvement can only be done if the internal details are not exposed to end users, for if the format of the internal data storage is exposed, it can’t be changed without forcing users to change. Better than a Tesla car getting an OTA update, Snowflake got better overnight without any user having to do anything differently (not even clicking on “update”) What other database provider has done anything like this?
F) Uses a single control plane for accessing all data you store. What this means is that when you put data into Snowflake, there’s no partitioning or splitting up of your data among different nodes or clusters. It’s not like many DBaaS providers that simply spin up an instance of their product and manage it for you to use, and so if you ask them they’ll spin up 5 or 6 or 100 separate instances (maybe in different regions of the world), with no easy/fast connection between them. With Snowflake, you don’t worry about instances, you just setup databases and you access them all the same way and your compute can access them all if you want.
G) Enables sharing of read-only data (as tables or views) with external organizations. A lot has been discussed around Data Sharing, in Snowflake’s conference calls, in the press, and here on this board. My view is that this is a major differentiator and will remain a big moat for Snowflake. With Snowflake, live data set views can be shared directly under control of the owner. Others may try to copy the feature by supporting read-only copy/clone access methods, but that’s not only both slow and costly, it won’t be “live” up-to-date data. Snowflake has built a system around controlling and monitoring access that I suspect won’t be easily reproduced by other DBaaS providers.
Now, to be fair, Snowflake isn’t going to replace all other databases. In particular, Snowflake isn’t good at Transactional workflows, where you might be updating individual rows in a table independently. Think of banking where an ATM needs to update the row for your account’s balance as you take money out, or inventory control in a warehouse as items come and go. Those are called OLTP workflows (OnLine Transactional Processing). The reason for this has to do with how Snowflake compresses and stores the data, which is optimized for data access, but is costly for writing/updating.
Instead, Snowflake is great at OLAP workflows (OnLine Analytical Processing). Think analytic, data mining, and business intelligence workflows where you are mostly reading data and processing it to provide insights. This is why Snowflake is often thought of as a EDW (Enterprise Data Warehouse) or Data Lake, where massive amounts of data are stored for later analytical processing. As IoT devices proliferate and gather more and more data that companies wish to analyze to improve their business, these kinds of workflows are becoming more prevalent.