Can anyone explain to me what they do, and who their competitors are and how Snowflake is differentiated from them?
Snowflake is Data Warehousing As A Service for enterprises. “Data Warehousing” is a fancy name for data storage. The “As A Service” indicates that it’s cloud-based and hosted, with no installs, no need for on-premise servers, no need for maintenance, and is charged via a subscription model. Snowflake charges by usage.
Snowflake is self-managing, automatically providing redundancy, security, file structure, compression, metadata, and backups. It integrates with a lot of different data services, from Informatica at the front-end (for ETL [Extract, Transform, Load] operations) to Tableau at the back-end for visualization, for instances. It also integrates with Salesforce, Talend, etc. You use standard SQL for data access.
Probably the biggest claim to fame for Snowflake is that it separates storage from compute. You setup whatever storage you need for your data, and then when you need to process it, you create these “Virtual Warehouses” to run compute (which can MPP - Massively Parallel Processing). You can have multiple Virtual Warehouses on a single data set, and they operate independently, so there’s no performance penalty.
If you’re familiar with Amazon’s Redshift, Snowflake is somewhat similar, with some differences:
- Snowflake separates compute from storage. One advantage of this is that you only pay for what you use. You don’t have to size the environment for the largest workflow, you establish what you need for the data, and then spin-up what you need for the compute (analysis, visualization, etc.). The other advantage is performance, since each compute environment is separate.
- It’s cloud-provider agnostic. While the first versions of Snowflake were built on top of AWS, there are now versions for Microsoft’s Azure and Google Cloud Platform. AND, you can move your data between providers with almost no reconfiguration in Snowflake. This alone is a reason some enterprises will choose Snowflake.
- Snowflake scales better. There are two aspects to this. The first is that since compute and storage are separate, the compute services can instantly scale up. The second is that Snowflake has a hybrid architecture combining aspects of a central repository (“shared-data”) with local storage (“shared-nothing”) (kind of an edge storage/compute thing).
- Snowflake handles JSON data much better. It’s kind of hard to explain this in non-technical terms, but JSON (JavaScript Object Notation) is a format for structured data that a lot of engineers like to use. This structured data doesn’t have to conform to a schema, so is very flexible.
- Snowflake has more automation - it’s simpler to use. For instance, compression is automatic on Snowflake, but has to be configured and controlled on Redshift.
Google’s BigQuery is another competitor. I’m not as familiar with it, but it’s certainly not cloud-provider agnostic.
As for why enterprises turn to Data Warehousing, that comes from all the data that many enterprises capture and want to leverage. IoT data pours in minute by minute, customer data is valuable and provides business insights. Do enterprises want to store that data on premise and try to keep up as the data set grows exponentially? Many will want to put it in the cloud and not have to worry about configuration and management. But, then how do your applications gets access to the data, and how do they perform? Many enterprises are worried about cloud provider lock-in - once you’re up on AWS or Azure or GCP it can be really hard to migrate to different provider and so you’re stuck with whatever pricing they give you. Most enterprises will simply leave data and start new work on a different cloud provider, which becomes a headache for their IT departments.
This isn’t a sexy business, it’s infrastructure.