Snowflake published a January new release blog that highlights some new datasets. Its really quite remarkable to see what other companies are doing and providing on the platform.
That’s a pretty nice blog post, @FinallyFoolin, thanks.
Just to help some of the less techy investors, here are some thoughts of mine on it:
• External Tables for On-Premises Storage
The Snowflake model has been: Load your data into Snowflake storage and then run your data jobs inside Snowflake. You only pay for the storage and compute you actually use, and the data storage cost is on-par with plain old Amazon S3. Snowflake internally optimizes the stored data so jobs run efficiently and you can have both structured and unstructured data.
What this new feature does is enable Snowflake compute to operate on data not stored in Snowflake. I can think of a couple cases where this might be interesting to customers:
- They might have data in a private cloud that for various governance (privacy) or historical reasons they don’t want to put in a public cloud.
- They might have data in an on-premise hardware solution that they want to run some Snowflake jobs on. Think solutions like Cloudian or PureStorage for instance.
At any rate, as long as that storage supports the complete Amazon S3 REST API from a AWS, Azure, or Google Cloud, you can point Snowflake at it and run Snowflake jobs. Probably at an efficiency cost, but perhaps better/cheaper than importing all the data into Snowflake for one-offs (or few-offs).
I don’t see this as a big customer attraction feature, but it does show that Snowflake isn’t worried about locking customers in to using only their storage for everything. It might even enable trials without the need to import existing data.
• Column Lineage in Data Governance
Snowflake has a pretty complete Access History feature available (if you have Enterprise Edition or higher). This “facilitates regulatory compliance auditing and provide insights on popular and frequently accessed tables and columns”. This is normally available on a per-row basis (see who accessed or modified a row of data, but now it’s also available on a per-column basis). Data privacy/governance is a complicated topic, but having per-column controls provides easier setup and more thorough monitoring, helping things like GDPR compliance as well as finding out where data came from.
• Aggregate Usage Analytics in the Snowflake Data Cloud
The Snowflake Data Cloud (aka Snowflake Marketplace) is Snowflake’s unique mechanism for a company to sell/share its data with other companies. What this feature does is give data providers insights into how their customers are using the data - which customers are using it the most, and which listings are the most popular, etc.
This blog also includes a number of companies that have data available in the Snowflake Marketplace. You can browse the Marketplace listings yourself at this URL:
https://app.snowflake.com/marketplace?sortBy=popular
Everything from Covid-19 data to Accu-Weather and more. Interestingly, some companies have limited/sub-sets of data available for free as well as more complex/complex data sets available for money.
Smorg, thank you.
One more technical note on the External Tables. There is a further big usage for it. There is a method where it can be used to INGEST data into Snowflake (Materialized View). So it can bypass the painful traditional ETL methods of ingestion.
Also note, this feature is pretty common among database platforms.
Interesting - how does that compare to using SnowPipe for ingestion? Since SnowPipe can ingest data in “micro-batches,” as they call it, one can keep Snowflake data up to date easily and without manual operations for each copy.
My understanding of Materialized Views is that they’re a “pre-computed” view from a SELECT SQL operation, and so the purpose is query performance optimization. What are the mechanics around using Materialized Views for ingestion? Can you establish a Materialized View on an External Table and have the result reside in Snowflake? Or something else?
I know this is getting a bit far-field from investing, but while Snowflake is this wonderful garden, getting data into it as part of your solution workflow is not always straightforward, particularly with random small additions/changes like you’d have in an IoT world. In the past I’ve seen workflows that typically use another database, like MDB, for capturing real-time data, which is then SnowPiped into Snowflake. The easier it is to get real-time data into Snowflake, the more attractive it becomes, especially if one can reduce the use of other products.