Please forgive the slight departure from the usual investment analysis discussion. This is somewhat of a deep dive discussion on three types of data platforms that enterprises are focused on for their particular business needs.
The material is gemaine to understanding where our investments fit in with the grander plans of the business enterprise.
A bit of backdrop on the source for this information. Andreessen Horowitz is a venture capital firm that backs bold entrepreneurs that are building the future through technology. They recently asked practitioners from leading data organizations: (1) what their internal technology stacks looked like, and (2) what they would do if they were to build a new one from scratch. The result of these discussions is the reference architecture diagram found in the link below. This architecture covers a full multi-modal model (more on that later).
There is a lot going on this diagram. Far more than you’d find in most production systems. It is a unified picture that covers almost all use cases - from analytics to machine learning operational models.
• The sources generate relevant business and operational data.
• Ingestion and Transformation allows us to do E L T:
? Extract data from operational systems
? Land the data onto a staging area that aligns the schemas between source and destination
? Transform the data into a structure ready for analysis
• Storage is where we keep data in a format that is accessible to query and processing systems
? Here we try to optimize around pull between cost, scalability and requirements for analytic AND data science workloads
• In the Historical and Predictive columns we are either doing descriptive statistics or inferential statistics, and providing an interface for analysts and data scientists to do their work. In descriptive statistics we are describing what happened. In inferential statistics we are predicting what will happen. It’s also where we build data-driven ML applications.
• Finally, in Output we present results of data analysis to stakeholders using tools like Tableau or PowerBI, or, we embed machine learning models into operational systems, applications and custom built data products.
From this Unified Architecture emerge three common branches.
One is for business intelligence, which focuses on cloud-native data warehouses and analytics use cases (https://i0.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…). Note the sections highlighted in orange. There is a data warehouse here. This is where Snowflake, BiqQuery and Redshift fit in. Note also in this model the world looks fairly simple.
Then there is the second for multimodal data processing (https://i2.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…). This blueprint covers BOTH analytic and operational use cases built around a data lake. The data lake is where Databricks/Delta Lake and other lesser known players like Iceberg and Hive fit in. Note the file-based format that the lake is built on top of (e.g. parquet format) and also note where the lake is typically stored (e.g. S3).
Finally, there is a third blueprint which focuses more squarely on the operational use cases for building data-driven machine learning applications and products (https://i1.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…). This is the new and emerging tech stack that supports robust development, testing and operationalizing of machine learning models used by major tech companies that are building data-enabled applications. Note that the data sources can be a streaming engine, a data lake or a data warehouse. But, the use of a warehouse for the ML use case (like in the multimodal blueprint above) is not an efficient path for ML. The data needs to be moved into a data lake first.
I’ll fill in with a section of content from the study conducted by a2z.
Two parallel ecosystems have grown up around these broad use cases. The data warehouse forms the foundation of the analytics ecosystem. Most data warehouses store data in a structured format and are designed to quickly and easily generate insights from core business metrics, usually with SQL (although Python is growing in popularity). The data lake is the backbone of the operational ecosystem. By storing data in raw form, it delivers the flexibility, scale, and performance required for bespoke applications and more advanced data processing needs. Data lakes operate on a wide range of languages including Java/Scala, Python, R, and SQL.
Each of these technologies has religious adherents, and building around one or the other turns out to have a significant impact on the rest of the stack (more on this later). But what’s really interesting is that modern data warehouses and data lakes are starting to resemble one another – both offering commodity storage, native horizontal scaling, semi-structured data types, ACID transactions, interactive SQL queries, and so on.
The key question going forward: are data warehouses and data lakes are on a path toward convergence? That is, are they becoming interchangeable in the stack? Some experts believe this is taking place and driving simplification of the technology and vendor landscape. Others believe parallel ecosystems will persist due to differences in languages, use cases, or other factors.
Read the full piece here: https://a16z.com/2020/10/15/the-emerging-architectures-for-m…
Building an architecture in the model of blueprint of #2 or #3 isn’t easy. Creating best of breed ML pipeline solution in house and at scale is one of the most challenging data problems today. The convergence mentioned above is what people in the industry are referring as the “Data Lakehouse”. The intention: to cover both analytics and data science workloads; To do blueprint #2 with just one underlying data technology.
But don’t take what a2z says about a Data Lakehouse as gospel. Read about it from Snowflake (https://www.snowflake.com/guides/what-data-lakehouse) and then read about it from Databricks (https://databricks.com/blog/2020/01/30/what-is-a-data-lakeho…). Note the title of the articles from both vendors “What is a lakehouse?”. Research their approach to addressing the Data Lakehouse need. Then make your decision and monitor your investments accordingly.
I hope this information is informative and useful.