Warehouses and Data Lakes and Lakehouses, Oh My!

Please forgive the slight departure from the usual investment analysis discussion. This is somewhat of a deep dive discussion on three types of data platforms that enterprises are focused on for their particular business needs.

The material is gemaine to understanding where our investments fit in with the grander plans of the business enterprise.

A bit of backdrop on the source for this information. Andreessen Horowitz is a venture capital firm that backs bold entrepreneurs that are building the future through technology. They recently asked practitioners from leading data organizations: (1) what their internal technology stacks looked like, and (2) what they would do if they were to build a new one from scratch. The result of these discussions is the reference architecture diagram found in the link below. This architecture covers a full multi-modal model (more on that later).

https://i1.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…

There is a lot going on this diagram. Far more than you’d find in most production systems. It is a unified picture that covers almost all use cases - from analytics to machine learning operational models.

• The sources generate relevant business and operational data.
• Ingestion and Transformation allows us to do E L T:
? Extract data from operational systems
? Land the data onto a staging area that aligns the schemas between source and destination
? Transform the data into a structure ready for analysis
• Storage is where we keep data in a format that is accessible to query and processing systems
? Here we try to optimize around pull between cost, scalability and requirements for analytic AND data science workloads
• In the Historical and Predictive columns we are either doing descriptive statistics or inferential statistics, and providing an interface for analysts and data scientists to do their work. In descriptive statistics we are describing what happened. In inferential statistics we are predicting what will happen. It’s also where we build data-driven ML applications.
• Finally, in Output we present results of data analysis to stakeholders using tools like Tableau or PowerBI, or, we embed machine learning models into operational systems, applications and custom built data products.

From this Unified Architecture emerge three common branches.

One is for business intelligence, which focuses on cloud-native data warehouses and analytics use cases (https://i0.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…). Note the sections highlighted in orange. There is a data warehouse here. This is where Snowflake, BiqQuery and Redshift fit in. Note also in this model the world looks fairly simple.

Then there is the second for multimodal data processing (https://i2.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…). This blueprint covers BOTH analytic and operational use cases built around a data lake. The data lake is where Databricks/Delta Lake and other lesser known players like Iceberg and Hive fit in. Note the file-based format that the lake is built on top of (e.g. parquet format) and also note where the lake is typically stored (e.g. S3).

Finally, there is a third blueprint which focuses more squarely on the operational use cases for building data-driven machine learning applications and products (https://i1.wp.com/a16z.com/wp-content/uploads/2020/10/Data-R…). This is the new and emerging tech stack that supports robust development, testing and operationalizing of machine learning models used by major tech companies that are building data-enabled applications. Note that the data sources can be a streaming engine, a data lake or a data warehouse. But, the use of a warehouse for the ML use case (like in the multimodal blueprint above) is not an efficient path for ML. The data needs to be moved into a data lake first.

I’ll fill in with a section of content from the study conducted by a2z.

Two parallel ecosystems have grown up around these broad use cases. The data warehouse forms the foundation of the analytics ecosystem. Most data warehouses store data in a structured format and are designed to quickly and easily generate insights from core business metrics, usually with SQL (although Python is growing in popularity). The data lake is the backbone of the operational ecosystem. By storing data in raw form, it delivers the flexibility, scale, and performance required for bespoke applications and more advanced data processing needs. Data lakes operate on a wide range of languages including Java/Scala, Python, R, and SQL.

Each of these technologies has religious adherents, and building around one or the other turns out to have a significant impact on the rest of the stack (more on this later). But what’s really interesting is that modern data warehouses and data lakes are starting to resemble one another – both offering commodity storage, native horizontal scaling, semi-structured data types, ACID transactions, interactive SQL queries, and so on.

The key question going forward: are data warehouses and data lakes are on a path toward convergence? That is, are they becoming interchangeable in the stack? Some experts believe this is taking place and driving simplification of the technology and vendor landscape. Others believe parallel ecosystems will persist due to differences in languages, use cases, or other factors.

Read the full piece here: https://a16z.com/2020/10/15/the-emerging-architectures-for-m…

Building an architecture in the model of blueprint of #2 or #3 isn’t easy. Creating best of breed ML pipeline solution in house and at scale is one of the most challenging data problems today. The convergence mentioned above is what people in the industry are referring as the “Data Lakehouse”. The intention: to cover both analytics and data science workloads; To do blueprint #2 with just one underlying data technology.

But don’t take what a2z says about a Data Lakehouse as gospel. Read about it from Snowflake (https://www.snowflake.com/guides/what-data-lakehouse) and then read about it from Databricks (https://databricks.com/blog/2020/01/30/what-is-a-data-lakeho…). Note the title of the articles from both vendors “What is a lakehouse?”. Research their approach to addressing the Data Lakehouse need. Then make your decision and monitor your investments accordingly.

I hope this information is informative and useful.

Best,
–Kevin

30 Likes

Folks,

Quick follow up on the above because my favorite pastime on the boards before was to be a dot-to-dot connector. Another company the board likes to discuss is using the multi-modal model from bucket #2. That company is Upstart.

Just a small bit of sleuthing from their data engineering job description.

https://www.upstart.com/careers/69257/apply?gh_jid=2044333

By leveraging Upstart’s AI platform, Upstart-powered banks can have higher approval rates and lower loss rates, while simultaneously delivering the exceptional digital-first lending experience their customers demand. Upstart’s patent-pending platform is the first to receive a no-action letter from the Consumer Financial Protection Bureau related to fair lending.

Upstart’s Data Engineering team builds the data infrastructure and platform for our AI lending products. Data engineering is part of our engineering organization, as we believe great data engineering relies on solid software engineering fundamentals.

Our stack:

Python, SQL, Bash, Spark, Kafka, Airflow, Avro, Postgres, Redshift, Looker, Kubernetes, and Docker

The inclusion of Spark, Kafka, Airflow and all the way to Kubernetes and Docker tells you how they are building their pipeline. A quick click on the images from the last post will show you how these technologies slot in together.

Again, I hope this information is helpful. It is an interesting way to look at our companies, for me at least.

Best,
–Kevin

19 Likes

Kevin,

Thank you for these informative posts.

Building an architecture in the model of blueprint of #2 or #3 isn’t easy. Creating best of breed ML pipeline solution in house and at scale is one of the most challenging data problems today. The convergence mentioned above is what people in the industry are referring as the “Data Lakehouse”. The intention: to cover both analytics and data science workloads; To do blueprint #2 with just one underlying data technology.

You mention Upstart seems to be using the blueprint model of #2, and that these blueprint models in #2 and #3 is not easy.

Is this difficulty really because of the need for a large team of extremely technical personnel and talent?
If so, would this be the biggest barrier to entry for big banks that try to build competing tech against UPST?

Just the other day I watched a 2017 video of Upstart’s CEO discussing their AI/ML.

It was a great talk by the way, I’ve timestamped the link to begin where Dave Girouard outlines Upstart’s mission statement in a crystal clear manner: https://www.youtube.com/watch?v=koOH4Hs2s9A&t=445s

But later on he mentions (I’m paraphrasing):
"It is our belief in a decade virtually every lending decision will be driven by AI/ML, it will be necessary to stay competitive.
AI/ML is the core of what we do. You cannot just take the same data that everyone already has access to from the credit bureaus and slice and dice it in a new way and expect to create a better credit model.

There are three key elements required:
(1) more data, and high quality clean data. Real time continuous learning - the data feed back loop has to be realtime (In Upstart’s case, the need for repayment flow data for the models to learn)

(2)advanced maths and techniques like gradient boosting (I’m guessing that was cutting edge stuff back in 2017?) and others used in voice recognition or autonomous driving AI/ML.

(3) Lots of people with PhDs.

(On a tangent, I notice in the video the CEO makes a mention toward the end that his ‘friend Douglas Merrill at ZestFinance’ is also working on AI/ML to disrupt credit. This is pure speculation but I hope Upstart will try to use their giant cash pile to buy out Zest AI and fold all their PhD and data scientists into UPST’s teams.)

35 Likes

Good summary. However, I did not get why Data Lake is necessarily a better solution than DataWarehouse + Streaming data store.

Essentially, no matter whether it’s for data analytics or ML/DL/RL, we need to have a persistent offline data store (which is traditionally provided by DataWarehouse) as well as another datastore for realtime data. My company has one of the richest ML and BI use cases in the world. We don’t have Data Lake. Our systems are just executing without any problem.

Therefore, I think the first Blueprint in the shared link is misleading. It tends to be saying that DataWarehouse can only serve data analytics but can not help ML. Data Lake is definitely not necessary for ML. DataWarehouse + any powerful real-time data store can provide pretty much all the data needed by ML.

In addition, there’re other errors in the charts as well. For example, it says Hive is Batch Query Engine, which is not correct. Hive is a DataWarehouse, primarily used by Facebook.

Best,
Luffy

10 Likes

Well here is a 101 from my perspective.

Most organizations have large complex data repositories, larger financial organizations having over thousands of different data repositories mostly as legacy databases such as Oracle and SQL Server. They are moving to the cloud as well and simplifying those architectures, but as you can imagine corralling this data is not trivial and finding quality people to work in those legacy environments with legacy thinking and poor management is challenging.

Data lakes are necessary for most AI driven organization for the following reasons -

It is important to preserve the incoming format of data as it was provided, instead of formatting it to fit current needs of the organization, one never knows which of the discarded information will be needed again, also if there are errors in processing that data, there will be no way to trace it back to the original data.

Data arrives in varying formats, structured, unstructured and Streaming and relational databases such as Snowflake and Redshift can’t handle it as is.

Data is sourced from multiple places, from areas of business that produce transactions data, there are scores of third party financial data providers with data about everything from loan default rates to traffic patterns (Think Refinitiv and Nasdaq).

Datawarehouses are needed for more analytical cases by various departments in an organization, marketing wants to correlate marketing spend with product sales in various regions. Sales folks might want to see which regions are profitable and correlations to seasons. CEO’s want to see which regions are underperforming in order to figure out who to fire. Not to mention a lot of other reporting use cases for various departments which use that data for inventory, staffing etc. People might use visualization tools like Tableau etc for this analysis.

Some of the data in a data lake can make it into a data warehouse and a lot might not. One important point is that data lake storage is a much cheaper as it is stored in object storage provided by Cloud providers, S3 from AWS, BlobStore from Azure and GCS from Google Cloud. Cost is usually around 30 to 60$ per terabyte, Databases like Oracle, etc cost thousands of dollars per terabyte.
So unless there is a real analytics use case you don’t just dump data into a expensive data warehouse.

Data is not information.

For data to turn into actionable information we need to apply various levels of algorithms on it after we distill it down to features we care about, enrich it, ensure its quality etc. It is a no brainer that we need sales information by regional user, but do we need ice cream sales on a cold days and do they relate to increased loan applications, these advanced cases need more the kind of algorithms employed by the PhD’s on all the data sitting in a data lake along with some data sitting in a data warehouse.
Now these PhD’s are smart people and good with ML algorithms and may know how to use python with sklearn etc. They usually need help from data engineers with CS degrees to help them format the data in a way that makes it easier for them to do their analysis and come up with interesting insights as in which loans will perform. Now these data engineers might use Databricks to build out these cleansing, formatting, feature extraction pipelines for the Data Science users. The ML users might use Databricks or might not. However once they find actionable insights they might persist some of this information to a data warehouse so that other analysts could use that information.

So what does Upstart use, we can only guess based on the job postings that they have a data lake and that they have the data they need to come up with the information that a bank needs.
So where do they acquire this data from and what features do they use to come up with the results. That is their secret sauce and their moat.

The biggest challenges for an organization whose business relies on AI in my opinion are

  1. Getting enough quality data, if not it is garbage in and garbage out. This is typically the biggest challenge for smaller organizations. Also the moat that Amazon or Google have. Enough data to come up with better models to predict consumer behavior and how to spend their dollars to grow and provide the greatest bang fo their buck.
  2. Smart people with a good understanding of ML algorithms to filter out signal from noise.
  3. A competent engineering organization to fill in all the other supporting needs of these goals.

I can keep going, but hopefully what I have typed out here is comprehensible. If not I suck at presenting and will keep quiet in the future.

57 Likes

Thanks for your explanation, rxkfoo!

So unless there is a real analytics use case you don’t just dump data into a expensive data warehouse.

This is probably what I was missing. My company would store all the data for several months in data warehouse, probably because we did not care about cost too much and we believed that the historical data can still be very valuable. But I’m curious about whether there’s any disadvantage for storing data in S3 instead of a data warehouse. What kind of query engine supports querying such raw data? What is its performance compared with the common query engines working with data warehouse? How do we enforce the privacy rules in data lake so that the raw data won’t be misused? Could it be more challenging than data warehouse?

It is important to preserve the incoming format of data as it was provided, instead of formatting it to fit current needs of the organization, one never knows which of the discarded information will be needed again, also if there are errors in processing that data, there will be no way to trace it back to the original data.

Can’t we just make sure we store the necessary information in Data warehouse? It should be very easy to just add a few more columns in the tables in Data warehouse when we find those information is needed by some new features.

Data arrives in varying formats, structured, unstructured and Streaming and relational databases such as Snowflake and Redshift can’t handle it as is.

Snowflake and Redshift are not relational databases, which are used for OLTP. Snowflake and Redshift are OLAP-purpose data storage. I don’t understand why data warehouse can not handle different format of event data, there should be enough data transformation tools to convert any form of data into the format required by the data warehouse. In terms of streaming data, there’s tools like Confluent to wire the data into data warehouse, if we use Kafka as event pub-sub system. Data warehouse is for offline data, so it usually has about 1 day delay. Therefore, if we want to access real time data, we need additional real time data stores.

To conclude, I learned that data lake is cheaper than data warehouse. However, I was not convinced that Data lakes are necessary for most AI driven organization. Data lake probably has some advantages, but the ML infrastructure built with data warehouses is just working well.

Best,
Luffy

5 Likes

To conclude, I learned that data lake is cheaper than data warehouse. However, I was not convinced that Data lakes are necessary for most AI driven organization. Data lake probably has some advantages, but the ML infrastructure built with data warehouses is just working well.

Hi Luffy! I’ll take a stab at an answer and try to provide the details behind that answer which lurk beneath the surface.

For the 1000 foot answer, I’ll borrow a transcript from an interview held between a16z (I had mistakenly typed a2z before) and the CEO of Databricks, Ali Ghodsi. Because this is an interview held with someone who has a vested interest in the Databricks product lineup you can take what is said with measured skepticism. That said, coming from the ML/AI side of the equation myself, I can vouch that Ali is correct in his summary assessment. Please take a look at the entire interview for the full set of discussion topics if you are interested (https://future.a16z.com/podcasts/evolution-of-data-architect…)

Here is what I plucked to answer your question.

But can’t we just use SQL?

Martin: If you talk to some folks that come from the traditional analyst side, they’ll say, “AI and ML is cool, but if you really look at what they’re doing, they’re just doing simple regressions. Why don’t we just use the traditional model of data warehouses with SQL, and then we’ll just extend SQL to do basic regressions, and we’ll cover 99% of the use cases?”

Ali: Yeah, that’s interesting that you ask because we actually tried that at UC Berkeley. There was a research project that looked at: Is there a way we can take an existing relational model and augment it with machine learning?

And after five years, they realized that it’s actually really hard to bolt machine learning and data science on top of these systems. The reason is a little bit technical — it just has to do with the fact that these are iterative, recursive algorithms that continue improving the statistical measure until it reaches a certain threshold and then they stop. That’s hard to implement on top of data warehousing.

If you look at the papers that were published out of that project, the conclusion was we have to really hack it hard, and it’s not going to be pretty. If you’re thinking of the relational Codd model with SQL on top of it, it’s not sufficient for doing things like deep learning and so on.

Let me try to unpack the relevant nugget in his assertion above:

The reason is a little bit technical — it just has to do with the fact that these are iterative, recursive algorithms that continue improving the statistical measure until it reaches a certain threshold and then they stop.

The reason why machine learning is more difficult to do against a regular EDW is because of the workflow that exists in the machine learning pipeline. Unfortunately the term pipeline has been used and re-used so it’s really hard to understand what this means in the specific sense of the term as it relates to machine learning.

As backdrop, building predictive models requires a different skill set (Data Scientists), different tools (Jupyter notebooks, Python, R and the machine learning libraries built for that ecosystem) and different data (specially engineered and transformed feature data that predictive models can be trained on).

The reason why Data Scientists don’t (or usually don’t) train directly on EDW data directly is because that data must be processed, transformed and encoded as it moves down the machine learning pipeline, and that data is staged at each step along the way. It is a workflow process, and the tools and libraries that Data Scientists use have all been built up around that workflow. The tools aren’t built against the EDW use case, unless of course you do the query once and then rely on performing every other step along the way in memory. This is not impossible, but it gets to be really really impractical when you get to big data.

The end result of your transformed data is feature matrix ready to be trained on machine learning algorithms. Preparing that feature matrix is iterative. Data Scientists restart their workflow based on insights they discover about their models on how well they perform, and then they go back to the step that had started with acquisition. But they don’t go all the way back to acquisition. The process evolves along the way, sometimes they pick additional features they want to play with, or they engineer brand new ones. Sometimes they decide they will one-hot encode categorical data as additional features. They also need to do something called scaling and normalization, and turn categorical data into encodings. Each of those operations is its own step in the pipeline. The pipeline gets altered along the way, each part is staged (stored back to the file system), and it is built up iteratively.

If you have small data then you could do all of that transformational buildup in memory. Then you could connect your data to the EDW, pull it in, and repeat it again and again since the query execution will be short.

But if we are talking about big data then that pipeline is best sourced from the data lake itself. It’s in the format that most of the tools in the data science tool kit are built with - formatted data files organized in columnar format (note as in above that EDW data and queries come back in tabular format and in order to work with this data by the algorithms it needs to be physically converted into vector format in memory). But here’s where things get interesting. When the data is already prepared in vector or columnar-based format arrangement (e.g. your data lakes) then tools like spark can come in and perform all those transformations I mentioned in a very procedural way on distributed clusters (the transformations are applied in execution order using something called a Directed Acyclic Graph or DAG).

If you attempted to do the same thing with the source data coming from an EDW, and then continuing to do those transformations on your own against the EDW you will eventually run out of memory (if it is big data). For small data you can get away with doing all of these operations yourself in memory, but you most certainly will not be using spark if you ever need it. So the process to source your data directly against the EDW does not scale. Also, EDWs being what they are, they are processed for human analytic consumption. Data Scientists want to work with raw data so they can apply their skill set (this is where their work becomes a craft) to make the best use of that data.

I hope my explanation is helpful. Please free to ping me with more questions through email if I missed the mark.

Best,
–Kevin

9 Likes

Hi johnwayne!

I think Rxkfoo has done a wonderful job in his explanation to your questions.

I’ll fill in for a few additional details from a Data Science viewpoint.

You mention Upstart seems to be using the blueprint model of #2, and that these blueprint models in #2 and #3 is not easy.

Is this difficulty really because of the need for a large team of extremely technical personnel and talent?
If so, would this be the biggest barrier to entry for big banks that try to build competing tech against UPST?

I personally think their secret sauce is the fact that they are doing incremental improvements on live data by applying reinforcement learning approaches (two common branches that expand on this are online learning and retraining and the use of multi-arm bandits). The technology stack of course needs to be there to support this type of adaptive learning. And then you need to manage the stack and the continuous refinement of models. That process is called MLOps.

"It is our belief in a decade virtually every lending decision will be driven by AI/ML, it will be necessary to stay competitive.
AI/ML is the core of what we do. You cannot just take the same data that everyone already has access to from the credit bureaus and slice and dice it in a new way and expect to create a better credit model.

He’s right. And organizations are recognizing they need to empower AI/ML to stay competitive, and that is what has inspired the generation of yet another field discipline - “the Machine Learning Engineer” - and what it is that they do - “build and operationalize models using MLOps”.

Regarding your question around gradient boosting - yes it is still in use today but the technology has evolved. There are new algorithm choices that do an even better job. One that is also based on gradient boosting that I like is LightGBM.

Really great questions about Upstart and analysis on their moat. Let’s say for now that their moat is the fact that they started this much sooner than everyone else. They’ll draw market share and with that stickiness. Then they can keep improving and provide additional services. But what they are doing isn’t 100% unrepeatable from a technology or execution sense (or even getting your own PhDs with the talent to do it yourself). All of those can be done, but Upstart has first mover advantage. That’s their moat IMO.

They had the foresight to start on all of this in 2017. Brilliant vision followed by brilliant execution. Chances are pretty good that they’ll continue to innovate as long as the leadership stays in place.

Best,
–Kevin

11 Likes

it just has to do with the fact that these are iterative, recursive algorithms that continue improving the statistical measure until it reaches a certain threshold and then they stop.

I almost forgot another relevant part. Stopping criteria. Machine learning algorithms are not only iterative in the pipeline transformations they apply, but also on how weights and coefficients are calculated during the training process. In Deep Learning those calculations continue until you effectively meet the stopping criteria. In other algorithms training continues until you reach convergence. The basis for that convergence is something called a cost function.

Sorry to add that last tidbit in late. Just mentioning it for completion. When Spark applies their algorithms in MLlib they are also keeping track of these “gradient descent” adjustments.

Best,
–Kevin

5 Likes

Hi Kevin,

Thank you for the valuable insights into the tech side of things. This is really great. I’ve got follow up questions if you could shed some more light into this subject.

I am recalling from an interview that UPST cofounder Paul Gu had (timestamp 12m): https://m.youtube.com/watch?v=C2cKcXOwibA&feature=youtu…

In the video he says:

“The first big thing is compute availability. Not having enough compute to do the sort of work in the space of time you want to do it. Even today in a world where you can in theory arbitrarily scale up the number of EC2 machines that you got on the Amazon cloud, we still are constrained by the amount of compute we can access and the way it manifests itself is runtimes that take longer and longer. For our full model training process we often in our history got into places where the complexity of the learning algorithms interacted with the amount of compute and efficiency available meant we faced runtimes of 24 to 72 hours for a single run of the model training… If you have a thousand runs to do and it takes 24 hours for a run, you are looking at a multi year project…thats the state of things today and a lot of investments we do is improving the efficiency of the learning algorithms and how can we shortcut the search so instead of 48 hours it can only take 24 hours and instead of a thousand searches we only have to do 20 searches…wind back the clock ten years you were looking at compute that was an order of magnitude less available at the same cost. In parallel with this sort of improvement in underlying infrastructure, there’s been many advancements in the actual algorithmic technology in both the ‘theoretical math and statistics’ side, and also sort of the ‘implementing computer sciences’ side and of course we benefit from these advances”

This video was from May 3, 2021. So I presume compute availability remains a huge constraint to UPST’s ability to quickly tweak their AI models.

Can you give some color to this? Is this another huge barrier to entry by other big banks seeking to replicate UPST’s AI/ML success?

That they would not only need to 1) hire and retain a large pool of talented PhDs and engineers,

  1. spend years training their AI models with live, real time continuous repayment flow/default data in a reinforcement learning approach,

  2. now also face a limited cloud compute availability that can turn algorithm improvement projects into multiyear timeframes

Oh and 4) not to mention fair AI modeling/regulatory issues they will need to address (like getting a similar Upstart CFPB no action letter)

Or, is this not accurate at all - big banks with deep pockets can just buy their way out of cloud computing limitations and purchase entire servers for themselves? Or perhaps further mathematical/MLOps advancements continue to shortcut the entire process and lower the compute constraint barrier for everyone?

By the way…what are EC2 machines on Amazon that Paul Gu references?

15 Likes

EC2 is Elastic Compute Cloud.
https://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud

1 Like