And it’s still in the early days of the shift from traditional legacy SQL relational databases, to non-relational (NoSQL) databases, but that is certainly where the market has been headed, and I bet it continues that way for well over the next 10 years, and MongoDB is the leader in NoSQL and continuing to get stronger.
The rise in NoSQL can be tied to the rise in IoT and the general massive increase gathering of lots of data from users. Contrary to what mekong states, I believe the shift to NoSQL is old hat today, and has created more problems to today’s important analytics use cases. People jumped onto NoSQL because it was expedient in terms of gathering the data, but it has created problems in using that data.
Most of the companies that are already MDB’s largest customers only use MongoDB for a very small percentage of their db business and still have the large majority of their databases as old school relational ones and they want to shift more and more to MDB. So that’s a ton of high likelihood growth that will be pretty easy, before I even start to think about Mongo’s new customers that are out there that haven’t started doing business with MDB at all yet.
We’re at a big juncture in DBs right now. Database technology is, surprisingly to me at least, fast moving these days. OTOH, Companies using databases are often slow adopters. I remember the days when companies would skip one or two SAP versions since a new version came out each year and they weren’t capable of keeping up. Often, companies would postpone upgrading until a feature they really wanted was included or the supplier forced them to upgrade by discontinuing support for the older version. So, there’s been a historical mismatch of how quickly new capabilities hit the market versus how quickly companies adopt those new products.
Here’s a brief SQL/NoSQL primer: SQL is like a table with well defined columns and each chunk of data you gather is a row in that table. I’m sure this isn’t the way Amazon does it, but think of a table of Amazon customers, where each row represents a customer and the columns in the table are for things like email address, delivery address, credit card #, Prime or not, etc.
NoSQL is free-form, really just a bunch of “key-value pairs.” Think of a text document where you place a word on the left, then a colon (or other delimiter), then some text (including numbers) representing the value on the right. You can use any word (the “key”) you want on the left hand side. An SQL row for a customer Amazon User Table might then be represented as a NoSQL “document” like:
CustomerAddress: 123 Main St. AnyTown, MA 01234
CreditCard#: 1234 5678 90123
Because this is just a document, you can actually add new keys, and/or you can omit some keys, too. So, you will have more information on some users than others. This is NoSQL.
NoSQL is more flexible: when Amazon adds WholeFoods as a sub-business, they simply start putting in WholeFoods related keys for users that invoke that service. It’s literally like adding new text to documents.
With SQL, Amazon would have to modify the table (with an SQL ALTER command which can take a long time to process and so sometimes they actually create a new table and copy the data over), AND they would have to figure out what to put for values in the new column for customers that haven’t yet used WholeFoods since every row in that table has that new column and needs to have some entry. And not only that, you have to define the format of the type of data in that column, which has to be the same for all cells in that column.
So, why hasn’t everyone switched over to NoSQL? Because while NoSQL makes it super-easy to add new keys, it makes it harder to deal with that data later (those “analytics” workflows). SQL guarantees that every row in the table has the same number and type of data fields - NoSQL has no such guarantee. With NoSQL you have to handle missing fields, extra fields, and - worse of all - there is the potential that the values for a given field are in different formats. So, when you go to make use of the data in a NoSQL database, you not only have to handle missing data, you have to handle any variations in the data values. Heck, if you’ve got multiple applications putting data into the database, you might find that they use slightly different keys, and you’ve got to handle that on the processing side as well. The “processing” side of analyzing data in an NoSQL database is often a time-consuming, expensive process.
When you think about it, SQL forces you to think up front about how your data is going to be stored, which results in ease of dealing with that data later. NoSQL lets you do what you want to capture the data, and then you have to process that data to get it into shape for analytics. In many organizations today, data is captured from various sources: not just web sites and mobile apps, but small IoT devices and from other B2B services (think of a 3rd party seller on Amazon having an automated system to capture orders and so getting data from Amazon, but also selling stuff on Shopify or Etsy). There’s no way to enforce that all these sources, some of which you don’t own, will use the same “schema” (a fancy way of saying which columns) for your SQL tables.
With NoSQL, you just capture the data and deal with it as it comes in. If you’re doing “transactional” type work (for instance, processing orders), then you deal with each input source to send the order out and then store the data for later. With SQL, you’d have to write applications that translate each order source into your format to be able to store it, and if a source changes you have to change, and you might have to add/subtract columns (changing the “schema”). So, NoSQL is a clear win for transactional (OLTP) use.
OTOH, more and more companies have the desire/need to leverage all the data they’ve gathered to create more profits. Knowing what customers have ordered in the past can help you decide not only what marketing to send their way, but also to help you with inventory prediction, or deciding which new products to design and manufacture, etc. This is done through “analytics,” which simply is analyzing the data you have. Here things are the reverse: NoSQL’s free-form input creates headaches in trying to get data from various sources to represent the same things while SQL’s rigidity means you just analyze what you have. (Yes, I’m oversimplifying, but this is basically it). This is the OLAP use case. (I hate both those terms since they’re just one inside letter apart).
In terms of competition, let’s talk Snowflake. Snowflake is mostly an SQL database, so it has all of the analytical use case advantages, but it also smooths over many of the input/transactional issues with SQL. For instance, one obvious solution people tried with SQL databases when they had additional information coming in was simply to create additional columns for the information coming in different formats or representing slightly different things. This works, but then you have a space problems as SQL databases have rigid space requirements since you have to be able to insert or re-order rows (meaning each row has to be the same size regardless of whether you use all columns or not). Snowflake solves this with internal compression that you as a Snowflake customer don’t see (and it doesn’t impact performance that you can tell).
Snowflake also lets you store NoSQL data within a “document” that is inside an SQL table cell. And you can directly reference data in that document using a “dot” notation, which essentially means you can write your efficient SQL analytical workflows that access both SQL and NoSQL data. Tastes great; less filling. You can also convert your NoSQL data into SQL data (“NoSQL flattening”) if you want. Note that internally, Snowflake is doing all kinds of conversions and processing to enable fast access to both the SQL and NoSQL data. This is all behind the scenes - which means that as Snowflake develops better algorithms you as a Snowflake customer just see better performance. Any application you’ve written does not need to be changed. Contrast that to MongoDB, where any change to the data storage or access requires a rewrite on your part.
Another major technological shift that’s been going on for a while is from on-Premise servers (machines in your own machine room) to Cloud servers (AWS, Azure, etc.) and then to hosted services on Clouds, and now to full-on Database as a Service (DaaS I guess) that combines both data storage and data processing compute. Here are some of the differences:
• OnPrem: You buy machines install and configure software, run your own connections, upgrade manually, run and manage backups, and perform all hardware and software maintenance.
• Cloud: You use machines configured by Amazon or Google (or Oracle, etc), and you install software on them. You may or may not need to perform upgrades and backups manually, but you have to at least configure backups manually.
• Hosted: The software is installed and upgraded and backed up for you, but you still choose parameters that affect performance since that also affects cost. More importantly, your applications are running in some other hosted Cloud instance and so there’s communication and scaling costs.
• DaaS: You don’t worry about anything except your data - the hosting is ALL handled for you transparently. Even more, however, most/all of your applications can run inside Snowflake itself. You don’t need a separate compute hosted service that gathers data from your hosted DB and then processes it, the processing is done inside of the database itself.
MongoDB started out as open-source OnPrem software. They then enabled it to run on the Cloud, but you still have to setup your own clusters and backups and such. Their latest product is Mongo Atlas, which is a “fully-managed” service, which is great, albeit Mongo was slow to recognize the need for it. See some of the differences on Mongo’s AWS page: https://docs.aws.amazon.com/quickstart/latest/mongodb/overvi…
But, even a fully-hosted database is not what I call DaaS. With DaaS you not only don’t have to worry about the mechanics of storing, maintaining, and backing up your data, you also get performance advantages since you often don’t seen a separate Cloud instance to process that data. Imagine you want to find out what percentage of Amazon customers in California have Prime AND use WholeFoods delivery services.
With Mongo Atlas, you write a program that runs in an AWS instance that queries your Atlas database (another instance) and gets each record, looking to see if that user is in CA and then if he/she has Prime and then if he/she has used WholeFoods. If so, you add to a variable. With Snowflake you write a program (perhaps in SnowPark) that runs inside of Snowflake and returns you the result. The difference is in data flow. With Mongo, you’re not only creating a new instance to process this, you’re paying Amazon for the data query AND waiting for the interprocess communication to happen. (This is an over-simplified example in many ways, and this hides the NoSQL pre-processing that might need to happen but also perhaps overlooks some Mongo built-ins for simple operations, but the general point remains).
You can read about SnowPark here: https://www.snowflake.com/blog/welcome-to-snowpark-new-data-…
Snowpark takes care of pushing all of your logic to Snowflake, so it runs right next to your data. To host that code, we’ve built a secure, sandboxed JVM right into Snowflake’s warehouses—more on that in a bit.
But wait; there’s more! Let’s say that you wanted to apply your PII detection logic to all of the string columns in a table. With SQL, you’d have to hand-code a query for each table—or write code to generate the query. With Snowpark, you can easily write a generic routine… and with this generic routine in hand, you can mask all of the PII in any table with ease…Snowpark takes care of dynamically generating the correct query in a robust, schema-driven way.
OK, this has gotten pretty involved technically, but I hope I’ve expressed it in terms laypeople can understand.
So, what’s my point? It’s that MongoDB is doomed. Certainly not this year or next, but in the long run it’s doomed. I freely admit that decline will be gradual. Remember how I started this post talking about how companies skip versions? They’re even slower to adopt newer/better technologies. But, more and more companies are realizing that just being able to perform transactions within their DB isn’t good enough business-wise. They need to be able to gather intelligence and take action based on the data they’ve collection (analytics). If they don’t, they’ll simply be out-competed in the marketplace.
My last job involved setting up a two-way IoT system for vehicles. The proposal was for over 3 database: Amazon S3 (Simple Storage System) to store all the data coming in since it is easy and cheap (and NoSQL). Amazon Redshift (competitor to Snowflake) for analytics, which meant moving data we wanted to analyze from S3 to Redshift. And a Mongodb database for transactional processing to support our mobile app. With Snowflake’s separating the cost of storage from compute, today I’d push to store it all in Snowflake for storage and analytics, and I think SnowPark could handle supporting the mobile apps as well. Any cost saving from S3 or even Mongo would be minimal compared to programming ease and performance gains.
Investing-wise, I clearly got out of MDB too soon, and even now may even be “too soon” business-wide but I think that’s better than too late. With every company that has data needing to analyze that data, I don’t see how the superior Snowflake model doesn’t win in the long run.