You are the Fool IT guy. Imagine Fool has 10,000 servers in a warehouse to handle all the traffic reading and posting on Fool.
ibuildthings,
This was an excellent example! Very well done. Please allow me to take your example one step further (because I build things too 
Not only do you possibly have 10,000 servers in a data center somewhere, along with a huge, complex, and ever-expanding database containing every iota of data about every Fool who has an account with TMF. But you also have a tremendous amount of storage (disk drives, RAID arrays, etc.) to store that database on.
Now imagine that FoolHQ is in the middle of a migration from those 10,000 servers, with databases and storage, to a cloud-native re-design of everything! On the one hand you have 10,000 servers, which are in reality probably a combination of “bare-metal”, and VMs (virtual machines), and on the other, you have a smattering of systems up in “the cloud”. These systems in the cloud are likely a mix of VMs, possibly controlled by auto-scaling groups to better handle dynamic load, and things like Kubernetes clusters orchestrating a bunch of disparate Docker containers, each of which may be running smaller micro-services. And, layer on top that, the further complexity of many components moving into a “serverless” framework which may be a combination of things like Lambdas, static content stored in S3 buckets and distributed via CDN, Application Gateways, etc. And, you’ve got a variety of databases from Postgres clusters, to DynamoDB, to Redis, etc.
Now, with all that complexity, how do you find where the slowdown is. Or even what it is? Some of these things can’t be “logged into” to check logs. They do have logs, and they store them “somewhere”. But where? And how to do read them? How do correlate that the lambda over here that took a long time to spin up and hit the API Gateway is at the root of problem? Or, was it the lambda that launched a long-running query against 3 different types of databases that timed out before the query returned? Or is it because the instance running the giant Java app that was merely forklifted to the cloud is lacking the permissions to access the DB that’s the problem?
Maybe the instance lacking permissions is allowed to query, but is returning the results incorrectly, causing the lambda to spin until it times out, and it manifests as if it’s a CDN problem distributing the next page?
Datadog (and Splunk before it) allow you to have all of your cloud widgets log to a central location, then ingest those logs and allow you to do something called “time-event correllation”. They correlate events based on time stamps. The answer the question, “What was happening at this moment in time across my entire infrastructure?”.
This is not an easy problem to solve. And, with so many different moving parts between cloud-based infrastructures, data centers, hybrid-cloud, etc. it is a NIGHTMARE to debug!
At one time you had “a” server to check for a particular type of problem. Now, you could literally have 1000s. Additionally, the system that actually caused the problem, may not even exist anymore by the time you start debugging it. Things like auto-scaling groups and container orchestration tools like kubernetes constantly spin things up and down based on a variety of criteria and demand. So it’s very possible, probable even, that you need to be able to debug code running somewhere that no longer exists!
Time-event correlation is an absolute necessity in this environment. It used to be that data centers could get away with minimal monitoring of only the essentials. But now, with products like Datadog, we can, and even have to in some case, add instrumentation and logging to every bit of code so that when something goes wrong, we can easily pinpoint when, where, why, and how, and then figure out how to fix it. Sometimes the fix is actually easier than identifying the problem itself, but we can’t fix it because we don’t know the when, where, why, and how!
BobbyBe: I hate the business it’s in. It is hard to understand and seems very competitive and susceptible to disruption at any time (like security). It’s also an infrastructure play and doesn’t get wired into the company culture the way software like Alteryx does.
While I understand your concerns about this being an infrastructure play, and feeling like it’s very competitive and susceptible to disruption, take a look around. It’s Top Dog in its area. And it’s a VERY VERY HARD problem to solve, never mind solve it well!
There are absolutely open source solutions which could compete with it. One can build out a complete ELK (Elastic Search, Logstash, Kibana) stack using Elastic’s (ESTC) productions free of charge. But this is even harder. This will require at least a full-time person designing and implementing this infrastructure. And, being exactly that person who builds out that sort of infrastructure, let me assure you, it’s not simple! Elastic’s products are well designed, but they are incredibly complex! You more likely need a small team of people who are Elastic experts to work with a team of infrastructure people. At that point, you’re looking at at least 3, maybe 4 FTEs… That gets very expensive very fast.
Or, you can buy Datadog’s product (or Splunk), which is expensive, but it’s probably worth every single penny! And, maybe, being a small start-up, strapped for cash, you decide to build your ELK stack at first. And maybe it’s “good enough” and you don’t buy Datadog. At some point, if your company is growing and successful, you will outgrow your ELK stack, or want it to more, but run out of bandwidth in your team to grow that. And you will quickly realize, or be pressured into by these people, that Datadog is the better solution.
People like me, who build infrastructure, do not enjoy monitoring. And we don’t enjoy building the infrastructure to deal with monitoring. It’s tedious, boring, sucks up a lot of time and resources, and our skills can be better put to use elsewhere. We want to use things like Datadog, if for no other reason than to allow us to do more interesting things than deal with monitoring 
While DDOG is techincally an infrastructure play, it is so at a level higher than ESTC. DDOG is a company who benefits from the infrastructure that ESTC lays down. ESTC is like dark fiber laid all over the place that is there and ready for anyone to use. Where as, DDOG is like the service running over the newly lit-up fiber. DDOG benefits from the fiber being laid, and allows it to sell very useful things to people who don’t have to or want to know how to light up the fiber!
I don’t really see a competitive threat to DDOG at this point. Sure, there are other companies in the space, but they’re either legacy companies like Splunk trying to re-position themselves, or they’re smaller start-ups trying to attack the exact same space with an inferior product.
It is entirely possible they’ll be challenged. It’s high-tech. Everything and everyone is constantly being challenged. And when another new up-start comes along to steal market-share from DDOG, if they have a better mouse-trap, and better numbers, we’ll all move over there just as those customers who first used Splunk, and moved to DDOG, are now going to Latest-Shiny-Thing.
–
Paul - who hates building monitoring infrastructure…